چکیده:
Heterogeneous data in all groups are growing on the web nowadays. Because of the variety of data types in the web search results, it is common to classify the results in order to find the preferred data. Many machine learning methods are used to classify textual data. The main challenges in data classification are the cost of classifier and performance of classification. A traditional model in IR and text data representation is the vector space model. In this representation cost of computations are dependent upon the dimension of the vector. Another problem is to select effective features and prune unwanted terms. Latent semantic indexing is used to transform VSM to orthogonal semantic space with term relation consideration. Experimental results showed that LSI semantic space can achieve better performance in computation time and classification accuracy. This result showed that semantic topic space has less noise so the accuracy will increase. Less vector dimension also reduces the computational complexity.
خلاصه ماشینی:
Experimental results showed that LSI semantic space can achieve better performance in computation time and classification accuracy.
Keywords: Persian Text Classification, Vector Space Model, Latent Semantic Indexing (LSI).
Therefore, word2vec constructs a vocabulary from the training text data and then learns vector representation of words with its neural network model.
Rejan, Ramalingam, Ganesan, Palanivel & Palaniappan (2009) developed the Tamil language text classification system based on VSM and neural network (NN) model.
Uysal and Gunal (2014) proposed a method based on genetic algorithm oriented latent semantic features (GALSF) for improving the representation of documents in text classification.
They developed a text classification system for comparing the performance of the VSM model and their proposed VSM model with title vector-based document representation method.
Pilevar, Feili & Soltani (2009) used the Learning Vector Quantization network for Persian document classification.
Persian Text Classification Vector Space Model (VSM) has been used in IR and NLP for many years before (Wong, Ziarko, Raghavan & Wong, 1987).
The latent semantic analysis uses singular value decomposition (SVD) method to decompose a large term-document matrix into a set of k orthogonal principal value'.
SVD low rank approximation LSI is an automatic method that can transform the original textual data to a smaller semantic space by taking advantage of some of the implicit higher-order structure in associations of words with text objects (Landauer & Dumais, 2008; Landauer and Dumais, 2006).
The experiments showed that reduced semantic LSI space has better performance in Persian text classification.
Improve VSM text classification by title vector based document representation method.