원문정보
초록
영어
This paper takes the automatic classification of the large-scale Uyghur text collected from the network as research background, designed the functional block structure of the Uyghur text classification system, and chose the KNN algorithm as the classification engine, and programmed the classification system using C sharp. In the preprocessing part, combining with the Uyghur language’s lexical characteristics, we introduced the stem extraction method into the procedure, and then have greatly reduced the whole feature dimensions. the classification experimental results on the basis of large-scale text corpus includes more than 3000 documents which are belongs to different 10 categories are given, and the results of the classification experiments for the different number of features selected by using x2 statistical method are also given. The results show that only 3% to 5% of the whole high dimensional features are crucial to higher classification accuracy, so it is possible how to determine what those best features are or further reducing the feature space dimensions which are the interesting issues to be further continued.
목차
1. Introduction
2. Uyghur Text Preprocessing
2.1. Uyghur Text Features
2.2. Uyghur Text Preprocessing
2.3. Feature Selection
3. Text Categorization Algorithm
4. Text Categorization Experiments and Analysis
4.1. Data Sets
4.2. Evaluation Parameters
4.3. Experimental Results Analysis
5. Conclusions
Acknowledgements
References
