원문정보
A Study on the Effectiveness of Bigrams in Text Categorization
초록
영어
Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Naive Bayes classifiers using both unigrams and bigrams as features. The results show that recall values were higher than those of unigrams alone. Break-even points and F1 values improved in most documents, especially when documents were classified along the large classes. In Reuters-21578 break-even points increased by 2.1%, with the highest at 18.8%, and F1 improved by 1.5%, with the highest at 3.2%. In Korea-web break-even points increased by 1.0%, with the highest at 4.5%, and F1 improved by 0.4%, with the highest at 4.2%. We can conclude that text classification using unigrams and bigrams together is more efficient than using only unigrams.
목차
1. 서론
2. 구(phrases)를 사용한 문서범주화
3. 바이그램 추출 알고리즘
4. 실험
4.1 실험 데이터
4.2 실험 방법
4.3 성능 평가
4.4 바이그램 추출 결과의 분석
4.5 실험 결과 및 분석
5. 결론
5.1 연구결과의 요약
5.2 기존연구 결과와의 비교
5.3 성능저하의 분석
참고문헌