바이그램이 문서범주화 성능에 미치는 영향에 관한 연구

이찬도; 최준영

바이그램이 문서범주화 성능에 미치는 영향에 관한 연구

원문정보

A Study on the Effectiveness of Bigrams in Text Categorization

이찬도, 최준영

한국정보기술응용학회 JITAM Vol.12 No.2 2005.06 pp.15-27 KCI 등재

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Naive Bayes classifiers using both unigrams and bigrams as features. The results show that recall values were higher than those of unigrams alone. Break-even points and F1 values improved in most documents, especially when documents were classified along the large classes. In Reuters-21578 break-even points increased by 2.1%, with the highest at 18.8%, and F1 improved by 1.5%, with the highest at 3.2%. In Korea-web break-even points increased by 1.0%, with the highest at 4.5%, and F1 improved by 0.4%, with the highest at 4.2%. We can conclude that text classification using unigrams and bigrams together is more efficient than using only unigrams.

Abstract
1. 서론
2. 구(phrases)를 사용한 문서범주화
3. 바이그램 추출 알고리즘
4. 실험
  4.1 실험 데이터
  4.2 실험 방법
  4.3 성능 평가
  4.4 바이그램 추출 결과의 분석
  4.5 실험 결과 및 분석
5. 결론
  5.1 연구결과의 요약
  5.2 기존연구 결과와의 비교
  5.3 성능저하의 분석
참고문헌

키워드

저자정보

이찬도 Chan-Do Lee. 대전대학교 정보통신인터넷공학부
최준영 Joon-Young Choi. 대전대학교 혜화의료원

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

※ 기관로그인 시 무료 이용이 가능합니다.

4,500원

0개의 논문이 장바구니에 담겼습니다.

earticle