earticle

논문검색

Ternary Decomposition and Dictionary Extension for Khmer Word Segmentation

원문정보

Thaileang Sung, Insoo Hwang

피인용수 : 0(자료제공 : 네이버학술정보)

초록

영어

In this paper, we proposed a dictionary extension and a ternary decomposition technique to improve the effectiveness of Khmer word segmentation. Most word segmentation approaches depend on a dictionary. However, the dictionary being used is not fully reliable and cannot cover all the words of the Khmer language. This causes an issue of unknown words or out-of-vocabulary words. Our approach is to extend the original dictionary to be more reliable with new words. In addition, we use ternary decomposition for the segmentation process. In this research, we also introduced the invisible space of the Khmer Unicode (char\u200B) in order to segment our training corpus. With our segmentation algorithm, based on ternary decomposition and invisible space, we can extract new words from our training text and then input the new words into the dictionary. We used an extended wordlist and a segmentation algorithm regardless of the invisible space to test an unannotated text. Our results remarkably outperformed other approaches. We have achieved 88.8%, 91.8% and 90.6% rates of precision, recall and F-measurement.

목차

Abstract
 1. Introduction
 2. Khmer Language Overview
  2.1 Khmer Language
  2.2 Chuon Nath Dictionary
  2.3 Problems in Khmer Word Segmentation
 3. Research Reviews
  3.1 KCC Bigram
  3.2 Trainable Rule-based
 4. Proposed Approach
  4.1 Initialization
  4.2 Decomposition
  4.3 New Word Extraction
  4.4 Extended Dictionary
 5. Experiment
  5.1 Experimental Setup
  5.2 Experimental Results
  5.3 Discussion
 6. Conclusion
 References

저자정보

  • Thaileang Sung Graduate Student of Information Systems Dept., Jeonju University
  • Insoo Hwang Professor of Smart Media, Jeonju University

참고문헌

자료제공 : 네이버학술정보

    함께 이용한 논문

      ※ 기관로그인 시 무료 이용이 가능합니다.

      • 5,200원

      0개의 논문이 장바구니에 담겼습니다.