

Ternary Decomposition and Dictionary Extension for Khmer Word Segmentation


Thaileang Sung, Insoo Hwang

In this paper, we proposed a dictionary extension and a ternary decomposition technique to improve the effectiveness of Khmer word segmentation. Most word segmentation approaches depend on a dictionary. However, the dictionary being used is not fully reliable and cannot cover all the words of the Khmer language. This causes an issue of unknown words or out-of-vocabulary words. Our approach is to extend the original dictionary to be more reliable with new words. In addition, we use ternary decomposition for the segmentation process. In this research, we also introduced the invisible space of the Khmer Unicode (char\u200B) in order to segment our training corpus. With our segmentation algorithm, based on ternary decomposition and invisible space, we can extract new words from our training text and then input the new words into the dictionary. We used an extended wordlist and a segmentation algorithm regardless of the invisible space to test an unannotated text. Our results remarkably outperformed other approaches. We have achieved 88.8%, 91.8% and 90.6% rates of precision, recall and F-measurement.


 1. Introduction
 2. Khmer Language Overview
  2.1 Khmer Language
  2.2 Chuon Nath Dictionary
  2.3 Problems in Khmer Word Segmentation
 3. Research Reviews
  3.1 KCC Bigram
  3.2 Trainable Rule-based
 4. Proposed Approach
  4.1 Initialization
  4.2 Decomposition
  4.3 New Word Extraction
  4.4 Extended Dictionary
 5. Experiment
  5.1 Experimental Setup
  5.2 Experimental Results
  5.3 Discussion
 6. Conclusion


  • Thaileang Sung Graduate Student of Information Systems Dept., Jeonju University
  • Insoo Hwang Professor of Smart Media, Jeonju University


