대용량 고립단어 언어처리를 위한 통계적 방안 연구

이선정

원문정보

A study on the statistical approach for language processing in large isolated vocabulary

이선정

한국차세대컴퓨팅학회 한국차세대컴퓨팅학회 논문지 Vol.9 No.1 2013.02 pp.17-24 KCI 등재

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

In this paper, we make a study on the statistical approach for language processing in a very large vocabulary isolated word recognizer. The representative system is a navigation software in which POI (point of interest) words consist of more than million isolated vocabularies. Those vocabularies have been used for building an inverted-indexed system for searching the first consonant as well as a speech recognizer. We propose an algorithm in which an isolated POI word can be converted into a continuous sentence consisting of a sequence of words in a unit word dictionary. First, a part of POIs is analyzed to make a unit word dictionary. The initial unit word dictionary consists of POIs with two or three syllables and is updated for POIs with more than 4 syllables. We build the unit vocabulary set of 174,535 after analyzing 653,939 POIs having 2-5 syllables. Finally, the perplexity of 485.73 is obtained with the same POIs after statistical language processing for unigram, bigram, and trigram.

한국어

본 논문에서는 대용량 고립단어 언어처리를 위한 통계적 방안에 대한 연구를 수행하였다. 대표적인 대용량 고립단어인 내비게이션 POI(Point of Interest) 단어는 백만 개 이상의 고립단어로 이루어져 있다. 내비게이션에 활용되고 있는 고립단어는 초성 등을 활용한 POI 검색뿐만 아니라 음성인식에 사용된다. 본 논문에서는 하나의 고립 단어로 간주하는 POI를 의미가 있는 단위 단어 세트로 구성된 연속단어로 변환시키는 알고리즘을 제안한다. 먼저 대용량 고립 단어로 이루어진 내비게이션 POI를 분석하여 음성인식 및 POI 검색에 활용이 될 수 있는 단위 단어 세트를 구하였다. 단위 단어 세트를 구하는 알고리즘은 2-3 음절로 이루어진 POI를 초기 단어 세트로 정의한 후 음절이 증가함에 따라 단위 단어 세트를 갱신하는 방식으로 구성되었다. 2-5 음절로 이루어진 653,939개의 POI를 제안된 방식을 사용하여 174,535개의 단위 단어 세트를 구하였으며 이를 이용하여 단위 단어 세트로 이루어진 연속 단어로 기존의 POI를 재 정의하였다. 이를 활용하여 통계적 언어처리 모델에 적용한 결과 복잡도가 485.73으로 나타났다.

earticle

대용량 고립단어 언어처리를 위한 통계적 방안 연구

원문정보

초록

목차

키워드

저자정보

참고문헌

함께 이용한 논문