딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별

김선범; 배준우; 박희진

원문정보

Encoding and language detection of text document using Deep learning algorithm

김선범, 배준우, 박희진

한국차세대컴퓨팅학회 한국차세대컴퓨팅학회 논문지 Vol.13 No.5 2017.10 pp.124-130 KCI 등재

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

Character encoding is the method used to represent characters or symbols on a computer, and there are many encoding detection software tools. For the widely used encoding detection software“uchardet”, the accuracy of encoding detection of unmodified normal text document is 91.39%, but the accuracy of language detection is only 32.09%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 3.55% and the accuracy of language detection is 0.06%. Therefore, in this paper, we propose encoding and language detection of text document using the deep learning algorithm called LSTM(Long Short-Term Memory). The results of LSTM are better than encoding detection software“uchardet”. The accuracy of encoding detection of normal text document using the LSTM is 99.89% and the accuracy of language detection is 99.92%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 99.26%, the accuracy of language detection is 99.77%.

한국어

문자 인코딩은 문자나 기호를 컴퓨터로 표현하기 위해 사용되는 방법이며 문자 인코딩 판별 소프트웨어들이 존재한다. 기존의 널리 쓰이는 인코딩 판별 소프트웨어인“uchardet”의 경우 변조되지 않은 일반 문서의 인코딩 판별 정확도는 91.39% 이지만 언어 판별 정확도는 32.09%에 불과하다. 또한 문서가 치환 암호에 의해 암호화 된 경우 인코딩 판별 정확도는 3.55%, 언어 판별 정확도는 0.06%로 매우 낮은 정확도를 보였다. 따라서 본 논문에서는Deep learning 알고리즘인 LSTM(Long Short-Term Memory)을 이용한 문서의 인코딩 및 언어 판별 방법을제안하며, 기존의 인코딩 판별 소프트웨어“uchardet”보다 뛰어난 결과를 보였다. 제안하는 방법을 이용한 일반 문서의 인코딩 판별 정확도는 99.89%이며, 언어 판별 정확도는 99.92%이다. 또한 문서가 치환 암호에 의해 암호화된 경우에는 제안하는 방법의 인코딩 판별 정확도는 99.26%이며, 언어 판별 정확도는 99.77%로 매우 뛰어나다.

요약
Abstract
1. 서론
2. 실험 데이터 수집
3. 일반 문서의 인코딩 및 언어 판별
  3.1 LSTM 입력 데이터 전처리
  3.2 실험 결과
4. 치환에 의해 암호화된 문서의 인코딩 및 언어 판별
  4.1 치환에 의한 문서 변조
  4.2 LSTM 입력 데이터 전처리
  4.3 실험 결과
5. 결론
참고문헌

earticle

딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별

원문정보

초록

목차

키워드

저자정보

참고문헌

함께 이용한 논문