병렬 말뭉치 필터링을 적용한 Filter-mBART기반 기계번역 연구

문현석; 박찬준; 어수경; 박정배; 임희석

병렬 말뭉치 필터링을 적용한 Filter-mBART기반 기계번역 연구

원문정보

Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering

문현석, 박찬준, 어수경, 박정배, 임희석

한국융합학회 한국융합학회논문지 제12권 제5호 2021.05 pp.1-7 KCI 등재

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

In the latest trend of machine translation research, the model is pretrained through a large mono lingual corpus and then finetuned with a parallel corpus. Although many studies tend to increase the amount of data used in the pretraining stage, it is hard to say that the amount of data must be increased to improve machine translation performance. In this study, through an experiment based on the mBART model using parallel corpus filtering, we propose that high quality data can yield better machine translation performance, even utilizing smaller amount of data. We propose that it is important to consider the quality of data rather than the amount of data, and it can be used as a guideline for building a training corpus.

한국어

최신 기계번역 연구 동향을 살펴보면 대용량의 단일말뭉치를 통해 모델의 사전학습을 거친 후 병렬 말뭉치로 미세조정을 진행한다. 많은 연구에서 사전학습 단계에 이용되는 데이터의 양을 늘리는 추세이나, 기계번역 성능 향상을 위해 반드시 데이터의 양을 늘려야 한다고는 보기 어렵다. 본 연구에서는 병렬 말뭉치 필터링을 활용한 mBART 모델 기반의 실험을 통해, 더 적은 양의 데이터라도 고품질의 데이터라면 더 좋은 기계번역 성능을 낼 수 있음을 보인다. 실험결과 병렬 말뭉치 필터링을 거친 사전학습모델이 그렇지 않은 모델보다 더 좋은 성능을 보였다. 본 실험결과를 통 해 데이터의 양보다 데이터의 질을 고려하는 것이 중요함을 보이고, 해당 프로세스를 통해 추후 말뭉치 구축에 있어 하나의 가이드라인으로 활용될 수 있음을 보였다.

키워드

저자정보

문현석 Hyeonseok Moon. 고려대학교 컴퓨터학과 석·박사통합과정
박찬준 Chanjun Park. 고려대학교 컴퓨터학과 석·박사통합과정
어수경 Sugyeong Eo. 고려대학교 컴퓨터학과 석·박사통합과정
박정배 JeongBae Park. 고려대학교 Human Inspired AI연구소 교수
임희석 Heuiseok Lim. 고려대학교 컴퓨터학과 교수

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

※ 기관로그인 시 무료 이용이 가능합니다.

4,000원

0개의 논문이 장바구니에 담겼습니다.

earticle