CEFRに対応した日本語例文自動分類システムのBERT適用による精度改善の試み

宮崎佳典; Cao Hoai Giang; 谷誠司; 安志英; 元裕璟

원문정보

An Attempt to Improve Accuracy by Applying BERT to Japanese Document Classification System Compatible with CEFR

宮崎佳典, Cao Hoai Giang, 谷誠司, 安志英, 元裕璟

한국일본학회 일본학보 제137권 2023.11 pp.43-62 KCI 등재

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

The Common European Framework of Reference for Languages (CEFR) is an example of a Can-Do language proficiency scale that has attracted large attention in recent years and has been introduced in foreign language education worldwide. On the other hand, there are a few examples of CEFR research related to Japanese language education. As far as the present authors investigated, there is no Japanese CEFR-compliant text corpus. In the current research, to create a corpus, we focused on the implementation of automatic classification in order to reduce the effort of adding Can-Do Statement (CDS), thereby enhancing the reading comprehension of CEFR in the example sentences. Document type, specialty, sentence length, and Kanji rate are commonly used as the parameters for classification; however, the current version of the aforementioned implementation uses fastText to identify document type and specialty. The present study attempted to apply the Bidirectional Encoder Representations from Transformers (BERT) algorithm, which has been confirmed to work effectively in many natural language processing tasks. The findings of the research showed that prediction accuracy was improved and it was possible to suppress the number of CDSs for more appropriate prediction. Prospects include improving the precision by further narrowing down the number of predictions, creating more compelling features, and collecting more data using CDS information. The targets in this study were CEFR proficiency levels A1, A2, B1, B2, and PreA1 for Reading skill items (34 CDSs correspond to those levels).

한국어

최근 CanDo에 의한 언어 능력 척도의 하나로 CEFR(유럽 언어 공통 기준)가 관심을 받고 있으며, 외국어 교육에 분야에서도 전세계적으로 도입되고 있는 것은 잘 알려진 사 실이다. 한편, 일본어 교육을 위한 CEFR의 연구 사례는 많지 않으며, 일본어 CEFR 준거 텍스트 코퍼스에 관한 연구도 거의 이루어지고 있지 않다. 본 연구에서는 코퍼스를 작성 할 때, 예문에 CEFR의 독해력을 나타내는 CDS(능력 기술문)에 관한 정보를 부여하는 것 이 상정되었을 때 그 노력을 경감하기 위한 자동 분류의 구현을 지속적으로 연구하고 있 다. 분류를 위한 특징량으로 현재 문서 타입, 전문성, 문장, 한자율을 채용하고 있으며, 그 속의 문서 타입이나 전문성 분류를 위해 fastText가 이용되고 있다. 이에 대해 본 논문 에서는 지금까지 많은 자연언어 처리 태스크에서 유효하게 동작하는 것이 확인되어 정평 이 있는 BERT 알고리즘(Bidirectional Encoder Representations from Transformers)을 새롭게 적용하여 시도한 결과, 예측 정밀도를 향상시키는 것에 성공하였으며, 예측한 CDS의 수 를 보다 적절하게 억제할 수 있었다. 향후 예측할 수 있는 수를 줄임으로써 적합률을 높 이고, 보다 효과적인 특징량의 작성을 검토하여 CDS 정보가 더 많은 데이터를 수집하고 자 한다. 또한 이번 연구 대상은 CEFR가 초기에 설정한 6단계의 언어 능력 레벨(초급 레 벨의 A1부터 A2, B1, B2, C1와 최상급 레벨의 C2까지) 중, 모국어화자라도 난이도가 높 은 C1과 C2를 뺀 A1, B2을 더하여 2017년 CEFR을 보완한 것으로 공개된 CEFR Companion Volume에서 새롭게 추가된 PreA1 레벨도 포함하였다. 또한 기술 항목은 Reading, Writing, Speaking, Listening, Interaction 중 Reading에 초점을 맞추고, 그에 대응하 는 CDS는 34으로 세었다. 즉, 본 연구는 입력 예문을 34개의 라벨로 분류하는 것을 중심 으로 하고 있다는 것을 의미한다.

<요지>
1. はじめに
2. 関連研究・先行研究
2.1. CEFRに関する関連研究
2.2. 本研究の先行研究
3. 先行研究における手法
3.1. 特徴量
3.2. CDS分類
4. 本研究の提案手法
4.1. BERT使用について
4.2. 実験用データの調整について
5. 実験結果
5.1. 文書タイプと専門性の分類実験
5.2. CDS分類実験
6. 結語
참고문헌(Reference)

키워드

저자정보

宮崎佳典 Yoshinori, Miyazaki. 静岡大学学術院情報学領域情報科学系列教授, 情報科学
Cao Hoai Giang 静岡大学情報学部情報科学科学部生, 情報科学
谷誠司 곡성사. 常葉大学外国語学部グローバルコミュニケーション学科教授, 日本語教育
安志英 안지영. 群山大学校東アジア学部日語日文学専攻副教授, 日本語学
元裕璟 원유경. 高麗大学校大学院中日語文学科日本語学専攻博士課程大学院生, 談話分析

참고문헌

자료제공 : 네이버학술정보

earticle