Finding an Optimal Classification Model for Analyzing Linguistic Data

Wonbin Kim

Finding an Optimal Classification Model for Analyzing Linguistic Data

원문정보

Wonbin Kim

국제언어인문학회 인문언어 제26권 2호 2024.12 pp.205-236 KCI 등재

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

This study aims to identify an AI classification model that is optimal for the classification of linguistic data. For this purpose, three commonly used classification models (XGBoost classifier, Random Forest Classifier, and SVM classifier) are compared in terms of their performance. Specifically, the three models are trained to classify the input data into essays and dialogues based on the syntactic complexity-related characteristics that distinguish between essays and dialogues. To determine if a model performing well on balanced data also performs well on imbalanced data, the three models’ performances are measured under two conditions: when the training dataset is balanced and when it is imbalanced. The performances of the trained models on the first test dataset are evaluated using accuracy, F1-score, normalized confusion matrix, and the area under the receiver operating characteristic curve. The performances on the second test dataset are assessed in terms of accuracy, confusion matrix, precision, and recall. The results demonstrate that the Random Forest Classifier has the best performance among the three models regardless of the balance of training data.

1. Introduction
2. Background
2.1 XGBoost Classifier
2.2 Random Forest Classifier
2.3 Support Vector Machine Classifier
3. Method
3.1 Data
3.2 Procedure
4. Results
4.1 Results from XGBoost Classifier
4.2 Results from Random Forest Classifier
4.3 Results from Support Vector Machine Classifier
5. Discussion and Conclusion
References
[Abstract]

키워드

저자정보

Wonbin Kim 김원빈. Yonsei University

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

※ 기관로그인 시 무료 이용이 가능합니다.

7,300원

0개의 논문이 장바구니에 담겼습니다.

earticle