초록
영어
This study aims to identify an AI classification model that is optimal for the classification of linguistic data. For this purpose, three commonly used classification models (XGBoost classifier, Random Forest Classifier, and SVM classifier) are compared in terms of their performance. Specifically, the three models are trained to classify the input data into essays and dialogues based on the syntactic complexity-related characteristics that distinguish between essays and dialogues. To determine if a model performing well on balanced data also performs well on imbalanced data, the three models’ performances are measured under two conditions: when the training dataset is balanced and when it is imbalanced. The performances of the trained models on the first test dataset are evaluated using accuracy, F1-score, normalized confusion matrix, and the area under the receiver operating characteristic curve. The performances on the second test dataset are assessed in terms of accuracy, confusion matrix, precision, and recall. The results demonstrate that the Random Forest Classifier has the best performance among the three models regardless of the balance of training data.
목차
2. Background
2.1 XGBoost Classifier
2.2 Random Forest Classifier
2.3 Support Vector Machine Classifier
3. Method
3.1 Data
3.2 Procedure
4. Results
4.1 Results from XGBoost Classifier
4.2 Results from Random Forest Classifier
4.3 Results from Support Vector Machine Classifier
5. Discussion and Conclusion
References
[Abstract]
