원문정보
The Potential of ChatGPT as a Translation Evaluator: Characteristics and Comparisons to Human Evaluation.
초록
영어
In this study, we carried out a series of experiments to explore how ChatGPT (version 4o) evaluated Korean-English translations. Using two datasets of human translations (n=57) and two datasets of post-edited translations (n=56), all drawn from Lee and Lee (2021), we adopted two evaluation approaches with strict prompt control. In Experiment A, ChatGPT rated the four datasets freely on a five-point scale without specific criteria. In Experiment B, which was conducted concurrently with Experiment A, ChatGPT rated the same datasets using a prescribed, criterion-referenced five-point scale. To assess intra-rater reliability, we repeated both experiments one month later. This study yielded both quantitative and qualitative findings, including the following: (1) ChatGPT’s average scores differed significantly from those of human raters; (2) correlations between human and ChatGPT scores ranged from ‘moderate’ to ‘strong’; (3) the use of the prescribed rating scale improved ChatGPT’s reliability as a rater; (4) ChatGPT exhibited very low intra-rater reliability; and (5) ChatGPT’s self-justifications for its ratings varied in quality, often failing to identify obvious errors.
목차
1. 서론
2. 선행연구 검토
2.1. 챗GPT와 외국어 작문/번역
2.2. 챗GPT를 평가 도구로 활용한 연구
3. 연구 방법
4. 분석 결과
4.1. 정량분석
4.1.1. 평균 비교
4.1.2. 평가자 간의 상관관계
4.1.3. 평가자 내 신뢰도
4.2. 정성분석
4.2.1. 평가 근거와 관련된 특징
4.2.2. 척도에 따른 평가 차이
4.2.3. 챗GPT 평가 사례
5. 결론
참고문헌
