교보문고

학술논문

Relationships among Different Effect-Size Indexes for Inter-Rater Agreement between Human and Automated Essay Scoring

이용수 88

영문명: Relationships among Different Effect-Size Indexes for Inter-Rater Agreement between Human and Automated Essay Scoring
발행기관: 학습자중심교과교육학회
저자명: 윤지여
간행물 정보: 『학습자중심교과교육연구』제23권 18호, 901~919쪽, 전체 19쪽
주제분류: 사회과학 > 교육학
파일형태: PDF
발행일자: 2023.09.30

5,080원

구매일시로부터 72시간 이내에 다운로드 가능합니다.
이 학술논문 정보는 (주)교보문고와 각 발행기관 사이에 저작물 이용 계약이 체결된 것으로, 교보문고를 통해 제공되고 있습니다.

1:1 문의

국문 초록

목적 영어 쓰기 평가에서 사람과 기계 채점자 간 일치도를 나타내는 효과 크기의 정도와 효과 크기 지수 간의 관계를 조사 분석하였다. 방법 목적을 달성하기 위하여 메타분석 방법을 사용하였다. 먼저, 문헌 탐색 및 포함 배제기준에 따라 연구 자료를 수집하였다. 선정된 연구 자료는 평가와 효과 크기 측면에 관하여 코딩하였고, 효과 크기와 분산을 계산하였다. 이질성 검사와 전체 모형 분석, 그리고조절변인 분석과 대조 분석을 위하여 R 소프트웨어 버전 3.3.2에서 metafor와 robumeta 패키지의 rma, robust, robu 기능을이용하였다. 결과 전체 무선 효과 모델의 결과에 따르면, 에세이 쓰기 평가에서 기계와 사람 채점 간 일치도는 평균 상관 계수는 .75, 평균 카파계수는 .48이며, 평균 근접 비율은 .99이었다. 그러나 위계 가중치 모델과 이질성 검사 결과, 이 지수들은 연구마다 차이가 있다는것을 보여주었다. 매개 변인을 통해 연구간 차이를 알아본 조절변인 분석과 대조 분석 결과, 상관 계수와 카파 계수는 6점 척도와3, 4, 5점 척도가 각각 통계적으로 유의한 차이를 보였다. 한편 정확과 근접 비율은 3, 4점 척도와 5, 6점 척도가 통계적으로 유의한차이를 보였다. 그리고 근접과 정확 비율은 평균 0.34의 차이를 보였으며 두 비율의 차이 분산은 0.004로 아주 적었다. 또한 상관계수와 카파도 평균 0.27의 차이를 보였으며 두 지수의 차이 분산도 0.003으로 아주 적었다. 결론 기계 채점은 사람 채점과 상대적 일관성과 절대적 일치도 측면에서 매우 비슷한 양상을 보인다. 선행연구에서 제시한 평가기준과 비교하였을 때, 상관 계수는 기준치보다 높고, 카파는 중간 정도이며, 근접 비율은 근접 일치도 비율 범위 내이다. 각 지수의일치도 크기는 연구 간 일관성이 없었다. 즉, 채점자 간 일치도는 연구 내 차이보다는 연구 간 차이에 의해 다르다는 것을 알 수 있다. 그래서 연구 간 차이를 설명할 수 있는 조절 변인을 위한 후속 연구가 필요하다. 영어 쓰기 평가에서 사람과 기계 채점자 간 일치도는사용한 척도의 영향을 받는다. 상관 계수, 카파, 정확 및 근접 비율은 상당히 강한 관련이 있다. 채점 척도와는 상관없이, 카파 계수는상관 계수보다 평균 0.27점이 낮으며, 정확 비율은 근접 비율보다 평균 0.34점이 낮게 나온다. 따라서, 채점자 간 일치도를 나타내는 지수들은 제각기 장단점이 있으므로, 영어 쓰기 평가의 자동 채점 연구에서는 채점자 간 일치도를 나타내는 다양한 지수들을 상호보완적으로 제시하는 것이 바람직하다.

영문 초록

Objectives The purpose of this study is to investigate the magnitudes of and relationships among different effect-size indexes for inter-rater agreement between human and machine scoring in writing assessments. Methods The procedure of meta-analyses consists of data gathering, including literature search with criteria for inclusion and exclusion, and data analysis, including data cleaning and coding, after tests of heterogeneity for each index, hierarchical weighted models, and moderator and contrast analyses were conducted. Appropriate analyses were conducted using rma, robust, and robu functions in the metafor and robumeta packages in R software Version 3.3.2. Results Based on the results, the overall random-effects means for correlations, kappa values, and adjacent proportions of agreement between automated and human scoring of essay writing were .75, .48, and .99, respectively. The results from hierarchical weighed models and heterogeneity tests indicate that the rates of agreement for each index were inconsistent across studies. According to moderator and contrast analyses, correlations and kappa values using 6-point scales significantly differed from those using 3-, 4-, and 5-point scales, respectively, at alpha level .05. On the other hand, the adjacent proportions of agreement on either 3- or 4-point scales significantly differed from the adjacent proportions of agreement on the 5- and 6-point scales, respectively, at alpha level .01. Regarding the exact and adjacent proportions of agreement, the average of discrepancies was 0.34 units, and the variance of discrepancies between exact and adjacent proportions of agreement was 0.004. Similarly, the mean of discrepancies between the correlation and kappa was 0.27, and the variance of discrepancies between the correlation and kappa was 0.003. Conclusions According to this finding, machine scoring is similar to human scoring in terms of relative consistency and absolute consensus. Compared to the evaluation criteria suggested by prior studies, the degrees of inter-rater agreement seen in this study were above the thresholds for correlations, moderate agreement for kappa, and in the range of consensus rates for adjacent proportion agreement. The rates of agreement for each index were inconsistent across studies. This means that the all agreement indexes had relatively large between-studies differences compared to the between-effects differences within the studies. It is necessary to investigate if some moderators explain the between-studies differences. The number of score-scale points used for measuring inter-rater agreement between human and machine scoring influenced the agreement rates. The relationships across the four indexes (i.e.,  , , , and ) from the study appear to be reasonably strong and linear. Regardless of the number of points on the score scales, kappa values are 0.27 points lower than correlations. In addition, the mean exact proportions of agreements is 0.34 points lower than the mean adjacent proportions of agreements. Since each inter-rater agreement index has its own disadvantages, such as scale dependency, not showing the degree of identical matching and matching patterns, and so on, it is advised to report several inter-rater agreement indexes.

키워드

채점자 간 일치도 메타분석 랜덤 효과 모델 조절변인 분석 자동 쓰기 채점 inter-rater agreement meta-analysis random-effect model moderator analysis automated essay scoring

국문 초록

영문 초록

목차

키워드

해당간행물 수록 논문

참고문헌

관련논문

사회과학 > 교육학분야 BEST

사회과학 > 교육학분야 NEW

최근 이용한 논문

APA

MLA