This article surveys methods for measuring agreement among corpus annotators, focusing on agreement coefficients such as Krippendorff's alpha, Scott's pi, and Cohen's kappa. It discusses their mathematical foundations, assumptions, and applications in computational linguistics (CL). The article argues that weighted, alpha-like coefficients, which are less commonly used in CL than kappa-like measures, may be more appropriate for many corpus annotation tasks. However, their use complicates the interpretation of the coefficient values.
The article begins by highlighting the increasing importance of empirical methods in discourse research and the challenges of subjective judgments in creating annotated resources. It notes that early methods for assessing coder agreement in discourse segmentation tasks led to the adoption of Krippendorff's alpha, a variant of Cohen's kappa, as a standard for measuring agreement in CL. However, questions have been raised about the applicability and interpretation of K and similar coefficients, including issues related to coder bias, the effect of skewed distributions, and the choice of calculating chance agreement.
The article then discusses the mathematics and assumptions of agreement coefficients, emphasizing the difference between chance agreement based on overall item distributions (pi and alpha) and individual coder priors (kappa). It explains how these coefficients are calculated, their limitations, and their suitability for different annotation tasks. The article also addresses the use of weighted coefficients, which allow for different magnitudes of disagreement, and highlights the importance of considering the natural distance metric for nominal, interval, ordinal, and ratio scales.
The article concludes by discussing the challenges of annotator bias and prevalence, where skewed data distributions can affect the interpretation of agreement coefficients. It emphasizes the need for careful consideration of the assumptions and interpretations of chance agreement when using these coefficients in CL. The article also highlights the importance of using appropriate coefficients for different annotation tasks and the need for further research to address the limitations and challenges of agreement measurement in CL.This article surveys methods for measuring agreement among corpus annotators, focusing on agreement coefficients such as Krippendorff's alpha, Scott's pi, and Cohen's kappa. It discusses their mathematical foundations, assumptions, and applications in computational linguistics (CL). The article argues that weighted, alpha-like coefficients, which are less commonly used in CL than kappa-like measures, may be more appropriate for many corpus annotation tasks. However, their use complicates the interpretation of the coefficient values.
The article begins by highlighting the increasing importance of empirical methods in discourse research and the challenges of subjective judgments in creating annotated resources. It notes that early methods for assessing coder agreement in discourse segmentation tasks led to the adoption of Krippendorff's alpha, a variant of Cohen's kappa, as a standard for measuring agreement in CL. However, questions have been raised about the applicability and interpretation of K and similar coefficients, including issues related to coder bias, the effect of skewed distributions, and the choice of calculating chance agreement.
The article then discusses the mathematics and assumptions of agreement coefficients, emphasizing the difference between chance agreement based on overall item distributions (pi and alpha) and individual coder priors (kappa). It explains how these coefficients are calculated, their limitations, and their suitability for different annotation tasks. The article also addresses the use of weighted coefficients, which allow for different magnitudes of disagreement, and highlights the importance of considering the natural distance metric for nominal, interval, ordinal, and ratio scales.
The article concludes by discussing the challenges of annotator bias and prevalence, where skewed data distributions can affect the interpretation of agreement coefficients. It emphasizes the need for careful consideration of the assumptions and interpretations of chance agreement when using these coefficients in CL. The article also highlights the importance of using appropriate coefficients for different annotation tasks and the need for further research to address the limitations and challenges of agreement measurement in CL.