4 Jun 2024 | Vilém Zouhar, Shuoyang Ding, Anna Currey, Tatyana Badeka, Jenyuan Wang, Brian Thompson
A new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain has been introduced. This dataset is used to investigate whether machine translation (MT) metrics fine-tuned on human-generated quality judgments are robust to domain shifts between training and inference. The study finds that fine-tuned metrics exhibit a substantial performance drop in the biomedical domain compared to metrics that rely on surface form or pre-trained models not fine-tuned on MT quality judgments.
The research examines how different types of metrics perform on the new biomedical test set relative to the WMT test set. It finds that fine-tuned metrics have lower correlation with human judgments in the biomedical domain, despite other metrics having higher correlation. This indicates that fine-tuned metrics struggle with domain mismatches between training and inference. The performance gap persists throughout the fine-tuning process and is not due to a deficiency in the pre-trained model.
The study also compares the performance of different metric types, including Surface-Form metrics, Pre-trained+Fine-tuned metrics, Pre-trained+Algorithm metrics, and Pre-trained+Prompt metrics. It finds that Pre-trained+Algorithm metrics, which are not trained on WMT data, do not exhibit the same domain gap as fine-tuned metrics. Additionally, Pre-trained+Prompt metrics show a large performance gap, but their underlying models are not publicly available.
The study also explores how the pre-trained model affects domain robustness. It finds that improving the pre-trained model improves BERTScore but not COMET, indicating that the poor performance of COMET on the biomedical domain is due to its fine-tuning stages rather than the pre-trained model itself.
The paper concludes that while fine-tuned metrics like COMET perform well on the biomedical domain, they struggle with domain shifts. Future work should focus on collecting more diverse human judgments for fine-tuned metrics and improving their generalization during fine-tuning. The study also highlights the limitations of current approaches, including reliance on empirical assumptions and the lack of document-level evaluation in existing metrics.A new, extensive multidimensional quality metrics (MQM) annotated dataset covering 11 language pairs in the biomedical domain has been introduced. This dataset is used to investigate whether machine translation (MT) metrics fine-tuned on human-generated quality judgments are robust to domain shifts between training and inference. The study finds that fine-tuned metrics exhibit a substantial performance drop in the biomedical domain compared to metrics that rely on surface form or pre-trained models not fine-tuned on MT quality judgments.
The research examines how different types of metrics perform on the new biomedical test set relative to the WMT test set. It finds that fine-tuned metrics have lower correlation with human judgments in the biomedical domain, despite other metrics having higher correlation. This indicates that fine-tuned metrics struggle with domain mismatches between training and inference. The performance gap persists throughout the fine-tuning process and is not due to a deficiency in the pre-trained model.
The study also compares the performance of different metric types, including Surface-Form metrics, Pre-trained+Fine-tuned metrics, Pre-trained+Algorithm metrics, and Pre-trained+Prompt metrics. It finds that Pre-trained+Algorithm metrics, which are not trained on WMT data, do not exhibit the same domain gap as fine-tuned metrics. Additionally, Pre-trained+Prompt metrics show a large performance gap, but their underlying models are not publicly available.
The study also explores how the pre-trained model affects domain robustness. It finds that improving the pre-trained model improves BERTScore but not COMET, indicating that the poor performance of COMET on the biomedical domain is due to its fine-tuning stages rather than the pre-trained model itself.
The paper concludes that while fine-tuned metrics like COMET perform well on the biomedical domain, they struggle with domain shifts. Future work should focus on collecting more diverse human judgments for fine-tuned metrics and improving their generalization during fine-tuning. The study also highlights the limitations of current approaches, including reliance on empirical assumptions and the lack of document-level evaluation in existing metrics.