[slides] Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

This paper investigates the performance of machine translation (MT) metrics across different domains, focusing on the robustness of fine-tuned metrics. The authors introduce a new, extensive multidimensional quality metrics (MQM) dataset covering 11 language pairs in the biomedical domain. They use this dataset to evaluate the domain robustness of various MT metrics, including fine-tuned metrics, pre-trained metrics, and surface-form metrics. The study finds that fine-tuned metrics, which are trained on human-generated MT quality judgments, exhibit a significant performance drop in the unseen biomedical domain compared to metrics that rely on surface form or pre-trained models. This performance gap is attributed to the domain mismatch between training and inference, rather than a deficiency in the pre-trained model. The authors also analyze the fine-tuning process of COMET, a popular fine-tuned metric, and find that the domain gap persists throughout the fine-tuning stages. They conclude that while fine-tuned metrics perform well in their trained domains, they struggle with unseen domains due to the domain-specific nature of the training data. The findings suggest that future work should focus on improving the generalization of fine-tuned metrics and collecting more diverse human judgments for these metrics.This paper investigates the performance of machine translation (MT) metrics across different domains, focusing on the robustness of fine-tuned metrics. The authors introduce a new, extensive multidimensional quality metrics (MQM) dataset covering 11 language pairs in the biomedical domain. They use this dataset to evaluate the domain robustness of various MT metrics, including fine-tuned metrics, pre-trained metrics, and surface-form metrics. The study finds that fine-tuned metrics, which are trained on human-generated MT quality judgments, exhibit a significant performance drop in the unseen biomedical domain compared to metrics that rely on surface form or pre-trained models. This performance gap is attributed to the domain mismatch between training and inference, rather than a deficiency in the pre-trained model. The authors also analyze the fine-tuning process of COMET, a popular fine-tuned metric, and find that the domain gap persists throughout the fine-tuning stages. They conclude that while fine-tuned metrics perform well in their trained domains, they struggle with unseen domains due to the domain-specific nature of the training data. The findings suggest that future work should focus on improving the generalization of fine-tuned metrics and collecting more diverse human judgments for these metrics.

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

4 Jun 2024 | Vilém Zouhar, Shuoyang Ding, Anna Currey, Tatyana Badeka, Jenyuan Wang, Brian Thompson