Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains

4 Jun 2024 | Vilém Zouhar, Shuoyang Ding, Anna Currey, Tatyana Badeka, Jenyuan Wang, Brian Thompson
This paper investigates the performance of machine translation (MT) metrics across different domains, focusing on the robustness of fine-tuned metrics. The authors introduce a new, extensive multidimensional quality metrics (MQM) dataset covering 11 language pairs in the biomedical domain. They use this dataset to evaluate the domain robustness of various MT metrics, including fine-tuned metrics, pre-trained metrics, and surface-form metrics. The study finds that fine-tuned metrics, which are trained on human-generated MT quality judgments, exhibit a significant performance drop in the unseen biomedical domain compared to metrics that rely on surface form or pre-trained models. This performance gap is attributed to the domain mismatch between training and inference, rather than a deficiency in the pre-trained model. The authors also analyze the fine-tuning process of COMET, a popular fine-tuned metric, and find that the domain gap persists throughout the fine-tuning stages. They conclude that while fine-tuned metrics perform well in their trained domains, they struggle with unseen domains due to the domain-specific nature of the training data. The findings suggest that future work should focus on improving the generalization of fine-tuned metrics and collecting more diverse human judgments for these metrics.This paper investigates the performance of machine translation (MT) metrics across different domains, focusing on the robustness of fine-tuned metrics. The authors introduce a new, extensive multidimensional quality metrics (MQM) dataset covering 11 language pairs in the biomedical domain. They use this dataset to evaluate the domain robustness of various MT metrics, including fine-tuned metrics, pre-trained metrics, and surface-form metrics. The study finds that fine-tuned metrics, which are trained on human-generated MT quality judgments, exhibit a significant performance drop in the unseen biomedical domain compared to metrics that rely on surface form or pre-trained models. This performance gap is attributed to the domain mismatch between training and inference, rather than a deficiency in the pre-trained model. The authors also analyze the fine-tuning process of COMET, a popular fine-tuned metric, and find that the domain gap persists throughout the fine-tuning stages. They conclude that while fine-tuned metrics perform well in their trained domains, they struggle with unseen domains due to the domain-specific nature of the training data. The findings suggest that future work should focus on improving the generalization of fine-tuned metrics and collecting more diverse human judgments for these metrics.
Reach us at info@study.space