COMET: A Neural Framework for MT Evaluation

COMET: A Neural Framework for MT Evaluation

November 16–20, 2020 | Ricardo Rei, Craig Stewart, Ana C Farinha, Alon Lavie
COMET is a neural framework for training multilingual machine translation (MT) evaluation models that achieve new state-of-the-art levels of correlation with human judgments. The framework leverages recent breakthroughs in cross-lingual pretrained language modeling to create highly multilingual and adaptable MT evaluation models that use information from both the source input and a target-language reference translation to more accurately predict MT quality. Three models were trained using different types of human judgments: Direct Assessments, Human-mediated Translation Edit Rate, and Multidimensional Quality Metrics. These models achieved new state-of-the-art performance on the WMT 2019 Metrics shared task and demonstrated robustness to high-performing systems. The framework includes two distinct architectures: the Estimator model, which is trained to regress directly on a quality score, and the Translation Ranking model, which is trained to minimize the distance between a "better" hypothesis and both its corresponding reference and original source. Both models use a cross-lingual encoder and a pooling layer to generate sentence embeddings. The Estimator model uses features derived from the source and reference to predict quality scores, while the Translation Ranking model uses triplet margin loss to optimize the embedding space. The framework was tested on three corpora: the QT21 corpus, the WMT DARR corpus, and the MQM corpus. The results showed that the COMET models outperformed existing metrics in terms of correlation with human judgments and robustness to high-quality MT systems. The models were able to generalize well across different language pairs and were effective even when trained on data that did not include English as a target language. The source language input was found to be important for the models' ability to learn accurate predictions, and including the source improved the overall correlation with human judgments. The COMET framework is released to the research community, along with the trained MT evaluation models and detailed scripts for running all reported baselines. The framework is built on top of PyTorch Lightning, a lightweight PyTorch wrapper that provides maximal flexibility and reproducibility. The framework has the potential to be used for a wide range of MT evaluation tasks and is expected to contribute to the development of more accurate and robust MT evaluation metrics.COMET is a neural framework for training multilingual machine translation (MT) evaluation models that achieve new state-of-the-art levels of correlation with human judgments. The framework leverages recent breakthroughs in cross-lingual pretrained language modeling to create highly multilingual and adaptable MT evaluation models that use information from both the source input and a target-language reference translation to more accurately predict MT quality. Three models were trained using different types of human judgments: Direct Assessments, Human-mediated Translation Edit Rate, and Multidimensional Quality Metrics. These models achieved new state-of-the-art performance on the WMT 2019 Metrics shared task and demonstrated robustness to high-performing systems. The framework includes two distinct architectures: the Estimator model, which is trained to regress directly on a quality score, and the Translation Ranking model, which is trained to minimize the distance between a "better" hypothesis and both its corresponding reference and original source. Both models use a cross-lingual encoder and a pooling layer to generate sentence embeddings. The Estimator model uses features derived from the source and reference to predict quality scores, while the Translation Ranking model uses triplet margin loss to optimize the embedding space. The framework was tested on three corpora: the QT21 corpus, the WMT DARR corpus, and the MQM corpus. The results showed that the COMET models outperformed existing metrics in terms of correlation with human judgments and robustness to high-quality MT systems. The models were able to generalize well across different language pairs and were effective even when trained on data that did not include English as a target language. The source language input was found to be important for the models' ability to learn accurate predictions, and including the source improved the overall correlation with human judgments. The COMET framework is released to the research community, along with the trained MT evaluation models and detailed scripts for running all reported baselines. The framework is built on top of PyTorch Lightning, a lightweight PyTorch wrapper that provides maximal flexibility and reproducibility. The framework has the potential to be used for a wide range of MT evaluation tasks and is expected to contribute to the development of more accurate and robust MT evaluation metrics.
Reach us at info@study.space