21 May 2020 | Thibault Sellam Dipanjan Das Ankur P. Parikh
BLEURT is a learned evaluation metric for text generation based on BERT, designed to model human judgments using thousands of training examples. It uses a novel pre-training scheme with synthetic data to improve generalization. BLEURT achieves state-of-the-art results on the WMT Metrics Shared Task and WebNLG Competition dataset. It outperforms traditional BERT-based approaches, especially when training data is scarce or out-of-distribution. BLEURT is robust to quality drifts and can adapt to new tasks with limited data. The model is pre-trained on synthetic data and fine-tuned on human ratings. It uses multiple pre-training signals, including automatic metrics, backtranslation likelihood, and textual entailment. BLEURT is evaluated on translation and data-to-text tasks, showing strong performance and robustness. The model is effective in both IID and out-of-distribution settings, and pre-training significantly improves its performance. BLEURT is a competitive and robust metric for text generation evaluation.BLEURT is a learned evaluation metric for text generation based on BERT, designed to model human judgments using thousands of training examples. It uses a novel pre-training scheme with synthetic data to improve generalization. BLEURT achieves state-of-the-art results on the WMT Metrics Shared Task and WebNLG Competition dataset. It outperforms traditional BERT-based approaches, especially when training data is scarce or out-of-distribution. BLEURT is robust to quality drifts and can adapt to new tasks with limited data. The model is pre-trained on synthetic data and fine-tuned on human ratings. It uses multiple pre-training signals, including automatic metrics, backtranslation likelihood, and textual entailment. BLEURT is evaluated on translation and data-to-text tasks, showing strong performance and robustness. The model is effective in both IID and out-of-distribution settings, and pre-training significantly improves its performance. BLEURT is a competitive and robust metric for text generation evaluation.