Date:2020-02-24
Author:Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
Pages:43
Summary:BERTSCORE is a new automatic evaluation metric for text generation, designed to address the limitations of existing metrics like BLEU and METEOR. It computes a similarity score for each token in the candidate sentence with each token in the reference sentence using contextual embeddings from pre-trained BERT models. BERTSCORE correlates better with human judgments and provides stronger model selection performance compared to existing metrics. It also demonstrates robustness to challenging examples in adversarial paraphrase detection tasks. The metric is evaluated on machine translation and image captioning tasks, showing high correlation with human evaluations and superior performance over other metrics. BERTSCORE is simple, task-agnostic, and easy to use, making it a valuable tool for evaluating text generation systems.