BERTScore: Evaluating Text Generation with BERT

BERTScore: Evaluating Text Generation with BERT

24 Feb 2020 | Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
BERTScore is an automatic evaluation metric for text generation that uses pre-trained BERT contextual embeddings to compute similarity between candidate and reference sentences. Unlike traditional metrics such as BLEU and METEOR, which rely on exact n-gram matching, BERTScore calculates token similarity using contextualized embeddings, which better capture semantic meaning and are more robust to paraphrases and semantic variations. BERTScore has been evaluated on 363 machine translation and image captioning systems, showing strong correlation with human judgments and superior performance in model selection compared to existing metrics. It also demonstrates robustness in adversarial paraphrase detection tasks, outperforming other metrics in handling challenging examples. BERTScore is computed by matching tokens between the candidate and reference sentences, using cosine similarity between their embeddings, and optionally weighting by inverse document frequency. It is designed to be simple, task-agnostic, and easy to use, with extensive experiments showing its effectiveness across various languages and tasks. BERTScore is particularly effective for machine translation and image captioning, and its performance is further enhanced by using importance weighting and appropriate contextual embedding models. The code for BERTScore is available at https://github.com/Tiiger/bert_score.BERTScore is an automatic evaluation metric for text generation that uses pre-trained BERT contextual embeddings to compute similarity between candidate and reference sentences. Unlike traditional metrics such as BLEU and METEOR, which rely on exact n-gram matching, BERTScore calculates token similarity using contextualized embeddings, which better capture semantic meaning and are more robust to paraphrases and semantic variations. BERTScore has been evaluated on 363 machine translation and image captioning systems, showing strong correlation with human judgments and superior performance in model selection compared to existing metrics. It also demonstrates robustness in adversarial paraphrase detection tasks, outperforming other metrics in handling challenging examples. BERTScore is computed by matching tokens between the candidate and reference sentences, using cosine similarity between their embeddings, and optionally weighting by inverse document frequency. It is designed to be simple, task-agnostic, and easy to use, with extensive experiments showing its effectiveness across various languages and tasks. BERTScore is particularly effective for machine translation and image captioning, and its performance is further enhanced by using importance weighting and appropriate contextual embedding models. The code for BERTScore is available at https://github.com/Tiiger/bert_score.
Reach us at info@study.space
Understanding BERTScore%3A Evaluating Text Generation with BERT