LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

6 Feb 2024 | Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shi Yao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, Yu Wang
LV-Eval is a long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) designed to evaluate the long-context understanding ability of large language models (LLMs). It features two main tasks: single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The benchmark incorporates three key techniques: confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. These techniques help control evaluation across different context lengths, challenge models with confusing facts, mitigate knowledge leakage, and provide more objective evaluations. The benchmark evaluates 10 LLMs, revealing that commercial LLMs generally outperform open-source LLMs when evaluated within length levels shorter than their claimed context length, but open-source LLMs with longer context lengths surpass them overall. Extremely long-context LLMs, such as Yi-6B-200k, show relatively gentle performance degradation but may not necessarily outperform shorter-context models. LLMs' performance can significantly degrade in the presence of confusing information, especially in the "needle in a haystack" task. Knowledge leakage and inaccurate metrics introduce bias in evaluation, which is mitigated in LV-Eval. LV-Eval's design includes context mixing up, confusing facts insertion, and keyword and phrase replacement to increase task difficulty and reduce knowledge leakage. The keyword-recall-based metric focuses on answer keywords, reducing the influence of non-informative words. The benchmark's datasets and evaluation codes are available at https://github.com/infinigence/LVEval. The results show that LLMs perform better on single-hop QA tasks compared to multi-hop QA tasks. The benchmark provides a more controllable and comprehensive evaluation of LLMs' long-context capabilities, with ablation studies demonstrating the effectiveness of its techniques. LV-Eval serves as a valuable resource for future research on long-context LLMs.LV-Eval is a long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) designed to evaluate the long-context understanding ability of large language models (LLMs). It features two main tasks: single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The benchmark incorporates three key techniques: confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. These techniques help control evaluation across different context lengths, challenge models with confusing facts, mitigate knowledge leakage, and provide more objective evaluations. The benchmark evaluates 10 LLMs, revealing that commercial LLMs generally outperform open-source LLMs when evaluated within length levels shorter than their claimed context length, but open-source LLMs with longer context lengths surpass them overall. Extremely long-context LLMs, such as Yi-6B-200k, show relatively gentle performance degradation but may not necessarily outperform shorter-context models. LLMs' performance can significantly degrade in the presence of confusing information, especially in the "needle in a haystack" task. Knowledge leakage and inaccurate metrics introduce bias in evaluation, which is mitigated in LV-Eval. LV-Eval's design includes context mixing up, confusing facts insertion, and keyword and phrase replacement to increase task difficulty and reduce knowledge leakage. The keyword-recall-based metric focuses on answer keywords, reducing the influence of non-informative words. The benchmark's datasets and evaluation codes are available at https://github.com/infinigence/LVEval. The results show that LLMs perform better on single-hop QA tasks compared to multi-hop QA tasks. The benchmark provides a more controllable and comprehensive evaluation of LLMs' long-context capabilities, with ablation studies demonstrating the effectiveness of its techniques. LV-Eval serves as a valuable resource for future research on long-context LLMs.
Reach us at info@study.space
[slides] LV-Eval%3A A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K | StudySpace