6 Feb 2024 | Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, Yu Wang
LV-Eval is a new benchmark designed to evaluate the long-context understanding capabilities of large language models (LLMs). It features five length levels (16k, 32k, 64k, 128k, and 256k words) and includes two main tasks: single-hop QA and multi-hop QA, with 11 bilingual datasets. The benchmark incorporates three key techniques: confusing facts insertion, keyword and phrase replacement, and a keyword-recall-based metric design. These techniques aim to make the evaluation more challenging, mitigate knowledge leakage, and ensure more objective scoring. The paper evaluates 10 LLMs on LV-Eval and conducts ablation studies to demonstrate the effectiveness of these techniques. Key findings include:
1. Commercial LLMs generally outperform open-source LLMs within shorter context lengths but are surpassed by open-source LLMs with longer context lengths.
2. Extremely long-context LLMs, such as Yi-6B-200k, show a gentle degradation in performance but may not consistently outperform shorter context LLMs.
3. LLMs' performance significantly degrades in the presence of confusing information, especially in the "needle in a haystack" task.
4. Knowledge leakage and inaccurate metrics introduce bias in evaluation, which are alleviated by LV-Eval.
The paper also provides detailed evaluation results and discusses the limitations and future directions for improving long-context understanding in LLMs.LV-Eval is a new benchmark designed to evaluate the long-context understanding capabilities of large language models (LLMs). It features five length levels (16k, 32k, 64k, 128k, and 256k words) and includes two main tasks: single-hop QA and multi-hop QA, with 11 bilingual datasets. The benchmark incorporates three key techniques: confusing facts insertion, keyword and phrase replacement, and a keyword-recall-based metric design. These techniques aim to make the evaluation more challenging, mitigate knowledge leakage, and ensure more objective scoring. The paper evaluates 10 LLMs on LV-Eval and conducts ablation studies to demonstrate the effectiveness of these techniques. Key findings include:
1. Commercial LLMs generally outperform open-source LLMs within shorter context lengths but are surpassed by open-source LLMs with longer context lengths.
2. Extremely long-context LLMs, such as Yi-6B-200k, show a gentle degradation in performance but may not consistently outperform shorter context LLMs.
3. LLMs' performance significantly degrades in the presence of confusing information, especially in the "needle in a haystack" task.
4. Knowledge leakage and inaccurate metrics introduce bias in evaluation, which are alleviated by LV-Eval.
The paper also provides detailed evaluation results and discusses the limitations and future directions for improving long-context understanding in LLMs.