6 Aug 2024 | Cheng-Ping Hsieh*, Simeng Sun*, Samuel Kriman, Shantanu Acharya Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg
RULER is a new synthetic benchmark designed to evaluate the long-context capabilities of language models (LMs). It introduces four task categories—retrieval, multi-hop tracing, aggregation, and question answering—to provide a more comprehensive assessment of LMs' long-context understanding beyond simple retrieval. The benchmark includes flexible configurations for sequence length and task complexity, allowing for diverse task setups.
The authors evaluate 17 long-context LMs using RULER, covering a wide range of context lengths from 4K to 128K tokens. Despite achieving near-perfect performance on the needle-in-a-haystack (NIAH) test, most models exhibit significant performance degradation as the context length increases. Only half of the models can maintain satisfactory performance at the claimed context length of 32K tokens. The analysis of Yi-34B, which supports a context length of 200K tokens, reveals substantial room for improvement, with models often failing to handle larger input lengths and task complexities.
RULER opens-source to encourage further research and evaluation of long-context LMs, highlighting the need for more robust and comprehensive benchmarks to assess these models' long-context capabilities. The study also identifies common failure modes, such as the inability to ignore distractors and ineffective utilization of long context, and suggests that larger model sizes and better training context lengths can improve performance.RULER is a new synthetic benchmark designed to evaluate the long-context capabilities of language models (LMs). It introduces four task categories—retrieval, multi-hop tracing, aggregation, and question answering—to provide a more comprehensive assessment of LMs' long-context understanding beyond simple retrieval. The benchmark includes flexible configurations for sequence length and task complexity, allowing for diverse task setups.
The authors evaluate 17 long-context LMs using RULER, covering a wide range of context lengths from 4K to 128K tokens. Despite achieving near-perfect performance on the needle-in-a-haystack (NIAH) test, most models exhibit significant performance degradation as the context length increases. Only half of the models can maintain satisfactory performance at the claimed context length of 32K tokens. The analysis of Yi-34B, which supports a context length of 200K tokens, reveals substantial room for improvement, with models often failing to handle larger input lengths and task complexities.
RULER opens-source to encourage further research and evaluation of long-context LMs, highlighting the need for more robust and comprehensive benchmarks to assess these models' long-context capabilities. The study also identifies common failure modes, such as the inability to ignore distractors and ineffective utilization of long context, and suggests that larger model sizes and better training context lengths can improve performance.