2024 | Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg
RULER is a new synthetic benchmark for evaluating long-context language models (LMs). It includes four task categories: retrieval, multi-hop tracing, aggregation, and question answering. The benchmark allows for flexible configurations to test various sequence lengths and task complexities. RULER expands on the needle-in-a-haystack (NIAH) test by introducing new tasks such as multi-hop tracing and aggregation, which assess behaviors beyond simple retrieval from context. The benchmark evaluates 17 long-context LMs across 13 tasks, revealing that while models perform well on the NIAH test, their performance drops significantly as context length increases. Despite claiming context sizes of 32K tokens or more, only half of the models maintain satisfactory performance at 32K. Analysis of Yi-34B, which supports a context length of 200K, shows large room for improvement as input length and task complexity increase. The benchmark highlights the need for comprehensive evaluation of long-context LMs, as models often fail to handle longer contexts or complex tasks. RULER also reveals that non-Transformer architectures like RWKV and Mamba lag behind Transformers in long-context capabilities. The benchmark provides insights into model behavior, including the tendency to copy from context and reliance on parametric knowledge. RULER is open-sourced to encourage further research on long-context language models.RULER is a new synthetic benchmark for evaluating long-context language models (LMs). It includes four task categories: retrieval, multi-hop tracing, aggregation, and question answering. The benchmark allows for flexible configurations to test various sequence lengths and task complexities. RULER expands on the needle-in-a-haystack (NIAH) test by introducing new tasks such as multi-hop tracing and aggregation, which assess behaviors beyond simple retrieval from context. The benchmark evaluates 17 long-context LMs across 13 tasks, revealing that while models perform well on the NIAH test, their performance drops significantly as context length increases. Despite claiming context sizes of 32K tokens or more, only half of the models maintain satisfactory performance at 32K. Analysis of Yi-34B, which supports a context length of 200K, shows large room for improvement as input length and task complexity increase. The benchmark highlights the need for comprehensive evaluation of long-context LMs, as models often fail to handle longer contexts or complex tasks. RULER also reveals that non-Transformer architectures like RWKV and Mamba lag behind Transformers in long-context capabilities. The benchmark provides insights into model behavior, including the tendency to copy from context and reliance on parametric knowledge. RULER is open-sourced to encourage further research on long-context language models.