19 Jun 2024 | Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
Long-context language models (LCLMs) have the potential to revolutionize tasks traditionally relying on external tools like retrieval systems or databases. By natively processing large corpora, LCLMs offer user-friendly, end-to-end modeling, and advanced prompting techniques. The LOFT benchmark evaluates LCLMs on real-world tasks requiring millions of tokens, showing they can rival state-of-the-art retrieval and RAG systems without explicit training for these tasks. However, LCLMs still face challenges in compositional reasoning required for SQL-like tasks. Prompting strategies significantly influence performance, highlighting the need for further research. LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.
LOFT is a benchmark with six tasks across text, visual, and audio modalities, designed to push LCLMs to their limits. It allows automatic creation of increasing context lengths, currently up to one million tokens. LOFT focuses on areas where LCLMs can disrupt: retrieval, RAG, SQL, and many-shot in-context learning (ICL). LCLMs can directly ingest and retrieve information from a corpus, simplifying tasks and eliminating the need for specialized models. They also excel in many-shot ICL by scaling the number of examples.
The LOFT benchmark includes diverse datasets for retrieval, RAG, SQL, and ICL. Retrieval tasks involve text, visual, and audio modalities, while RAG tasks simplify pipelines by directly reasoning over a corpus. SQL tasks explore LCLMs' ability to process databases as text, enabling natural language querying. Many-shot ICL tasks evaluate LCLMs' ability to scale examples from tens to hundreds or thousands.
CiC prompting, a novel approach, enables LCLMs to process large corpora within their context window. It combines established prompting strategies, tailoring them to leverage LCLMs' capabilities for learning, retrieving, and reasoning over in-context corpora. Instructions, corpus formatting, few-shot examples, and query formatting are key components of CiC prompting.
Results show that LCLMs like Gemini 1.5 Pro perform comparably to specialized models on retrieval, visual, audio, and RAG tasks. However, they lag in complex multi-hop reasoning tasks. Performance varies significantly based on prompting strategies, emphasizing the need for further research. LOFT demonstrates that LCLMs can match specialized models' performance while revealing room for improvement in long-context reasoning as context windows scale.
LOFT's tasks include text retrieval, visual retrieval, audio retrieval, RAG, SQL-like reasoning, and many-shot ICL. Results show that LCLMs perform well on these tasks, though they face challenges in complex reasoning. The benchmark highlights the potential of LCLMs to supplant existing paradigms and tackle novel tasks as model capabilities scale. LimitCan Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
Long-context language models (LCLMs) have the potential to revolutionize tasks traditionally relying on external tools like retrieval systems or databases. By natively processing large corpora, LCLMs offer user-friendly, end-to-end modeling, and advanced prompting techniques. The LOFT benchmark evaluates LCLMs on real-world tasks requiring millions of tokens, showing they can rival state-of-the-art retrieval and RAG systems without explicit training for these tasks. However, LCLMs still face challenges in compositional reasoning required for SQL-like tasks. Prompting strategies significantly influence performance, highlighting the need for further research. LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.
LOFT is a benchmark with six tasks across text, visual, and audio modalities, designed to push LCLMs to their limits. It allows automatic creation of increasing context lengths, currently up to one million tokens. LOFT focuses on areas where LCLMs can disrupt: retrieval, RAG, SQL, and many-shot in-context learning (ICL). LCLMs can directly ingest and retrieve information from a corpus, simplifying tasks and eliminating the need for specialized models. They also excel in many-shot ICL by scaling the number of examples.
The LOFT benchmark includes diverse datasets for retrieval, RAG, SQL, and ICL. Retrieval tasks involve text, visual, and audio modalities, while RAG tasks simplify pipelines by directly reasoning over a corpus. SQL tasks explore LCLMs' ability to process databases as text, enabling natural language querying. Many-shot ICL tasks evaluate LCLMs' ability to scale examples from tens to hundreds or thousands.
CiC prompting, a novel approach, enables LCLMs to process large corpora within their context window. It combines established prompting strategies, tailoring them to leverage LCLMs' capabilities for learning, retrieving, and reasoning over in-context corpora. Instructions, corpus formatting, few-shot examples, and query formatting are key components of CiC prompting.
Results show that LCLMs like Gemini 1.5 Pro perform comparably to specialized models on retrieval, visual, audio, and RAG tasks. However, they lag in complex multi-hop reasoning tasks. Performance varies significantly based on prompting strategies, emphasizing the need for further research. LOFT demonstrates that LCLMs can match specialized models' performance while revealing room for improvement in long-context reasoning as context windows scale.
LOFT's tasks include text retrieval, visual retrieval, audio retrieval, RAG, SQL-like reasoning, and many-shot ICL. Results show that LCLMs perform well on these tasks, though they face challenges in complex reasoning. The benchmark highlights the potential of LCLMs to supplant existing paradigms and tackle novel tasks as model capabilities scale. Limit