This paper introduces RetrievalQA, a new benchmark for evaluating adaptive retrieval-augmented generation (ARAG) in short-form open-domain question answering. The dataset contains 1,271 questions covering new world and long-tail knowledge, where the necessary information is not present in large language models (LLMs). This makes the dataset suitable for testing ARAG methods, as LLMs must decide whether to retrieve external information to answer correctly. The authors find that calibration-based methods heavily rely on threshold tuning, while vanilla prompting is inadequate for guiding LLMs to make reliable retrieval decisions. Based on these findings, they propose Time-Aware Adaptive REtrieval (TA-ARE), a simple yet effective method that helps LLMs assess the necessity of retrieval without calibration or additional training.
The paper evaluates existing methods on RetrievalQA, including calibration-based and model-based approaches. Results show that calibration-based Self-RAG requires threshold tuning to balance QA performance and retrieval efficiency, while vanilla prompting is insufficient in guiding LLMs to make reliable retrieval decisions. TA-ARE significantly improves all baselines, with an average gain of 14.9% in retrieval accuracy and 6.7% in QA accuracy. The method reduces the areas of overconfidence and uncertainty in LLMs, demonstrating its effectiveness in helping LLMs assess the necessity of retrieval.
The authors also conduct error analysis, finding that LLMs can discern the need for resource retrieval but struggle with long-tail questions. They further show that TA-ARE improves performance on both long-tail and new world knowledge questions. The paper highlights the importance of evaluating ARAG methods on datasets that accurately reflect the limitations of LLMs, and proposes TA-ARE as a simple yet effective solution for adaptive retrieval without calibration or additional training. The study also acknowledges limitations, including the potential for some questions to be answerable without external information and the need for further research to improve retrieval relevance and accuracy.This paper introduces RetrievalQA, a new benchmark for evaluating adaptive retrieval-augmented generation (ARAG) in short-form open-domain question answering. The dataset contains 1,271 questions covering new world and long-tail knowledge, where the necessary information is not present in large language models (LLMs). This makes the dataset suitable for testing ARAG methods, as LLMs must decide whether to retrieve external information to answer correctly. The authors find that calibration-based methods heavily rely on threshold tuning, while vanilla prompting is inadequate for guiding LLMs to make reliable retrieval decisions. Based on these findings, they propose Time-Aware Adaptive REtrieval (TA-ARE), a simple yet effective method that helps LLMs assess the necessity of retrieval without calibration or additional training.
The paper evaluates existing methods on RetrievalQA, including calibration-based and model-based approaches. Results show that calibration-based Self-RAG requires threshold tuning to balance QA performance and retrieval efficiency, while vanilla prompting is insufficient in guiding LLMs to make reliable retrieval decisions. TA-ARE significantly improves all baselines, with an average gain of 14.9% in retrieval accuracy and 6.7% in QA accuracy. The method reduces the areas of overconfidence and uncertainty in LLMs, demonstrating its effectiveness in helping LLMs assess the necessity of retrieval.
The authors also conduct error analysis, finding that LLMs can discern the need for resource retrieval but struggle with long-tail questions. They further show that TA-ARE improves performance on both long-tail and new world knowledge questions. The paper highlights the importance of evaluating ARAG methods on datasets that accurately reflect the limitations of LLMs, and proposes TA-ARE as a simple yet effective solution for adaptive retrieval without calibration or additional training. The study also acknowledges limitations, including the potential for some questions to be answerable without external information and the need for further research to improve retrieval relevance and accuracy.