7 May 2024 | Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini
The paper "FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions" by Orion Weller et al. addresses the gap in the use of detailed instructions in Information Retrieval (IR) systems. Modern Language Models (LMs) are capable of following complex instructions, but IR models often fail to utilize these instructions effectively. The authors introduce FOLLOWIR, a dataset that includes a rigorous evaluation benchmark and a training set to help IR models learn to follow real-world instructions. FOLLOWIR repurposes detailed instructions, known as *narratives*, developed for professional assessors to evaluate retrieval systems. The dataset is built from three collections curated for shared tasks at the Text REtrieval Conference (TREC), each containing hundreds to thousands of labeled documents per query. The authors develop a new pairwise evaluation framework, *p-MRR*, to measure how well IR models follow instructions. Results show that existing retrieval models struggle with long-form information and use instructions for basic keyword search rather than understanding relevance. However, the authors demonstrate that it is possible for IR models to learn to follow complex instructions, with the FOLLOWIR-7B model showing significant improvements after fine-tuning on their training set. The paper contributes a benchmark for evaluating instruction following in retrieval, analysis of why current models fail to understand instructions, and a training dataset for teaching retrieval models to follow instructions.The paper "FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions" by Orion Weller et al. addresses the gap in the use of detailed instructions in Information Retrieval (IR) systems. Modern Language Models (LMs) are capable of following complex instructions, but IR models often fail to utilize these instructions effectively. The authors introduce FOLLOWIR, a dataset that includes a rigorous evaluation benchmark and a training set to help IR models learn to follow real-world instructions. FOLLOWIR repurposes detailed instructions, known as *narratives*, developed for professional assessors to evaluate retrieval systems. The dataset is built from three collections curated for shared tasks at the Text REtrieval Conference (TREC), each containing hundreds to thousands of labeled documents per query. The authors develop a new pairwise evaluation framework, *p-MRR*, to measure how well IR models follow instructions. Results show that existing retrieval models struggle with long-form information and use instructions for basic keyword search rather than understanding relevance. However, the authors demonstrate that it is possible for IR models to learn to follow complex instructions, with the FOLLOWIR-7B model showing significant improvements after fine-tuning on their training set. The paper contributes a benchmark for evaluating instruction following in retrieval, analysis of why current models fail to understand instructions, and a training dataset for teaching retrieval models to follow instructions.