7 May 2024 | Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini
This paper introduces FOLLOWIR, a new benchmark for evaluating how well information retrieval (IR) models follow instructions. The dataset is built from three TREC collections, which contain detailed instructions used by professional annotators to evaluate document relevance. These instructions are repurposed to create a rigorous benchmark for assessing how well IR models follow instructions. The dataset includes both a benchmark for evaluating instruction following and a training set for helping IR models learn to follow real-world instructions.
The paper shows that current retrieval models fail to correctly use instructions, often relying on basic keyword matching rather than understanding the full context of the instructions. However, the authors demonstrate that it is possible for IR models to learn to follow complex instructions, as evidenced by the performance of their new FOLLOWIR-7B model, which shows significant improvements after fine-tuning on the training set.
The paper also introduces a new pairwise evaluation framework, p-MRR, which measures how well models follow instructions by comparing their performance on the original and modified instructions. Results show that models with over 3B parameters or instruction-tuned LMs perform better in following instructions.
The authors also show that fine-tuning on a training set of longer instructions can improve model performance in following instructions. They build and release a training corpus for teaching retrieval models to follow instructions, and their new model, FOLLOWIR-7B, shows improvement on both standard retrieval metrics and in instruction following.
The paper concludes that it is possible to train IR models to be better instruction followers, and that the new benchmark and model can help the community develop more capable instruction-following retrieval models.This paper introduces FOLLOWIR, a new benchmark for evaluating how well information retrieval (IR) models follow instructions. The dataset is built from three TREC collections, which contain detailed instructions used by professional annotators to evaluate document relevance. These instructions are repurposed to create a rigorous benchmark for assessing how well IR models follow instructions. The dataset includes both a benchmark for evaluating instruction following and a training set for helping IR models learn to follow real-world instructions.
The paper shows that current retrieval models fail to correctly use instructions, often relying on basic keyword matching rather than understanding the full context of the instructions. However, the authors demonstrate that it is possible for IR models to learn to follow complex instructions, as evidenced by the performance of their new FOLLOWIR-7B model, which shows significant improvements after fine-tuning on the training set.
The paper also introduces a new pairwise evaluation framework, p-MRR, which measures how well models follow instructions by comparing their performance on the original and modified instructions. Results show that models with over 3B parameters or instruction-tuned LMs perform better in following instructions.
The authors also show that fine-tuning on a training set of longer instructions can improve model performance in following instructions. They build and release a training corpus for teaching retrieval models to follow instructions, and their new model, FOLLOWIR-7B, shows improvement on both standard retrieval metrics and in instruction following.
The paper concludes that it is possible to train IR models to be better instruction followers, and that the new benchmark and model can help the community develop more capable instruction-following retrieval models.