22 Feb 2024 | Hanseok Oh, Hyunj Lee, Seonghyeon Ye, Haebin Shin, Hansol Jang, Changwook Jun, Minjoon Seo
INSTRUCTIR is a novel benchmark designed to evaluate the instruction-following ability of information retrieval models. The benchmark focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics of real-world search scenarios. The dataset includes 9,906 instance-wise instructions that involve details about the search users, such as their job, background, situation, location, hobbies, additional interests, and search goals, and preferred sources. The data creation pipeline involves selecting seed examples from the MSMARCO dataset, generating instructions for each query, revising target texts to align with these instructions, and systematically filtering the generated content. The resulting dataset is used for the INSTRUCTIR benchmark, and the quality of the datasets is verified through a combination of human evaluation and machine filtering.
The benchmark introduces a Robustness score as an evaluation metric, quantifying the ability of retrievers to robustly follow instructions. The benchmark evaluates over 12 retriever baselines, including both naïve retrievers (not explicitly instruction-tuned) and instruction-tuned retrievers. The results show that task-style instruction-tuned retrievers, such as INSTRUCTOR, consistently underperform compared to their non-tuned counterparts. This underscores potential overfitting issues inherent in constructing retrievers trained on existing instruction-aware retrieval datasets.
The benchmark highlights the importance of using large instruction-tuned models for search tasks to follow instructions effectively. The study also shows that the performance of retrievers is sensitive to the order of instructions and queries, and that weighting individual terms leads to instruction sensitivity for paraphrased instructions. Additionally, relying on lexical redundancy reduces robustness, as it can lead to wrong targets and lower robustness scores.
The study concludes that larger models benefit more from instruction tuning, and that the performance of retrievers is influenced by the diversity of instructions and the complexity of the tasks. The benchmark provides valuable insights into the instruction-following capabilities of information retrieval models and contributes to the development of more sophisticated, controllable, and instruction-aware information access systems.INSTRUCTIR is a novel benchmark designed to evaluate the instruction-following ability of information retrieval models. The benchmark focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics of real-world search scenarios. The dataset includes 9,906 instance-wise instructions that involve details about the search users, such as their job, background, situation, location, hobbies, additional interests, and search goals, and preferred sources. The data creation pipeline involves selecting seed examples from the MSMARCO dataset, generating instructions for each query, revising target texts to align with these instructions, and systematically filtering the generated content. The resulting dataset is used for the INSTRUCTIR benchmark, and the quality of the datasets is verified through a combination of human evaluation and machine filtering.
The benchmark introduces a Robustness score as an evaluation metric, quantifying the ability of retrievers to robustly follow instructions. The benchmark evaluates over 12 retriever baselines, including both naïve retrievers (not explicitly instruction-tuned) and instruction-tuned retrievers. The results show that task-style instruction-tuned retrievers, such as INSTRUCTOR, consistently underperform compared to their non-tuned counterparts. This underscores potential overfitting issues inherent in constructing retrievers trained on existing instruction-aware retrieval datasets.
The benchmark highlights the importance of using large instruction-tuned models for search tasks to follow instructions effectively. The study also shows that the performance of retrievers is sensitive to the order of instructions and queries, and that weighting individual terms leads to instruction sensitivity for paraphrased instructions. Additionally, relying on lexical redundancy reduces robustness, as it can lead to wrong targets and lower robustness scores.
The study concludes that larger models benefit more from instruction tuning, and that the performance of retrievers is influenced by the diversity of instructions and the complexity of the tasks. The benchmark provides valuable insights into the instruction-following capabilities of information retrieval models and contributes to the development of more sophisticated, controllable, and instruction-aware information access systems.