INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models

INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models

22 Feb 2024 | Hanseok Oh, Hyunji Lee, Seonghyeon Ye, Haebin Shin, Hansol Jang, Changwook Jun, Minjoon Seo
The paper introduces INSTRUCTIR, a novel benchmark designed to evaluate the instruction-following ability of information retrieval models. The benchmark focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics of real-world search scenarios. Unlike existing benchmarks that use coarse-grained task descriptions, INSTRUCTIR employs instance-wise instructions that vary based on factors such as user job, background, situation, location, hobbies, interests, and search goals. The dataset is created through a multi-stage process involving seed examples from the MSMARCO dataset, instruction generation using GPT-4, target text revision, and filtering to ensure quality and diversity. The evaluation metric, Robustness score, measures how consistently retrievers predict targets over evolving instructions. Experiments with various retriever systems, including non-instruction-tuned and instruction-tuned models, show that task-style instruction-tuned retrievers often underperform compared to non-instruction-tuned counterparts, highlighting potential overfitting issues. The study also reveals that larger models and instruction-tuned models trained on diverse instruction-aware retrieval datasets perform better. Future work could explore methods like Reinforcement Learning from Human Feedback (RLHF) to enhance the alignment of retrieval models with user intentions.The paper introduces INSTRUCTIR, a novel benchmark designed to evaluate the instruction-following ability of information retrieval models. The benchmark focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics of real-world search scenarios. Unlike existing benchmarks that use coarse-grained task descriptions, INSTRUCTIR employs instance-wise instructions that vary based on factors such as user job, background, situation, location, hobbies, interests, and search goals. The dataset is created through a multi-stage process involving seed examples from the MSMARCO dataset, instruction generation using GPT-4, target text revision, and filtering to ensure quality and diversity. The evaluation metric, Robustness score, measures how consistently retrievers predict targets over evolving instructions. Experiments with various retriever systems, including non-instruction-tuned and instruction-tuned models, show that task-style instruction-tuned retrievers often underperform compared to non-instruction-tuned counterparts, highlighting potential overfitting issues. The study also reveals that larger models and instruction-tuned models trained on diverse instruction-aware retrieval datasets perform better. Future work could explore methods like Reinforcement Learning from Human Feedback (RLHF) to enhance the alignment of retrieval models with user intentions.
Reach us at info@study.space
[slides and audio] INSTRUCTIR%3A A Benchmark for Instruction Following of Information Retrieval Models