20 Aug 2024 | David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, Arman Cohan
SciRIFF is a dataset of 137,000 instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. The dataset includes tasks from five scientific domains, ranging from artificial intelligence to clinical medicine. SciRIFF is designed to enhance and evaluate instruction-following capabilities of large language models (LLMs) in the specialized domain of scientific literature understanding. The tasks are derived from existing scientific literature understanding datasets with human-annotated inputs and outputs, and are converted to a common instruction-following format via templates written by the paper authors. The dataset includes long input contexts and requires structured model responses.
To demonstrate the utility of SciRIFF, we develop a sample-efficient strategy to adapt a general instruction-following model for science by performing additional finetuning on a mix of general-domain and SciRIFF demonstrations. In evaluations on nine held-out scientific tasks, our model—called SciTülu—improves over a strong LLM baseline by 28.1% and 6.5% at the 7B and 70B scales respectively, while maintaining general instruction-following performance within 2% of the baseline. We are optimistic that SciRIFF will facilitate the development and evaluation of LLMs to help researchers navigate the ever-growing body of scientific literature. We release our dataset, model checkpoints, and data processing and evaluation code to enable further research.SciRIFF is a dataset of 137,000 instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. The dataset includes tasks from five scientific domains, ranging from artificial intelligence to clinical medicine. SciRIFF is designed to enhance and evaluate instruction-following capabilities of large language models (LLMs) in the specialized domain of scientific literature understanding. The tasks are derived from existing scientific literature understanding datasets with human-annotated inputs and outputs, and are converted to a common instruction-following format via templates written by the paper authors. The dataset includes long input contexts and requires structured model responses.
To demonstrate the utility of SciRIFF, we develop a sample-efficient strategy to adapt a general instruction-following model for science by performing additional finetuning on a mix of general-domain and SciRIFF demonstrations. In evaluations on nine held-out scientific tasks, our model—called SciTülu—improves over a strong LLM baseline by 28.1% and 6.5% at the 7B and 70B scales respectively, while maintaining general instruction-following performance within 2% of the baseline. We are optimistic that SciRIFF will facilitate the development and evaluation of LLMs to help researchers navigate the ever-growing body of scientific literature. We release our dataset, model checkpoints, and data processing and evaluation code to enable further research.