Understanding STaRK%3A Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

STARK is a large-scale benchmark designed to evaluate the performance of retrieval systems driven by large language models (LLMs) on semi-structured knowledge bases (SKBs). These SKBs integrate both unstructured textual data and structured relational information, making them suitable for complex real-world queries. The benchmark covers three domains: product search, academic paper search, and precision medicine queries. To address the gap between textual and relational retrieval tasks, STARK employs a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties. The queries are designed to be natural-sounding and relevant to real-world scenarios, with ground-truth answers constructed using multiple language models. Additionally, STARK includes 274 human-generated queries to provide an authentic reference. Experiments on STARK reveal significant challenges for current retrieval systems, particularly in handling the interplay between textual and relational requirements. The benchmark highlights the need for more capable retrieval systems that can effectively manage large-scale SKBs and private knowledge bases. The data and code for STARK are available on GitHub, making it a valuable resource for researchers and practitioners in the field of information retrieval.STARK is a large-scale benchmark designed to evaluate the performance of retrieval systems driven by large language models (LLMs) on semi-structured knowledge bases (SKBs). These SKBs integrate both unstructured textual data and structured relational information, making them suitable for complex real-world queries. The benchmark covers three domains: product search, academic paper search, and precision medicine queries. To address the gap between textual and relational retrieval tasks, STARK employs a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties. The queries are designed to be natural-sounding and relevant to real-world scenarios, with ground-truth answers constructed using multiple language models. Additionally, STARK includes 274 human-generated queries to provide an authentic reference. Experiments on STARK reveal significant challenges for current retrieval systems, particularly in handling the interplay between textual and relational requirements. The benchmark highlights the need for more capable retrieval systems that can effectively manage large-scale SKBs and private knowledge bases. The data and code for STARK are available on GitHub, making it a valuable resource for researchers and practitioners in the field of information retrieval.

STARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

20 May 2024 | Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis†, Karthik Subbian†, James Zou†,§, Jure Leskovec†,§

STARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

20 May 2024 | Shirley Wu*, Shiyu Zhao*, Michihiro Yasunaga*, Kexin Huang*, Kaidi Cao*, Qian Huang*, Vassilis N. Ioannidis†, Karthik Subbian†, James Zou†,§, Jure Leskovec†,§

20 May 2024 | Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis†, Karthik Subbian†, James Zou†,§, Jure Leskovec†,§