20 May 2024 | Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou, Jure Leskovec
STARK is a large-scale benchmark for evaluating retrieval systems on textual and relational knowledge bases. The benchmark includes three domains: product search, academic paper search, and precision medicine. It features synthesized and human-generated queries that integrate textual and relational information, with ground-truth answers. The benchmark includes 9,100 synthesized queries for product search, 13,323 for academic papers, and 11,204 for precision medicine. The queries are designed to mimic real-world scenarios and require context-specific reasoning. The benchmark includes extensive data statistics and evaluation metrics, including Hit@k, Recall@k, and MRR. The benchmark also includes human evaluation to assess the naturalness, diversity, and practicality of the queries. The benchmark is designed to evaluate the performance of retrieval systems driven by large language models (LLMs) in handling semi-structured knowledge bases. The results show that current retrieval systems face significant challenges in handling both textual and relational information, highlighting the need for more capable retrieval systems. The benchmark data and code are available on GitHub.STARK is a large-scale benchmark for evaluating retrieval systems on textual and relational knowledge bases. The benchmark includes three domains: product search, academic paper search, and precision medicine. It features synthesized and human-generated queries that integrate textual and relational information, with ground-truth answers. The benchmark includes 9,100 synthesized queries for product search, 13,323 for academic papers, and 11,204 for precision medicine. The queries are designed to mimic real-world scenarios and require context-specific reasoning. The benchmark includes extensive data statistics and evaluation metrics, including Hit@k, Recall@k, and MRR. The benchmark also includes human evaluation to assess the naturalness, diversity, and practicality of the queries. The benchmark is designed to evaluate the performance of retrieval systems driven by large language models (LLMs) in handling semi-structured knowledge bases. The results show that current retrieval systems face significant challenges in handling both textual and relational information, highlighting the need for more capable retrieval systems. The benchmark data and code are available on GitHub.