Developing a Test Collection for the Evaluation of Integrated Search

Developing a Test Collection for the Evaluation of Integrated Search

2010 | Lykke, Marianne; Larsen, Birger; Lund, Haakon; Ingwersen, Peter
The paper discusses the development of a test collection for evaluating integrated search, which involves searching across multiple sources with a single search box and ranked result list. The test collection includes approximately 18,000 monographic records, 160,000 papers, and 275,000 abstracts from the physics domain, along with 65 real-world tasks and graded relevance assessments. The collection supports both system- and user-oriented evaluations. Integrated search presents challenges due to varying document types, metadata levels, and vocabularies. Scientific publications may be available in full text, with or without metadata, or only as metadata records with or without abstracts. Treating all document types equally in indexing and retrieval may overemphasize certain types, such as full-text documents, which are more easily retrieved and ranked. Currently, there is a lack of test collections with sufficient diversity in document types and comprehensive relevance assessments for each type, making it difficult to evaluate different approaches to integrated search. An appropriately designed test collection would allow for the development of integrated search algorithms that better identify and rank relevant documents across different types. The test collection was developed using a semi-laboratory/semi-real-life approach, incorporating users' genuine information needs and non-binary relevance judgments. The collection includes documents from arXiv.org, a large open-access repository of physics papers, and a Danish national database of bibliographic records. Two subsets were extracted: 160,000+ full-text papers in PDF with metadata, and 274,000+ metadata records with abstracts. Additionally, 18,000+ bibliographic book records classified as physics were added. Search tasks were extracted from 23 physicists, PhDs, and MSc students, resulting in 65 natural search tasks. These tasks were used to create a relevance assessment system, allowing users to assign relevance scores to documents. The system enabled the evaluation of different aspects of information situations and work task contexts, providing a realistic and controlled environment for testing integrated search systems.The paper discusses the development of a test collection for evaluating integrated search, which involves searching across multiple sources with a single search box and ranked result list. The test collection includes approximately 18,000 monographic records, 160,000 papers, and 275,000 abstracts from the physics domain, along with 65 real-world tasks and graded relevance assessments. The collection supports both system- and user-oriented evaluations. Integrated search presents challenges due to varying document types, metadata levels, and vocabularies. Scientific publications may be available in full text, with or without metadata, or only as metadata records with or without abstracts. Treating all document types equally in indexing and retrieval may overemphasize certain types, such as full-text documents, which are more easily retrieved and ranked. Currently, there is a lack of test collections with sufficient diversity in document types and comprehensive relevance assessments for each type, making it difficult to evaluate different approaches to integrated search. An appropriately designed test collection would allow for the development of integrated search algorithms that better identify and rank relevant documents across different types. The test collection was developed using a semi-laboratory/semi-real-life approach, incorporating users' genuine information needs and non-binary relevance judgments. The collection includes documents from arXiv.org, a large open-access repository of physics papers, and a Danish national database of bibliographic records. Two subsets were extracted: 160,000+ full-text papers in PDF with metadata, and 274,000+ metadata records with abstracts. Additionally, 18,000+ bibliographic book records classified as physics were added. Search tasks were extracted from 23 physicists, PhDs, and MSc students, resulting in 65 natural search tasks. These tasks were used to create a relevance assessment system, allowing users to assign relevance scores to documents. The system enabled the evaluation of different aspects of information situations and work task contexts, providing a realistic and controlled environment for testing integrated search systems.
Reach us at info@study.space