RCV1: A New Benchmark Collection for Text Categorization Research

RCV1: A New Benchmark Collection for Text Categorization Research

2004 | David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li
RCV1 is a new benchmark collection for text categorization research, containing over 800,000 manually categorized newswire stories from Reuters. The data was produced under operational procedures at Reuters, with coding policies and quality control measures in place. The original data is referred to as RCV1-v1, and the corrected data as RCV1-v2. The paper describes the coding policies, category taxonomies, and corrections necessary to remove errorful data. It benchmarks several supervised learning methods on RCV1-v2, illustrating the collection's properties and suggesting new research directions. The paper also provides detailed experimental results and corrected category assignments via online appendices. The RCV1 data consists of over 800,000 English language news stories from August 1996 to August 1997. The stories are categorized using three sets of codes: Topics, Industries, and Regions. The Topics codes are organized in four hierarchical groups, while the Industry codes are grouped in 10 subhierarchies. The Region codes include both geographic locations and economic/political groupings. The coding process involved three stages: autocoding, manual editing, and manual correction. The coding policies included the Minimum Code Policy and the Hierarchy Policy, which required at least one Topic and one Region code per story, and the assignment of the most specific appropriate codes from the Topic and Industry sets, along with their ancestors. The RCV1 data has been used to evaluate text categorization methods, and the paper provides benchmark results for several supervised learning approaches. The paper also discusses the implications of the coding policies for algorithm design and evaluation. The RCV1 data has been found to have some coding errors, including duplicate documents and inconsistencies in category assignments. The paper provides corrections to these errors and suggests that the RCV1-v2 data is a better test collection for text categorization research. The paper also discusses the implications of the coding policies for the use of RCV1 in research, and provides a detailed analysis of the coding process and the resulting data.RCV1 is a new benchmark collection for text categorization research, containing over 800,000 manually categorized newswire stories from Reuters. The data was produced under operational procedures at Reuters, with coding policies and quality control measures in place. The original data is referred to as RCV1-v1, and the corrected data as RCV1-v2. The paper describes the coding policies, category taxonomies, and corrections necessary to remove errorful data. It benchmarks several supervised learning methods on RCV1-v2, illustrating the collection's properties and suggesting new research directions. The paper also provides detailed experimental results and corrected category assignments via online appendices. The RCV1 data consists of over 800,000 English language news stories from August 1996 to August 1997. The stories are categorized using three sets of codes: Topics, Industries, and Regions. The Topics codes are organized in four hierarchical groups, while the Industry codes are grouped in 10 subhierarchies. The Region codes include both geographic locations and economic/political groupings. The coding process involved three stages: autocoding, manual editing, and manual correction. The coding policies included the Minimum Code Policy and the Hierarchy Policy, which required at least one Topic and one Region code per story, and the assignment of the most specific appropriate codes from the Topic and Industry sets, along with their ancestors. The RCV1 data has been used to evaluate text categorization methods, and the paper provides benchmark results for several supervised learning approaches. The paper also discusses the implications of the coding policies for algorithm design and evaluation. The RCV1 data has been found to have some coding errors, including duplicate documents and inconsistencies in category assignments. The paper provides corrections to these errors and suggests that the RCV1-v2 data is a better test collection for text categorization research. The paper also discusses the implications of the coding policies for the use of RCV1 in research, and provides a detailed analysis of the coding process and the resulting data.
Reach us at info@study.space
[slides and audio] RCV1%3A A New Benchmark Collection for Text Categorization Research