Submitted 9/02; Published 4/04 | David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li
The paper introduces the Reuters Corpus Volume 1 (RCV1), a large collection of manually categorized newswire stories, and discusses its potential as a benchmark for text categorization research. The authors describe the coding policies, quality control procedures, and category taxonomies used in producing RCV1, and present corrections to remove errors in the original data (RCV1-v1). They benchmark several supervised learning methods on the corrected data (RCV1-v2) to illustrate the collection's properties and suggest new research directions. The paper also provides detailed experimental results and corrected category assignments via online appendices. Key aspects include the hierarchical structure of the category sets, the implications of coding policies for algorithm design and evaluation, and the detection of coding errors using duplicate documents.The paper introduces the Reuters Corpus Volume 1 (RCV1), a large collection of manually categorized newswire stories, and discusses its potential as a benchmark for text categorization research. The authors describe the coding policies, quality control procedures, and category taxonomies used in producing RCV1, and present corrections to remove errors in the original data (RCV1-v1). They benchmark several supervised learning methods on the corrected data (RCV1-v2) to illustrate the collection's properties and suggest new research directions. The paper also provides detailed experimental results and corrected category assignments via online appendices. Key aspects include the hierarchical structure of the category sets, the implications of coding policies for algorithm design and evaluation, and the detection of coding errors using duplicate documents.