The Wall Street Journal-based CSR Corpus (WSJ-CORPUS) is a significant addition to the DARPA Spoken Language System (SLS) corpus collection. It provides a large, general-purpose English corpus with substantial speech and text data (400 hours of speech and 47 million words of text), enabling integration of speech recognition and natural language processing in practical applications. The corpus is designed to support the 1994 SLS research goals, including cooperative speech, speaker-adaptive and independent modes, and integrated speech and language processing in moderate noise environments. It includes a wide range of data types, such as spontaneous dictation, Hansard, and other materials, and is structured to facilitate diagnostic research and comparative testing.
The WSJ-CORPUS is built to accommodate variable vocabulary sizes (5K, 20K, and larger), variable perplexities (80, 120, 160, 240, and larger), and both speaker-dependent (SD) and independent (SI) training. It includes equal portions of verbalized and non-verbalized punctuation, and features separate speaker adaptation materials. The corpus is designed to support both "open" and "closed" vocabulary tests, and includes a variety of testing paradigms to evaluate performance with and without "out-of-vocabulary" lexical items.
The WSJ-CORPUS is structured to allow for efficient comparisons of SI and SD performance, and includes a range of testing scenarios. It also includes a large amount of machine-readable text from the Wall Street Journal, enabling the generation of statistical language models and the evaluation of novel language models. The corpus includes a dictionary of 33,000 words, as well as baseline open and closed test vocabularies and language models for research and cross-site evaluation.
The WSJ-CORPUS is designed to support a wide range of research interests, including domain-independent acoustic and language models, and speaker adaptation. It is a carefully crafted resource that allows for highly informative intra- and inter-group comparisons. The corpus includes a pilot database, which is a smaller version of the full corpus, and includes a variety of text processing steps to ensure the quality and usability of the data. The text is processed to remove ambiguity, ensure readability, and facilitate accurate transcription. The corpus is also designed to support both case-sensitive and case-insensitive recognition, and includes a variety of text formats for different testing scenarios. The WSJ-CORPUS is a valuable resource for speech recognition research, providing a comprehensive set of data and tools for evaluating and improving spoken language technology.The Wall Street Journal-based CSR Corpus (WSJ-CORPUS) is a significant addition to the DARPA Spoken Language System (SLS) corpus collection. It provides a large, general-purpose English corpus with substantial speech and text data (400 hours of speech and 47 million words of text), enabling integration of speech recognition and natural language processing in practical applications. The corpus is designed to support the 1994 SLS research goals, including cooperative speech, speaker-adaptive and independent modes, and integrated speech and language processing in moderate noise environments. It includes a wide range of data types, such as spontaneous dictation, Hansard, and other materials, and is structured to facilitate diagnostic research and comparative testing.
The WSJ-CORPUS is built to accommodate variable vocabulary sizes (5K, 20K, and larger), variable perplexities (80, 120, 160, 240, and larger), and both speaker-dependent (SD) and independent (SI) training. It includes equal portions of verbalized and non-verbalized punctuation, and features separate speaker adaptation materials. The corpus is designed to support both "open" and "closed" vocabulary tests, and includes a variety of testing paradigms to evaluate performance with and without "out-of-vocabulary" lexical items.
The WSJ-CORPUS is structured to allow for efficient comparisons of SI and SD performance, and includes a range of testing scenarios. It also includes a large amount of machine-readable text from the Wall Street Journal, enabling the generation of statistical language models and the evaluation of novel language models. The corpus includes a dictionary of 33,000 words, as well as baseline open and closed test vocabularies and language models for research and cross-site evaluation.
The WSJ-CORPUS is designed to support a wide range of research interests, including domain-independent acoustic and language models, and speaker adaptation. It is a carefully crafted resource that allows for highly informative intra- and inter-group comparisons. The corpus includes a pilot database, which is a smaller version of the full corpus, and includes a variety of text processing steps to ensure the quality and usability of the data. The text is processed to remove ambiguity, ensure readability, and facilitate accurate transcription. The corpus is also designed to support both case-sensitive and case-insensitive recognition, and includes a variety of text formats for different testing scenarios. The WSJ-CORPUS is a valuable resource for speech recognition research, providing a comprehensive set of data and tools for evaluating and improving spoken language technology.