The Design for the Wall Street Journal-based CSR Corpus

The Design for the Wall Street Journal-based CSR Corpus

| Douglas B. Paul and Janet M. Baker
The paper introduces the Wall Street Journal (WSJ) CSR Corpus, a significant resource for advancing speech recognition and natural language processing (NLP) research. This corpus is designed to support advanced research goals set by the DARPA Spoken Language System (SLS) community, particularly focusing on cooperative speakers generating goal-directed, spontaneous continuous speech in speaker-adaptive and speaker-independent modes. The WSJ corpus includes 400 hours of speech data and 47 million words of text, making it the largest general-purpose English, large vocabulary, high perplexity corpus available. The corpus is structured to accommodate variable sizes of large vocabularies, perplexities, and speaker-dependent and speaker-independent training with varying amounts of data. It includes equal portions of verbalized and non-verbalized punctuation, separate speaker adaptation materials, and a diverse range of speakers to reflect real-world conditions. The text preprocessing involves removing ambiguities and ensuring the text is suitable for training language models. The WSJ-Pilot database, a smaller version of the full corpus, was designed to share training data among different research paradigms. It consists of 80 hours of recorded speech, with 50 hours for speaker-independent training and 30 hours for speaker-dependent training. The text selection process involved filtering for readability and ensuring high-quality sentences for recording. Additional components of the WSJ corpus include a dictionary, language models, and baseline test vocabularies, all of which are provided to support recognition experiments. The WSJ Corpus and its supporting components are designed to facilitate advanced strategic CSR research and potentially broaden the practical applications of spoken language technology.The paper introduces the Wall Street Journal (WSJ) CSR Corpus, a significant resource for advancing speech recognition and natural language processing (NLP) research. This corpus is designed to support advanced research goals set by the DARPA Spoken Language System (SLS) community, particularly focusing on cooperative speakers generating goal-directed, spontaneous continuous speech in speaker-adaptive and speaker-independent modes. The WSJ corpus includes 400 hours of speech data and 47 million words of text, making it the largest general-purpose English, large vocabulary, high perplexity corpus available. The corpus is structured to accommodate variable sizes of large vocabularies, perplexities, and speaker-dependent and speaker-independent training with varying amounts of data. It includes equal portions of verbalized and non-verbalized punctuation, separate speaker adaptation materials, and a diverse range of speakers to reflect real-world conditions. The text preprocessing involves removing ambiguities and ensuring the text is suitable for training language models. The WSJ-Pilot database, a smaller version of the full corpus, was designed to share training data among different research paradigms. It consists of 80 hours of recorded speech, with 50 hours for speaker-independent training and 30 hours for speaker-dependent training. The text selection process involved filtering for readability and ensuring high-quality sentences for recording. Additional components of the WSJ corpus include a dictionary, language models, and baseline test vocabularies, all of which are provided to support recognition experiments. The WSJ Corpus and its supporting components are designed to facilitate advanced strategic CSR research and potentially broaden the practical applications of spoken language technology.
Reach us at info@study.space