| Charles T. Hemphill, John J. Godfrey, George R. Doddington
The ATIS Spoken Language Systems Pilot Corpus is a significant development in speech and natural language processing research, designed to measure progress in systems that handle both speech and natural language. This pilot marks the first full-scale attempt to collect such a corpus, providing guidelines for future efforts. The corpus differs from its predecessor, the Resource Management corpus, in several key aspects, including the collection of spontaneous speech, the use of an office environment, the integration of grammar into the system, and the use of actual replies as reference answers.
The ATIS corpus includes acoustic speech data, transcriptions, a set of tuples representing the answers, and SQL expressions for the queries. The corpus was collected through a simulated travel planning system, where subjects interacted with a "travel planner" and received both transcriptions and answers on a computer screen. The collection process involved detailed instructions, a structured session format, and a rigorous data processing pipeline to ensure the quality and reliability of the corpus.
The pilot collected 41 sessions with 1041 utterances over 8 weeks, with 740 utterances judged as evaluable. The results showed that the corpus is valuable for objective evaluation of spoken language systems, providing realistic conditions for testing. The study also highlighted the challenges and benefits of collecting spontaneous speech data, emphasizing its potential for advancing the field of spoken language systems.The ATIS Spoken Language Systems Pilot Corpus is a significant development in speech and natural language processing research, designed to measure progress in systems that handle both speech and natural language. This pilot marks the first full-scale attempt to collect such a corpus, providing guidelines for future efforts. The corpus differs from its predecessor, the Resource Management corpus, in several key aspects, including the collection of spontaneous speech, the use of an office environment, the integration of grammar into the system, and the use of actual replies as reference answers.
The ATIS corpus includes acoustic speech data, transcriptions, a set of tuples representing the answers, and SQL expressions for the queries. The corpus was collected through a simulated travel planning system, where subjects interacted with a "travel planner" and received both transcriptions and answers on a computer screen. The collection process involved detailed instructions, a structured session format, and a rigorous data processing pipeline to ensure the quality and reliability of the corpus.
The pilot collected 41 sessions with 1041 utterances over 8 weeks, with 740 utterances judged as evaluable. The results showed that the corpus is valuable for objective evaluation of spoken language systems, providing realistic conditions for testing. The study also highlighted the challenges and benefits of collecting spontaneous speech data, emphasizing its potential for advancing the field of spoken language systems.