| Charles T. Hemphill, John J. Godfrey, George R. Doddington
The ATIS Spoken Language Systems Pilot Corpus was created to evaluate progress in spoken language systems that combine speech and natural language processing. This corpus marks the first large-scale attempt to collect such data and provides guidelines for future efforts. The corpus includes acoustic speech data, transcriptions, answer tuples, and SQL expressions for queries. It is based on a relational database derived from the Official Airline Guide (OAG), containing information about flights, fares, airlines, cities, airports, and ground services. The corpus was collected through simulated sessions where subjects, acting as "travel planners," interacted with a system that used a head-mounted microphone and a desk-mounted microphone to record speech. The system responded to queries with transcriptions and answers, and the data was processed into three types of transcriptions: NL-input, prompting_text, and SR_output. The corpus also includes classifications of queries based on context, ambiguity, and other factors, as well as reference answers and SQL expressions. The corpus was distributed to interested sites by the National Institute of Standards and Technology (NIST). Over 8 weeks, 41 sessions with 1041 utterances were collected, with 740 judged evaluable according to the June 1990 criteria. The corpus has proven that objective evaluation of spoken language systems is possible and beneficial, and has clarified many points in data collection procedures. The pilot corpus has shown that spontaneous speech corpora are more expensive to collect than read speech ones, but provide a realistic opportunity to evaluate spoken language systems. The work was supported by the Defense Advanced Research Projects Agency and monitored by the Naval Space and Warfare Systems Command. The authors acknowledge the publishers of the Official Airline Guide for travel data and consulting help, as well as the subjects and members of various committees for their contributions.The ATIS Spoken Language Systems Pilot Corpus was created to evaluate progress in spoken language systems that combine speech and natural language processing. This corpus marks the first large-scale attempt to collect such data and provides guidelines for future efforts. The corpus includes acoustic speech data, transcriptions, answer tuples, and SQL expressions for queries. It is based on a relational database derived from the Official Airline Guide (OAG), containing information about flights, fares, airlines, cities, airports, and ground services. The corpus was collected through simulated sessions where subjects, acting as "travel planners," interacted with a system that used a head-mounted microphone and a desk-mounted microphone to record speech. The system responded to queries with transcriptions and answers, and the data was processed into three types of transcriptions: NL-input, prompting_text, and SR_output. The corpus also includes classifications of queries based on context, ambiguity, and other factors, as well as reference answers and SQL expressions. The corpus was distributed to interested sites by the National Institute of Standards and Technology (NIST). Over 8 weeks, 41 sessions with 1041 utterances were collected, with 740 judged evaluable according to the June 1990 criteria. The corpus has proven that objective evaluation of spoken language systems is possible and beneficial, and has clarified many points in data collection procedures. The pilot corpus has shown that spontaneous speech corpora are more expensive to collect than read speech ones, but provide a realistic opportunity to evaluate spoken language systems. The work was supported by the Defense Advanced Research Projects Agency and monitored by the Naval Space and Warfare Systems Command. The authors acknowledge the publishers of the Official Airline Guide for travel data and consulting help, as well as the subjects and members of various committees for their contributions.