July 5 - 10, 2020 | Marco Tulio Ribeiro1 Tongshuang Wu2 Carlos Guestrin2 Sameer Singh3
The paper introduces CheckList, a task-agnostic methodology for testing NLP models, inspired by behavioral testing principles in software engineering. CheckList provides a matrix of general linguistic capabilities and test types to facilitate comprehensive test ideation, along with a software tool to generate diverse test cases quickly. The authors demonstrate the utility of CheckList through tests on three tasks: sentiment analysis, duplicate question detection, and machine comprehension. They find critical failures in both commercial and state-of-the-art models, revealing issues with basic linguistic phenomena such as negation, named entities, and coreferences. User studies show that CheckList helps practitioners create more tests and uncover more bugs compared to those without it. The paper also discusses related work and concludes by emphasizing the importance of systematic testing beyond accuracy on held-out data.The paper introduces CheckList, a task-agnostic methodology for testing NLP models, inspired by behavioral testing principles in software engineering. CheckList provides a matrix of general linguistic capabilities and test types to facilitate comprehensive test ideation, along with a software tool to generate diverse test cases quickly. The authors demonstrate the utility of CheckList through tests on three tasks: sentiment analysis, duplicate question detection, and machine comprehension. They find critical failures in both commercial and state-of-the-art models, revealing issues with basic linguistic phenomena such as negation, named entities, and coreferences. User studies show that CheckList helps practitioners create more tests and uncover more bugs compared to those without it. The paper also discusses related work and concludes by emphasizing the importance of systematic testing beyond accuracy on held-out data.