Beyond Accuracy: Behavioral Testing of NLP Models with CHECKLIST

Beyond Accuracy: Behavioral Testing of NLP Models with CHECKLIST

July 5 - 10, 2020 | Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh
The paper introduces CCheckList, a task-agnostic methodology for behavioral testing of NLP models, inspired by software engineering principles. It provides a matrix of linguistic capabilities and test types to guide comprehensive testing, along with a software tool to generate diverse test cases. The authors demonstrate the utility of CCheckList by testing three NLP tasks, identifying critical failures in both commercial and state-of-the-art models. In user studies, teams using CCheckList created more tests and found more bugs than those without it. CCheckList includes various test types, such as Minimum Functionality tests, Invariance tests, and Directional Expectation tests, which help evaluate models on unlabeled data. The methodology also includes templates, masked language models, and other abstractions to generate test cases efficiently. The paper shows that CCheckList can reveal severe bugs in commercial models, even those that have been extensively tested. It also highlights the importance of testing for linguistic capabilities such as negation, named entities, and coreference, which are often overlooked in traditional benchmarking. The authors argue that CCheckList provides a more comprehensive evaluation of NLP models than traditional accuracy metrics, as it identifies failures in specific behaviors that may not be captured by held-out data. The methodology is open-source and can be applied to a wide range of NLP tasks, making it a valuable tool for evaluating the performance of NLP models.The paper introduces CCheckList, a task-agnostic methodology for behavioral testing of NLP models, inspired by software engineering principles. It provides a matrix of linguistic capabilities and test types to guide comprehensive testing, along with a software tool to generate diverse test cases. The authors demonstrate the utility of CCheckList by testing three NLP tasks, identifying critical failures in both commercial and state-of-the-art models. In user studies, teams using CCheckList created more tests and found more bugs than those without it. CCheckList includes various test types, such as Minimum Functionality tests, Invariance tests, and Directional Expectation tests, which help evaluate models on unlabeled data. The methodology also includes templates, masked language models, and other abstractions to generate test cases efficiently. The paper shows that CCheckList can reveal severe bugs in commercial models, even those that have been extensively tested. It also highlights the importance of testing for linguistic capabilities such as negation, named entities, and coreference, which are often overlooked in traditional benchmarking. The authors argue that CCheckList provides a more comprehensive evaluation of NLP models than traditional accuracy metrics, as it identifies failures in specific behaviors that may not be captured by held-out data. The methodology is open-source and can be applied to a wide range of NLP tasks, making it a valuable tool for evaluating the performance of NLP models.
Reach us at info@study.space