DeepXplore: Automated Whitebox Testing of Deep Learning Systems

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

2017 | Kexin Pei*, Yinzhi Cao†, Junfeng Yang*, Suman Jana*
DeepXplore is the first whitebox testing framework for systematically testing deep learning (DL) systems. It addresses the challenge of finding erroneous behaviors in DL systems, which are often difficult to detect due to the lack of manual labeling and the complexity of DL models. DeepXplore introduces neuron coverage as a metric to measure how much of a DL system's logic is exercised by test inputs. It also leverages multiple DL systems with similar functionality as cross-referencing oracles to identify erroneous corner cases without manual checks. The framework formulates the problem of generating test inputs that maximize neuron coverage and expose differential behaviors as a joint optimization problem, which is solved efficiently using gradient-based search techniques. DeepXplore efficiently finds thousands of incorrect corner case behaviors in state-of-the-art DL models with thousands of neurons trained on five popular datasets. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running on a commodity laptop. The test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve the model's accuracy by up to 3%. The main contributions of DeepXplore include introducing neuron coverage as a whitebox testing metric for DL systems, demonstrating that the problem of finding behavioral differences between similar DL systems while maximizing neuron coverage can be formulated as a joint optimization problem, and implementing these techniques as part of DeepXplore, which exposed thousands of incorrect corner case behaviors in 15 state-of-the-art DL models. DeepXplore also supports adding custom constraints to simulate different types of realistic inputs. The framework has been evaluated on various datasets, including MNIST, ImageNet, Driving, Contagio/VirusTotal, and Drebin, and has shown significant improvements in neuron coverage and accuracy compared to random or adversarial inputs.DeepXplore is the first whitebox testing framework for systematically testing deep learning (DL) systems. It addresses the challenge of finding erroneous behaviors in DL systems, which are often difficult to detect due to the lack of manual labeling and the complexity of DL models. DeepXplore introduces neuron coverage as a metric to measure how much of a DL system's logic is exercised by test inputs. It also leverages multiple DL systems with similar functionality as cross-referencing oracles to identify erroneous corner cases without manual checks. The framework formulates the problem of generating test inputs that maximize neuron coverage and expose differential behaviors as a joint optimization problem, which is solved efficiently using gradient-based search techniques. DeepXplore efficiently finds thousands of incorrect corner case behaviors in state-of-the-art DL models with thousands of neurons trained on five popular datasets. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running on a commodity laptop. The test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve the model's accuracy by up to 3%. The main contributions of DeepXplore include introducing neuron coverage as a whitebox testing metric for DL systems, demonstrating that the problem of finding behavioral differences between similar DL systems while maximizing neuron coverage can be formulated as a joint optimization problem, and implementing these techniques as part of DeepXplore, which exposed thousands of incorrect corner case behaviors in 15 state-of-the-art DL models. DeepXplore also supports adding custom constraints to simulate different types of realistic inputs. The framework has been evaluated on various datasets, including MNIST, ImageNet, Driving, Contagio/VirusTotal, and Drebin, and has shown significant improvements in neuron coverage and accuracy compared to random or adversarial inputs.
Reach us at info@study.space