Understanding Understanding Black-box Predictions via Influence Functions

This paper introduces influence functions as a method to explain the predictions of black-box models by tracing the impact of training data on model parameters. Influence functions, a technique from robust statistics, allow for efficient computation of how changes in training data affect model predictions without retraining the model. The authors develop an efficient implementation that requires only gradient and Hessian-vector product access, making it applicable to modern machine learning models. They demonstrate that influence functions can provide valuable insights in various scenarios, including understanding model behavior, debugging, detecting dataset errors, and creating visually indistinguishable adversarial training examples. The paper also discusses the challenges of non-convex and non-differentiable models and shows that influence functions can still be useful in these cases. The authors validate their approach through experiments on linear models and convolutional neural networks, highlighting its effectiveness in identifying influential training points and constructing adversarial attacks.This paper introduces influence functions as a method to explain the predictions of black-box models by tracing the impact of training data on model parameters. Influence functions, a technique from robust statistics, allow for efficient computation of how changes in training data affect model predictions without retraining the model. The authors develop an efficient implementation that requires only gradient and Hessian-vector product access, making it applicable to modern machine learning models. They demonstrate that influence functions can provide valuable insights in various scenarios, including understanding model behavior, debugging, detecting dataset errors, and creating visually indistinguishable adversarial training examples. The paper also discusses the challenges of non-convex and non-differentiable models and shows that influence functions can still be useful in these cases. The authors validate their approach through experiments on linear models and convolutional neural networks, highlighting its effectiveness in identifying influential training points and constructing adversarial attacks.

Understanding Black-box Predictions via Influence Functions

29 Dec 2020 | Pang Wei Koh, Percy Liang