[slides and audio] A Multimodal Automated Interpretability Agent

This paper introduces MAIA, a Multimodal Automated Interpretability Agent designed to automate neural model understanding tasks such as feature interpretation and failure mode discovery. MAIA is equipped with a pre-trained vision-language model and an API that includes tools for iterative experimentation on subcomponents of other models. These tools support human interpretability researchers in synthesizing and editing inputs, computing maximally activating exemplars, and summarizing experimental results. MAIA's interpretability experiments compose these tools to explain system behavior. The paper evaluates MAIA's ability to describe features in learned representations of images, showing that it produces descriptions comparable to those generated by expert human experimenters. Additionally, MAIA aids in reducing sensitivity to spurious features and identifying inputs likely to be mis-classified. The evaluation demonstrates MAIA's effectiveness in both real and synthetic datasets, highlighting its potential for automating and enhancing model interpretability tasks. While MAIA still requires human supervision to avoid common pitfalls, the paper suggests that interpretability agents will become increasingly useful as they grow in sophistication.This paper introduces MAIA, a Multimodal Automated Interpretability Agent designed to automate neural model understanding tasks such as feature interpretation and failure mode discovery. MAIA is equipped with a pre-trained vision-language model and an API that includes tools for iterative experimentation on subcomponents of other models. These tools support human interpretability researchers in synthesizing and editing inputs, computing maximally activating exemplars, and summarizing experimental results. MAIA's interpretability experiments compose these tools to explain system behavior. The paper evaluates MAIA's ability to describe features in learned representations of images, showing that it produces descriptions comparable to those generated by expert human experimenters. Additionally, MAIA aids in reducing sensitivity to spurious features and identifying inputs likely to be mis-classified. The evaluation demonstrates MAIA's effectiveness in both real and synthetic datasets, highlighting its potential for automating and enhancing model interpretability tasks. While MAIA still requires human supervision to avoid common pitfalls, the paper suggests that interpretability agents will become increasingly useful as they grow in sophistication.

A Multimodal Automated Interpretability Agent

22 Apr 2024 | Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, Antonio Torralba