24 Jun 2024 | Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt
The paper "Interpreting the Second-Order Effects of Neurons in CLIP" by Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt from UC Berkeley explores the interpretability of individual neurons in the CLIP model. The authors introduce a "second-order lens" to analyze the indirect effects of neurons, which flow through subsequent attention heads to the output. They find that these effects are highly selective, significant for only a small fraction of images, and can be approximated by a single direction in the text-image space. Each neuron's effect is decomposed into sparse sets of text representations, revealing polysemantic behavior—each neuron corresponds to multiple, often unrelated concepts. This polysemy is leveraged to generate "semantic" adversarial examples by creating images with spuriously correlated concepts to the incorrect class. Additionally, the second-order effects are used for zero-shot segmentation and attribute discovery in images. The paper demonstrates that a scalable understanding of neurons can be used to improve model deception and introduce new capabilities.The paper "Interpreting the Second-Order Effects of Neurons in CLIP" by Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt from UC Berkeley explores the interpretability of individual neurons in the CLIP model. The authors introduce a "second-order lens" to analyze the indirect effects of neurons, which flow through subsequent attention heads to the output. They find that these effects are highly selective, significant for only a small fraction of images, and can be approximated by a single direction in the text-image space. Each neuron's effect is decomposed into sparse sets of text representations, revealing polysemantic behavior—each neuron corresponds to multiple, often unrelated concepts. This polysemy is leveraged to generate "semantic" adversarial examples by creating images with spuriously correlated concepts to the incorrect class. Additionally, the second-order effects are used for zero-shot segmentation and attribute discovery in images. The paper demonstrates that a scalable understanding of neurons can be used to improve model deception and introduce new capabilities.