Interpreting the Second-Order Effects of Neurons in CLIP

Interpreting the Second-Order Effects of Neurons in CLIP

24 Jun 2024 | Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt
This paper presents a method for interpreting the function of individual neurons in CLIP by analyzing their second-order effects. The authors find that the direct effects of neurons on the output are negligible, and that the indirect effects are not sufficient to capture the neurons' function. Instead, they introduce a "second-order lens" that analyzes the effect flowing from a neuron through the later attention heads, directly to the output. They find that these effects are highly selective, with each neuron having a significant effect on only about 2% of the images. Each effect can be approximated by a single direction in the text-image space of CLIP. By decomposing these directions into sparse sets of text representations, the authors reveal that neurons are polysemantic—each neuron corresponds to multiple, often unrelated, concepts. They use this polysemy to generate semantic adversarial examples and for zero-shot segmentation and attribute discovery in images. The results show that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.This paper presents a method for interpreting the function of individual neurons in CLIP by analyzing their second-order effects. The authors find that the direct effects of neurons on the output are negligible, and that the indirect effects are not sufficient to capture the neurons' function. Instead, they introduce a "second-order lens" that analyzes the effect flowing from a neuron through the later attention heads, directly to the output. They find that these effects are highly selective, with each neuron having a significant effect on only about 2% of the images. Each effect can be approximated by a single direction in the text-image space of CLIP. By decomposing these directions into sparse sets of text representations, the authors reveal that neurons are polysemantic—each neuron corresponds to multiple, often unrelated, concepts. They use this polysemy to generate semantic adversarial examples and for zero-shot segmentation and attribute discovery in images. The results show that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
Reach us at info@study.space