22 Jan 2024 | Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas
This paper investigates the universality of individual neurons across different GPT2 language models trained from different random initializations. The study finds that only 1-5% of neurons are universal, meaning they consistently activate on the same inputs across models. These universal neurons often have clear interpretations and can be grouped into a small number of neuron families. The study also identifies several universal functional roles of neurons, including deactivating attention heads, changing the entropy of the next token distribution, and predicting or suppressing elements of the vocabulary. The results suggest that universal neurons are more likely to be interpretable and that studying them can provide insights into the underlying mechanisms of neural networks. The study also highlights the importance of understanding the functional roles of neurons in language models and the potential for developing a periodic table of neural circuits that can be automatically referenced when interpreting new models. The findings contribute to the field of mechanistic interpretability by providing evidence that individual neurons are not the appropriate unit of analysis for most network behaviors, and that leveraging universality is an effective approach to identifying interpretable model components and important motifs. The study also identifies several limitations, including the use of small models and a narrow form of universality, and suggests avenues for future research.This paper investigates the universality of individual neurons across different GPT2 language models trained from different random initializations. The study finds that only 1-5% of neurons are universal, meaning they consistently activate on the same inputs across models. These universal neurons often have clear interpretations and can be grouped into a small number of neuron families. The study also identifies several universal functional roles of neurons, including deactivating attention heads, changing the entropy of the next token distribution, and predicting or suppressing elements of the vocabulary. The results suggest that universal neurons are more likely to be interpretable and that studying them can provide insights into the underlying mechanisms of neural networks. The study also highlights the importance of understanding the functional roles of neurons in language models and the potential for developing a periodic table of neural circuits that can be automatically referenced when interpreting new models. The findings contribute to the field of mechanistic interpretability by providing evidence that individual neurons are not the appropriate unit of analysis for most network behaviors, and that leveraging universality is an effective approach to identifying interpretable model components and important motifs. The study also identifies several limitations, including the use of small models and a narrow form of universality, and suggests avenues for future research.