[slides and audio] Universal Neurons in GPT2 Language Models

This paper investigates the universality of individual neurons across different GPT2 language models trained from different initial random seeds. The authors hypothesize that universal neurons, which consistently activate on the same inputs across models, are more interpretable and can provide insights into the underlying mechanisms of neural networks. By computing pairwise correlations of neuron activations over 100 million tokens, they find that only 1-5% of neurons are universal. These universal neurons are further analyzed in detail, revealing clear interpretations and categorizing them into a small number of neuron families. The paper also explores the functional roles of these neurons, such as deactivating attention heads, changing the entropy of the next token distribution, and predicting or suppressing specific tokens. The findings suggest that universal neurons play crucial roles in the network's behavior and can be used to develop a periodic table of neural circuits for better interpretability. The study highlights the importance of understanding the universality of neural mechanisms to advance the field of mechanistic interpretability and improve the reliability of AI systems.This paper investigates the universality of individual neurons across different GPT2 language models trained from different initial random seeds. The authors hypothesize that universal neurons, which consistently activate on the same inputs across models, are more interpretable and can provide insights into the underlying mechanisms of neural networks. By computing pairwise correlations of neuron activations over 100 million tokens, they find that only 1-5% of neurons are universal. These universal neurons are further analyzed in detail, revealing clear interpretations and categorizing them into a small number of neuron families. The paper also explores the functional roles of these neurons, such as deactivating attention heads, changing the entropy of the next token distribution, and predicting or suppressing specific tokens. The findings suggest that universal neurons play crucial roles in the network's behavior and can be used to develop a periodic table of neural circuits for better interpretability. The study highlights the importance of understanding the universality of neural mechanisms to advance the field of mechanistic interpretability and improve the reliability of AI systems.

UNIVERSAL NEURONS IN GPT2 LANGUAGE MODELS

22 Jan 2024 | Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas