Confidence Regulation Neurons in Language Models

Confidence Regulation Neurons in Language Models

24 Jun 2024 | Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
This paper investigates the mechanisms by which large language models (LLMs) regulate uncertainty in next-token predictions, focusing on two key components: entropy neurons and token frequency neurons. Entropy neurons are characterized by high weight norms and low interaction with the unembedding matrix, and they influence the final layer normalization (LayerNorm) scale to modulate the entropy of the model's output distribution. These neurons are found across various models, including GPT-2, Pythia, Phi-2, Gemma 2B, and LLaMA2 7B. They help manage confidence by adjusting the model's output entropy, particularly in scenarios involving repeated subsequences (induction). Token frequency neurons, a novel discovery, adjust the model's output distribution by modulating the distance from the token frequency distribution. These neurons are identified by their effect on the Kullback-Leibler divergence between the model's output and the token frequency distribution. The study shows that entropy neurons act as hedging mechanisms, reducing model confidence in high-confidence wrong predictions. Token frequency neurons influence the model's output by aligning it closer or further from the token frequency distribution. The findings highlight the role of LayerNorm in indirectly modulating logit values and the importance of effective null spaces in the unembedding matrix. The study provides insights into how LLMs manage uncertainty and confidence, with implications for model calibration and safety.This paper investigates the mechanisms by which large language models (LLMs) regulate uncertainty in next-token predictions, focusing on two key components: entropy neurons and token frequency neurons. Entropy neurons are characterized by high weight norms and low interaction with the unembedding matrix, and they influence the final layer normalization (LayerNorm) scale to modulate the entropy of the model's output distribution. These neurons are found across various models, including GPT-2, Pythia, Phi-2, Gemma 2B, and LLaMA2 7B. They help manage confidence by adjusting the model's output entropy, particularly in scenarios involving repeated subsequences (induction). Token frequency neurons, a novel discovery, adjust the model's output distribution by modulating the distance from the token frequency distribution. These neurons are identified by their effect on the Kullback-Leibler divergence between the model's output and the token frequency distribution. The study shows that entropy neurons act as hedging mechanisms, reducing model confidence in high-confidence wrong predictions. Token frequency neurons influence the model's output by aligning it closer or further from the token frequency distribution. The findings highlight the role of LayerNorm in indirectly modulating logit values and the importance of effective null spaces in the unembedding matrix. The study provides insights into how LLMs manage uncertainty and confidence, with implications for model calibration and safety.
Reach us at info@study.space
Understanding Confidence Regulation Neurons in Language Models