[slides] Confidence Regulation Neurons in Language Models

This paper investigates the mechanisms by which large language models (LLMs) manage and regulate uncertainty in their predictions, focusing on two critical components: entropy neurons and token frequency neurons. Entropy neurons, characterized by high weight norms and minimal interaction with the unembedding matrix, influence the final layer normalization (LayerNorm) scale to modulate the entropy of the model's output distribution. Token frequency neurons, discovered in this study, adjust the model's output by aligning it closer to or further from the token frequency distribution. The authors demonstrate that entropy neurons operate by writing onto an effective null space of the unembedding matrix, allowing them to impact the residual stream norm with minimal direct effect on the logits. Token frequency neurons boost or suppress each token's logit based on its log frequency, shifting the output distribution. A case study on induction shows that entropy neurons act as a hedging mechanism, increasing entropy to manage confidence in repeated sequence scenarios. The findings highlight the importance of these neurons in LLMs' confidence calibration and provide insights into their practical implications.This paper investigates the mechanisms by which large language models (LLMs) manage and regulate uncertainty in their predictions, focusing on two critical components: entropy neurons and token frequency neurons. Entropy neurons, characterized by high weight norms and minimal interaction with the unembedding matrix, influence the final layer normalization (LayerNorm) scale to modulate the entropy of the model's output distribution. Token frequency neurons, discovered in this study, adjust the model's output by aligning it closer to or further from the token frequency distribution. The authors demonstrate that entropy neurons operate by writing onto an effective null space of the unembedding matrix, allowing them to impact the residual stream norm with minimal direct effect on the logits. Token frequency neurons boost or suppress each token's logit based on its log frequency, shifting the output distribution. A case study on induction shows that entropy neurons act as a hedging mechanism, increasing entropy to manage confidence in repeated sequence scenarios. The findings highlight the importance of these neurons in LLMs' confidence calibration and provide insights into their practical implications.

Confidence Regulation Neurons in Language Models

24 Jun 2024 | Alessandro Stolfò, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda