This paper explores self-repair in large language models (LLMs), a phenomenon where components of the model compensate for the removal of other components. The study demonstrates that self-repair occurs across various model families and sizes when individual attention heads are ablated on the full training distribution. However, self-repair is imperfect and noisy, as the original direct effect of the ablated head is not fully restored, and the degree of self-repair varies significantly across different prompts. Two mechanisms contributing to self-repair are identified: changes in the final LayerNorm scaling factor and sparse sets of neurons implementing Anti-Erasure. The results have implications for interpretability practitioners, highlighting the complexity of self-repair and the need for further investigation.
The study shows that self-repair is not always accurate, as the direct effect of the ablated head may not be fully restored, and the compensation may be overcorrected. The LayerNorm scaling factor plays a significant role in self-repair, with changes in this factor contributing to the restoration of the original direct effect. Additionally, sparse neurons in MLP layers perform Anti-Erasure, helping to compensate for the removal of other components.
The paper also discusses the Iterative Inference Hypothesis, which suggests that models build up their final logits over time through the outputs of multiple components. This hypothesis is supported by evidence showing that certain attention heads can influence downstream heads to not perform a task, and that some heads may be self-reinforcing or self-repressing. The study highlights the importance of understanding self-repair in LLMs, as it has implications for interpretability, safety, and control of these models. The findings suggest that self-repair is a complex phenomenon that may be influenced by various mechanisms, including LayerNorm scaling and sparse neuron Anti-Erasure. The paper concludes that further research is needed to fully understand the mechanisms behind self-repair in LLMs.This paper explores self-repair in large language models (LLMs), a phenomenon where components of the model compensate for the removal of other components. The study demonstrates that self-repair occurs across various model families and sizes when individual attention heads are ablated on the full training distribution. However, self-repair is imperfect and noisy, as the original direct effect of the ablated head is not fully restored, and the degree of self-repair varies significantly across different prompts. Two mechanisms contributing to self-repair are identified: changes in the final LayerNorm scaling factor and sparse sets of neurons implementing Anti-Erasure. The results have implications for interpretability practitioners, highlighting the complexity of self-repair and the need for further investigation.
The study shows that self-repair is not always accurate, as the direct effect of the ablated head may not be fully restored, and the compensation may be overcorrected. The LayerNorm scaling factor plays a significant role in self-repair, with changes in this factor contributing to the restoration of the original direct effect. Additionally, sparse neurons in MLP layers perform Anti-Erasure, helping to compensate for the removal of other components.
The paper also discusses the Iterative Inference Hypothesis, which suggests that models build up their final logits over time through the outputs of multiple components. This hypothesis is supported by evidence showing that certain attention heads can influence downstream heads to not perform a task, and that some heads may be self-reinforcing or self-repressing. The study highlights the importance of understanding self-repair in LLMs, as it has implications for interpretability, safety, and control of these models. The findings suggest that self-repair is a complex phenomenon that may be influenced by various mechanisms, including LayerNorm scaling and sparse neuron Anti-Erasure. The paper concludes that further research is needed to fully understand the mechanisms behind self-repair in LLMs.