Explorations of Self-Repair in Language Models

Explorations of Self-Repair in Language Models

26 May 2024 | Cody Rushing, Neel Nanda
This paper explores the phenomenon of self-repair in large language models, where the removal of a component (such as an attention head) leads to compensatory changes in downstream components, resulting in imperfect and noisy self-repair. The authors build on previous research that identified self-repair in narrow distributions and extend it to a broader range of models and sizes. They find that self-repair is imperfect and noisy, with a significant fraction of the self-repair attributed to changes in the LayerNorm normalization factor and sparse sets of neurons implementing Anti-Erase. The paper discusses the implications of these findings for interpretability practitioners and introduces the Iterative Inference hypothesis, which suggests that self-repair may be a side effect of the model's internal mechanisms. The authors also highlight the challenges posed by self-repair for interpretability efforts, particularly in circuit analysis, and provide suggestions for mitigating these issues.This paper explores the phenomenon of self-repair in large language models, where the removal of a component (such as an attention head) leads to compensatory changes in downstream components, resulting in imperfect and noisy self-repair. The authors build on previous research that identified self-repair in narrow distributions and extend it to a broader range of models and sizes. They find that self-repair is imperfect and noisy, with a significant fraction of the self-repair attributed to changes in the LayerNorm normalization factor and sparse sets of neurons implementing Anti-Erase. The paper discusses the implications of these findings for interpretability practitioners and introduces the Iterative Inference hypothesis, which suggests that self-repair may be a side effect of the model's internal mechanisms. The authors also highlight the challenges posed by self-repair for interpretability efforts, particularly in circuit analysis, and provide suggestions for mitigating these issues.
Reach us at info@study.space