28 May 2024 | Yifei Wang*, Yuyang Wu*, Zeming Wei, Stefanie Jegelka, Yisen Wang
This paper explores the theoretical understanding of self-correction in large language models (LLMs) through in-context alignment. The authors demonstrate that LLMs can improve their alignment capabilities by self-correction, using a simplified setup akin to an alignment task. They show that when LLMs provide relatively accurate self-examinations as rewards, they can refine their responses in an in-context way. The study highlights the roles of key transformer designs—softmax attention, multi-head attention, and the MLP block—in enabling self-correction. The findings are validated on synthetic datasets and applied to real-world scenarios such as defending against LLM jailbreaks. The results show that self-correction can significantly reduce the success rate of jailbreak attacks. The paper also discusses the importance of reward quality, the necessity of softmax attention for ranking, and the role of multi-head attention in token discrimination. Theoretical analysis reveals that transformers can optimize alignment objectives through in-context learning, and that self-correction can improve alignment in real-world tasks like alleviating social bias. The study provides a solid theoretical foundation for understanding and enhancing self-correction in LLMs.This paper explores the theoretical understanding of self-correction in large language models (LLMs) through in-context alignment. The authors demonstrate that LLMs can improve their alignment capabilities by self-correction, using a simplified setup akin to an alignment task. They show that when LLMs provide relatively accurate self-examinations as rewards, they can refine their responses in an in-context way. The study highlights the roles of key transformer designs—softmax attention, multi-head attention, and the MLP block—in enabling self-correction. The findings are validated on synthetic datasets and applied to real-world scenarios such as defending against LLM jailbreaks. The results show that self-correction can significantly reduce the success rate of jailbreak attacks. The paper also discusses the importance of reward quality, the necessity of softmax attention for ranking, and the role of multi-head attention in token discrimination. Theoretical analysis reveals that transformers can optimize alignment objectives through in-context learning, and that self-correction can improve alignment in real-world tasks like alleviating social bias. The study provides a solid theoretical foundation for understanding and enhancing self-correction in LLMs.