28 May 2024 | Yifei Wang*, Yuyang Wu*, Zeming Wei, Stefanie Jegelka, Yisen Wang
This paper explores the theoretical understanding of self-correction in large language models (LLMs) through in-context alignment. The authors argue that LLMs can improve their responses by self-examining and correcting them, a capability previously thought to be unique to humans. They analyze this process from an in-context learning perspective, showing that when LLMs receive relatively accurate self-examinations as rewards, they can refine their responses in an in-context manner. The study highlights the importance of key transformer designs such as softmax attention, multi-head attention, and the MLP block in achieving this self-correction. Extensive synthetic experiments validate these findings, demonstrating that transformers can learn from noisy outputs with the help of accurate critics. The authors also explore real-world applications, such as alleviating social bias and defending against jailbreak attacks, where intrinsic self-correction shows significant improvements. The paper concludes by providing insights into the mechanisms of self-correction and suggesting future directions for enhancing LLMs' alignment capabilities.This paper explores the theoretical understanding of self-correction in large language models (LLMs) through in-context alignment. The authors argue that LLMs can improve their responses by self-examining and correcting them, a capability previously thought to be unique to humans. They analyze this process from an in-context learning perspective, showing that when LLMs receive relatively accurate self-examinations as rewards, they can refine their responses in an in-context manner. The study highlights the importance of key transformer designs such as softmax attention, multi-head attention, and the MLP block in achieving this self-correction. Extensive synthetic experiments validate these findings, demonstrating that transformers can learn from noisy outputs with the help of accurate critics. The authors also explore real-world applications, such as alleviating social bias and defending against jailbreak attacks, where intrinsic self-correction shows significant improvements. The paper concludes by providing insights into the mechanisms of self-correction and suggesting future directions for enhancing LLMs' alignment capabilities.