[slides] Self-Rewarding Language Models

The paper introduces *Self-Rewarding Language Models* (SRLMs), which are designed to improve both instruction following and reward modeling abilities through an iterative training process. Unlike traditional methods that rely on human preferences to train reward models, SRLMs use the language model itself to provide its own rewards during training. This approach aims to avoid the bottleneck caused by the quality and size of human preference data and allows the model to continuously improve its reward modeling capabilities. The key components of SRLMs include: 1. **Instruction Following**: The model generates responses to prompts. 2. **Self-Instruction Creation**: The model evaluates its own responses and assigns rewards, acting as its own reward model. 3. **Iterative Training**: The model creates additional training data by generating new prompts and responses, which are then used to train subsequent iterations of the model. The paper details the experimental setup, including the use of Llama 2 70B as the base model and the evaluation metrics. The results show that SRLMs outperform existing models on the AlpacaEval 2.0 leaderboard and demonstrate significant improvements in both instruction following and reward modeling abilities across multiple iterations. Human evaluations further validate the effectiveness of the proposed method. The authors also discuss the limitations and future directions, emphasizing the need for further exploration in safety evaluation and understanding the scaling laws of the iterative training process. Overall, the work opens the door to the possibility of models that can continually improve in both instruction following and reward modeling, potentially leading to more advanced and versatile AI systems.The paper introduces *Self-Rewarding Language Models* (SRLMs), which are designed to improve both instruction following and reward modeling abilities through an iterative training process. Unlike traditional methods that rely on human preferences to train reward models, SRLMs use the language model itself to provide its own rewards during training. This approach aims to avoid the bottleneck caused by the quality and size of human preference data and allows the model to continuously improve its reward modeling capabilities. The key components of SRLMs include: 1. **Instruction Following**: The model generates responses to prompts. 2. **Self-Instruction Creation**: The model evaluates its own responses and assigns rewards, acting as its own reward model. 3. **Iterative Training**: The model creates additional training data by generating new prompts and responses, which are then used to train subsequent iterations of the model. The paper details the experimental setup, including the use of Llama 2 70B as the base model and the evaluation metrics. The results show that SRLMs outperform existing models on the AlpacaEval 2.0 leaderboard and demonstrate significant improvements in both instruction following and reward modeling abilities across multiple iterations. Human evaluations further validate the effectiveness of the proposed method. The authors also discuss the limitations and future directions, emphasizing the need for further exploration in safety evaluation and understanding the scaling laws of the iterative training process. Overall, the work opens the door to the possibility of models that can continually improve in both instruction following and reward modeling, potentially leading to more advanced and versatile AI systems.

Self-Rewarding Language Models

28 Mar 2025 | Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston