28 Mar 2025 | Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
This paper introduces Self-Rewarding Language Models (SRLMs), where the language model itself generates and evaluates its own training data during training. The key idea is to use the model as a judge to provide rewards for its own responses, enabling continuous improvement without relying on external human feedback. The approach involves an iterative process where the model generates new prompts, creates candidate responses, and evaluates them using its own generated rewards. This process is then used to train the model further, leading to improved instruction-following and reward modeling abilities.
The study shows that fine-tuning Llama 2 70B with three iterations of this self-rewarding approach outperforms existing models like Claude 2, Gemini Pro, and GPT-4 0613 on the AlpacaEval 2.0 leaderboard. The model's ability to generate and evaluate its own training data allows it to continuously improve, leading to better performance in both instruction-following and reward modeling tasks. The method also shows improvements in various NLP benchmarks and in human evaluations, where later iterations of the model outperform the baseline.
The paper highlights the potential of self-rewarding models to surpass traditional methods by continuously improving through self-generated data. However, it also notes that the effectiveness of this approach may be limited in real-world scenarios where human preferences are the primary training signal. The study suggests that further research is needed to explore the scalability and limitations of this approach, particularly in terms of safety and the long-term effects of iterative training. Overall, the work presents a promising direction for improving large language models through self-rewarding mechanisms.This paper introduces Self-Rewarding Language Models (SRLMs), where the language model itself generates and evaluates its own training data during training. The key idea is to use the model as a judge to provide rewards for its own responses, enabling continuous improvement without relying on external human feedback. The approach involves an iterative process where the model generates new prompts, creates candidate responses, and evaluates them using its own generated rewards. This process is then used to train the model further, leading to improved instruction-following and reward modeling abilities.
The study shows that fine-tuning Llama 2 70B with three iterations of this self-rewarding approach outperforms existing models like Claude 2, Gemini Pro, and GPT-4 0613 on the AlpacaEval 2.0 leaderboard. The model's ability to generate and evaluate its own training data allows it to continuously improve, leading to better performance in both instruction-following and reward modeling tasks. The method also shows improvements in various NLP benchmarks and in human evaluations, where later iterations of the model outperform the baseline.
The paper highlights the potential of self-rewarding models to surpass traditional methods by continuously improving through self-generated data. However, it also notes that the effectiveness of this approach may be limited in real-world scenarios where human preferences are the primary training signal. The study suggests that further research is needed to explore the scalability and limitations of this approach, particularly in terms of safety and the long-term effects of iterative training. Overall, the work presents a promising direction for improving large language models through self-rewarding mechanisms.