Recursive Introspection: Teaching Language Model Agents How to Self-Improve

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

26 Jul 2024 | Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar
RISE: Recursive Introspection enables large language models (LLMs) to self-improve over multiple turns by iteratively fine-tuning them to correct their mistakes. This approach addresses the limitation of current LLMs, which often fail to improve their responses sequentially even when explicitly told they are wrong. RISE treats the problem as a multi-turn Markov decision process (MDP), where the model learns to improve its responses through iterative data collection and training. It uses reward-weighted regression to fine-tune the model, leveraging both successful and unsuccessful rollouts to enhance its ability to self-correct. RISE outperforms single-turn strategies on math reasoning tasks, achieving significant improvements on datasets like GSM8K and MATH. It scales well with more capable models and generalizes to out-of-distribution prompts. The method combines online imitation learning and reinforcement learning principles, using either oracle responses or self-generated data to guide the model. RISE is effective in both "with oracle" and "without oracle" modes, with the latter using majority voting on multiple responses. Experiments show that RISE consistently improves performance over multiple turns, outperforming other approaches like self-refine and GLoRE. The algorithm's design is crucial for enabling self-improvement, as it learns to correct mistakes through iterative training rather than relying solely on expert supervision. RISE also demonstrates robustness to out-of-distribution prompts and is effective with both self-generated and oracle data. The results highlight the importance of multi-turn interaction history, weighted objectives, and on-policy data in training RISE, showing that it can significantly enhance the self-improvement capabilities of LLMs.RISE: Recursive Introspection enables large language models (LLMs) to self-improve over multiple turns by iteratively fine-tuning them to correct their mistakes. This approach addresses the limitation of current LLMs, which often fail to improve their responses sequentially even when explicitly told they are wrong. RISE treats the problem as a multi-turn Markov decision process (MDP), where the model learns to improve its responses through iterative data collection and training. It uses reward-weighted regression to fine-tune the model, leveraging both successful and unsuccessful rollouts to enhance its ability to self-correct. RISE outperforms single-turn strategies on math reasoning tasks, achieving significant improvements on datasets like GSM8K and MATH. It scales well with more capable models and generalizes to out-of-distribution prompts. The method combines online imitation learning and reinforcement learning principles, using either oracle responses or self-generated data to guide the model. RISE is effective in both "with oracle" and "without oracle" modes, with the latter using majority voting on multiple responses. Experiments show that RISE consistently improves performance over multiple turns, outperforming other approaches like self-refine and GLoRE. The algorithm's design is crucial for enabling self-improvement, as it learns to correct mistakes through iterative training rather than relying solely on expert supervision. RISE also demonstrates robustness to out-of-distribution prompts and is effective with both self-generated and oracle data. The results highlight the importance of multi-turn interaction history, weighted objectives, and on-policy data in training RISE, showing that it can significantly enhance the self-improvement capabilities of LLMs.
Reach us at info@study.space