10 Oct 2023 | Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
Reflexion is a novel framework for verbal reinforcement learning that enables language agents to learn from trial and error by reflecting on feedback signals. Unlike traditional reinforcement learning methods that require extensive training samples and model fine-tuning, Reflexion uses linguistic feedback to guide agents in improving their decision-making. The framework allows agents to maintain a reflective text in an episodic memory buffer, which helps them make better decisions in subsequent trials. Reflexion is flexible and can incorporate various types and sources of feedback, including scalar values and free-form language. It has shown significant improvements over a baseline agent across diverse tasks, including sequential decision-making, coding, and language reasoning.
Reflexion agents verbally reflect on task feedback signals, then store this information in an episodic memory buffer to guide future actions. This self-reflective feedback acts as a semantic gradient signal, helping the agent learn from past mistakes. Reflexion is particularly effective in tasks that require reasoning, decision-making, and programming. For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. The framework also conducts ablation studies using different feedback signals, feedback incorporation methods, and agent types, providing insights into how these factors affect performance.
Reflexion is implemented using three distinct models: an Actor, which generates text and actions; an Evaluator, which scores the outputs produced by the Actor; and a Self-Reflection model, which generates verbal reinforcement cues to assist the Actor in self-improvement. The framework has several advantages over traditional reinforcement learning approaches, including being lightweight, allowing for more nuanced feedback, and providing a more explicit and interpretable form of episodic memory. However, it also has disadvantages, such as relying on the power of the LLM's self-evaluation capabilities and not having a formal guarantee for success.
Reflexion has been tested on various tasks, including decision-making, reasoning, and programming. In decision-making tasks, Reflexion agents outperform baseline approaches by 22% in AlfWorld, 20% in HotPotQA, and 11% on HumanEval. In reasoning tasks, Reflexion agents improve performance by 14% on certain tasks. In programming tasks, Reflexion agents achieve state-of-the-art results on various code generation benchmarks. The framework also introduces LeetcodeHardGym, a code-generation RL gym environment consisting of 40 challenging Leetcode questions in 19 programming languages.
Reflexion has the potential to significantly improve the performance of language agents in various tasks, including decision-making, reasoning, and programming. However, it also has limitations, such as struggling with tasks that require a significant amount of diversity and exploration. The framework is also limited by the ability of the LLM to generate accurate self-reflections and the lack of a formal guarantee for success. Despite these limitations, Reflexion represents a promising approachReflexion is a novel framework for verbal reinforcement learning that enables language agents to learn from trial and error by reflecting on feedback signals. Unlike traditional reinforcement learning methods that require extensive training samples and model fine-tuning, Reflexion uses linguistic feedback to guide agents in improving their decision-making. The framework allows agents to maintain a reflective text in an episodic memory buffer, which helps them make better decisions in subsequent trials. Reflexion is flexible and can incorporate various types and sources of feedback, including scalar values and free-form language. It has shown significant improvements over a baseline agent across diverse tasks, including sequential decision-making, coding, and language reasoning.
Reflexion agents verbally reflect on task feedback signals, then store this information in an episodic memory buffer to guide future actions. This self-reflective feedback acts as a semantic gradient signal, helping the agent learn from past mistakes. Reflexion is particularly effective in tasks that require reasoning, decision-making, and programming. For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. The framework also conducts ablation studies using different feedback signals, feedback incorporation methods, and agent types, providing insights into how these factors affect performance.
Reflexion is implemented using three distinct models: an Actor, which generates text and actions; an Evaluator, which scores the outputs produced by the Actor; and a Self-Reflection model, which generates verbal reinforcement cues to assist the Actor in self-improvement. The framework has several advantages over traditional reinforcement learning approaches, including being lightweight, allowing for more nuanced feedback, and providing a more explicit and interpretable form of episodic memory. However, it also has disadvantages, such as relying on the power of the LLM's self-evaluation capabilities and not having a formal guarantee for success.
Reflexion has been tested on various tasks, including decision-making, reasoning, and programming. In decision-making tasks, Reflexion agents outperform baseline approaches by 22% in AlfWorld, 20% in HotPotQA, and 11% on HumanEval. In reasoning tasks, Reflexion agents improve performance by 14% on certain tasks. In programming tasks, Reflexion agents achieve state-of-the-art results on various code generation benchmarks. The framework also introduces LeetcodeHardGym, a code-generation RL gym environment consisting of 40 challenging Leetcode questions in 19 programming languages.
Reflexion has the potential to significantly improve the performance of language agents in various tasks, including decision-making, reasoning, and programming. However, it also has limitations, such as struggling with tasks that require a significant amount of diversity and exploration. The framework is also limited by the ability of the LLM to generate accurate self-reflections and the lack of a formal guarantee for success. Despite these limitations, Reflexion represents a promising approach