29 Feb 2024 | Yifei Zhou¹, Andrea Zanette¹, Jiayi Pan¹, Sergey Levine¹ and Aviral Kumar²
ArCHer is a hierarchical multi-turn reinforcement learning (RL) framework for training large language models (LLMs) to perform goal-directed tasks requiring multiple interactions. Unlike single-turn RL methods, which focus on immediate rewards, ArCHer addresses the challenges of multi-turn interactions by combining a high-level off-policy RL algorithm that learns a value function over utterances with a low-level RL algorithm that trains a token-by-token policy within each utterance. This hierarchical approach enables efficient and effective training of LLMs for tasks such as web navigation, tool use, and customer support, where long-term planning and information gathering are essential.
The framework preserves the flexibility of existing single-turn RL methods while accommodating multiple turns, long horizons, and delayed rewards. ArCHer significantly improves sample efficiency and performance on multi-turn tasks, achieving about 100x better efficiency than existing on-policy methods. It also benefits from scaling up model capacity, up to 7 billion parameters. The framework is implemented with a high-level critic that uses a value function to guide the low-level token-level policy, reducing the need for direct reward modeling and enabling faster convergence.
ArCHer is evaluated on several tasks, including the Detective Game, Twenty Questions, Guess My City, and WebShop. It outperforms other methods such as token-level PPO, filtered BC, and CHAI in terms of sample efficiency and performance. ArCHer's hierarchical design allows for the use of off-policy data, which improves stability and performance. The framework is also tested in an offline setting, where it uses implicit Q-learning (IQL) and AWR losses to handle out-of-distribution actions and improve policy performance.
Theoretical analysis shows that ArCHer requires weaker convergence conditions compared to off-policy methods and provides improved guarantees on statistical error. The framework's hierarchical structure enables the use of different methods at different levels, optimizing for the specific requirements of each level. Overall, ArCHer demonstrates the effectiveness of hierarchical RL for training LLMs in multi-turn tasks, achieving better performance and efficiency than existing methods.ArCHer is a hierarchical multi-turn reinforcement learning (RL) framework for training large language models (LLMs) to perform goal-directed tasks requiring multiple interactions. Unlike single-turn RL methods, which focus on immediate rewards, ArCHer addresses the challenges of multi-turn interactions by combining a high-level off-policy RL algorithm that learns a value function over utterances with a low-level RL algorithm that trains a token-by-token policy within each utterance. This hierarchical approach enables efficient and effective training of LLMs for tasks such as web navigation, tool use, and customer support, where long-term planning and information gathering are essential.
The framework preserves the flexibility of existing single-turn RL methods while accommodating multiple turns, long horizons, and delayed rewards. ArCHer significantly improves sample efficiency and performance on multi-turn tasks, achieving about 100x better efficiency than existing on-policy methods. It also benefits from scaling up model capacity, up to 7 billion parameters. The framework is implemented with a high-level critic that uses a value function to guide the low-level token-level policy, reducing the need for direct reward modeling and enabling faster convergence.
ArCHer is evaluated on several tasks, including the Detective Game, Twenty Questions, Guess My City, and WebShop. It outperforms other methods such as token-level PPO, filtered BC, and CHAI in terms of sample efficiency and performance. ArCHer's hierarchical design allows for the use of off-policy data, which improves stability and performance. The framework is also tested in an offline setting, where it uses implicit Q-learning (IQL) and AWR losses to handle out-of-distribution actions and improve policy performance.
Theoretical analysis shows that ArCHer requires weaker convergence conditions compared to off-policy methods and provides improved guarantees on statistical error. The framework's hierarchical structure enables the use of different methods at different levels, optimizing for the specific requirements of each level. Overall, ArCHer demonstrates the effectiveness of hierarchical RL for training LLMs in multi-turn tasks, achieving better performance and efficiency than existing methods.