29 Feb 2024 | Yifei Zhou1, Andrea Zanette1, Jiayi Pan1, Sergey Levine1 and Aviral Kumar2
The paper "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL" addresses the challenge of training large language models (LLMs) for multi-turn decision-making tasks, which require intelligent interactions over multiple turns to accomplish a goal. Traditional reinforcement learning (RL) methods for LLMs focus on single-turn reward maximization, which are insufficient for multi-turn tasks. The authors propose a hierarchical RL framework called Actor-Critic Framework with a Hierarchical Structure (ArCHer) to address this issue. ArCHer combines a high-level off-policy RL algorithm that trains a value function to aggregate rewards over utterances and a low-level on-policy RL algorithm that uses this value function to train a token-by-token policy within each turn. This approach allows for efficient sample reuse and faster convergence while maintaining the flexibility of existing single-turn RL methods. Empirical results show that ArCHer significantly improves sample efficiency and performance compared to on-policy methods, achieving a 100x improvement over PPO. The framework is also shown to scale well with larger model capacities, up to 7 billion parameters. The paper includes theoretical analysis and ablation studies to support the effectiveness of ArCHer, demonstrating its robustness and scalability in various tasks.The paper "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL" addresses the challenge of training large language models (LLMs) for multi-turn decision-making tasks, which require intelligent interactions over multiple turns to accomplish a goal. Traditional reinforcement learning (RL) methods for LLMs focus on single-turn reward maximization, which are insufficient for multi-turn tasks. The authors propose a hierarchical RL framework called Actor-Critic Framework with a Hierarchical Structure (ArCHer) to address this issue. ArCHer combines a high-level off-policy RL algorithm that trains a value function to aggregate rewards over utterances and a low-level on-policy RL algorithm that uses this value function to train a token-by-token policy within each turn. This approach allows for efficient sample reuse and faster convergence while maintaining the flexibility of existing single-turn RL methods. Empirical results show that ArCHer significantly improves sample efficiency and performance compared to on-policy methods, achieving a 100x improvement over PPO. The framework is also shown to scale well with larger model capacities, up to 7 billion parameters. The paper includes theoretical analysis and ablation studies to support the effectiveness of ArCHer, demonstrating its robustness and scalability in various tasks.