[slides] Easy-to-Hard Generalization%3A Scalable Alignment Beyond Human Supervision

The paper "Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision" addresses the challenge of aligning AI systems with human goals when those goals are beyond human capabilities. The authors propose a novel approach called *easy-to-hard generalization*, which leverages human annotations on easier tasks to improve the performance of AI systems on harder tasks. This approach is particularly useful for complex reasoning tasks, such as solving mathematical problems, where human supervision is often limited. The key insight is that an evaluator (reward model) trained on human annotations for easier tasks can be used to score candidate solutions for harder tasks, facilitating easy-to-hard generalization. The authors train process-supervised reward models on easy tasks and then use them to evaluate the performance of policy models on hard tasks. They demonstrate that this approach can enable easy-to-hard generalization in generators through re-ranking or reinforcement learning (RL). Notably, their process-supervised 7b RL model and 34b model achieve significant improvements on the MATHS00 dataset, despite only using human supervision on easy tasks. The paper also explores the effectiveness of different re-ranking strategies and RL algorithms, finding that weighted voting and PPO (Proximal Policy Optimization) outperform other methods. The authors further extend their findings to the coding domain, showing that their approach can be applied to diverse domains beyond mathematics. Overall, the study advances the field of AI alignment by demonstrating the potential of easy-to-hard generalization, suggesting a scalable method for developing AI systems capable of advancing beyond human capabilities. However, the authors also acknowledge limitations and the need for further research on long-term implications and robustness.The paper "Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision" addresses the challenge of aligning AI systems with human goals when those goals are beyond human capabilities. The authors propose a novel approach called *easy-to-hard generalization*, which leverages human annotations on easier tasks to improve the performance of AI systems on harder tasks. This approach is particularly useful for complex reasoning tasks, such as solving mathematical problems, where human supervision is often limited. The key insight is that an evaluator (reward model) trained on human annotations for easier tasks can be used to score candidate solutions for harder tasks, facilitating easy-to-hard generalization. The authors train process-supervised reward models on easy tasks and then use them to evaluate the performance of policy models on hard tasks. They demonstrate that this approach can enable easy-to-hard generalization in generators through re-ranking or reinforcement learning (RL). Notably, their process-supervised 7b RL model and 34b model achieve significant improvements on the MATHS00 dataset, despite only using human supervision on easy tasks. The paper also explores the effectiveness of different re-ranking strategies and RL algorithms, finding that weighted voting and PPO (Proximal Policy Optimization) outperform other methods. The authors further extend their findings to the coding domain, showing that their approach can be applied to diverse domains beyond mathematics. Overall, the study advances the field of AI alignment by demonstrating the potential of easy-to-hard generalization, suggesting a scalable method for developing AI systems capable of advancing beyond human capabilities. However, the authors also acknowledge limitations and the need for further research on long-term implications and robustness.

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

10 Dec 2024 | Zhiqing Sun1*, Longhui Yu2*, Yikang Shen3, Weiyang Liu4,5, Yiming Yang1†, Sean Welleck1†, Chuang Gan3,6†

10 Dec 2024 | Zhiqing Sun1, Longhui Yu2, Yikang Shen3, Weiyang Liu4,5, Yiming Yang1†, Sean Welleck1†, Chuang Gan3,6†