This paper proposes a method to learn reward functions for robot skills using Large Language Models (LLMs) through a self-alignment process. The method aims to efficiently learn reward functions without human intervention by leveraging the task-related knowledge embedded in LLMs. The approach involves two main components: first, using LLMs to propose features and parameterization of the reward function, and second, iteratively updating the parameters through a self-alignment process that minimizes ranking inconsistency between the LLM and the learned reward functions based on execution feedback.
The method was validated on nine tasks across two simulation environments, demonstrating consistent improvements in training efficacy and efficiency while consuming significantly fewer GPT tokens compared to alternative mutation-based methods. The framework utilizes a bi-level optimization structure, where the inner loop induces the optimal policy from the current reward function, samples trajectories, and generates execution descriptions. The outer loop updates the reward parameters by aligning the ranking between the LLM's proposed reward and the execution description feedback.
The method is compared with alternative unsupervised update methods, such as Text2Reward and Eureka, showing that it achieves better performance in terms of training efficacy, efficiency, and token consumption. The self-alignment process is shown to consistently improve the training performance by reducing numerical imprecision and instability, and by actively adjusting reward parameters based on LLM feedback. The method is also proven to be significantly token-efficient compared to mutation-based methods.
The paper also discusses the limitations of the method, including the need for more informative feedback and the challenges of preference ranking in sparse-reward tasks. The framework is validated on six ManiSkill2 tasks and three Isaac Gym tasks, demonstrating its effectiveness in inducing optimal policies. The method is shown to be significantly more efficient in terms of token consumption compared to alternative methods, and it is able to achieve higher success rates in training policies. The method is also shown to be effective in handling multi-step tasks and complex reward functions.This paper proposes a method to learn reward functions for robot skills using Large Language Models (LLMs) through a self-alignment process. The method aims to efficiently learn reward functions without human intervention by leveraging the task-related knowledge embedded in LLMs. The approach involves two main components: first, using LLMs to propose features and parameterization of the reward function, and second, iteratively updating the parameters through a self-alignment process that minimizes ranking inconsistency between the LLM and the learned reward functions based on execution feedback.
The method was validated on nine tasks across two simulation environments, demonstrating consistent improvements in training efficacy and efficiency while consuming significantly fewer GPT tokens compared to alternative mutation-based methods. The framework utilizes a bi-level optimization structure, where the inner loop induces the optimal policy from the current reward function, samples trajectories, and generates execution descriptions. The outer loop updates the reward parameters by aligning the ranking between the LLM's proposed reward and the execution description feedback.
The method is compared with alternative unsupervised update methods, such as Text2Reward and Eureka, showing that it achieves better performance in terms of training efficacy, efficiency, and token consumption. The self-alignment process is shown to consistently improve the training performance by reducing numerical imprecision and instability, and by actively adjusting reward parameters based on LLM feedback. The method is also proven to be significantly token-efficient compared to mutation-based methods.
The paper also discusses the limitations of the method, including the need for more informative feedback and the challenges of preference ranking in sparse-reward tasks. The framework is validated on six ManiSkill2 tasks and three Isaac Gym tasks, demonstrating its effectiveness in inducing optimal policies. The method is shown to be significantly more efficient in terms of token consumption compared to alternative methods, and it is able to achieve higher success rates in training policies. The method is also shown to be effective in handling multi-step tasks and complex reward functions.