March 11-14, 2024 | Simon Holk, Daniel Marta, Iolanda Leite
PREDILECT is a framework that uses zero-shot language-based reasoning to improve preference-based reinforcement learning (RL). The method leverages large language models (LLMs) to extract additional information from human preferences and text prompts, enabling more accurate reward function learning. By incorporating both preferences and text-based explanations, PREDILECT reduces the need for extensive human feedback and enhances the granularity of the reward model. The approach involves mapping human text prompts to intrinsic features and using these to highlight key state-action pairs that inform the reward function. This process helps mitigate causal confusion and improves the alignment of robot behavior with human preferences.
In simulated and real-world experiments, PREDILECT demonstrated superior performance compared to traditional preference-based RL. In simulated environments, it achieved faster convergence with fewer queries, and in a social navigation scenario, it produced policies that better aligned with human preferences. The LLM's ability to extract relevant features from text descriptions was validated, showing high accuracy in identifying features, sentiments, and magnitudes. However, there were instances where the LLM misinterpreted features or introduced false positives, highlighting the need for careful prompt design.
The framework's integration of textual explanations allows for more nuanced policy learning, focusing on specific objectives rather than generic preferences. This approach is particularly beneficial in social navigation tasks, where safety and human interaction are critical. Overall, PREDILECT shows promise in improving human-robot interaction by leveraging natural language to refine reward functions and enhance policy learning.PREDILECT is a framework that uses zero-shot language-based reasoning to improve preference-based reinforcement learning (RL). The method leverages large language models (LLMs) to extract additional information from human preferences and text prompts, enabling more accurate reward function learning. By incorporating both preferences and text-based explanations, PREDILECT reduces the need for extensive human feedback and enhances the granularity of the reward model. The approach involves mapping human text prompts to intrinsic features and using these to highlight key state-action pairs that inform the reward function. This process helps mitigate causal confusion and improves the alignment of robot behavior with human preferences.
In simulated and real-world experiments, PREDILECT demonstrated superior performance compared to traditional preference-based RL. In simulated environments, it achieved faster convergence with fewer queries, and in a social navigation scenario, it produced policies that better aligned with human preferences. The LLM's ability to extract relevant features from text descriptions was validated, showing high accuracy in identifying features, sentiments, and magnitudes. However, there were instances where the LLM misinterpreted features or introduced false positives, highlighting the need for careful prompt design.
The framework's integration of textual explanations allows for more nuanced policy learning, focusing on specific objectives rather than generic preferences. This approach is particularly beneficial in social navigation tasks, where safety and human interaction are critical. Overall, PREDILECT shows promise in improving human-robot interaction by leveraging natural language to refine reward functions and enhance policy learning.