2024 | Seohong Park, Tobias Kreiman, Sergey Levine
This paper introduces Hilbert Foundation Policies (HILPs), a novel unsupervised pre-training framework for generalist policies that can adapt to a wide range of downstream tasks. HILPs learn a structured Hilbert representation of the state space that preserves the temporal structure of the environment. This representation enables the policy to span the latent space with directional movements, allowing it to capture diverse, long-horizon behaviors from unlabeled offline data. The key idea is to learn a representation that maps states into a Hilbert space such that the distance between states corresponds to their temporal distance. This representation is then used to train a latent-conditioned policy that can be quickly adapted to new tasks through zero-shot prompting.
HILPs are evaluated on seven robotic locomotion and manipulation environments, including Walker, Cheetah, Quadruped, Jaco, AntMaze-Large, AntMaze-Ultra, and Kitchen. The results show that HILPs outperform previous methods in zero-shot RL, offline goal-conditioned RL, and hierarchical RL tasks. HILPs achieve the best performance in zero-shot RL, demonstrating their ability to adapt to arbitrary reward functions without additional training. In offline goal-conditioned RL, HILPs outperform specialized methods, showing their effectiveness in solving goal-reaching tasks. Additionally, HILPs are shown to be effective in test-time planning, where they use midpoint planning to refine subgoals and improve performance on long-horizon tasks.
The paper also compares HILPs with previous methods such as OPAL, a trajectory-based skill learning method. HILPs are shown to achieve better performance, especially in challenging tasks like AntMaze-Ultra. The structured Hilbert representation enables efficient test-time planning without the need for additional training, making HILPs a versatile and effective approach for offline policy pre-training. The results suggest that HILPs provide a promising direction for future research in offline reinforcement learning.This paper introduces Hilbert Foundation Policies (HILPs), a novel unsupervised pre-training framework for generalist policies that can adapt to a wide range of downstream tasks. HILPs learn a structured Hilbert representation of the state space that preserves the temporal structure of the environment. This representation enables the policy to span the latent space with directional movements, allowing it to capture diverse, long-horizon behaviors from unlabeled offline data. The key idea is to learn a representation that maps states into a Hilbert space such that the distance between states corresponds to their temporal distance. This representation is then used to train a latent-conditioned policy that can be quickly adapted to new tasks through zero-shot prompting.
HILPs are evaluated on seven robotic locomotion and manipulation environments, including Walker, Cheetah, Quadruped, Jaco, AntMaze-Large, AntMaze-Ultra, and Kitchen. The results show that HILPs outperform previous methods in zero-shot RL, offline goal-conditioned RL, and hierarchical RL tasks. HILPs achieve the best performance in zero-shot RL, demonstrating their ability to adapt to arbitrary reward functions without additional training. In offline goal-conditioned RL, HILPs outperform specialized methods, showing their effectiveness in solving goal-reaching tasks. Additionally, HILPs are shown to be effective in test-time planning, where they use midpoint planning to refine subgoals and improve performance on long-horizon tasks.
The paper also compares HILPs with previous methods such as OPAL, a trajectory-based skill learning method. HILPs are shown to achieve better performance, especially in challenging tasks like AntMaze-Ultra. The structured Hilbert representation enables efficient test-time planning without the need for additional training, making HILPs a versatile and effective approach for offline policy pre-training. The results suggest that HILPs provide a promising direction for future research in offline reinforcement learning.