AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents

AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents

19 Mar 2024 | Jieming Cui, Tengyu Liu, Nian Liu, Yaodong Yang, Yixin Zhu, Siyuan Huang
AnySkill is a novel hierarchical method for learning open-vocabulary physical skills for interactive agents. It combines a low-level controller with a high-level policy, enabling agents to learn physical interactions from open-vocabulary instructions. The low-level controller is trained using imitation learning to generate atomic actions, while the high-level policy selects and integrates these actions to maximize the CLIP similarity between the agent's rendered images and the text description. The method uses image-based rewards for the high-level policy, eliminating the need for manual reward engineering. AnySkill demonstrates the ability to generate realistic and natural motion sequences in response to unseen instructions, making it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents. The low-level controller is trained using a shared latent space to encode unlabeled motions into a shared latent space Z. This process utilizes GAIL to ensure that the atomic actions are physically plausible. The high-level policy is trained for each open-vocabulary text instruction, using a flexible and generalizable image-based reward via a VLM. This design facilitates the learning of physical interactions with dynamic objects, without the need for handcrafted reward engineering. The high-level policy is implemented as an MLP, taking the agent's state s as input and outputting a latent representation z close to the low-level controller's latent space Z. It is trained using a composite reward of image-based similarity and latent-representation alignment. Given state s and text description d, the agent's image is rendered and encoded along with the text using a pretrained, frozen CLIP model to obtain features. The similarity reward is computed as the cosine similarity between the features, with an additional latent-representation alignment reward to draw z nearer to the latent distribution of M. The method is evaluated on a dataset of 93 distinct motion records, primarily sourced from the CMU Graphics Lab Motion Capture Database and SFU Motion Capture Database. The low-level controller is trained using PPO in IsaacGym, while the high-level policy is trained on an NVIDIA RTX3090 GPU. The training process is conducted on a single A100 GPU, operating at a 120Hz simulation frequency, and spans four days to cover a dataset comprising 93 unique motion patterns. The method is compared with existing open-vocabulary motion generation approaches, demonstrating that AnySkill significantly surpasses current methods across all evaluated metrics. The ablation study underscores the importance of incorporating early termination into the training process. The method is also evaluated on text enhancement, showing that refined text descriptions significantly improve AnySkill's execution accuracy. The method is further evaluated on interaction motions, demonstrating the superb capability to interact with dynamic objects, for instance, a soccer ball and a door. The method is also evaluated on reward function analysis, showing that the proposed reward function is effective in achieving text-to-motion alignment for open-vocabulary instructions. The method is compared with other reward functions, demonstrating that the proposed reward function surpasses the baseline methods in most metricsAnySkill is a novel hierarchical method for learning open-vocabulary physical skills for interactive agents. It combines a low-level controller with a high-level policy, enabling agents to learn physical interactions from open-vocabulary instructions. The low-level controller is trained using imitation learning to generate atomic actions, while the high-level policy selects and integrates these actions to maximize the CLIP similarity between the agent's rendered images and the text description. The method uses image-based rewards for the high-level policy, eliminating the need for manual reward engineering. AnySkill demonstrates the ability to generate realistic and natural motion sequences in response to unseen instructions, making it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents. The low-level controller is trained using a shared latent space to encode unlabeled motions into a shared latent space Z. This process utilizes GAIL to ensure that the atomic actions are physically plausible. The high-level policy is trained for each open-vocabulary text instruction, using a flexible and generalizable image-based reward via a VLM. This design facilitates the learning of physical interactions with dynamic objects, without the need for handcrafted reward engineering. The high-level policy is implemented as an MLP, taking the agent's state s as input and outputting a latent representation z close to the low-level controller's latent space Z. It is trained using a composite reward of image-based similarity and latent-representation alignment. Given state s and text description d, the agent's image is rendered and encoded along with the text using a pretrained, frozen CLIP model to obtain features. The similarity reward is computed as the cosine similarity between the features, with an additional latent-representation alignment reward to draw z nearer to the latent distribution of M. The method is evaluated on a dataset of 93 distinct motion records, primarily sourced from the CMU Graphics Lab Motion Capture Database and SFU Motion Capture Database. The low-level controller is trained using PPO in IsaacGym, while the high-level policy is trained on an NVIDIA RTX3090 GPU. The training process is conducted on a single A100 GPU, operating at a 120Hz simulation frequency, and spans four days to cover a dataset comprising 93 unique motion patterns. The method is compared with existing open-vocabulary motion generation approaches, demonstrating that AnySkill significantly surpasses current methods across all evaluated metrics. The ablation study underscores the importance of incorporating early termination into the training process. The method is also evaluated on text enhancement, showing that refined text descriptions significantly improve AnySkill's execution accuracy. The method is further evaluated on interaction motions, demonstrating the superb capability to interact with dynamic objects, for instance, a soccer ball and a door. The method is also evaluated on reward function analysis, showing that the proposed reward function is effective in achieving text-to-motion alignment for open-vocabulary instructions. The method is compared with other reward functions, demonstrating that the proposed reward function surpasses the baseline methods in most metrics
Reach us at info@futurestudyspace.com
[slides and audio] AnySkill%3A Learning Open-Vocabulary Physical Skill for Interactive Agents