The paper introduces AnySkill, a novel hierarchical framework for learning open-vocabulary physical interaction skills in interactive agents. AnySkill combines a low-level controller trained via imitation learning to generate atomic actions and a high-level policy that integrates these actions based on open-vocabulary textual instructions. The high-level policy uses image-based rewards from a Vision-Language Model (VLM) like CLIP to optimize the similarity between the agent's rendered images and the text, ensuring physical plausibility and naturalness. The method is evaluated through various experiments, demonstrating its ability to generate realistic and natural motion sequences for unseen instructions, outperforming existing methods in both qualitative and quantitative metrics. The paper also discusses the impact of text enhancement on performance and showcases the agent's proficiency in interacting with dynamic objects.The paper introduces AnySkill, a novel hierarchical framework for learning open-vocabulary physical interaction skills in interactive agents. AnySkill combines a low-level controller trained via imitation learning to generate atomic actions and a high-level policy that integrates these actions based on open-vocabulary textual instructions. The high-level policy uses image-based rewards from a Vision-Language Model (VLM) like CLIP to optimize the similarity between the agent's rendered images and the text, ensuring physical plausibility and naturalness. The method is evaluated through various experiments, demonstrating its ability to generate realistic and natural motion sequences for unseen instructions, outperforming existing methods in both qualitative and quantitative metrics. The paper also discusses the impact of text enhancement on performance and showcases the agent's proficiency in interacting with dynamic objects.