Ag2Manip is a novel framework designed to enable robots to learn novel manipulation skills without relying on domain-specific demonstrations. The framework addresses the challenges of bridging the domain gap between humans and robots and improving the precision of robotic manipulations. Key innovations include an agent-agnostic visual representation that removes human-specific details from videos, enhancing generalizability, and an agent-agnostic action representation that abstracts robot actions into a universal proxy, focusing on crucial interactions between the end-effector and the object. Empirical validation across simulated benchmarks like FrancaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieving a 78.7% success rate. Ablation studies highlight the importance of both visual and action representations. In real-world experiments, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across simulated and physical environments.Ag2Manip is a novel framework designed to enable robots to learn novel manipulation skills without relying on domain-specific demonstrations. The framework addresses the challenges of bridging the domain gap between humans and robots and improving the precision of robotic manipulations. Key innovations include an agent-agnostic visual representation that removes human-specific details from videos, enhancing generalizability, and an agent-agnostic action representation that abstracts robot actions into a universal proxy, focusing on crucial interactions between the end-effector and the object. Empirical validation across simulated benchmarks like FrancaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieving a 78.7% success rate. Ablation studies highlight the importance of both visual and action representations. In real-world experiments, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across simulated and physical environments.