8 Mar 2024 | Ge Yan*, Yueh-Hua Wu*, and Xiaolong Wang
DNAct is a novel multi-task policy learning framework designed for robotic manipulation tasks. It integrates neural rendering pre-training and diffusion training to enhance multi-modality learning in action sequence spaces. The pre-training phase leverages neural rendering to distill 2D semantic features from foundation models like Stable Diffusion into a 3D space, providing comprehensive semantic understanding of the scene. This enables the model to handle complex robotic tasks requiring rich 3D semantics and accurate geometry. The diffusion training phase further enhances the model's ability to distinguish different modalities in multi-task demonstrations by reconstructing action sequences. DNAct significantly outperforms state-of-the-art NeRF-based multi-task manipulation approaches, achieving over 30% improvement in success rates. The method is evaluated on both simulated and real-world robotic tasks, demonstrating its robustness and generalizability to novel objects and arrangements. DNAct's effectiveness is attributed to its ability to optimize the learned representation from multiple perspectives, including knowledge distillation and action sequence reconstruction, preventing overfitting to training demonstrations.DNAct is a novel multi-task policy learning framework designed for robotic manipulation tasks. It integrates neural rendering pre-training and diffusion training to enhance multi-modality learning in action sequence spaces. The pre-training phase leverages neural rendering to distill 2D semantic features from foundation models like Stable Diffusion into a 3D space, providing comprehensive semantic understanding of the scene. This enables the model to handle complex robotic tasks requiring rich 3D semantics and accurate geometry. The diffusion training phase further enhances the model's ability to distinguish different modalities in multi-task demonstrations by reconstructing action sequences. DNAct significantly outperforms state-of-the-art NeRF-based multi-task manipulation approaches, achieving over 30% improvement in success rates. The method is evaluated on both simulated and real-world robotic tasks, demonstrating its robustness and generalizability to novel objects and arrangements. DNAct's effectiveness is attributed to its ability to optimize the learned representation from multiple perspectives, including knowledge distillation and action sequence reconstruction, preventing overfitting to training demonstrations.