8 Mar 2024 | Ge Yan*, Yueh-Hua Wu*, and Xiaolong Wang
DNAct is a novel multi-task object manipulation approach that integrates neural rendering pre-training and diffusion training to learn semantic-aware and multi-modal representations. The method leverages knowledge distillation to distill 2D semantic features from foundation models like Stable Diffusion into a 3D space, enabling comprehensive semantic understanding of the scene. Diffusion training is used to learn vision and language features that capture the inherent multi-modality in multi-task demonstrations. By reconstructing action sequences from different tasks via the diffusion process, the model can distinguish different modalities, improving robustness and generalization. DNAct significantly outperforms state-of-the-art NeRF-based multi-task manipulation approaches, achieving over 30% improvement in success rate. The method is trained using a pre-trained 3D encoder and a point-cloud encoder, which work together to predict representations of observations that preserve multimodality within trajectories. DNAct is particularly effective in scenarios with novel objects and arrangements, demonstrating strong generalization capabilities. The method is evaluated on both simulated and real-world tasks, showing promising results in both environments. DNAct achieves a 1.35x improvement in simulation and a 1.33x improvement in real-world robot experiments, outperforming baselines by 1.25x even when pre-trained on unrelated tasks. The method is also shown to be more efficient, using only 11.1M parameters compared to 33.2M and 41.7M parameters for PerAct and GNFactor, respectively. The results demonstrate that DNAct is both more efficient and capable of handling various tasks.DNAct is a novel multi-task object manipulation approach that integrates neural rendering pre-training and diffusion training to learn semantic-aware and multi-modal representations. The method leverages knowledge distillation to distill 2D semantic features from foundation models like Stable Diffusion into a 3D space, enabling comprehensive semantic understanding of the scene. Diffusion training is used to learn vision and language features that capture the inherent multi-modality in multi-task demonstrations. By reconstructing action sequences from different tasks via the diffusion process, the model can distinguish different modalities, improving robustness and generalization. DNAct significantly outperforms state-of-the-art NeRF-based multi-task manipulation approaches, achieving over 30% improvement in success rate. The method is trained using a pre-trained 3D encoder and a point-cloud encoder, which work together to predict representations of observations that preserve multimodality within trajectories. DNAct is particularly effective in scenarios with novel objects and arrangements, demonstrating strong generalization capabilities. The method is evaluated on both simulated and real-world tasks, showing promising results in both environments. DNAct achieves a 1.35x improvement in simulation and a 1.33x improvement in real-world robot experiments, outperforming baselines by 1.25x even when pre-trained on unrelated tasks. The method is also shown to be more efficient, using only 11.1M parameters compared to 33.2M and 41.7M parameters for PerAct and GNFactor, respectively. The results demonstrate that DNAct is both more efficient and capable of handling various tasks.