The paper introduces 3D Diffuser Actor, a novel neural policy that combines 3D scene representations with diffusion models for robot manipulation. It uses a 3D denoising transformer to fuse information from 3D visual scenes, language instructions, and proprioception to predict noise in 3D robot pose trajectories. The model outperforms existing 2D and 3D policies on RLBench and CALVIN benchmarks, achieving a 18.1% absolute gain on multi-view setups and a 13.1% gain on single-view setups. On CALVIN, it improves by 9% relative gain in zero-shot unseen scene generalization. The model also learns to control a robot manipulator in the real world from a few demonstrations. It outperforms 2D representations, regression and classification objectives, absolute attentions, and holistic non-tokenized 3D scene embeddings. The model uses a tokenized 3D scene representation, which is more robust to scene changes than holistic embeddings. It uses relative 3D attentions for better generalization and translation equivariance. The model is trained on demonstration trajectories and evaluated on RLBench and CALVIN benchmarks. It outperforms existing methods in both benchmarks and in real-world tasks. The model is implemented in Python and available for public use.The paper introduces 3D Diffuser Actor, a novel neural policy that combines 3D scene representations with diffusion models for robot manipulation. It uses a 3D denoising transformer to fuse information from 3D visual scenes, language instructions, and proprioception to predict noise in 3D robot pose trajectories. The model outperforms existing 2D and 3D policies on RLBench and CALVIN benchmarks, achieving a 18.1% absolute gain on multi-view setups and a 13.1% gain on single-view setups. On CALVIN, it improves by 9% relative gain in zero-shot unseen scene generalization. The model also learns to control a robot manipulator in the real world from a few demonstrations. It outperforms 2D representations, regression and classification objectives, absolute attentions, and holistic non-tokenized 3D scene embeddings. The model uses a tokenized 3D scene representation, which is more robust to scene changes than holistic embeddings. It uses relative 3D attentions for better generalization and translation equivariance. The model is trained on demonstration trajectories and evaluated on RLBench and CALVIN benchmarks. It outperforms existing methods in both benchmarks and in real-world tasks. The model is implemented in Python and available for public use.