RISE is an end-to-end imitation learning method that uses 3D perception to enable effective and accurate real-world robot imitation. It directly predicts continuous robot actions from single-view point clouds, compressing the point cloud into tokens with a sparse 3D encoder. These tokens are then featurized using a transformer with sparse positional encoding, and the features are decoded into robot actions via a diffusion head. RISE outperforms existing 2D and 3D policies in accuracy and efficiency, and shows strong generalization and robustness to environmental changes. It is tested on six real-world tasks, including pick-and-place, 6-DoF pouring, push-to-goal, and long-horizon tasks. RISE achieves high completion rates in these tasks, even in complex environments with varying object locations and camera views. It also demonstrates superior performance in tasks requiring precise spatial understanding and dynamic adjustments. The method is effective in handling noisy single-view point clouds and is robust to changes in camera views and environmental conditions. RISE's use of 3D perception enables it to capture accurate spatial relationships, leading to more precise and adaptable robot actions. The method is evaluated on various tasks and shows significant improvements over existing approaches, particularly in handling complex and dynamic environments.RISE is an end-to-end imitation learning method that uses 3D perception to enable effective and accurate real-world robot imitation. It directly predicts continuous robot actions from single-view point clouds, compressing the point cloud into tokens with a sparse 3D encoder. These tokens are then featurized using a transformer with sparse positional encoding, and the features are decoded into robot actions via a diffusion head. RISE outperforms existing 2D and 3D policies in accuracy and efficiency, and shows strong generalization and robustness to environmental changes. It is tested on six real-world tasks, including pick-and-place, 6-DoF pouring, push-to-goal, and long-horizon tasks. RISE achieves high completion rates in these tasks, even in complex environments with varying object locations and camera views. It also demonstrates superior performance in tasks requiring precise spatial understanding and dynamic adjustments. The method is effective in handling noisy single-view point clouds and is robust to changes in camera views and environmental conditions. RISE's use of 3D perception enables it to capture accurate spatial relationships, leading to more precise and adaptable robot actions. The method is evaluated on various tasks and shows significant improvements over existing approaches, particularly in handling complex and dynamic environments.