This paper proposes an economic framework for 6-DoF grasp detection, aiming to reduce training resource costs while maintaining effective grasp performance. The key challenges in current methods include high resource costs and difficulty in convergence due to dense supervision. To address these issues, the authors introduce an economic supervision paradigm that selects key labels with minimal ambiguity and reduces the supervision burden. This allows the model to focus on specific grasps, leading to the design of a focal representation module that improves grasp accuracy through interactive grasp heads and composite score estimation.
The economic framework significantly reduces training time, memory usage, and storage costs compared to state-of-the-art methods. Specifically, it achieves about 3 AP improvement on average, with training time costs reduced to 1/4, memory costs to 1/8, and storage costs to 1/30. The framework is evaluated on the GraspNet-1Billion dataset, demonstrating superior performance in both seen, similar, and novel object scenarios. The economic supervision paradigm enables efficient training by focusing on specific grasps, leading to better performance with lower resource consumption.
The framework is implemented using a 14-layer 3D UNet backbone with Minkowski Engine, and the loss function is designed to handle view, angle, depth, width, and score prediction. The model is trained on the Kinect and RealSense cameras, showing robust performance in real-world scenarios. The results indicate that the economic framework is effective in reducing resource costs while maintaining high grasp performance, making it suitable for resource-constrained environments. The framework's ability to handle ambiguous supervision and focus on specific grasps is a key contribution, demonstrating the potential of economic grasp detection in robotic manipulation.This paper proposes an economic framework for 6-DoF grasp detection, aiming to reduce training resource costs while maintaining effective grasp performance. The key challenges in current methods include high resource costs and difficulty in convergence due to dense supervision. To address these issues, the authors introduce an economic supervision paradigm that selects key labels with minimal ambiguity and reduces the supervision burden. This allows the model to focus on specific grasps, leading to the design of a focal representation module that improves grasp accuracy through interactive grasp heads and composite score estimation.
The economic framework significantly reduces training time, memory usage, and storage costs compared to state-of-the-art methods. Specifically, it achieves about 3 AP improvement on average, with training time costs reduced to 1/4, memory costs to 1/8, and storage costs to 1/30. The framework is evaluated on the GraspNet-1Billion dataset, demonstrating superior performance in both seen, similar, and novel object scenarios. The economic supervision paradigm enables efficient training by focusing on specific grasps, leading to better performance with lower resource consumption.
The framework is implemented using a 14-layer 3D UNet backbone with Minkowski Engine, and the loss function is designed to handle view, angle, depth, width, and score prediction. The model is trained on the Kinect and RealSense cameras, showing robust performance in real-world scenarios. The results indicate that the economic framework is effective in reducing resource costs while maintaining high grasp performance, making it suitable for resource-constrained environments. The framework's ability to handle ambiguous supervision and focus on specific grasps is a key contribution, demonstrating the potential of economic grasp detection in robotic manipulation.