RealDex is a pioneering dataset that captures authentic dexterous hand grasping motions, infused with human behavioral patterns and enriched by multi-view and multimodal visual data. Utilizing a teleoperation system, RealDex seamlessly synchronizes human-robot hand poses in real time, enabling the training of dexterous hands to mimic human movements more naturally and precisely. The dataset includes 52 objects with varied scales, shapes, and materials, along with 2.6k sequences of grasping motions and approximately 955k frames of visual data. This comprehensive dataset is crucial for advancing humanoid robotics in automated perception, cognition, and manipulation in real-world scenarios.
The paper introduces a cutting-edge dexterous grasping motion generation framework that aligns with human experience and enhances real-world applicability through the effective utilization of Multi-modal Large Language Models (MLLMs). Extensive experiments demonstrate the superior performance of the method on RealDex and other open datasets. The complete dataset and code will be made available upon publication.
The framework consists of two main stages: grasp pose generation and motion synthesis. Grasp pose generation involves using a conditional Variational Autoencoder (cVAE) to generate candidate poses and aligning them with human preferences through MLLMs. Motion synthesis predicts complete hand motion sequences for each pose using an auto-regressive motion trajectory prediction model. The framework's effectiveness is validated through user studies and comparisons with existing methods, showing superior performance in grasping stability, human-like grasp quality, and generalization capabilities. Real-world testing on a real robot further demonstrates the practical value of the method.RealDex is a pioneering dataset that captures authentic dexterous hand grasping motions, infused with human behavioral patterns and enriched by multi-view and multimodal visual data. Utilizing a teleoperation system, RealDex seamlessly synchronizes human-robot hand poses in real time, enabling the training of dexterous hands to mimic human movements more naturally and precisely. The dataset includes 52 objects with varied scales, shapes, and materials, along with 2.6k sequences of grasping motions and approximately 955k frames of visual data. This comprehensive dataset is crucial for advancing humanoid robotics in automated perception, cognition, and manipulation in real-world scenarios.
The paper introduces a cutting-edge dexterous grasping motion generation framework that aligns with human experience and enhances real-world applicability through the effective utilization of Multi-modal Large Language Models (MLLMs). Extensive experiments demonstrate the superior performance of the method on RealDex and other open datasets. The complete dataset and code will be made available upon publication.
The framework consists of two main stages: grasp pose generation and motion synthesis. Grasp pose generation involves using a conditional Variational Autoencoder (cVAE) to generate candidate poses and aligning them with human preferences through MLLMs. Motion synthesis predicts complete hand motion sequences for each pose using an auto-regressive motion trajectory prediction model. The framework's effectiveness is validated through user studies and comparisons with existing methods, showing superior performance in grasping stability, human-like grasp quality, and generalization capabilities. Real-world testing on a real robot further demonstrates the practical value of the method.