30 May 2024 | Ling-Hao Chen*, Shunlin Lu*, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang
MotionLLM is a framework that understands human behaviors from human motions and videos. It combines video and motion data to capture nuanced body dynamics and semantics. The framework uses a unified training strategy that leverages both coarse video-text data and fine-grained motion-text data to gain rich spatial-temporal insights. A large dataset, MoVid, is collected, containing diverse videos, motions, captions, and instructions. MoVid-Bench is proposed for evaluating human behavior understanding on video and motion. Extensive experiments show that MotionLLM outperforms existing methods in captioning, spatial-temporal comprehension, and reasoning. MotionLLM is also applied as a fitness coach, providing guidance for physical activities. The framework is trained using a two-stage approach, first translating motion and video data into linguistic space, then fine-tuning the model with instruction tuning data. MotionLLM demonstrates superior performance in understanding human behaviors, with improvements in motion and video understanding compared to existing models. The framework is also effective in various downstream applications, such as serving as a fitness coach for the visually impaired. The research contributes to the development of multi-modal human behavior understanding and provides a foundation for future research in this area.MotionLLM is a framework that understands human behaviors from human motions and videos. It combines video and motion data to capture nuanced body dynamics and semantics. The framework uses a unified training strategy that leverages both coarse video-text data and fine-grained motion-text data to gain rich spatial-temporal insights. A large dataset, MoVid, is collected, containing diverse videos, motions, captions, and instructions. MoVid-Bench is proposed for evaluating human behavior understanding on video and motion. Extensive experiments show that MotionLLM outperforms existing methods in captioning, spatial-temporal comprehension, and reasoning. MotionLLM is also applied as a fitness coach, providing guidance for physical activities. The framework is trained using a two-stage approach, first translating motion and video data into linguistic space, then fine-tuning the model with instruction tuning data. MotionLLM demonstrates superior performance in understanding human behaviors, with improvements in motion and video understanding compared to existing models. The framework is also effective in various downstream applications, such as serving as a fitness coach for the visually impaired. The research contributes to the development of multi-modal human behavior understanding and provides a foundation for future research in this area.