MotionLLM: Understanding Human Behaviors from Human Motions and Videos

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

30 May 2024 | Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang
This study explores the understanding of human behavior from videos and motion sequences using Large Language Models (LLMs). Unlike previous LLMs that focus on video or motion understanding alone, the authors argue that a comprehensive understanding requires joint modeling of both modalities to capture nuanced body part dynamics and semantics effectively. They introduce MotionLLM, a framework that leverages a unified video-motion training strategy to bridge the gap between video and motion data. The framework includes a V-L translator to project visual inputs into a linguistic space and an LLM to reason about the content. To evaluate MotionLLM, the authors construct the MoVid dataset, which includes diverse videos, motions, captions, and instructions, and the MoVid-Bench benchmark, which assesses model performance on various aspects of human behavior understanding. Extensive experiments show that MotionLLM outperforms existing models in captioning, spatial-temporal comprehension, and reasoning ability. The authors also demonstrate the versatility of MotionLLM in applications such as serving as a fitness coach for social goods, particularly for the visually impaired community.This study explores the understanding of human behavior from videos and motion sequences using Large Language Models (LLMs). Unlike previous LLMs that focus on video or motion understanding alone, the authors argue that a comprehensive understanding requires joint modeling of both modalities to capture nuanced body part dynamics and semantics effectively. They introduce MotionLLM, a framework that leverages a unified video-motion training strategy to bridge the gap between video and motion data. The framework includes a V-L translator to project visual inputs into a linguistic space and an LLM to reason about the content. To evaluate MotionLLM, the authors construct the MoVid dataset, which includes diverse videos, motions, captions, and instructions, and the MoVid-Bench benchmark, which assesses model performance on various aspects of human behavior understanding. Extensive experiments show that MotionLLM outperforms existing models in captioning, spatial-temporal comprehension, and reasoning ability. The authors also demonstrate the versatility of MotionLLM in applications such as serving as a fitness coach for social goods, particularly for the visually impaired community.
Reach us at info@study.space