17 Jul 2024 | Lei Zhong¹, Yiming Xie¹, Varun Jampani², Deqing Sun³, and Huaizu Jiang¹
SMooDi is a novel stylized motion diffusion model that generates motion driven by content text and style motion sequences. Unlike existing methods that either generate motion of various content or transfer style from one sequence to another, SMooDi can rapidly generate motion across a broad range of content and diverse styles. The model is built upon a pre-trained motion latent diffusion model (MLD), which is customized for stylization. A style adaptor and style guidance module are introduced to ensure that the generated motion closely matches the reference style while maintaining realism. The style adaptor predicts residual features conditioned on style reference motion sequences, while the style guidance module uses both classifier-free and classifier-based guidance to control the stylized motion generation. Experiments on the HumanML3D and 100STYLE datasets show that SMooDi outperforms existing methods in generating stylized motion driven by content text, excelling in both content preservation and style reflection. SMooDi successfully integrates diverse content from the HumanML3D dataset and various styles from the 100STYLE dataset into a single model without requiring additional tuning during inference. The model's contributions include being the first to adapt a pre-trained text-to-motion model for generating diverse stylized motion, introducing a novel style modulation module that enables stylized motion generation while ensuring style reflection, content preservation, and realism, and demonstrating that SMooDi not only sets a new state of the art in stylized motion generation driven by content text but also achieves performance comparable to state-of-the-art methods in motion style transfer.SMooDi is a novel stylized motion diffusion model that generates motion driven by content text and style motion sequences. Unlike existing methods that either generate motion of various content or transfer style from one sequence to another, SMooDi can rapidly generate motion across a broad range of content and diverse styles. The model is built upon a pre-trained motion latent diffusion model (MLD), which is customized for stylization. A style adaptor and style guidance module are introduced to ensure that the generated motion closely matches the reference style while maintaining realism. The style adaptor predicts residual features conditioned on style reference motion sequences, while the style guidance module uses both classifier-free and classifier-based guidance to control the stylized motion generation. Experiments on the HumanML3D and 100STYLE datasets show that SMooDi outperforms existing methods in generating stylized motion driven by content text, excelling in both content preservation and style reflection. SMooDi successfully integrates diverse content from the HumanML3D dataset and various styles from the 100STYLE dataset into a single model without requiring additional tuning during inference. The model's contributions include being the first to adapt a pre-trained text-to-motion model for generating diverse stylized motion, introducing a novel style modulation module that enables stylized motion generation while ensuring style reflection, content preservation, and realism, and demonstrating that SMooDi not only sets a new state of the art in stylized motion generation driven by content text but also achieves performance comparable to state-of-the-art methods in motion style transfer.