July 27-August 1, 2024, Denver, CO, USA | Setareh Cohan, Guy Tevet, Daniele Reda, Xue Bin Peng, Michiel van de Panne
The paper presents a flexible and unified diffusion model for generating diverse human motions guided by keyframes, named Conditional Motion Diffusion In-betweening (CondMDI). Unlike previous in-betweening methods, CondMDI can handle arbitrary dense or sparse keyframe placement and partial keyframe constraints while generating high-quality, coherent motions. The model is trained on randomly sampled keyframes and joints, with a mask indicating observed keyframes and features. This flexibility allows for temporal and spatial sparsity in keyframes and partial pose specifications, along with text prompts. The authors evaluate CondMDI on the HumanML3D dataset, demonstrating its versatility and efficacy in keyframe in-betweening. They also explore alternative design choices, including imputation and reconstruction guidance methods, and compare CondMDI against these approaches. The results show that CondMDI can generate smooth and high-quality motions that adhere closely to the input keyframes, even with significant temporal and spatial sparsity. The method is simple, flexible, and comparable to state-of-the-art diffusion-based models in terms of motion quality.The paper presents a flexible and unified diffusion model for generating diverse human motions guided by keyframes, named Conditional Motion Diffusion In-betweening (CondMDI). Unlike previous in-betweening methods, CondMDI can handle arbitrary dense or sparse keyframe placement and partial keyframe constraints while generating high-quality, coherent motions. The model is trained on randomly sampled keyframes and joints, with a mask indicating observed keyframes and features. This flexibility allows for temporal and spatial sparsity in keyframes and partial pose specifications, along with text prompts. The authors evaluate CondMDI on the HumanML3D dataset, demonstrating its versatility and efficacy in keyframe in-betweening. They also explore alternative design choices, including imputation and reconstruction guidance methods, and compare CondMDI against these approaches. The results show that CondMDI can generate smooth and high-quality motions that adhere closely to the input keyframes, even with significant temporal and spatial sparsity. The method is simple, flexible, and comparable to state-of-the-art diffusion-based models in terms of motion quality.