24 May 2024 | Mathis Petrovich, Or Litany, Umar Iqbal, Michael J. Black, G"ul Varol, Xue Bin Peng, Davis Rempe
This paper introduces a new problem of multi-track timeline control for text-driven 3D human motion generation, enabling users to specify complex motion sequences with precise timing and spatial composition. The proposed method, Spatio-Temporal Motion Collage (STMC), operates at test time and leverages pre-trained motion diffusion models to generate realistic motions that accurately reflect the timeline. STMC processes each timeline interval independently, then aggregates predictions considering the specific body parts involved in each action. The method handles both spatial and temporal compositions, allowing for seamless transitions between actions. Experimental results show that STMC produces realistic motions that respect the semantics and timing of given text prompts. The method also improves upon existing motion diffusion models by supporting the SMPL body representation and reducing runtime through fewer denoising steps. The paper also presents a comprehensive evaluation of the method, including quantitative comparisons with baselines and a perceptual study with human raters. The results demonstrate that STMC outperforms existing methods in terms of motion realism and semantic accuracy. The method is evaluated on a new dataset of 500 multi-track timelines, which includes complex compositions of actions. The paper concludes that STMC provides a significant improvement over existing methods for text-driven 3D human motion generation.This paper introduces a new problem of multi-track timeline control for text-driven 3D human motion generation, enabling users to specify complex motion sequences with precise timing and spatial composition. The proposed method, Spatio-Temporal Motion Collage (STMC), operates at test time and leverages pre-trained motion diffusion models to generate realistic motions that accurately reflect the timeline. STMC processes each timeline interval independently, then aggregates predictions considering the specific body parts involved in each action. The method handles both spatial and temporal compositions, allowing for seamless transitions between actions. Experimental results show that STMC produces realistic motions that respect the semantics and timing of given text prompts. The method also improves upon existing motion diffusion models by supporting the SMPL body representation and reducing runtime through fewer denoising steps. The paper also presents a comprehensive evaluation of the method, including quantitative comparisons with baselines and a perceptual study with human raters. The results demonstrate that STMC outperforms existing methods in terms of motion realism and semantic accuracy. The method is evaluated on a new dataset of 500 multi-track timelines, which includes complex compositions of actions. The paper concludes that STMC provides a significant improvement over existing methods for text-driven 3D human motion generation.