GUESS: GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

GUESS: GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

6 Jan 2024 | Xuehao Gao, Yang Yang, Zhenyu Xie, Shaoyi Du, Zhongqian Sun, and Yang Wu
GUESS is a novel cascaded diffusion-based generative framework for text-driven human motion synthesis. It employs a strategy called GradUally Enriching SyntheSis (GUESS) to progressively abstract human poses into coarser skeletons at multiple granularity levels. This approach enables more concise and stable human motion generation, significantly benefiting cross-modal motion synthesis. The framework divides the text-driven motion synthesis problem into multiple abstraction levels and solves it using a multi-stage generation framework with a cascaded latent diffusion model. An initial generator creates the coarsest motion guess from a text description, and subsequent generators gradually enrich motion details based on the text and previous results. GUESS integrates a dynamic multi-condition fusion mechanism to balance the effects of textual and synthesized motion prompts during generation. Extensive experiments on large-scale datasets show that GUESS outperforms existing methods in terms of accuracy, realism, and diversity. The framework uses multi-scale pose representations, latent motion encoding, and cascaded latent diffusion to generate realistic and diverse human motions. GUESS also incorporates dynamic multi-condition fusion to adaptively balance the contributions of textual and motion conditions. The method is evaluated on multiple datasets and shows superior performance in text-to-motion and action-to-motion synthesis. The results demonstrate that GUESS is a powerful and effective approach for text-driven human motion generation.GUESS is a novel cascaded diffusion-based generative framework for text-driven human motion synthesis. It employs a strategy called GradUally Enriching SyntheSis (GUESS) to progressively abstract human poses into coarser skeletons at multiple granularity levels. This approach enables more concise and stable human motion generation, significantly benefiting cross-modal motion synthesis. The framework divides the text-driven motion synthesis problem into multiple abstraction levels and solves it using a multi-stage generation framework with a cascaded latent diffusion model. An initial generator creates the coarsest motion guess from a text description, and subsequent generators gradually enrich motion details based on the text and previous results. GUESS integrates a dynamic multi-condition fusion mechanism to balance the effects of textual and synthesized motion prompts during generation. Extensive experiments on large-scale datasets show that GUESS outperforms existing methods in terms of accuracy, realism, and diversity. The framework uses multi-scale pose representations, latent motion encoding, and cascaded latent diffusion to generate realistic and diverse human motions. GUESS also incorporates dynamic multi-condition fusion to adaptively balance the contributions of textual and motion conditions. The method is evaluated on multiple datasets and shows superior performance in text-to-motion and action-to-motion synthesis. The results demonstrate that GUESS is a powerful and effective approach for text-driven human motion generation.
Reach us at info@study.space
[slides] GUESS%3A GradUally Enriching SyntheSis for Text-Driven Human Motion Generation | StudySpace