Understanding GUESS%3A GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

The paper introduces GUESS (GradUally Enriching SyntheSis), a novel cascaded diffusion-based generative framework for text-driven human motion synthesis. GUESS employs a strategy called GradUally Enriching SyntheSis, which groups body joints in close semantic proximity together and replaces them with single body-part nodes, recursively abstracting human poses to coarser skeletons at multiple granularity levels. This process makes the generated human motion more concise and stable, benefiting cross-modal motion synthesis. The framework is divided into multiple abstraction levels, with an initial generator generating the coarsest human motion guess from a given text description, followed by successive generators enriching the motion details based on the textual description and previous synthesized results. A dynamic multi-condition fusion mechanism is integrated to dynamically balance the cooperative effects of the textual condition and synthesized coarse motion prompt. Extensive experiments on large-scale datasets show that GUESS outperforms existing state-of-the-art methods in terms of accuracy, realism, and diversity. The paper also discusses the limitations and future work, including the potential for dynamic inference stages and temporal dimension expansion.The paper introduces GUESS (GradUally Enriching SyntheSis), a novel cascaded diffusion-based generative framework for text-driven human motion synthesis. GUESS employs a strategy called GradUally Enriching SyntheSis, which groups body joints in close semantic proximity together and replaces them with single body-part nodes, recursively abstracting human poses to coarser skeletons at multiple granularity levels. This process makes the generated human motion more concise and stable, benefiting cross-modal motion synthesis. The framework is divided into multiple abstraction levels, with an initial generator generating the coarsest human motion guess from a given text description, followed by successive generators enriching the motion details based on the textual description and previous synthesized results. A dynamic multi-condition fusion mechanism is integrated to dynamically balance the cooperative effects of the textual condition and synthesized coarse motion prompt. Extensive experiments on large-scale datasets show that GUESS outperforms existing state-of-the-art methods in terms of accuracy, realism, and diversity. The paper also discusses the limitations and future work, including the potential for dynamic inference stages and temporal dimension expansion.

GUESS: GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

6 Jan 2024 | Xuehao Gao, Yang Yang, Zhenyu Xie, Shaoyi Du, Zhongqian Sun, and Yang Wu