1 Apr 2024 | Ekkasit Pinyoanuntapong1, Muhammad Usama Saleem1, Pu Wang1, Minwoo Lee1, Srijan Das1, and Chen Chen2
The paper introduces the Bidirectional Autoregressive Motion Model (BAMM), a novel framework for generating human motion from text. BAMM addresses the limitations of existing methods, such as denoising motion models and autoregressive motion models, by combining a motion tokenizer and a masked self-attention transformer. The motion tokenizer converts 3D human motion into discrete latent tokens, while the masked self-attention transformer predicts these tokens autoregressively, using a hybrid attention masking strategy that includes both unidirectional and bidirectional causal masks. This approach captures bidirectional dependencies among motion tokens, enabling high-quality motion generation with dynamically adjusted sequence lengths and enhanced usability.
Experiments on the HumanML3D and KIT-ML datasets demonstrate that BAMM outperforms state-of-the-art methods in terms of quality and capability. BAMM can generate natural and detailed human movements aligned with textual descriptions, supports various motion editing tasks (e.g., inpainting, outpainting, prefix prediction, suffix completion), and can generate arbitrarily long motion sequences without prior knowledge of the motion length. The model's ability to predict motion lengths adaptively and its zero-shot motion editing capabilities make it a significant advancement in the field of text-to-motion generation.The paper introduces the Bidirectional Autoregressive Motion Model (BAMM), a novel framework for generating human motion from text. BAMM addresses the limitations of existing methods, such as denoising motion models and autoregressive motion models, by combining a motion tokenizer and a masked self-attention transformer. The motion tokenizer converts 3D human motion into discrete latent tokens, while the masked self-attention transformer predicts these tokens autoregressively, using a hybrid attention masking strategy that includes both unidirectional and bidirectional causal masks. This approach captures bidirectional dependencies among motion tokens, enabling high-quality motion generation with dynamically adjusted sequence lengths and enhanced usability.
Experiments on the HumanML3D and KIT-ML datasets demonstrate that BAMM outperforms state-of-the-art methods in terms of quality and capability. BAMM can generate natural and detailed human movements aligned with textual descriptions, supports various motion editing tasks (e.g., inpainting, outpainting, prefix prediction, suffix completion), and can generate arbitrarily long motion sequences without prior knowledge of the motion length. The model's ability to predict motion lengths adaptively and its zero-shot motion editing capabilities make it a significant advancement in the field of text-to-motion generation.