BAMM: Bidirectional Autoregressive Motion Model

BAMM: Bidirectional Autoregressive Motion Model

1 Apr 2024 | Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, and Chen Chen
BAMM: Bidirectional Autoregressive Motion Model BAMM is a novel text-to-motion generation framework that addresses the limitations of existing motion generation models. Traditional models require prior knowledge of motion length, which limits their usability. BAMM, however, can predict the end of the motion sequence, enabling more flexible and accurate motion generation. It combines two key components: a motion tokenizer that converts 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens using a hybrid attention masking strategy. This approach allows BAMM to capture rich and bidirectional dependencies among motion tokens, while learning the probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length. This feature enables BAMM to achieve high-quality motion generation with enhanced usability and built-in motion editability. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that BAMM surpasses current state-of-the-art methods in both qualitative and quantitative measures. BAMM outperforms existing methods in motion generation quality, editability, and the ability to generate motion without prior knowledge of motion length. It supports a variety of motion editing tasks in the zero-shot manner, including motion inpainting, outpainting, prefix prediction, suffix completion, and long sequence generation. BAMM's architecture includes a motion tokenizer that compresses and converts raw 3D human motion into a sequence of discrete motion tokens in the latent space, and a conditional masked self-attention transformer that leverages unidirectional and bidirectional causal masks to integrate autoregressive model and generative masked model into a unified framework. The training procedure follows a hybrid attention masking strategy, where the two causal masks are applied randomly and the model is forced to reconstruct the motion sequence under both cases. The cascaded motion decoding is introduced for motion generation during the inference phase, where BAMM first leverages unidirectional autoregressive decoding to implicitly predict motion sequence length and generate coarse-grained motion sequence. This motion sequence is then refined by masking and regenerating a portion of motion tokens in a bidirectional autoregressive fashion. BAMM's ability to predict motion length and edit motion sequences without prior knowledge makes it a powerful tool for text-to-motion generation. It supports a wide range of motion editing tasks, including inpainting, outpainting, prefix prediction, suffix completion, and long sequence generation. The model's performance is evaluated on two standard text-to-motion generation datasets, HumanML3D and KIT-ML, where it outperforms existing methods in terms of motion generation quality, editability, and the ability to generate motion without prior knowledge of motion length.BAMM: Bidirectional Autoregressive Motion Model BAMM is a novel text-to-motion generation framework that addresses the limitations of existing motion generation models. Traditional models require prior knowledge of motion length, which limits their usability. BAMM, however, can predict the end of the motion sequence, enabling more flexible and accurate motion generation. It combines two key components: a motion tokenizer that converts 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens using a hybrid attention masking strategy. This approach allows BAMM to capture rich and bidirectional dependencies among motion tokens, while learning the probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length. This feature enables BAMM to achieve high-quality motion generation with enhanced usability and built-in motion editability. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that BAMM surpasses current state-of-the-art methods in both qualitative and quantitative measures. BAMM outperforms existing methods in motion generation quality, editability, and the ability to generate motion without prior knowledge of motion length. It supports a variety of motion editing tasks in the zero-shot manner, including motion inpainting, outpainting, prefix prediction, suffix completion, and long sequence generation. BAMM's architecture includes a motion tokenizer that compresses and converts raw 3D human motion into a sequence of discrete motion tokens in the latent space, and a conditional masked self-attention transformer that leverages unidirectional and bidirectional causal masks to integrate autoregressive model and generative masked model into a unified framework. The training procedure follows a hybrid attention masking strategy, where the two causal masks are applied randomly and the model is forced to reconstruct the motion sequence under both cases. The cascaded motion decoding is introduced for motion generation during the inference phase, where BAMM first leverages unidirectional autoregressive decoding to implicitly predict motion sequence length and generate coarse-grained motion sequence. This motion sequence is then refined by masking and regenerating a portion of motion tokens in a bidirectional autoregressive fashion. BAMM's ability to predict motion length and edit motion sequences without prior knowledge makes it a powerful tool for text-to-motion generation. It supports a wide range of motion editing tasks, including inpainting, outpainting, prefix prediction, suffix completion, and long sequence generation. The model's performance is evaluated on two standard text-to-motion generation datasets, HumanML3D and KIT-ML, where it outperforms existing methods in terms of motion generation quality, editability, and the ability to generate motion without prior knowledge of motion length.
Reach us at info@study.space