SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction

SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction

24 May 2024 | Wei Wu, Xiaoxin Feng, Ziyuan Gao, Yuheng Kan
SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction This paper introduces SMART, a novel autonomous driving motion generation paradigm that models vectorized map and agent trajectory data into discrete sequence tokens. These tokens are processed through a decoder-only transformer architecture to train for the next token prediction task across spatial-temporal series. This GPT-style method allows the model to learn the motion distribution in real driving scenarios. SMART achieves state-of-the-art performance across most of the metrics on the generative Sim Agents challenge, ranking 1st on the leaderboards of Waymo Open Motion Dataset (WOMD), demonstrating remarkable inference speed. Moreover, SMART represents the generative model in the autonomous driving motion domain, exhibiting zero-shot generalization capabilities: Using only the NuPlan dataset for training and WOMD for validation, SMART achieved a competitive score of 0.71 on the Sim Agents challenge. Lastly, we have collected over 1 billion motion tokens from multiple datasets, validating the model's scalability. These results suggest that SMART has initially emulated two important properties: scalability and zero-shot generalization, and preliminarily meets the needs of large-scale real-time simulation applications. We have released all the code to promote the exploration of models for motion generation in the autonomous driving field. SMART is an autoregressive generative model for dynamic driving scenarios. While both language and agent motions are sequential, they differ in their representation—natural language consists of words from a finite vocabulary, whereas agent motions are continuous real-valued data. This distinction necessitates the unique design outlined in Sec. 3.1 for agent motion and road vector tokenizer, including the construction of vocabulary and the tokenization of motion sequences. Sec. 3.2 provides a comprehensive description of the model's architecture. Sec. 3.3 elaborates on the training tasks designed for the proposed model to learn the distribution of the motion token within the temporal sequence and the distribution of the road token within the spatial sequence. The model comprises an encoder for road map encoding and a motion decoder that predicts a category distribution based on motion token embeddings. RoadNet: road token encoder We employ multi-head self-attention (MHSA) to model the relationships among road tokens, after which the updated road token encodings will assist motion token decoding. For the i-th road token, we derive a query from its embedding r_i and let it attend to the neighboring tokens r_j ∈ R_i. MotionNet: factorized agent motion decoder Prevailing methods for encoding agents prioritize capturing the temporal dynamics of an agent's movements, followed by the integration of agent-map and agent-agent interactions, as highlighted by [35]. Factorized attention effectively captures detailed agent-map interactions across temporal scales [28]. In our work, we leverage a factorized Transformer architecture with multi-head cross-attention (MHCA) to decode complex road-agent and agent-agent relationships along the time series. Akin to query-centric methodologies [52],SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction This paper introduces SMART, a novel autonomous driving motion generation paradigm that models vectorized map and agent trajectory data into discrete sequence tokens. These tokens are processed through a decoder-only transformer architecture to train for the next token prediction task across spatial-temporal series. This GPT-style method allows the model to learn the motion distribution in real driving scenarios. SMART achieves state-of-the-art performance across most of the metrics on the generative Sim Agents challenge, ranking 1st on the leaderboards of Waymo Open Motion Dataset (WOMD), demonstrating remarkable inference speed. Moreover, SMART represents the generative model in the autonomous driving motion domain, exhibiting zero-shot generalization capabilities: Using only the NuPlan dataset for training and WOMD for validation, SMART achieved a competitive score of 0.71 on the Sim Agents challenge. Lastly, we have collected over 1 billion motion tokens from multiple datasets, validating the model's scalability. These results suggest that SMART has initially emulated two important properties: scalability and zero-shot generalization, and preliminarily meets the needs of large-scale real-time simulation applications. We have released all the code to promote the exploration of models for motion generation in the autonomous driving field. SMART is an autoregressive generative model for dynamic driving scenarios. While both language and agent motions are sequential, they differ in their representation—natural language consists of words from a finite vocabulary, whereas agent motions are continuous real-valued data. This distinction necessitates the unique design outlined in Sec. 3.1 for agent motion and road vector tokenizer, including the construction of vocabulary and the tokenization of motion sequences. Sec. 3.2 provides a comprehensive description of the model's architecture. Sec. 3.3 elaborates on the training tasks designed for the proposed model to learn the distribution of the motion token within the temporal sequence and the distribution of the road token within the spatial sequence. The model comprises an encoder for road map encoding and a motion decoder that predicts a category distribution based on motion token embeddings. RoadNet: road token encoder We employ multi-head self-attention (MHSA) to model the relationships among road tokens, after which the updated road token encodings will assist motion token decoding. For the i-th road token, we derive a query from its embedding r_i and let it attend to the neighboring tokens r_j ∈ R_i. MotionNet: factorized agent motion decoder Prevailing methods for encoding agents prioritize capturing the temporal dynamics of an agent's movements, followed by the integration of agent-map and agent-agent interactions, as highlighted by [35]. Factorized attention effectively captures detailed agent-map interactions across temporal scales [28]. In our work, we leverage a factorized Transformer architecture with multi-head cross-attention (MHCA) to decode complex road-agent and agent-agent relationships along the time series. Akin to query-centric methodologies [52],
Reach us at info@study.space