Understanding DriveDreamer-2%3A LLM-Enhanced World Models for Diverse Driving Video Generation

**DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation** **Authors:** Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang **Project Page:** <https://drivedreamer2.github.io> **Abstract:** World models have shown significant potential in autonomous driving, particularly in generating multi-view driving videos. However, challenges remain in creating customized driving videos. This paper introduces *DriveDreamer-2*, an advanced world model that incorporates a Large Language Model (LLM) to generate user-defined driving videos. *DriveDreamer-2* first converts user queries into agent trajectories using an LLM interface. It then generates a realistic HDMap based on these trajectories, adhering to traffic regulations. The Unified Multi-View Model (UniMVM) enhances temporal and spatial coherence in the generated videos. *DriveDreamer-2* is the first world model to generate diverse and user-friendly driving videos, including uncommon scenarios like vehicles abruptly cutting in. Experimental results demonstrate that *DriveDreamer-2* outperforms other state-of-the-art methods, achieving FID and FVD scores of 11.2 and 55.7, respectively, representing improvements of approximately 30% and 50%. **Keywords:** World models, Autonomous driving, Video generation **Introduction:** World models in autonomous driving have gained significant attention due to their predictive capabilities, enabling the generation of diverse driving videos. However, generating customized driving videos remains challenging. *DriveDreamer-2* addresses this by using a user-friendly text-to-traffic simulation pipeline, generating diverse traffic conditions. The traffic simulation pipeline disentangles foreground and background conditions, with the LLM finetuned to generate agent trajectories and an HDMap generator simulating road structures. The UniMVM framework enhances temporal and spatial coherence in the generated videos. **Main Contributions:** - *DriveDreamer-2* is the first world model to generate diverse driving videos in a user-friendly manner. - A traffic simulation pipeline using text prompts to generate diverse traffic conditions. - UniMVM for enhancing spatial and temporal coherence in multi-view driving videos. **Experiments:** - Extensive experiments show that *DriveDreamer-2* can generate diverse and high-quality driving videos, including uncommon scenarios. - The generated videos enhance the training of driving perception methods, improving 3D detection and tracking metrics. - Ablation studies demonstrate the effectiveness of the diffusion backbone and UniMVM. **Conclusion:** *DriveDreamer-2* significantly advances the generation of diverse and high-quality driving videos, outperforming state-of-the-art methods in terms of FID and FVD scores.**DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation** **Authors:** Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang **Project Page:** <https://drivedreamer2.github.io> **Abstract:** World models have shown significant potential in autonomous driving, particularly in generating multi-view driving videos. However, challenges remain in creating customized driving videos. This paper introduces *DriveDreamer-2*, an advanced world model that incorporates a Large Language Model (LLM) to generate user-defined driving videos. *DriveDreamer-2* first converts user queries into agent trajectories using an LLM interface. It then generates a realistic HDMap based on these trajectories, adhering to traffic regulations. The Unified Multi-View Model (UniMVM) enhances temporal and spatial coherence in the generated videos. *DriveDreamer-2* is the first world model to generate diverse and user-friendly driving videos, including uncommon scenarios like vehicles abruptly cutting in. Experimental results demonstrate that *DriveDreamer-2* outperforms other state-of-the-art methods, achieving FID and FVD scores of 11.2 and 55.7, respectively, representing improvements of approximately 30% and 50%. **Keywords:** World models, Autonomous driving, Video generation **Introduction:** World models in autonomous driving have gained significant attention due to their predictive capabilities, enabling the generation of diverse driving videos. However, generating customized driving videos remains challenging. *DriveDreamer-2* addresses this by using a user-friendly text-to-traffic simulation pipeline, generating diverse traffic conditions. The traffic simulation pipeline disentangles foreground and background conditions, with the LLM finetuned to generate agent trajectories and an HDMap generator simulating road structures. The UniMVM framework enhances temporal and spatial coherence in the generated videos. **Main Contributions:** - *DriveDreamer-2* is the first world model to generate diverse driving videos in a user-friendly manner. - A traffic simulation pipeline using text prompts to generate diverse traffic conditions. - UniMVM for enhancing spatial and temporal coherence in multi-view driving videos. **Experiments:** - Extensive experiments show that *DriveDreamer-2* can generate diverse and high-quality driving videos, including uncommon scenarios. - The generated videos enhance the training of driving perception methods, improving 3D detection and tracking metrics. - Ablation studies demonstrate the effectiveness of the diffusion backbone and UniMVM. **Conclusion:** *DriveDreamer-2* significantly advances the generation of diverse and high-quality driving videos, outperforming state-of-the-art methods in terms of FID and FVD scores.

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

11 Apr 2024 | Guosheng Zhao1*, Xiaofeng Wang1*, Zheng Zhu2*□, Xinze Chen2, Guan Huang2, Xiaoyi Bao1, and Xingang Wang1□

11 Apr 2024 | Guosheng Zhao1, Xiaofeng Wang1, Zheng Zhu2*□, Xinze Chen2, Guan Huang2, Xiaoyi Bao1, and Xingang Wang1□