[slides and audio] DriveVLM%3A The Convergence of Autonomous Driving and Large Vision-Language Models

**DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models** This paper introduces DriveVLM, an autonomous driving system that leverages Vision-Language Models (VLMs) to enhance scene understanding and planning capabilities. DriveVLM integrates a Chain-of-Thought (CoT) process with three key modules: scene description, scene analysis, and hierarchical planning. To address the limitations of VLMs in spatial reasoning and computational efficiency, the authors propose DriveVLM-Dual, a hybrid system that combines DriveVLM with traditional autonomous driving pipelines. This hybrid system integrates 3D perception and planning modules to achieve spatial reasoning and real-time trajectory planning. The authors define a Scene Understanding for Planning (SUP) task and propose evaluation metrics to assess the scene analysis and meta-action planning capabilities of DriveVLM and DriveVLM-Dual. They construct a Scene Understanding and Planning (SUP-AD) dataset through a comprehensive data mining and annotation pipeline. Extensive experiments on the nuScenes dataset and the SUP-AD dataset demonstrate the effectiveness of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, the authors deploy DriveVLM-Dual on a production vehicle, confirming its effectiveness in real-world autonomous driving environments. **Contributions:** 1. Introduction of DriveVLM and DriveVLM-Dual, which leverage VLMs for enhanced scene understanding and planning. 2. Definition of the SUP task and evaluation metrics. 3. Construction of the SUP-AD dataset and deployment of DriveVLM-Dual on a production vehicle. **Keywords:** Autonomous Driving, Vision Language Model, Dual System **Related Works:** - Vision-Language Models (VLMs) - Learning-based Planning - Driving Caption Datasets **Experiments:** - Performance on the SUP-AD dataset and nuScenes dataset. - Ablation studies on model design and traditional autonomous driving pipeline integration. - Qualitative results and inference performance on different LLMs and hardware. **Conclusion:** DriveVLM and DriveVLM-Dual significantly advance the field of autonomous driving by leveraging VLMs for scene understanding and planning. The hybrid system effectively addresses spatial reasoning and computational challenges, demonstrating superior performance in handling complex and dynamic driving scenarios.**DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models** This paper introduces DriveVLM, an autonomous driving system that leverages Vision-Language Models (VLMs) to enhance scene understanding and planning capabilities. DriveVLM integrates a Chain-of-Thought (CoT) process with three key modules: scene description, scene analysis, and hierarchical planning. To address the limitations of VLMs in spatial reasoning and computational efficiency, the authors propose DriveVLM-Dual, a hybrid system that combines DriveVLM with traditional autonomous driving pipelines. This hybrid system integrates 3D perception and planning modules to achieve spatial reasoning and real-time trajectory planning. The authors define a Scene Understanding for Planning (SUP) task and propose evaluation metrics to assess the scene analysis and meta-action planning capabilities of DriveVLM and DriveVLM-Dual. They construct a Scene Understanding and Planning (SUP-AD) dataset through a comprehensive data mining and annotation pipeline. Extensive experiments on the nuScenes dataset and the SUP-AD dataset demonstrate the effectiveness of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, the authors deploy DriveVLM-Dual on a production vehicle, confirming its effectiveness in real-world autonomous driving environments. **Contributions:** 1. Introduction of DriveVLM and DriveVLM-Dual, which leverage VLMs for enhanced scene understanding and planning. 2. Definition of the SUP task and evaluation metrics. 3. Construction of the SUP-AD dataset and deployment of DriveVLM-Dual on a production vehicle. **Keywords:** Autonomous Driving, Vision Language Model, Dual System **Related Works:** - Vision-Language Models (VLMs) - Learning-based Planning - Driving Caption Datasets **Experiments:** - Performance on the SUP-AD dataset and nuScenes dataset. - Ablation studies on model design and traditional autonomous driving pipeline integration. - Qualitative results and inference performance on different LLMs and hardware. **Conclusion:** DriveVLM and DriveVLM-Dual significantly advance the field of autonomous driving by leveraging VLMs for scene understanding and planning. The hybrid system effectively addresses spatial reasoning and computational challenges, demonstrating superior performance in handling complex and dynamic driving scenarios.

DRIVEVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

25 Jun 2024 | Xiaoyu Tian1, Junru Gu1, Bailin Li2, Yicheng Liu1, Yang Wang2, Zhiyong Zhao2, Kun Zhan2, Peng Jia2, Xianpeng Lang2, Hang Zhao1†

DRIVEVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

25 Jun 2024 | Xiaoyu Tian1*, Junru Gu1*, Bailin Li2*, Yicheng Liu1*, Yang Wang2, Zhiyong Zhao2, Kun Zhan2, Peng Jia2, Xianpeng Lang2, Hang Zhao1†

25 Jun 2024 | Xiaoyu Tian1, Junru Gu1, Bailin Li2, Yicheng Liu1, Yang Wang2, Zhiyong Zhao2, Kun Zhan2, Peng Jia2, Xianpeng Lang2, Hang Zhao1†