[slides] Holistic Autonomous Driving Understanding by Bird'View Injected Multi-Modal Large Models

The paper introduces NuInstruct, a comprehensive dataset for language-based autonomous driving tasks, featuring 91K multi-view video-instruction-response pairs across 17 subtasks. The dataset addresses the limitations of existing benchmarks by providing a broader range of tasks and more holistic information, including multi-view, temporal, and spatial details. To generate NuInstruct, the authors propose an SQL-based method that automates the creation of instruction-response pairs, following the logical progression of human driving decision-making. They also present BEV-InMLLM, an end-to-end method that integrates Bird's-Eye View (BEV) features into large language models (LLMs) to enhance their capabilities in autonomous driving tasks. BEV-InMLLM captures multi-view, spatial, and temporal semantics, significantly improving the performance of LLMs on various tasks. Experiments demonstrate that BEV-InMLLM outperforms existing LLMs by 9% on various tasks, highlighting the effectiveness of the proposed approach. The authors plan to release NuInstruct for future research development.The paper introduces NuInstruct, a comprehensive dataset for language-based autonomous driving tasks, featuring 91K multi-view video-instruction-response pairs across 17 subtasks. The dataset addresses the limitations of existing benchmarks by providing a broader range of tasks and more holistic information, including multi-view, temporal, and spatial details. To generate NuInstruct, the authors propose an SQL-based method that automates the creation of instruction-response pairs, following the logical progression of human driving decision-making. They also present BEV-InMLLM, an end-to-end method that integrates Bird's-Eye View (BEV) features into large language models (LLMs) to enhance their capabilities in autonomous driving tasks. BEV-InMLLM captures multi-view, spatial, and temporal semantics, significantly improving the performance of LLMs on various tasks. Experiments demonstrate that BEV-InMLLM outperforms existing LLMs by 9% on various tasks, highlighting the effectiveness of the proposed approach. The authors plan to release NuInstruct for future research development.

Holistic Autonomous Driving Understanding by Bird’s-Eye-View Injected Multi-Modal Large Models

2 Jan 2024 | Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, Xiaomeng Li