Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

2 Jan 2024 | Xinpeng Ding¹ Jianhua Han² Hang Xu² Xiaodan Liang³ Wei Zhang² Xiaomeng Li¹
This paper introduces NuInstruct, a novel dataset for holistic language-based autonomous driving, containing 91,355 instruction-response pairs across 17 subtasks. The dataset is generated using a SQL-based method that follows the logical progression of human driving tasks: Perception, Prediction, Risk, and Planning with Reasoning. The dataset includes multi-view video inputs and requires comprehensive information such as temporal, multi-view, and spatial data for autonomous driving understanding. To address the challenges of the NuInstruct dataset, the authors propose BEV-InMLLM, an end-to-end method that integrates instruction-aware Bird's-Eye-View (BEV) features with existing multimodal large language models (MLLMs). BEV-InMLLM enhances MLLMs by incorporating multi-view, spatial awareness, and temporal semantics, enabling more accurate autonomous driving understanding. The BEV injection module is a plug-and-play solution that aligns BEV features with language features for LLMs, improving performance on various tasks. Experiments on NuInstruct show that BEV-InMLLM significantly outperforms existing MLLMs, achieving a 9% improvement on various tasks. The dataset and method are released for future research. The paper also discusses the limitations of current language-based driving research, including partial task coverage and incomplete information, and proposes NuInstruct as a comprehensive solution to these issues. The authors also conduct ablation studies to evaluate the effectiveness of different modules in the proposed method.This paper introduces NuInstruct, a novel dataset for holistic language-based autonomous driving, containing 91,355 instruction-response pairs across 17 subtasks. The dataset is generated using a SQL-based method that follows the logical progression of human driving tasks: Perception, Prediction, Risk, and Planning with Reasoning. The dataset includes multi-view video inputs and requires comprehensive information such as temporal, multi-view, and spatial data for autonomous driving understanding. To address the challenges of the NuInstruct dataset, the authors propose BEV-InMLLM, an end-to-end method that integrates instruction-aware Bird's-Eye-View (BEV) features with existing multimodal large language models (MLLMs). BEV-InMLLM enhances MLLMs by incorporating multi-view, spatial awareness, and temporal semantics, enabling more accurate autonomous driving understanding. The BEV injection module is a plug-and-play solution that aligns BEV features with language features for LLMs, improving performance on various tasks. Experiments on NuInstruct show that BEV-InMLLM significantly outperforms existing MLLMs, achieving a 9% improvement on various tasks. The dataset and method are released for future research. The paper also discusses the limitations of current language-based driving research, including partial task coverage and incomplete information, and proposes NuInstruct as a comprehensive solution to these issues. The authors also conduct ablation studies to evaluate the effectiveness of different modules in the proposed method.
Reach us at info@study.space