RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model

RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model

29 May 2024 | Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, Matthew Gadd
RAG-Driver is a retrieval-augmented, multi-modal large language model (MLLM) designed for generalisable and explainable end-to-end autonomous driving. It provides three key functions: (1) Action Explanation, (2) Action Justification, and (3) Next Control Signal Prediction. The model uses a unified perception and planning unit based on an MLLM and a memory unit based on a hybrid vector and textual database. These components interact through a retrieval engine to enable robust multi-modal in-context learning (ICL) during decision-making. The model leverages retrieval-augmented in-context learning (RA-ICL) to improve generalisation performance in unseen driving environments. It uses a curated multi-modal driving in-context instruction tuning dataset and a vector similarity-based retrieval engine specifically tailored for driving applications. The model is trained using a two-stage strategy, first pre-training the cross-modality projector and then instruction tuning on a dataset of structured ICL examples derived from the BDD-X dataset. RAG-Driver outperforms existing methods in both in-domain deployments and deployment in unseen environments without any fine-tuning. It achieves state-of-the-art performance in producing driving action explanations, justifications, and control signal predictions. The model demonstrates exceptional zero-shot generalisation capabilities to unseen scenarios without further training. It is grounded in analogical demonstrations, significantly reducing the need for continuous retraining while enhancing the generalisability and quality of generated explanatory texts. The model is evaluated on the BDD-X dataset, which includes 77 hours of videos under various road and weather conditions. It is also tested on a customised dataset, Spoken-SAX, which features video sequences narrated by a professional driving instructor. The model's performance is measured using metrics such as BLEU, METEOR, and CIDEr for text generation quality, and RMSE for control signal prediction accuracy. RAG-Driver shows significant improvements in both explainability and control signal prediction compared to existing methods. It also demonstrates strong generalisation capabilities to unseen environments, with performance improvements in both in-distribution and out-of-distribution settings. The model is capable of providing human-understandable explanations and justifications for driving actions, even in out-of-distribution settings. It is also capable of providing accurate control signal predictions, even in scenarios where the model has not been trained on those specific conditions. The model is designed to be robust and generalisable, with a focus on providing explainable and reliable autonomous driving.RAG-Driver is a retrieval-augmented, multi-modal large language model (MLLM) designed for generalisable and explainable end-to-end autonomous driving. It provides three key functions: (1) Action Explanation, (2) Action Justification, and (3) Next Control Signal Prediction. The model uses a unified perception and planning unit based on an MLLM and a memory unit based on a hybrid vector and textual database. These components interact through a retrieval engine to enable robust multi-modal in-context learning (ICL) during decision-making. The model leverages retrieval-augmented in-context learning (RA-ICL) to improve generalisation performance in unseen driving environments. It uses a curated multi-modal driving in-context instruction tuning dataset and a vector similarity-based retrieval engine specifically tailored for driving applications. The model is trained using a two-stage strategy, first pre-training the cross-modality projector and then instruction tuning on a dataset of structured ICL examples derived from the BDD-X dataset. RAG-Driver outperforms existing methods in both in-domain deployments and deployment in unseen environments without any fine-tuning. It achieves state-of-the-art performance in producing driving action explanations, justifications, and control signal predictions. The model demonstrates exceptional zero-shot generalisation capabilities to unseen scenarios without further training. It is grounded in analogical demonstrations, significantly reducing the need for continuous retraining while enhancing the generalisability and quality of generated explanatory texts. The model is evaluated on the BDD-X dataset, which includes 77 hours of videos under various road and weather conditions. It is also tested on a customised dataset, Spoken-SAX, which features video sequences narrated by a professional driving instructor. The model's performance is measured using metrics such as BLEU, METEOR, and CIDEr for text generation quality, and RMSE for control signal prediction accuracy. RAG-Driver shows significant improvements in both explainability and control signal prediction compared to existing methods. It also demonstrates strong generalisation capabilities to unseen environments, with performance improvements in both in-distribution and out-of-distribution settings. The model is capable of providing human-understandable explanations and justifications for driving actions, even in out-of-distribution settings. It is also capable of providing accurate control signal predictions, even in scenarios where the model has not been trained on those specific conditions. The model is designed to be robust and generalisable, with a focus on providing explainable and reliable autonomous driving.
Reach us at info@study.space