[slides and audio] Multi-Frame%2C Lightweight %26 Efficient Vision-Language Models for Question Answering in Autonomous Driving

The paper introduces EM-VLM4AD, an efficient, lightweight, multi-frame vision-language model designed for Visual Question Answering (VQA) in autonomous driving. The model aims to address the limitations of current large language models (LLMs) and image encoders, which are expensive and unsuitable for real-time autonomous driving systems. EM-VLM4AD requires at least 10 times less memory and floating-point operations compared to existing approaches while achieving higher CIDEr and ROUGE scores on the DriveLM dataset. The model can extract relevant information from multi-view traffic scenes and answer questions related to various autonomous driving subtasks. The paper details the development of EM-VLM4AD, including the use of a custom image embedding network and a pre-trained T5 language model. The model is trained using a two-stage approach, first freezing the image patch encoder and LM to align with the LM's expectations, and then fine-tuning the LM. The paper also presents a qualitative and quantitative analysis of the model's performance, demonstrating its effectiveness in VQA tasks and its superior efficiency compared to other models.The paper introduces EM-VLM4AD, an efficient, lightweight, multi-frame vision-language model designed for Visual Question Answering (VQA) in autonomous driving. The model aims to address the limitations of current large language models (LLMs) and image encoders, which are expensive and unsuitable for real-time autonomous driving systems. EM-VLM4AD requires at least 10 times less memory and floating-point operations compared to existing approaches while achieving higher CIDEr and ROUGE scores on the DriveLM dataset. The model can extract relevant information from multi-view traffic scenes and answer questions related to various autonomous driving subtasks. The paper details the development of EM-VLM4AD, including the use of a custom image embedding network and a pre-trained T5 language model. The model is trained using a two-stage approach, first freezing the image patch encoder and LM to align with the LM's expectations, and then fine-tuning the LM. The paper also presents a qualitative and quantitative analysis of the model's performance, demonstrating its effectiveness in VQA tasks and its superior efficiency compared to other models.

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

9 May 2024 | Akshay Gopalkrishnan, Ross Greer, Mohan Trivedi