Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

9 May 2024 | Akshay Gopalkrishnan, Ross Greer, Mohan Trivedi
This paper introduces EM-VLM4AD, an efficient, lightweight, multi-frame vision-language model designed for Visual Question Answering (VQA) in autonomous driving. Unlike existing models that use large language models (LLMs) with over one billion parameters, EM-VLM4AD requires at least 10 times less memory and floating point operations, while achieving higher CIDEr and ROUGE scores on the DriveLM dataset. It can extract relevant information from traffic views related to prompts and answer questions for various autonomous driving subtasks. The model uses a custom image embedding network and a pre-trained T5 language model. The image embedding network aggregates multi-view images into a single embedding, which is then concatenated with a text embedding to input into the LM. The model uses two different lightweight LM backbones: a finetuned Text-to-Text Transfer Transformer (T5) Base LM and an 8-bit quantized T5-Large LM finetuned using low-rank adaptation (LoRA). The model is trained using the DriveLM dataset, which provides real, multi-view traffic scene images paired with question/answer pairs. The model is evaluated on BLEU-4, ROUGE-L, METEOR, and CIDEr metrics, demonstrating stronger performance in ROUGE-L and CIDEr even with a much smaller model. The model is also efficient in terms of computational resources, requiring less memory, computations, and model parameters compared to other multimodal LMs for autonomous driving. The model is able to perform VQA for various autonomous driving tasks such as perception, planning, and traffic agent behavior prediction. However, it exhibits one specific weakness: answering questions related to behavior where the prompt is "Predict the behavior for the ego vehicle." Adding temporal context through inputting multi-view videos to the network would improve results on this type of question. The model is designed to be efficient and suitable for real-time applications, with fast inference times and low computational requirements. Future research aims to evolve the model into a video-language model capable of generating responses from multi-view video inputs, thereby enhancing its ability to handle temporal-related inquiries.This paper introduces EM-VLM4AD, an efficient, lightweight, multi-frame vision-language model designed for Visual Question Answering (VQA) in autonomous driving. Unlike existing models that use large language models (LLMs) with over one billion parameters, EM-VLM4AD requires at least 10 times less memory and floating point operations, while achieving higher CIDEr and ROUGE scores on the DriveLM dataset. It can extract relevant information from traffic views related to prompts and answer questions for various autonomous driving subtasks. The model uses a custom image embedding network and a pre-trained T5 language model. The image embedding network aggregates multi-view images into a single embedding, which is then concatenated with a text embedding to input into the LM. The model uses two different lightweight LM backbones: a finetuned Text-to-Text Transfer Transformer (T5) Base LM and an 8-bit quantized T5-Large LM finetuned using low-rank adaptation (LoRA). The model is trained using the DriveLM dataset, which provides real, multi-view traffic scene images paired with question/answer pairs. The model is evaluated on BLEU-4, ROUGE-L, METEOR, and CIDEr metrics, demonstrating stronger performance in ROUGE-L and CIDEr even with a much smaller model. The model is also efficient in terms of computational resources, requiring less memory, computations, and model parameters compared to other multimodal LMs for autonomous driving. The model is able to perform VQA for various autonomous driving tasks such as perception, planning, and traffic agent behavior prediction. However, it exhibits one specific weakness: answering questions related to behavior where the prompt is "Predict the behavior for the ego vehicle." Adding temporal context through inputting multi-view videos to the network would improve results on this type of question. The model is designed to be efficient and suitable for real-time applications, with fast inference times and low computational requirements. Future research aims to evolve the model into a video-language model capable of generating responses from multi-view video inputs, thereby enhancing its ability to handle temporal-related inquiries.
Reach us at info@study.space