CarLLaVA: Vision language models for camera-only closed-loop driving

CarLLaVA: Vision language models for camera-only closed-loop driving

14 Jun 2024 | Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, Oleg Sinavski
CarLLaVA is a Vision Language Model (VLM) designed for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. It uses the vision encoder of the LLaVA VLM and the LLaMA architecture as its backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, which allows for better lateral and longitudinal control. It also proposes an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0, outperforming the previous state-of-the-art by 458% and the best concurrent submission by 32.6%. CarLLaVA is a camera-only method that does not rely on expensive labels such as BEV, depth, or semantic segmentation. It leverages a vision encoder pre-trained on internet-scale vision-language data, which is effective for driving tasks. The model uses high-resolution input and splits images into patches to allow the VLM to access smaller details in the driving images. It also proposes an efficient training recipe that makes more use of interesting training samples, significantly reducing training time. The model uses a semi-disentangled representation with both time-conditioned waypoints and space-conditioned path waypoints, leading to better control. CarLLaVA is compared to other methods in the literature, including foundation models for driving, end-to-end closed-loop driving in CARLA, and related work. It is shown to outperform previous methods in terms of performance and efficiency. The model is evaluated on the CARLA Leaderboard 2.0, achieving state-of-the-art performance. It is also shown to generate language explanations that comment on the current driving behavior, although this is not intended as an actual explanation. CarLLaVA is a scalable and cost-effective solution for autonomous driving, as it does not require expensive labels or sensors. It is able to operate in a variety of environments and weather conditions, and it is able to handle complex scenarios such as encountering pedestrians, navigating parking exits, executing unprotected turns, merging into ongoing traffic, passing construction sites or avoiding vehicles with opening doors. The model is able to scale up the LLaMA architecture and add language predictions, showing the potential for future research on VLMs for driving. The results indicate a substantial improvement over previous methods, showcasing the potential of vision-language models in real-world autonomous driving applications.CarLLaVA is a Vision Language Model (VLM) designed for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. It uses the vision encoder of the LLaVA VLM and the LLaMA architecture as its backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, which allows for better lateral and longitudinal control. It also proposes an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0, outperforming the previous state-of-the-art by 458% and the best concurrent submission by 32.6%. CarLLaVA is a camera-only method that does not rely on expensive labels such as BEV, depth, or semantic segmentation. It leverages a vision encoder pre-trained on internet-scale vision-language data, which is effective for driving tasks. The model uses high-resolution input and splits images into patches to allow the VLM to access smaller details in the driving images. It also proposes an efficient training recipe that makes more use of interesting training samples, significantly reducing training time. The model uses a semi-disentangled representation with both time-conditioned waypoints and space-conditioned path waypoints, leading to better control. CarLLaVA is compared to other methods in the literature, including foundation models for driving, end-to-end closed-loop driving in CARLA, and related work. It is shown to outperform previous methods in terms of performance and efficiency. The model is evaluated on the CARLA Leaderboard 2.0, achieving state-of-the-art performance. It is also shown to generate language explanations that comment on the current driving behavior, although this is not intended as an actual explanation. CarLLaVA is a scalable and cost-effective solution for autonomous driving, as it does not require expensive labels or sensors. It is able to operate in a variety of environments and weather conditions, and it is able to handle complex scenarios such as encountering pedestrians, navigating parking exits, executing unprotected turns, merging into ongoing traffic, passing construction sites or avoiding vehicles with opening doors. The model is able to scale up the LLaMA architecture and add language predictions, showing the potential for future research on VLMs for driving. The results indicate a substantial improvement over previous methods, showcasing the potential of vision-language models in real-world autonomous driving applications.
Reach us at info@study.space