14 Jun 2024 | Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, Oleg Sinavski
CarLLaVA is a Vision Language Model (VLM) designed for camera-only closed-loop driving in the CARLA Autonomous Driving Challenge 2.0. The model leverages the vision encoder of the LLaVA VLM and the LLaMA architecture, achieving state-of-the-art performance with only camera input and without complex or expensive labels. CarLLaVA uses a semi-disentangled output representation for both path predictions and waypoints, enhancing lateral and longitudinal control. The model's training is optimized to efficiently utilize large driving datasets, reducing compute time on trivial data. CarLLaVA ranks 1st in the sensor track of the CARLA Challenge, outperforming previous state-of-the-art methods by 458% and the best concurrent submission by 32.6%. The report details the architecture, training methodology, and experimental results, demonstrating the model's effectiveness in various driving scenarios and its potential for real-world deployment.CarLLaVA is a Vision Language Model (VLM) designed for camera-only closed-loop driving in the CARLA Autonomous Driving Challenge 2.0. The model leverages the vision encoder of the LLaVA VLM and the LLaMA architecture, achieving state-of-the-art performance with only camera input and without complex or expensive labels. CarLLaVA uses a semi-disentangled output representation for both path predictions and waypoints, enhancing lateral and longitudinal control. The model's training is optimized to efficiently utilize large driving datasets, reducing compute time on trivial data. CarLLaVA ranks 1st in the sensor track of the CARLA Challenge, outperforming previous state-of-the-art methods by 458% and the best concurrent submission by 32.6%. The report details the architecture, training methodology, and experimental results, demonstrating the model's effectiveness in various driving scenarios and its potential for real-world deployment.