25 Mar 2024 | Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, and Ping Luo
DriveCoT is a new end-to-end driving dataset that includes chain-of-thought reasoning labels and challenging driving scenarios. The dataset is built using the CARLA simulator and includes sensor data, control decisions, and chain-of-thought labels to indicate the reasoning process. It contains challenging driving scenarios such as high-speed driving and lane-changing, and provides ground truth labels for different driving aspects and final decisions. The dataset can serve as an open-loop end-to-end driving benchmark, enabling the evaluation of accuracy in various chain-of-thought aspects and the final decision. Additionally, a baseline model called DriveCoT-Agent is proposed, trained on the dataset to generate chain-of-thought predictions and final decisions. The trained model exhibits strong performance in both open-loop and closed-loop evaluations, demonstrating the effectiveness of the proposed dataset. The dataset includes 1058 scenarios and 36K labeled samples, collected at a 2 Hz frequency, averaging 17 seconds per scenario. The dataset is partitioned into training, validation, and testing sets at a ratio of 70%, 15%, and 15%, respectively. The dataset includes a variety of challenging driving scenarios and diverse weather and time-of-day conditions. The dataset also includes text-form annotations and simplified classification results for flexible usage. The DriveCoT-Agent model is designed to process video inputs from six surrounding cameras and predict chain-of-thought aspects such as potential collisions, traffic light and stop sign hazards, and the relation to the vehicle ahead. The model uses a shared Video-SwinTransformer to extract video features and a path GRU to predict planned waypoints. The model is trained on the DriveCoT dataset and shows strong performance in both open-loop and closed-loop evaluations. The model is evaluated on the DriveCoT validation data and the CARLA leaderboard 2.0 and Town05Long benchmark. The model performs well in both open-loop and closed-loop evaluations, showing the benefits of integrating chain-of-thought with end-to-end driving. The model is also evaluated on real-world data, such as the nuScenes dataset, and shows promising results. The model is able to generate reasonable decisions and reasons, demonstrating the usefulness of the proposed DriveCoT data and the trained DriveCoT-Agent model. The model is able to recognize red traffic lights and potential collisions with pedestrians, thus deciding to brake, and identify green traffic lights and free space ahead, thus deciding to drive normally and aiming for the road speed limit. The model is able to generate appropriate speed decisions concerning the ahead vehicle based on the distance and time-to-collision information embedded in the video input. The model is able to correctly brake for lane-merging vehicles, red traffic lights and pedestrians, and crossing pedestrians in the middle of the road. The model is able to generate proper speed decisions concerning the ahead vehicle across a time period based on the distance and time-to-collision. The model is able to generate reasonable decisions and reasonsDriveCoT is a new end-to-end driving dataset that includes chain-of-thought reasoning labels and challenging driving scenarios. The dataset is built using the CARLA simulator and includes sensor data, control decisions, and chain-of-thought labels to indicate the reasoning process. It contains challenging driving scenarios such as high-speed driving and lane-changing, and provides ground truth labels for different driving aspects and final decisions. The dataset can serve as an open-loop end-to-end driving benchmark, enabling the evaluation of accuracy in various chain-of-thought aspects and the final decision. Additionally, a baseline model called DriveCoT-Agent is proposed, trained on the dataset to generate chain-of-thought predictions and final decisions. The trained model exhibits strong performance in both open-loop and closed-loop evaluations, demonstrating the effectiveness of the proposed dataset. The dataset includes 1058 scenarios and 36K labeled samples, collected at a 2 Hz frequency, averaging 17 seconds per scenario. The dataset is partitioned into training, validation, and testing sets at a ratio of 70%, 15%, and 15%, respectively. The dataset includes a variety of challenging driving scenarios and diverse weather and time-of-day conditions. The dataset also includes text-form annotations and simplified classification results for flexible usage. The DriveCoT-Agent model is designed to process video inputs from six surrounding cameras and predict chain-of-thought aspects such as potential collisions, traffic light and stop sign hazards, and the relation to the vehicle ahead. The model uses a shared Video-SwinTransformer to extract video features and a path GRU to predict planned waypoints. The model is trained on the DriveCoT dataset and shows strong performance in both open-loop and closed-loop evaluations. The model is evaluated on the DriveCoT validation data and the CARLA leaderboard 2.0 and Town05Long benchmark. The model performs well in both open-loop and closed-loop evaluations, showing the benefits of integrating chain-of-thought with end-to-end driving. The model is also evaluated on real-world data, such as the nuScenes dataset, and shows promising results. The model is able to generate reasonable decisions and reasons, demonstrating the usefulness of the proposed DriveCoT data and the trained DriveCoT-Agent model. The model is able to recognize red traffic lights and potential collisions with pedestrians, thus deciding to brake, and identify green traffic lights and free space ahead, thus deciding to drive normally and aiming for the road speed limit. The model is able to generate appropriate speed decisions concerning the ahead vehicle based on the distance and time-to-collision information embedded in the video input. The model is able to correctly brake for lane-merging vehicles, red traffic lights and pedestrians, and crossing pedestrians in the middle of the road. The model is able to generate proper speed decisions concerning the ahead vehicle across a time period based on the distance and time-to-collision. The model is able to generate reasonable decisions and reasons