22 Jul 2024 | Shenyuan Gao, Jiazhizh Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, Hongyang Li
Vista is a generalizable driving world model with high fidelity and versatile controllability. It addresses limitations in existing driving world models, such as poor generalization to unseen environments, low prediction fidelity, and limited action controllability. Vista is trained on a large corpus of worldwide driving videos to enhance its generalization ability. It incorporates dynamic priors and two novel losses to improve prediction fidelity and preserve structural details. For action controllability, Vista supports a versatile set of actions, from high-level intentions to low-level maneuvers, through an efficient learning strategy. Vista can seamlessly generalize to different scenarios and outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Additionally, Vista can be used as a generalizable reward function to evaluate real-world action evaluation without accessing the ground truth actions. Vista's capabilities include high-resolution prediction, multi-modal action control, and long-horizon forecasting. It is trained using a two-phase pipeline, with the first phase focusing on high-fidelity future prediction and the second phase integrating versatile action controllability. Vista's performance is validated through extensive experiments on multiple datasets, demonstrating its effectiveness in generalization and fidelity. The model's contributions include a generalizable driving world model with high spatiotemporal resolution, versatile action controllability, and a generalizable reward function.Vista is a generalizable driving world model with high fidelity and versatile controllability. It addresses limitations in existing driving world models, such as poor generalization to unseen environments, low prediction fidelity, and limited action controllability. Vista is trained on a large corpus of worldwide driving videos to enhance its generalization ability. It incorporates dynamic priors and two novel losses to improve prediction fidelity and preserve structural details. For action controllability, Vista supports a versatile set of actions, from high-level intentions to low-level maneuvers, through an efficient learning strategy. Vista can seamlessly generalize to different scenarios and outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Additionally, Vista can be used as a generalizable reward function to evaluate real-world action evaluation without accessing the ground truth actions. Vista's capabilities include high-resolution prediction, multi-modal action control, and long-horizon forecasting. It is trained using a two-phase pipeline, with the first phase focusing on high-fidelity future prediction and the second phase integrating versatile action controllability. Vista's performance is validated through extensive experiments on multiple datasets, demonstrating its effectiveness in generalization and fidelity. The model's contributions include a generalizable driving world model with high spatiotemporal resolution, versatile action controllability, and a generalizable reward function.