Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability

Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability

22 Jul 2024 | Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, Hongyang Li
**Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability** **Authors:** Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, Hongyang Li **Abstract:** World models are crucial for autonomous driving as they can predict the outcomes of different actions. However, existing driving world models have limitations in generalizing to unseen environments, predicting critical details accurately, and controlling actions flexibly. This paper introduces Vista, a generalizable driving world model that addresses these limitations. Vista is designed to predict real-world dynamics at high resolution and support versatile action controllability. It introduces two novel losses to enhance dynamics and preserve structural details, and an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, Vista incorporates a versatile set of controls, including high-level intentions and low-level maneuvers, through an efficient learning strategy. Extensive experiments on multiple datasets show that Vista outperforms state-of-the-art video generators and driving world models in terms of FID and FVD scores. Additionally, Vista can be used as a generalizable reward function to evaluate real-world driving actions without accessing ground truth actions. **Key Contributions:** 1. **High-Fidelity Prediction:** Vista predicts realistic futures at high spatiotemporal resolution using two novel losses that capture dynamics and preserve structures. 2. **Versatile Action Controllability:** Vista integrates a versatile set of action formats, including high-level intentions and low-level maneuvers, through a unified conditioning interface. 3. **Generalizable Reward Function:** Vista can be used as a reward function to evaluate actions without referring to ground truth actions. **Experiments:** - **Generalization and Fidelity:** Vista outperforms state-of-the-art models in FID and FVD scores. - **Action Controllability:** Vista effectively emulates various actions, demonstrating versatile controllability. - **Reward Function:** Vista's reward function estimates uncertainty and can be used to evaluate actions without ground truth. **Conclusion:** Vista is a generalizable driving world model that enhances fidelity and controllability. It predicts realistic futures at high resolution and supports versatile actions. The model's ability to serve as a reward function for action evaluation further highlights its potential in autonomous driving applications.**Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability** **Authors:** Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, Hongyang Li **Abstract:** World models are crucial for autonomous driving as they can predict the outcomes of different actions. However, existing driving world models have limitations in generalizing to unseen environments, predicting critical details accurately, and controlling actions flexibly. This paper introduces Vista, a generalizable driving world model that addresses these limitations. Vista is designed to predict real-world dynamics at high resolution and support versatile action controllability. It introduces two novel losses to enhance dynamics and preserve structural details, and an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, Vista incorporates a versatile set of controls, including high-level intentions and low-level maneuvers, through an efficient learning strategy. Extensive experiments on multiple datasets show that Vista outperforms state-of-the-art video generators and driving world models in terms of FID and FVD scores. Additionally, Vista can be used as a generalizable reward function to evaluate real-world driving actions without accessing ground truth actions. **Key Contributions:** 1. **High-Fidelity Prediction:** Vista predicts realistic futures at high spatiotemporal resolution using two novel losses that capture dynamics and preserve structures. 2. **Versatile Action Controllability:** Vista integrates a versatile set of action formats, including high-level intentions and low-level maneuvers, through a unified conditioning interface. 3. **Generalizable Reward Function:** Vista can be used as a reward function to evaluate actions without referring to ground truth actions. **Experiments:** - **Generalization and Fidelity:** Vista outperforms state-of-the-art models in FID and FVD scores. - **Action Controllability:** Vista effectively emulates various actions, demonstrating versatile controllability. - **Reward Function:** Vista's reward function estimates uncertainty and can be used to evaluate actions without ground truth. **Conclusion:** Vista is a generalizable driving world model that enhances fidelity and controllability. It predicts realistic futures at high resolution and supports versatile actions. The model's ability to serve as a reward function for action evaluation further highlights its potential in autonomous driving applications.
Reach us at info@study.space