14 Apr 2017 | Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B. Choy, Philip H. S. Torr, Manmohan Chandraker
This paper introduces DESIRE, a Deep Stochastic IOC RNN Encoder-decoder framework for future prediction of multiple interacting agents in dynamic scenes. DESIRE effectively predicts future locations of objects by accounting for the multi-modal nature of future prediction, foreseeing potential outcomes, and reasoning from past motion history, scene context, and agent interactions. The model generates a diverse set of hypothetical future prediction samples using a conditional variational autoencoder (CVAE), which are ranked and refined by an RNN scoring-regression module. The RNN scene context fusion module jointly captures past motion histories, semantic scene context, and agent interactions. A feedback mechanism iterates over the ranking and refinement to further improve prediction accuracy. The model is evaluated on two datasets: KITTI and Stanford Drone Dataset, showing significant improvements in prediction accuracy compared to other baseline methods. DESIRE is a general framework applicable to various future prediction tasks, demonstrating its utility in traffic scene understanding for autonomous driving and behavior prediction in aerial surveillance. The model's key features include scalability, diversity, and accuracy, making it suitable for time-profiled distant future prediction. The framework is trained end-to-end using deep learning, enabling the incorporation of multiple cues from past motions, scene context, and agent interactions. The model's ability to handle uncertainty and generate diverse predictions makes it effective for complex environments with multiple agents. The experiments show that DESIRE achieves high accuracy in predicting future locations of agents in dynamic scenes, outperforming other methods in both KITTI and Stanford Drone Dataset.This paper introduces DESIRE, a Deep Stochastic IOC RNN Encoder-decoder framework for future prediction of multiple interacting agents in dynamic scenes. DESIRE effectively predicts future locations of objects by accounting for the multi-modal nature of future prediction, foreseeing potential outcomes, and reasoning from past motion history, scene context, and agent interactions. The model generates a diverse set of hypothetical future prediction samples using a conditional variational autoencoder (CVAE), which are ranked and refined by an RNN scoring-regression module. The RNN scene context fusion module jointly captures past motion histories, semantic scene context, and agent interactions. A feedback mechanism iterates over the ranking and refinement to further improve prediction accuracy. The model is evaluated on two datasets: KITTI and Stanford Drone Dataset, showing significant improvements in prediction accuracy compared to other baseline methods. DESIRE is a general framework applicable to various future prediction tasks, demonstrating its utility in traffic scene understanding for autonomous driving and behavior prediction in aerial surveillance. The model's key features include scalability, diversity, and accuracy, making it suitable for time-profiled distant future prediction. The framework is trained end-to-end using deep learning, enabling the incorporation of multiple cues from past motions, scene context, and agent interactions. The model's ability to handle uncertainty and generate diverse predictions makes it effective for complex environments with multiple agents. The experiments show that DESIRE achieves high accuracy in predicting future locations of agents in dynamic scenes, outperforming other methods in both KITTI and Stanford Drone Dataset.