Understanding VideoPhy%3A Evaluating Physical Commonsense for Video Generation

The paper introduces VIDEOPHY, a benchmark designed to evaluate the physical commonsense in generated videos. VIDEOPHY consists of 688 high-quality, human-verified captions that describe interactions between various material types (solid-solid, solid-fluid, fluid-fluid) and are used to generate videos from text-to-video (T2V) models. The evaluation focuses on two main metrics: semantic adherence (SA) and physical commonsense (PC). SA assesses whether the text caption is semantically grounded in the video, while PC evaluates if the actions and object states follow real-world physics laws. Human evaluation reveals that existing T2V models severely lack the ability to generate videos that adhere to physical laws and text prompts. The best model, CogVideoX-5B, achieves 39.6% adherence to both SA and PC. To address the challenge of scalable and reliable evaluation, the authors propose VIDEOCON-PHYSICS, an auto-evaluator trained on human annotations. This model outperforms existing baselines and generalizes to unseen generative models. The study highlights the gap between current T2V models and the goal of simulating the physical world accurately.The paper introduces VIDEOPHY, a benchmark designed to evaluate the physical commonsense in generated videos. VIDEOPHY consists of 688 high-quality, human-verified captions that describe interactions between various material types (solid-solid, solid-fluid, fluid-fluid) and are used to generate videos from text-to-video (T2V) models. The evaluation focuses on two main metrics: semantic adherence (SA) and physical commonsense (PC). SA assesses whether the text caption is semantically grounded in the video, while PC evaluates if the actions and object states follow real-world physics laws. Human evaluation reveals that existing T2V models severely lack the ability to generate videos that adhere to physical laws and text prompts. The best model, CogVideoX-5B, achieves 39.6% adherence to both SA and PC. To address the challenge of scalable and reliable evaluation, the authors propose VIDEOCON-PHYSICS, an auto-evaluator trained on human annotations. This model outperforms existing baselines and generalizes to unseen generative models. The study highlights the gap between current T2V models and the goal of simulating the physical world accurately.

VIDEOPHY: Evaluating Physical Commonsense for Video Generation

3 Oct 2024 | Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, Aditya Grover