On the Content Bias in Fréchet Video Distance

On the Content Bias in Fréchet Video Distance

18 Apr 2024 | Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, Jia-Bin Huang
The paper "On the Content Bias in Fréchet Video Distance" by Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, and Jia-Bin Huang explores the bias of the Fréchet Video Distance (FVD) metric towards per-frame quality over temporal realism in video generation models. The authors conduct experiments to quantify FVD's sensitivity to the temporal axis and analyze the generated videos to understand its impact. They find that FVD is strongly biased towards individual frame quality, as demonstrated by the metric favoring videos with spatial distortions over those with temporal inconsistencies. The bias is attributed to the features extracted from a supervised video classifier trained on content-biased datasets. The paper also shows that using features from self-supervised video models, such as VideoMAE-v2, reduces this bias. Additionally, the authors probe the perceptual null space in FVD, finding that it can be effectively reduced without improving temporal quality, highlighting the metric's insensitivity to temporal realism. Real-world examples are revisited to validate the hypothesis, and the paper concludes with discussions on the limitations and future directions for improving video generation evaluation metrics.The paper "On the Content Bias in Fréchet Video Distance" by Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, and Jia-Bin Huang explores the bias of the Fréchet Video Distance (FVD) metric towards per-frame quality over temporal realism in video generation models. The authors conduct experiments to quantify FVD's sensitivity to the temporal axis and analyze the generated videos to understand its impact. They find that FVD is strongly biased towards individual frame quality, as demonstrated by the metric favoring videos with spatial distortions over those with temporal inconsistencies. The bias is attributed to the features extracted from a supervised video classifier trained on content-biased datasets. The paper also shows that using features from self-supervised video models, such as VideoMAE-v2, reduces this bias. Additionally, the authors probe the perceptual null space in FVD, finding that it can be effectively reduced without improving temporal quality, highlighting the metric's insensitivity to temporal realism. Real-world examples are revisited to validate the hypothesis, and the paper concludes with discussions on the limitations and future directions for improving video generation evaluation metrics.
Reach us at info@study.space
[slides and audio] On the Content Bias in Fr%C3%A9chet Video Distance