On the Content Bias in Fréchet Video Distance

On the Content Bias in Fréchet Video Distance

18 Apr 2024 | Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, Jia-Bin Huang
This paper investigates the content bias in the Fréchet Video Distance (FVD), a widely used metric for evaluating video generation models. The study reveals that FVD disproportionately favors per-frame quality over temporal consistency. Through experiments, the authors demonstrate that FVD scores can be significantly affected by spatial distortions, while temporal inconsistencies may lead to lower FVD scores, even when the videos are visually less consistent. This discrepancy suggests that FVD is biased toward individual frame quality rather than the overall temporal quality of the video. The study quantifies FVD's sensitivity to temporal consistency by comparing videos with similar frame quality but varying levels of temporal consistency. It shows that FVD is less sensitive to temporal changes when using features extracted from a supervised video classifier trained on a content-biased dataset. However, when using features from a self-supervised video model, such as VideoMAE-v2, the bias is significantly reduced, indicating that the metric's sensitivity to temporal quality improves. The paper also explores the perceptual null space in FVD, where the metric can be manipulated to lower scores without improving the actual video quality. By resampling videos and optimizing weights, the authors show that FVD scores can be reduced, highlighting the metric's sensitivity to the selection of features and the underlying model used for evaluation. The study further validates these findings through real-world examples, such as long video generation tasks, where FVD fails to capture motion artifacts that are perceptible to humans. When using VideoMAE-v2 features, the FVD scores align more closely with human perception, indicating that the metric's bias is significantly reduced. Overall, the paper highlights the content bias in FVD and suggests that using self-supervised features can mitigate this bias, leading to more accurate evaluations of video generation models. The findings emphasize the importance of developing more reliable evaluation metrics that account for both spatial and temporal aspects of video generation.This paper investigates the content bias in the Fréchet Video Distance (FVD), a widely used metric for evaluating video generation models. The study reveals that FVD disproportionately favors per-frame quality over temporal consistency. Through experiments, the authors demonstrate that FVD scores can be significantly affected by spatial distortions, while temporal inconsistencies may lead to lower FVD scores, even when the videos are visually less consistent. This discrepancy suggests that FVD is biased toward individual frame quality rather than the overall temporal quality of the video. The study quantifies FVD's sensitivity to temporal consistency by comparing videos with similar frame quality but varying levels of temporal consistency. It shows that FVD is less sensitive to temporal changes when using features extracted from a supervised video classifier trained on a content-biased dataset. However, when using features from a self-supervised video model, such as VideoMAE-v2, the bias is significantly reduced, indicating that the metric's sensitivity to temporal quality improves. The paper also explores the perceptual null space in FVD, where the metric can be manipulated to lower scores without improving the actual video quality. By resampling videos and optimizing weights, the authors show that FVD scores can be reduced, highlighting the metric's sensitivity to the selection of features and the underlying model used for evaluation. The study further validates these findings through real-world examples, such as long video generation tasks, where FVD fails to capture motion artifacts that are perceptible to humans. When using VideoMAE-v2 features, the FVD scores align more closely with human perception, indicating that the metric's bias is significantly reduced. Overall, the paper highlights the content bias in FVD and suggests that using self-supervised features can mitigate this bias, leading to more accurate evaluations of video generation models. The findings emphasize the importance of developing more reliable evaluation metrics that account for both spatial and temporal aspects of video generation.
Reach us at info@study.space
Understanding On the Content Bias in Fr%C3%A9chet Video Distance