Understanding Panda-70M%3A Captioning 70M Videos with Multiple Cross-Modality Teachers

The paper introduces Panda-70M, a large-scale video dataset with high-quality captions generated by multiple cross-modality vision-language models. The dataset includes 70 million videos paired with captions, which are more precise and semantically coherent compared to existing large-scale video-text datasets. The captions are derived from various sources such as video descriptions, subtitles, and individual video frames. The authors propose an automatic annotation pipeline that leverages multimodal inputs to improve the quality of captions. They split 3.8 million high-resolution videos into semantically consistent clips and use multiple cross-modality teacher models to generate candidate captions. A fine-grained retrieval model is then fine-tuned on a subset of videos where human annotators select the best captions. This model is applied to the entire dataset to select the most accurate captions. The paper demonstrates the effectiveness of Panda-70M on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on Panda-70M achieve significantly better performance on these tasks compared to models trained on existing datasets. Additionally, the authors propose a student captioning model that distills knowledge from multiple teacher models, further improving performance. The paper also discusses the limitations of the dataset and suggests future directions for improvement.The paper introduces Panda-70M, a large-scale video dataset with high-quality captions generated by multiple cross-modality vision-language models. The dataset includes 70 million videos paired with captions, which are more precise and semantically coherent compared to existing large-scale video-text datasets. The captions are derived from various sources such as video descriptions, subtitles, and individual video frames. The authors propose an automatic annotation pipeline that leverages multimodal inputs to improve the quality of captions. They split 3.8 million high-resolution videos into semantically consistent clips and use multiple cross-modality teacher models to generate candidate captions. A fine-grained retrieval model is then fine-tuned on a subset of videos where human annotators select the best captions. This model is applied to the entire dataset to select the most accurate captions. The paper demonstrates the effectiveness of Panda-70M on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on Panda-70M achieve significantly better performance on these tasks compared to models trained on existing datasets. Additionally, the authors propose a student captioning model that distills knowledge from multiple teacher models, further improving performance. The paper also discusses the limitations of the dataset and suggests future directions for improvement.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

29 Feb 2024 | Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov