OpenVid-1M: A Large-scale High-quality Dataset for Text-to-video Generation

OpenVid-1M: A Large-scale High-quality Dataset for Text-to-video Generation

2 Aug 2024 | Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, Ying Tai
OpenVid-1M is a large-scale, high-quality text-to-video generation dataset containing over 1 million video clips with high aesthetics, clarity, and expressive captions. It addresses two key challenges in text-to-video (T2V) generation: the lack of a precise, high-quality dataset and the underutilization of textual information. OpenVid-1M is curated from multiple sources, including ChronoMagic, CelebvHQ, Open-Sora-plan, and Panda, with a focus on high-quality, aesthetically pleasing, and temporally consistent videos. It also includes OpenVidHD-0.4M, a subset of high-definition videos for advanced HD video generation. The dataset is accompanied by a novel Multi-modal Video Diffusion Transformer (MVDiT) model that enhances video quality by integrating both visual and textual information. Extensive experiments and ablation studies demonstrate the superiority of OpenVid-1M over previous datasets and the effectiveness of MVDiT. The dataset is publicly available and released under a CC-BY-4.0 license, with ongoing maintenance and updates to ensure its accuracy and usefulness for research. OpenVid-1M provides a valuable resource for advancing T2V generation, offering high-quality, expressive captions, and a wide range of video content for various scenarios.OpenVid-1M is a large-scale, high-quality text-to-video generation dataset containing over 1 million video clips with high aesthetics, clarity, and expressive captions. It addresses two key challenges in text-to-video (T2V) generation: the lack of a precise, high-quality dataset and the underutilization of textual information. OpenVid-1M is curated from multiple sources, including ChronoMagic, CelebvHQ, Open-Sora-plan, and Panda, with a focus on high-quality, aesthetically pleasing, and temporally consistent videos. It also includes OpenVidHD-0.4M, a subset of high-definition videos for advanced HD video generation. The dataset is accompanied by a novel Multi-modal Video Diffusion Transformer (MVDiT) model that enhances video quality by integrating both visual and textual information. Extensive experiments and ablation studies demonstrate the superiority of OpenVid-1M over previous datasets and the effectiveness of MVDiT. The dataset is publicly available and released under a CC-BY-4.0 license, with ongoing maintenance and updates to ensure its accuracy and usefulness for research. OpenVid-1M provides a valuable resource for advancing T2V generation, offering high-quality, expressive captions, and a wide range of video content for various scenarios.
Reach us at info@study.space