MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

8 Jul 2024 | Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan
MiraData is a large-scale video dataset with long durations and structured captions, designed to address the limitations of existing video generation datasets. It features videos with an average duration of 72.1 seconds and structured captions with an average length of 318 words. The dataset is curated from diverse sources, including YouTube, Videvo, Pixabay, and Pexels, and undergoes a rigorous process of selection, splitting, stitching, and annotation. MiraData includes five versions of video clips, filtered based on color, aesthetic quality, motion strength, and the presence of NSFW content. The dataset is annotated with detailed captions, including dense captions and structured captions that provide descriptions of the video's main subject, background, camera motion, and style. To evaluate the effectiveness of MiraData, the authors introduce MiraBench, a benchmark that includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. MiraBench enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. The authors also present MiraDiT, a video generation model trained on MiraData, which demonstrates superior performance in motion strength and 3D consistency compared to models trained on other datasets. The results show that MiraData significantly improves the quality and accuracy of long video generation, making it a valuable resource for researchers in the field of video generation.MiraData is a large-scale video dataset with long durations and structured captions, designed to address the limitations of existing video generation datasets. It features videos with an average duration of 72.1 seconds and structured captions with an average length of 318 words. The dataset is curated from diverse sources, including YouTube, Videvo, Pixabay, and Pexels, and undergoes a rigorous process of selection, splitting, stitching, and annotation. MiraData includes five versions of video clips, filtered based on color, aesthetic quality, motion strength, and the presence of NSFW content. The dataset is annotated with detailed captions, including dense captions and structured captions that provide descriptions of the video's main subject, background, camera motion, and style. To evaluate the effectiveness of MiraData, the authors introduce MiraBench, a benchmark that includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. MiraBench enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. The authors also present MiraDiT, a video generation model trained on MiraData, which demonstrates superior performance in motion strength and 3D consistency compared to models trained on other datasets. The results show that MiraData significantly improves the quality and accuracy of long video generation, making it a valuable resource for researchers in the field of video generation.
Reach us at info@study.space
[slides] MiraData%3A A Large-Scale Video Dataset with Long Durations and Structured Captions | StudySpace