MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

8 Jul 2024 | Xuan Ju1,2*, Yiming Gao1**, Zhaoyang Zhang1†*, Ziyang Yuan1, Xintao Wang1, Ailing Zeng2, Yu Xiong2, Qiang Xu2, Ying Shan1
MiraData is a large-scale video dataset designed to address the limitations of existing datasets in generating high-quality, long-duration videos with detailed captions. The dataset features videos with an average duration of 72.1 seconds, strong motion intensity, and detailed structured captions (average of 318 words). The data curation process involves collecting videos from diverse sources, segmenting and filtering clips, and annotating them with structured captions using GPT-4V. MiraBench, an enhanced benchmark, includes 17 metrics to evaluate temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. Experiments using the MiraDiT model trained on MiraData demonstrate superior performance in motion strength and 3D consistency compared to models trained on other datasets. The study highlights the importance of detailed captions in guiding the model to produce videos that closely match the desired descriptions while maintaining coherence and realism.MiraData is a large-scale video dataset designed to address the limitations of existing datasets in generating high-quality, long-duration videos with detailed captions. The dataset features videos with an average duration of 72.1 seconds, strong motion intensity, and detailed structured captions (average of 318 words). The data curation process involves collecting videos from diverse sources, segmenting and filtering clips, and annotating them with structured captions using GPT-4V. MiraBench, an enhanced benchmark, includes 17 metrics to evaluate temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. Experiments using the MiraDiT model trained on MiraData demonstrate superior performance in motion strength and 3D consistency compared to models trained on other datasets. The study highlights the importance of detailed captions in guiding the model to produce videos that closely match the desired descriptions while maintaining coherence and realism.
Reach us at info@study.space