VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

14 May 2024 | Wenhao Wang, Yi Yang
VidProM is a large-scale dataset containing 1.67 million unique text-to-video prompts and 6.69 million videos generated by four state-of-the-art diffusion models. It was curated from real user prompts collected from Pika Discord channels and generated by Pika, Text2Video-Zero, VideoCraft2, and ModelScope. The dataset includes NSFW probabilities, 3072-dimensional prompt embeddings, and additional metadata. VidProM differs from DiffusionDB in terms of semantics, modality, and techniques. It provides a rich resource for research in text-to-video diffusion models, including prompt engineering, efficient video generation, fake video detection, and video copy detection. The dataset also supports multimodal learning tasks such as video-text retrieval and video captioning. VidProM is publicly available under the CC-BY-NC 4.0 License. The paper highlights the necessity of collecting a new prompt dataset for text-to-video generation and outlines potential research directions inspired by VidProM. The dataset includes semantically unique prompts, ensuring a high level of diversity. The paper also discusses the limitations of the current dataset, including the short duration and lower quality of the videos. Future work aims to enhance the dataset by incorporating high-quality videos generated by more advanced models.VidProM is a large-scale dataset containing 1.67 million unique text-to-video prompts and 6.69 million videos generated by four state-of-the-art diffusion models. It was curated from real user prompts collected from Pika Discord channels and generated by Pika, Text2Video-Zero, VideoCraft2, and ModelScope. The dataset includes NSFW probabilities, 3072-dimensional prompt embeddings, and additional metadata. VidProM differs from DiffusionDB in terms of semantics, modality, and techniques. It provides a rich resource for research in text-to-video diffusion models, including prompt engineering, efficient video generation, fake video detection, and video copy detection. The dataset also supports multimodal learning tasks such as video-text retrieval and video captioning. VidProM is publicly available under the CC-BY-NC 4.0 License. The paper highlights the necessity of collecting a new prompt dataset for text-to-video generation and outlines potential research directions inspired by VidProM. The dataset includes semantically unique prompts, ensuring a high level of diversity. The paper also discusses the limitations of the current dataset, including the short duration and lower quality of the videos. Future work aims to enhance the dataset by incorporating high-quality videos generated by more advanced models.
Reach us at info@study.space
[slides] VidProM%3A A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models | StudySpace