Understanding VidProM%3A A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

**VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models** **Wenhao Wang** University of Technology Sydney wangwenhao0716@gmail.com **Yi Yang** Zhejiang University yangyics@zju.edu.cn **Abstract:** The introduction of Sora marks a new era for text-to-video diffusion models, significantly advancing video generation and potential applications. However, these models heavily rely on prompts, and there is no publicly available dataset specifically designed for text-to-video prompts. This paper introduces VidProM, the first large-scale dataset comprising 1.67 million unique text-to-video prompts from real users and 6.69 million videos generated by four state-of-the-art diffusion models. The dataset includes NSFW probabilities, 3072-dimensional prompt embeddings, and additional metadata. VidProM highlights the necessity of a specialized prompt dataset by comparing it with DiffusionDB, a text-to-image prompt dataset. The paper also explores new research directions inspired by VidProM, such as text-to-video prompt engineering, efficient video generation, fake video detection, and video copy detection for diffusion models. The project is publicly available under the CC-BY-NC 4.0 License. **Key Contributions:** 1. **First Text-to-Video Prompt-Gallery Dataset:** VidProM includes 1.67 million unique prompts and 6.69 million generated videos. 2. **Necessity of a New Prompt Dataset:** VidProM differs from DiffusionDB in terms of basic information and prompt semantics. 3. **New Research Directions:** VidProM inspires research in text-to-video prompt engineering, efficient video generation, fake video detection, and video copy detection for diffusion models. **Introduction:** The arrival of Sora and other text-to-video diffusion models has revolutionized video generation. However, these models heavily rely on prompts, and there is a lack of a publicly available dataset for studying text-to-video prompts. This paper addresses this gap by introducing VidProM, the first large-scale dataset for text-to-video prompts. VidProM includes 1.67 million unique prompts and 6.69 million videos generated by four state-of-the-art diffusion models. The dataset is curated through web scraping and local generation, and it includes NSFW probabilities and 3072-dimensional prompt embeddings. VidProM differs from existing datasets like DiffusionDB in terms of prompt semantics and modality. The paper highlights the necessity of VidProM by comparing it with DiffusionDB and outlines several exciting research directions inspired by VidProM. **Related Work:** The paper discusses existing text-to-video diffusion models and datasets, emphasizing the importance of prompt datasets for text-to-video generation. **Curating VidProM:** The process of curating VidProM is detailed, including collecting source HTML files, extracting and**VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models** **Wenhao Wang** University of Technology Sydney wangwenhao0716@gmail.com **Yi Yang** Zhejiang University yangyics@zju.edu.cn **Abstract:** The introduction of Sora marks a new era for text-to-video diffusion models, significantly advancing video generation and potential applications. However, these models heavily rely on prompts, and there is no publicly available dataset specifically designed for text-to-video prompts. This paper introduces VidProM, the first large-scale dataset comprising 1.67 million unique text-to-video prompts from real users and 6.69 million videos generated by four state-of-the-art diffusion models. The dataset includes NSFW probabilities, 3072-dimensional prompt embeddings, and additional metadata. VidProM highlights the necessity of a specialized prompt dataset by comparing it with DiffusionDB, a text-to-image prompt dataset. The paper also explores new research directions inspired by VidProM, such as text-to-video prompt engineering, efficient video generation, fake video detection, and video copy detection for diffusion models. The project is publicly available under the CC-BY-NC 4.0 License. **Key Contributions:** 1. **First Text-to-Video Prompt-Gallery Dataset:** VidProM includes 1.67 million unique prompts and 6.69 million generated videos. 2. **Necessity of a New Prompt Dataset:** VidProM differs from DiffusionDB in terms of basic information and prompt semantics. 3. **New Research Directions:** VidProM inspires research in text-to-video prompt engineering, efficient video generation, fake video detection, and video copy detection for diffusion models. **Introduction:** The arrival of Sora and other text-to-video diffusion models has revolutionized video generation. However, these models heavily rely on prompts, and there is a lack of a publicly available dataset for studying text-to-video prompts. This paper addresses this gap by introducing VidProM, the first large-scale dataset for text-to-video prompts. VidProM includes 1.67 million unique prompts and 6.69 million videos generated by four state-of-the-art diffusion models. The dataset is curated through web scraping and local generation, and it includes NSFW probabilities and 3072-dimensional prompt embeddings. VidProM differs from existing datasets like DiffusionDB in terms of prompt semantics and modality. The paper highlights the necessity of VidProM by comparing it with DiffusionDB and outlines several exciting research directions inspired by VidProM. **Related Work:** The paper discusses existing text-to-video diffusion models and datasets, emphasizing the importance of prompt datasets for text-to-video generation. **Curating VidProM:** The process of curating VidProM is detailed, including collecting source HTML files, extracting and

VidProM : A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

14 May 2024 | Wenhao Wang, Yi Yang