OpenVid-1M is a large-scale, high-quality dataset designed for text-to-video (T2V) generation research. It contains over 1 million video clips with high aesthetic, clarity, and expressive captions, addressing the challenges of lacking precise, open-sourced datasets and the need to fully utilize textual information. The dataset is curated from various sources, prioritizing aesthetics, temporal consistency, motion difference, and clarity. Additionally, 433K 1080p videos are selected to create OpenVidHD-0.4M, advancing high-definition video generation. A novel Multi-modal Video Diffusion Transformer (MVDiT) is proposed to enhance video quality by mining both structural information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies demonstrate the superiority of OpenVid-1M over previous datasets and the effectiveness of MVDiT. The dataset and model are publicly available, facilitating further research in T2V generation.OpenVid-1M is a large-scale, high-quality dataset designed for text-to-video (T2V) generation research. It contains over 1 million video clips with high aesthetic, clarity, and expressive captions, addressing the challenges of lacking precise, open-sourced datasets and the need to fully utilize textual information. The dataset is curated from various sources, prioritizing aesthetics, temporal consistency, motion difference, and clarity. Additionally, 433K 1080p videos are selected to create OpenVidHD-0.4M, advancing high-definition video generation. A novel Multi-modal Video Diffusion Transformer (MVDiT) is proposed to enhance video quality by mining both structural information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies demonstrate the superiority of OpenVid-1M over previous datasets and the effectiveness of MVDiT. The dataset and model are publicly available, facilitating further research in T2V generation.