6 Jun 2024 | Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang
The paper introduces ShareGPT4Video, a comprehensive dataset and ShareCaptioner-Video, an advanced model designed to improve video understanding and generation. ShareGPT4Video consists of 40K high-quality video-caption pairs, collected from diverse sources, with detailed and precise captions generated using the multi-modal image model GPT4V. The dataset includes rich world knowledge, object attributes, camera movements, and temporal descriptions. ShareCaptioner-Video is an efficient and capable captioning model that can generate high-quality captions for arbitrary videos, with 4.8M aesthetically appealing videos annotated by it. The paper also presents ShareGPT4Video-8B, a large-scale video-language model that achieves state-of-the-art performance on multiple video benchmarks. The models and datasets are open-sourced to advance the LVLMs and T2VMs communities. The paper discusses the challenges in video captioning, such as inter-frame temporal change understanding, intra-frame content description, and frame-number scalability, and proposes a differential sliding-window captioning strategy (DiffSW) to address these challenges. Extensive experiments demonstrate the effectiveness of the dataset and model in video understanding and generation tasks.The paper introduces ShareGPT4Video, a comprehensive dataset and ShareCaptioner-Video, an advanced model designed to improve video understanding and generation. ShareGPT4Video consists of 40K high-quality video-caption pairs, collected from diverse sources, with detailed and precise captions generated using the multi-modal image model GPT4V. The dataset includes rich world knowledge, object attributes, camera movements, and temporal descriptions. ShareCaptioner-Video is an efficient and capable captioning model that can generate high-quality captions for arbitrary videos, with 4.8M aesthetically appealing videos annotated by it. The paper also presents ShareGPT4Video-8B, a large-scale video-language model that achieves state-of-the-art performance on multiple video benchmarks. The models and datasets are open-sourced to advance the LVLMs and T2VMs communities. The paper discusses the challenges in video captioning, such as inter-frame temporal change understanding, intra-frame content description, and frame-number scalability, and proposes a differential sliding-window captioning strategy (DiffSW) to address these challenges. Extensive experiments demonstrate the effectiveness of the dataset and model in video understanding and generation tasks.