[slides and audio] ShareGPT4Video%3A Improving Video Understanding and Generation with Better Captions

The paper introduces ShareGPT4Video, a comprehensive dataset and ShareCaptioner-Video, an advanced model designed to improve video understanding and generation. ShareGPT4Video consists of 40K high-quality video-caption pairs, collected from diverse sources, with detailed and precise captions generated using the multi-modal image model GPT4V. The dataset includes rich world knowledge, object attributes, camera movements, and temporal descriptions. ShareCaptioner-Video is an efficient and capable captioning model that can generate high-quality captions for arbitrary videos, with 4.8M aesthetically appealing videos annotated by it. The paper also presents ShareGPT4Video-8B, a large-scale video-language model that achieves state-of-the-art performance on multiple video benchmarks. The models and datasets are open-sourced to advance the LVLMs and T2VMs communities. The paper discusses the challenges in video captioning, such as inter-frame temporal change understanding, intra-frame content description, and frame-number scalability, and proposes a differential sliding-window captioning strategy (DiffSW) to address these challenges. Extensive experiments demonstrate the effectiveness of the dataset and model in video understanding and generation tasks.The paper introduces ShareGPT4Video, a comprehensive dataset and ShareCaptioner-Video, an advanced model designed to improve video understanding and generation. ShareGPT4Video consists of 40K high-quality video-caption pairs, collected from diverse sources, with detailed and precise captions generated using the multi-modal image model GPT4V. The dataset includes rich world knowledge, object attributes, camera movements, and temporal descriptions. ShareCaptioner-Video is an efficient and capable captioning model that can generate high-quality captions for arbitrary videos, with 4.8M aesthetically appealing videos annotated by it. The paper also presents ShareGPT4Video-8B, a large-scale video-language model that achieves state-of-the-art performance on multiple video benchmarks. The models and datasets are open-sourced to advance the LVLMs and T2VMs communities. The paper discusses the challenges in video captioning, such as inter-frame temporal change understanding, intra-frame content description, and frame-number scalability, and proposes a differential sliding-window captioning strategy (DiffSW) to address these challenges. Extensive experiments demonstrate the effectiveness of the dataset and model in video understanding and generation tasks.

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

6 Jun 2024 | Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang