ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

6 Jun 2024 | Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang
The ShareGPT4Video project aims to enhance video understanding and generation by developing high-quality video-caption datasets and models. The project introduces ShareGPT4Video, a dataset containing 40K high-quality video-caption pairs, and ShareCaptioner-Video, an efficient captioning model that generates 4.8M high-quality captions. The dataset is constructed using a differential sliding-window captioning strategy, which enables precise temporal description of events and efficient caption generation for videos of arbitrary length and resolution. The project also presents ShareGPT4Video-8B, a large video-language model that achieves state-of-the-art performance on three video benchmarks. The dataset and models are open-sourced to advance research in video-language models (LVLMs) and text-to-video models (T2VMs). The project addresses challenges in video captioning, including precise temporal understanding, detailed content description, and scalability for long videos. The differential captioning strategy allows for re-captioning of sub-clips by reusing differential captions, improving the efficiency and quality of video captioning. The project also demonstrates the effectiveness of the dataset and models in video understanding and generation tasks, showing that high-quality captions significantly improve performance on benchmarks like TempCompass and VideoBench. The project highlights the importance of detailed and temporally accurate captions in bridging video and language modalities, and emphasizes the need for further research to incorporate audio information and address potential social impacts of large-scale caption generation.The ShareGPT4Video project aims to enhance video understanding and generation by developing high-quality video-caption datasets and models. The project introduces ShareGPT4Video, a dataset containing 40K high-quality video-caption pairs, and ShareCaptioner-Video, an efficient captioning model that generates 4.8M high-quality captions. The dataset is constructed using a differential sliding-window captioning strategy, which enables precise temporal description of events and efficient caption generation for videos of arbitrary length and resolution. The project also presents ShareGPT4Video-8B, a large video-language model that achieves state-of-the-art performance on three video benchmarks. The dataset and models are open-sourced to advance research in video-language models (LVLMs) and text-to-video models (T2VMs). The project addresses challenges in video captioning, including precise temporal understanding, detailed content description, and scalability for long videos. The differential captioning strategy allows for re-captioning of sub-clips by reusing differential captions, improving the efficiency and quality of video captioning. The project also demonstrates the effectiveness of the dataset and models in video understanding and generation tasks, showing that high-quality captions significantly improve performance on benchmarks like TempCompass and VideoBench. The project highlights the importance of detailed and temporally accurate captions in bridging video and language modalities, and emphasizes the need for further research to incorporate audio information and address potential social impacts of large-scale caption generation.
Reach us at info@study.space