10 Jun 2024 | Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, Hai Zhao
Vript is a high-quality video-text dataset containing 12,000 high-resolution videos with over 420,000 clips, each annotated with detailed, dense, and script-like captions of approximately 145 words. Unlike previous video-text datasets, Vript includes not only content descriptions but also camera operations, such as shot types and camera movements. This dataset is used to train Vriptor, a top-performing video captioning model that achieves state-of-the-art results among open-source models, comparable to GPT-4V. Vriptor is capable of generating dense and detailed captions for both short and long videos. Additionally, Vript-Hard is introduced as a challenging benchmark consisting of three tasks: Vript-HAL, which evaluates object and action hallucinations in video LLMs; Vript-RR, which involves retrieval and reasoning for long video QA; and Vript-ERO, which assesses event reordering in long videos. Vript-Hard addresses limitations in existing benchmarks by providing more challenging tasks and detailed ground truth. The dataset and models are available at https://github.com/mutonix/Vript.Vript is a high-quality video-text dataset containing 12,000 high-resolution videos with over 420,000 clips, each annotated with detailed, dense, and script-like captions of approximately 145 words. Unlike previous video-text datasets, Vript includes not only content descriptions but also camera operations, such as shot types and camera movements. This dataset is used to train Vriptor, a top-performing video captioning model that achieves state-of-the-art results among open-source models, comparable to GPT-4V. Vriptor is capable of generating dense and detailed captions for both short and long videos. Additionally, Vript-Hard is introduced as a challenging benchmark consisting of three tasks: Vript-HAL, which evaluates object and action hallucinations in video LLMs; Vript-RR, which involves retrieval and reasoning for long video QA; and Vript-ERO, which assesses event reordering in long videos. Vript-Hard addresses limitations in existing benchmarks by providing more challenging tasks and detailed ground truth. The dataset and models are available at https://github.com/mutonix/Vript.