30 Jun 2024 | Jiawei Wang*, Liping Yuan*, Yuchen Zhang*
The paper introduces Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and uses an LLM to model temporal relationships. Despite its simple architecture, Tarsier models exhibit superior video description capabilities compared to existing open-source models, achieving a +51.4% advantage in human side-by-side evaluation. They are also comparable to state-of-the-art proprietary models, with a +12.3% advantage against GPT-4V and a −6.7% disadvantage against Gemini 1.5 Pro. Tarsier is versatile, achieving new state-of-the-art results on various benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. The paper also introduces a new benchmark, DREAM-1K, and an automatic evaluation method, AutoDQ, to assess the quality of fine-grained video descriptions. Extensive ablation studies highlight the importance of multi-task pre-training and fine-tuning on high-quality, fine-grained video description data.The paper introduces Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and uses an LLM to model temporal relationships. Despite its simple architecture, Tarsier models exhibit superior video description capabilities compared to existing open-source models, achieving a +51.4% advantage in human side-by-side evaluation. They are also comparable to state-of-the-art proprietary models, with a +12.3% advantage against GPT-4V and a −6.7% disadvantage against Gemini 1.5 Pro. Tarsier is versatile, achieving new state-of-the-art results on various benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. The paper also introduces a new benchmark, DREAM-1K, and an automatic evaluation method, AutoDQ, to assess the quality of fine-grained video descriptions. Extensive ablation studies highlight the importance of multi-task pre-training and fine-tuning on high-quality, fine-grained video description data.