Tarsier: Recipes for Training and Evaluating Large Video Description Models

Tarsier: Recipes for Training and Evaluating Large Video Description Models

30 Jun 2024 | Jiawei Wang*, Liping Yuan*, Yuchen Zhang*
Tarsier is a family of large-scale video-language models designed to generate high-quality video descriptions. The model uses CLIP-ViT to encode frames separately and then employs a large language model (LLM) to model temporal relationships. Despite its simple architecture, Tarsier models demonstrate significantly stronger video description capabilities than existing open-source models, with a +51.4% advantage in human evaluations over the strongest model. They are also comparable to state-of-the-art proprietary models, with a +12.3% advantage over GPT-4V and a -6.7% disadvantage over Gemini 1.5 Pro. Tarsier also achieves new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. The paper introduces a new benchmark for evaluating video description models, DREAM-1K, which includes a challenging dataset of videos from diverse sources and varying complexity, along with an automatic method for assessing the quality of fine-grained video descriptions. The model is trained using a two-stage process: multi-task pre-training on large-scale, high-quality data, followed by instruction tuning on human-annotated data. The Tarsier-34B model outperforms all open-source models and GPT-4V in automatic evaluations and significantly outperforms the strongest open-source model in human evaluations. Tarsier also proves to be a versatile generalist model, achieving new state-of-the-art results on public benchmarks in multi-choice video Q&A, open-ended video Q&A, and zero-shot video captioning. The paper also presents ablation studies showing that the strong performance of Tarsier is attributed to its extensive multi-task pre-training and fine-tuning on human-annotated, multi-grained video description data. The results demonstrate that Tarsier outperforms existing models in both automatic and human evaluations, highlighting its effectiveness in generating detailed and accurate video descriptions. The Tarsier model architecture is simple, consisting of a CLIP-ViT encoder and an LLM, with the LLM trained using next-token prediction loss. The model is trained on a diverse set of video-text pairs, including 13.6M pairs from public and in-house data sources. The model is evaluated on multiple benchmarks, including DREAM-1K, and achieves state-of-the-art results in both automatic and human evaluations. The paper concludes that Tarsier is a promising model for video understanding and description tasks.Tarsier is a family of large-scale video-language models designed to generate high-quality video descriptions. The model uses CLIP-ViT to encode frames separately and then employs a large language model (LLM) to model temporal relationships. Despite its simple architecture, Tarsier models demonstrate significantly stronger video description capabilities than existing open-source models, with a +51.4% advantage in human evaluations over the strongest model. They are also comparable to state-of-the-art proprietary models, with a +12.3% advantage over GPT-4V and a -6.7% disadvantage over Gemini 1.5 Pro. Tarsier also achieves new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. The paper introduces a new benchmark for evaluating video description models, DREAM-1K, which includes a challenging dataset of videos from diverse sources and varying complexity, along with an automatic method for assessing the quality of fine-grained video descriptions. The model is trained using a two-stage process: multi-task pre-training on large-scale, high-quality data, followed by instruction tuning on human-annotated data. The Tarsier-34B model outperforms all open-source models and GPT-4V in automatic evaluations and significantly outperforms the strongest open-source model in human evaluations. Tarsier also proves to be a versatile generalist model, achieving new state-of-the-art results on public benchmarks in multi-choice video Q&A, open-ended video Q&A, and zero-shot video captioning. The paper also presents ablation studies showing that the strong performance of Tarsier is attributed to its extensive multi-task pre-training and fine-tuning on human-annotated, multi-grained video description data. The results demonstrate that Tarsier outperforms existing models in both automatic and human evaluations, highlighting its effectiveness in generating detailed and accurate video descriptions. The Tarsier model architecture is simple, consisting of a CLIP-ViT encoder and an LLM, with the LLM trained using next-token prediction loss. The model is trained on a diverse set of video-text pairs, including 13.6M pairs from public and in-house data sources. The model is evaluated on multiple benchmarks, including DREAM-1K, and achieves state-of-the-art results in both automatic and human evaluations. The paper concludes that Tarsier is a promising model for video understanding and description tasks.
Reach us at info@study.space