Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

24 Jun 2024 | Shengbang Tong*, Ellis Brown*, Penghao Wu*, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining Xie†
Cambrian-1 is a family of multimodal large language models (MLLMs) designed with a vision-centric approach. The paper introduces Cambrian-1, which evaluates various visual representations using large language models and visual instruction tuning. It addresses the challenges in consolidating and interpreting results from different tasks and introduces a new vision-centric benchmark, CV-Bench. The paper also proposes the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, it discusses the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. The paper provides model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. The paper also explores the impact of various vision encoders on MLLM performance and introduces a new vision-centric benchmark, CV-Bench. The paper discusses the design of the SVA, the curation of instruction-tuning data, and the performance of Cambrian-1 on various benchmarks. The paper also highlights the importance of data balancing and the impact of different data ratios on downstream performance. The paper concludes that Cambrian-1 achieves state-of-the-art performance and serves as a comprehensive, open cookbook for instruction-tuned MLLMs.Cambrian-1 is a family of multimodal large language models (MLLMs) designed with a vision-centric approach. The paper introduces Cambrian-1, which evaluates various visual representations using large language models and visual instruction tuning. It addresses the challenges in consolidating and interpreting results from different tasks and introduces a new vision-centric benchmark, CV-Bench. The paper also proposes the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, it discusses the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. The paper provides model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. The paper also explores the impact of various vision encoders on MLLM performance and introduces a new vision-centric benchmark, CV-Bench. The paper discusses the design of the SVA, the curation of instruction-tuning data, and the performance of Cambrian-1 on various benchmarks. The paper also highlights the importance of data balancing and the impact of different data ratios on downstream performance. The paper concludes that Cambrian-1 achieves state-of-the-art performance and serves as a comprehensive, open cookbook for instruction-tuned MLLMs.
Reach us at info@study.space