Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

27 May 2024 | Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro
Meteor is a new efficient large language and vision model (LLVM) that leverages multifaceted rationale to enhance understanding and answering capabilities. It uses the Mamba architecture, which can process sequential data with linear time complexity, to embed lengthy rationales containing abundant information. The model introduces a new concept of rationale traversal to facilitate efficient embedding of rationale. The backbone multimodal language model (MLM) is trained to generate answers with the aid of rationale. Meteor achieves significant improvements in vision language performances across multiple evaluation benchmarks without scaling up the model size or employing additional vision encoders and computer vision models. The model is trained using 2.1 million question-answer pairs from existing visual instruction tuning datasets, and the rationales are generated using the Claude Haiku API and filtered by human reviewers. The model's performance is evaluated on various benchmarks, including QBench, SQA, AI2D, ChartQA, SEED, POPE, HallB, MME, MathVista, MMB, MM-Vet, and LLaVA. Meteor demonstrates significant improvements in vision language performances, particularly in tasks requiring diverse capabilities such as image understanding, common-sense knowledge, and non-object concepts. The model's ability to embed rationales is validated through ablation studies, showing that the Mamba architecture and rationale traversal contribute to performance improvements. Meteor is designed to be efficient, with a model size that is relatively small compared to other models, and it does not require additional vision encoders or computer vision models. The model's performance is further validated through experiments showing that it can effectively answer complex questions with the help of embedded rationales. The model's ability to embed rationales is also demonstrated through feature analysis, showing that the model can effectively capture multifaceted information even without explicit rationales in natural language. Overall, Meteor demonstrates significant improvements in vision language performances across various benchmarks, and it is a promising step towards achieving more efficient LLVMs.Meteor is a new efficient large language and vision model (LLVM) that leverages multifaceted rationale to enhance understanding and answering capabilities. It uses the Mamba architecture, which can process sequential data with linear time complexity, to embed lengthy rationales containing abundant information. The model introduces a new concept of rationale traversal to facilitate efficient embedding of rationale. The backbone multimodal language model (MLM) is trained to generate answers with the aid of rationale. Meteor achieves significant improvements in vision language performances across multiple evaluation benchmarks without scaling up the model size or employing additional vision encoders and computer vision models. The model is trained using 2.1 million question-answer pairs from existing visual instruction tuning datasets, and the rationales are generated using the Claude Haiku API and filtered by human reviewers. The model's performance is evaluated on various benchmarks, including QBench, SQA, AI2D, ChartQA, SEED, POPE, HallB, MME, MathVista, MMB, MM-Vet, and LLaVA. Meteor demonstrates significant improvements in vision language performances, particularly in tasks requiring diverse capabilities such as image understanding, common-sense knowledge, and non-object concepts. The model's ability to embed rationales is validated through ablation studies, showing that the Mamba architecture and rationale traversal contribute to performance improvements. Meteor is designed to be efficient, with a model size that is relatively small compared to other models, and it does not require additional vision encoders or computer vision models. The model's performance is further validated through experiments showing that it can effectively answer complex questions with the help of embedded rationales. The model's ability to embed rationales is also demonstrated through feature analysis, showing that the model can effectively capture multifaceted information even without explicit rationales in natural language. Overall, Meteor demonstrates significant improvements in vision language performances across various benchmarks, and it is a promising step towards achieving more efficient LLVMs.
Reach us at info@study.space
[slides and audio] Meteor%3A Mamba-based Traversal of Rationale for Large Language and Vision Models