27 May 2024 | Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro
The paper introduces Meteor, a new efficient large language and vision model (LLVM) that leverages multifaceted rationale to enhance understanding and answering capabilities. Meteor is designed to address the rapid development of large language and vision models (LLVMs) driven by visual instruction tuning. The authors highlight the importance of multifaceted information, including fundamental image understanding, real-world knowledge, and step-by-step procedures for solving complex questions. To embed lengthy rationales, Meteor employs the Mamba architecture, which is capable of processing sequential data with linear time complexity. The paper introduces the concept of traversal of rationale, which facilitates efficient embedding of rationale during inference. Meteor is trained using a backbone multimodal language model (MLM) and a curated dataset of 2.1M question-rationale pairs, resulting in significant improvements in vision-language performance across multiple benchmarks without scaling up the model size or using additional vision encoders. The contributions of the paper include the introduction of Meteor and its effectiveness in handling diverse capabilities, as demonstrated through extensive experiments and ablation studies.The paper introduces Meteor, a new efficient large language and vision model (LLVM) that leverages multifaceted rationale to enhance understanding and answering capabilities. Meteor is designed to address the rapid development of large language and vision models (LLVMs) driven by visual instruction tuning. The authors highlight the importance of multifaceted information, including fundamental image understanding, real-world knowledge, and step-by-step procedures for solving complex questions. To embed lengthy rationales, Meteor employs the Mamba architecture, which is capable of processing sequential data with linear time complexity. The paper introduces the concept of traversal of rationale, which facilitates efficient embedding of rationale during inference. Meteor is trained using a backbone multimodal language model (MLM) and a curated dataset of 2.1M question-rationale pairs, resulting in significant improvements in vision-language performance across multiple benchmarks without scaling up the model size or using additional vision encoders. The contributions of the paper include the introduction of Meteor and its effectiveness in handling diverse capabilities, as demonstrated through extensive experiments and ablation studies.