F-LMM: Grounding Frozen Large Multimodal Models

F-LMM: Grounding Frozen Large Multimodal Models

9 Jun 2024 | Size Wu1 Sheng Jin2,3 Wenwei Zhang4 Lumin Xu5 Wentao Liu3,4 Wei Li1 Chen Change Loy 1*
The paper introduces F-LMM (Frozen Large Multimodal Models), a novel approach to grounding large multimodal models (LMMs) in human-AI conversations while preserving their conversational capabilities. Existing methods often fine-tune LMMs to learn additional segmentation tokens, leading to a loss of general knowledge comprehension and instruction-following ability. F-LMM leverages the inherent word-pixel correspondences in the attention weights of well-trained LMMs, which are not typically used for visual grounding. By using a few trainable CNN layers, F-LMM translates these attention weights into mask logits, which are further optimized by a SAM-based mask refiner. This design ensures that F-LMM maintains its original conversational ability while achieving competitive performance on visual grounding benchmarks. Experiments show that F-LMM outperforms existing grounding models on benchmarks like referring expression segmentation and panoptic narrative grounding, while also improving visual chain-of-thought reasoning and reducing object hallucinations. The paper discusses the broader impacts and limitations of the approach, emphasizing the need for access controls, usage policies, and transparency to address potential biases and ethical concerns.The paper introduces F-LMM (Frozen Large Multimodal Models), a novel approach to grounding large multimodal models (LMMs) in human-AI conversations while preserving their conversational capabilities. Existing methods often fine-tune LMMs to learn additional segmentation tokens, leading to a loss of general knowledge comprehension and instruction-following ability. F-LMM leverages the inherent word-pixel correspondences in the attention weights of well-trained LMMs, which are not typically used for visual grounding. By using a few trainable CNN layers, F-LMM translates these attention weights into mask logits, which are further optimized by a SAM-based mask refiner. This design ensures that F-LMM maintains its original conversational ability while achieving competitive performance on visual grounding benchmarks. Experiments show that F-LMM outperforms existing grounding models on benchmarks like referring expression segmentation and panoptic narrative grounding, while also improving visual chain-of-thought reasoning and reducing object hallucinations. The paper discusses the broader impacts and limitations of the approach, emphasizing the need for access controls, usage policies, and transparency to address potential biases and ethical concerns.
Reach us at info@study.space