F-LMM: Grounding Frozen Large Multimodal Models

F-LMM: Grounding Frozen Large Multimodal Models

9 Jun 2024 | Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen Change Loy
F-LMM: Grounding Frozen Large Multimodal Models This paper introduces F-LMM, a method for grounding large multimodal models (LMMs) without fine-tuning their parameters. F-LMM leverages the existing visual grounding capabilities of well-trained LMMs by utilizing word-pixel correspondences in their attention weights. By using a few trainable CNN layers, F-LMM translates word-pixel attention weights into mask logits, which are then refined by a SAM-based mask refiner. This approach allows F-LMM to achieve competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while preserving the original conversational ability of LMMs. Additionally, F-LMM can perform visual chain-of-thought reasoning and better resist object hallucinations. The method is evaluated on various benchmarks, showing that F-LMM achieves the best balance between grounding and chat capabilities. The results demonstrate that F-LMM outperforms existing grounding LMMs in both grounding and conversational abilities. The paper also discusses the limitations of existing grounding LMMs, such as the loss of general knowledge and instruction-following ability, and proposes F-LMM as a solution to these issues. The method is implemented on several open-sourced LMMs, including LLaVA-1.5, LLaVA-Next, MiniGemini, DeepseekVL, and HPT-Air. The experiments show that F-LMM achieves competitive results on referring expression segmentation and phrase grounding benchmarks. The paper also presents an ablation study on the PNG benchmark, showing that the U-Net architecture outperforms the plain CNN in terms of performance. The results indicate that the combination of excellent grounding and instruction-following abilities enables F-LMM to perform complex visual perception and reasoning tasks. The paper concludes that F-LMM is a promising approach for grounding LMMs without losing their conversational ability.F-LMM: Grounding Frozen Large Multimodal Models This paper introduces F-LMM, a method for grounding large multimodal models (LMMs) without fine-tuning their parameters. F-LMM leverages the existing visual grounding capabilities of well-trained LMMs by utilizing word-pixel correspondences in their attention weights. By using a few trainable CNN layers, F-LMM translates word-pixel attention weights into mask logits, which are then refined by a SAM-based mask refiner. This approach allows F-LMM to achieve competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while preserving the original conversational ability of LMMs. Additionally, F-LMM can perform visual chain-of-thought reasoning and better resist object hallucinations. The method is evaluated on various benchmarks, showing that F-LMM achieves the best balance between grounding and chat capabilities. The results demonstrate that F-LMM outperforms existing grounding LMMs in both grounding and conversational abilities. The paper also discusses the limitations of existing grounding LMMs, such as the loss of general knowledge and instruction-following ability, and proposes F-LMM as a solution to these issues. The method is implemented on several open-sourced LMMs, including LLaVA-1.5, LLaVA-Next, MiniGemini, DeepseekVL, and HPT-Air. The experiments show that F-LMM achieves competitive results on referring expression segmentation and phrase grounding benchmarks. The paper also presents an ablation study on the PNG benchmark, showing that the U-Net architecture outperforms the plain CNN in terms of performance. The results indicate that the combination of excellent grounding and instruction-following abilities enables F-LMM to perform complex visual perception and reasoning tasks. The paper concludes that F-LMM is a promising approach for grounding LMMs without losing their conversational ability.
Reach us at info@study.space