Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

25 Mar 2024 | Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie
AnyRef is a novel multi-modal instruction-tuned large language model (LLM) that can generate pixel-level object perceptions and natural language descriptions from multi-modal references such as texts, boxes, images, or audio. This model enables users to interact with the model beyond textual and regional prompts without modality-specific designs. The model introduces a refocusing mechanism that enhances the grounding mask predictions by leveraging the correlations of generated tokens, improving regional expression referring. The refocusing mechanism uses attention scores to weight such correlations, enhancing the mask embedding with additional grounded embeddings, and since the attention scores are intermediate outputs of the self-attention layers, the additional computation introduced by the refocusing mechanism is minimal. The model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation. The model is built upon LLaVA-7B, which can be efficiently fine-tuned with 8 NVIDIA 32G V100 GPUs, making our method easily reproducible at a reasonable computational cost. The model's contributions include introducing AnyRef, the first general MLLM capable of producing pixel-level object perceptions as well as region-aware referring descriptions, proposing a simple yet effective refocusing mechanism to enhance the grounded mask predictions, and conducting thorough experiments on multiple datasets demonstrating the efficacy of the proposed method, resulting in state-of-the-art performance across a diverse range of multi-modality tasks.AnyRef is a novel multi-modal instruction-tuned large language model (LLM) that can generate pixel-level object perceptions and natural language descriptions from multi-modal references such as texts, boxes, images, or audio. This model enables users to interact with the model beyond textual and regional prompts without modality-specific designs. The model introduces a refocusing mechanism that enhances the grounding mask predictions by leveraging the correlations of generated tokens, improving regional expression referring. The refocusing mechanism uses attention scores to weight such correlations, enhancing the mask embedding with additional grounded embeddings, and since the attention scores are intermediate outputs of the self-attention layers, the additional computation introduced by the refocusing mechanism is minimal. The model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation. The model is built upon LLaVA-7B, which can be efficiently fine-tuned with 8 NVIDIA 32G V100 GPUs, making our method easily reproducible at a reasonable computational cost. The model's contributions include introducing AnyRef, the first general MLLM capable of producing pixel-level object perceptions as well as region-aware referring descriptions, proposing a simple yet effective refocusing mechanism to enhance the grounded mask predictions, and conducting thorough experiments on multiple datasets demonstrating the efficacy of the proposed method, resulting in state-of-the-art performance across a diverse range of multi-modality tasks.
Reach us at info@study.space