[slides] Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual Perception

The paper introduces AnyRef, a novel Multi-modal Large Language Model (MLLM) designed to generate pixel-level object perceptions and natural language descriptions from multi-modal references, including text, images, and audio. The model aims to bridge the gap between textual and visual tasks by enabling fine-grained visual perception and grounding capabilities. Key contributions include: 1. **Unified Referring Representation**: AnyRef converts multi-modal references into a unified representation that can be processed by the LLM, allowing for flexible interactions beyond textual prompts. 2. **Refocusing Mechanism**: This mechanism enhances the grounding output by incorporating additional pixel-level supervision through attention scores, improving the accuracy of pixel-level masks and referring expressions. 3. **State-of-the-Art Performance**: The model achieves superior performance on multiple benchmarks, including multi-modality referring segmentation and region-level referring expression generation. The paper also discusses related works, experimental setup, and ablation studies to validate the effectiveness of the proposed methods. AnyRef demonstrates its capabilities through qualitative and quantitative evaluations, showing proficiency in generating both textual responses and pixel-level perceptions across diverse modality instructions.The paper introduces AnyRef, a novel Multi-modal Large Language Model (MLLM) designed to generate pixel-level object perceptions and natural language descriptions from multi-modal references, including text, images, and audio. The model aims to bridge the gap between textual and visual tasks by enabling fine-grained visual perception and grounding capabilities. Key contributions include: 1. **Unified Referring Representation**: AnyRef converts multi-modal references into a unified representation that can be processed by the LLM, allowing for flexible interactions beyond textual prompts. 2. **Refocusing Mechanism**: This mechanism enhances the grounding output by incorporating additional pixel-level supervision through attention scores, improving the accuracy of pixel-level masks and referring expressions. 3. **State-of-the-Art Performance**: The model achieves superior performance on multiple benchmarks, including multi-modality referring segmentation and region-level referring expression generation. The paper also discusses related works, experimental setup, and ablation studies to validate the effectiveness of the proposed methods. AnyRef demonstrates its capabilities through qualitative and quantitative evaluations, showing proficiency in generating both textual responses and pixel-level perceptions across diverse modality instructions.

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

25 Mar 2024 | Junwen He*1,2, Yifan Wang1, Lijun Wang†1, Huchuan Lu1, Jun-Yan He2, Jin-Peng Lan2, Bin Luo2, and Xuansong Xie2