Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

28 May 2024 | Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
The paper introduces Lumen, a novel Large Multimodal Model (LMM) architecture designed to enhance its vision-centric capabilities. LMMs have shown significant potential in various fields, but current methods often adapt visual task outputs to language-oriented formats, which overlooks the intrinsic characteristics of diverse visual tasks and hinders the learning of perception capabilities. To address this issue, Lumen decouples the learning of perception capabilities into task-agnostic and task-specific stages. In the first stage, Lumen promotes fine-grained vision-language concept alignment, generating a shared representation for all vision-centric tasks. In the second stage, this shared representation is flexibly routed to lightweight task decoders for task-specific decoding. Experimental results on multiple vision-centric and VQA benchmarks demonstrate that Lumen not only achieves or surpasses the performance of existing LMM-based approaches but also maintains general visual understanding and instruction-following capabilities. The contributions of Lumen include its ability to seamlessly adapt to various vision-centric tasks without specialized dialogue datasets, and its superior performance in fundamental vision tasks while maintaining versatile capabilities.The paper introduces Lumen, a novel Large Multimodal Model (LMM) architecture designed to enhance its vision-centric capabilities. LMMs have shown significant potential in various fields, but current methods often adapt visual task outputs to language-oriented formats, which overlooks the intrinsic characteristics of diverse visual tasks and hinders the learning of perception capabilities. To address this issue, Lumen decouples the learning of perception capabilities into task-agnostic and task-specific stages. In the first stage, Lumen promotes fine-grained vision-language concept alignment, generating a shared representation for all vision-centric tasks. In the second stage, this shared representation is flexibly routed to lightweight task decoders for task-specific decoding. Experimental results on multiple vision-centric and VQA benchmarks demonstrate that Lumen not only achieves or surpasses the performance of existing LMM-based approaches but also maintains general visual understanding and instruction-following capabilities. The contributions of Lumen include its ability to seamlessly adapt to various vision-centric tasks without specialized dialogue datasets, and its superior performance in fundamental vision tasks while maintaining versatile capabilities.
Reach us at info@study.space