28 May 2024 | Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Lumen is a large multimodal model that enhances vision-centric capabilities by decoupling task-agnostic and task-specific learning. The model first aligns visual and language concepts to generate a shared representation for various vision tasks. This representation is then used to guide task-specific decoding through lightweight decoders, enabling the model to handle diverse vision tasks like object detection, instance segmentation, and pose estimation. Lumen achieves state-of-the-art performance on multiple vision-centric tasks while maintaining general visual understanding and instruction following capabilities. The model's design allows it to adapt to various tasks without requiring specialized datasets, and it outperforms existing methods in tasks such as object detection and visual grounding. Lumen's approach addresses the limitations of previous methods by focusing on fundamental vision tasks and improving the model's ability to handle complex visual scenarios. The model's effectiveness is demonstrated through extensive experiments on various benchmarks, showing its versatility and performance across different vision-centric tasks.Lumen is a large multimodal model that enhances vision-centric capabilities by decoupling task-agnostic and task-specific learning. The model first aligns visual and language concepts to generate a shared representation for various vision tasks. This representation is then used to guide task-specific decoding through lightweight decoders, enabling the model to handle diverse vision tasks like object detection, instance segmentation, and pose estimation. Lumen achieves state-of-the-art performance on multiple vision-centric tasks while maintaining general visual understanding and instruction following capabilities. The model's design allows it to adapt to various tasks without requiring specialized datasets, and it outperforms existing methods in tasks such as object detection and visual grounding. Lumen's approach addresses the limitations of previous methods by focusing on fundamental vision tasks and improving the model's ability to handle complex visual scenarios. The model's effectiveness is demonstrated through extensive experiments on various benchmarks, showing its versatility and performance across different vision-centric tasks.