Matryoshka Multimodal Models

Matryoshka Multimodal Models

29 Jul 2024 | Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee
Matryoshka Multimodal Models (M³) are introduced to address the inefficiency of large multimodal models (LMMs) in handling dense visual content. Traditional LMMs embed images into a fixed number of visual tokens, which becomes inefficient for high-resolution images and long videos. M³ learns nested sets of visual tokens that capture information at multiple granularities, enabling flexible control over visual granularity during inference. This approach allows for efficient representation of visual content by adjusting the number of tokens based on the complexity of the input. M³ provides a framework to analyze the granularity needed for different datasets, showing that many benchmarks can be handled with as few as 9 tokens. The model also offers a foundation for exploring the optimal trade-off between performance and visual token length. Experiments demonstrate that M³ achieves performance comparable to or better than existing methods, with significant improvements in efficiency. The model is trained using the same architecture and data as LLaVA-1.5 and LLaVA-NeXT, and is publicly available for further research.Matryoshka Multimodal Models (M³) are introduced to address the inefficiency of large multimodal models (LMMs) in handling dense visual content. Traditional LMMs embed images into a fixed number of visual tokens, which becomes inefficient for high-resolution images and long videos. M³ learns nested sets of visual tokens that capture information at multiple granularities, enabling flexible control over visual granularity during inference. This approach allows for efficient representation of visual content by adjusting the number of tokens based on the complexity of the input. M³ provides a framework to analyze the granularity needed for different datasets, showing that many benchmarks can be handled with as few as 9 tokens. The model also offers a foundation for exploring the optimal trade-off between performance and visual token length. Experiments demonstrate that M³ achieves performance comparable to or better than existing methods, with significant improvements in efficiency. The model is trained using the same architecture and data as LLaVA-1.5 and LLaVA-NeXT, and is publicly available for further research.
Reach us at info@study.space