Understanding Towards Semantic Equivalence of Tokenization in Multimodal LLM

This paper addresses the challenge of semantic alignment between vision and language in Multimodal Large Language Models (MLLMs) by proposing a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok). SeTok groups visual features into semantic units using a dynamic clustering algorithm, ensuring that the number of tokens is flexible and based on image complexity. This approach preserves semantic integrity and captures both low-frequency and high-frequency visual features. The proposed MLLM, SETOKIM, equipped with SeTok, demonstrates superior performance across various tasks, including understanding, generation, segmentation, and editing. The paper also discusses the methodology, including the clustering mechanism and the three-stage training procedure. Experimental results show that SETOKIM outperforms existing MLLMs in multiple benchmarks, highlighting its effectiveness in enhancing the alignment between vision and language. The paper concludes by discussing potential limitations and future directions, such as model scale, image resolution, and hallucination issues.This paper addresses the challenge of semantic alignment between vision and language in Multimodal Large Language Models (MLLMs) by proposing a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok). SeTok groups visual features into semantic units using a dynamic clustering algorithm, ensuring that the number of tokens is flexible and based on image complexity. This approach preserves semantic integrity and captures both low-frequency and high-frequency visual features. The proposed MLLM, SETOKIM, equipped with SeTok, demonstrates superior performance across various tasks, including understanding, generation, segmentation, and editing. The paper also discusses the methodology, including the clustering mechanism and the three-stage training procedure. Experimental results show that SETOKIM outperforms existing MLLMs in multiple benchmarks, highlighting its effectiveness in enhancing the alignment between vision and language. The paper concludes by discussing potential limitations and future directions, such as model scale, image resolution, and hallucination issues.

Towards Semantic Equivalence of Tokenization in Multimodal LLM

27 Jun 2024 | Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuiceng Yan