Towards Semantic Equivalence of Tokenization in Multimodal LLM

Towards Semantic Equivalence of Tokenization in Multimodal LLM

27 Jun 2024 | Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok) to enhance the semantic alignment between vision and language in Multimodal Large Language Models (MLLMs). Existing vision tokenizers, which fragment visual input into fixed patches or codebooks, lead to semantic fragmentation and loss of high-frequency visual information. SeTok addresses this by clustering visual features into semantic units based on dynamic clustering, allowing the number of tokens to vary according to image complexity. This approach preserves semantic integrity and captures both low- and high-frequency visual features. The proposed MLLM, SETOKIM, is trained with SeTok and demonstrates superior performance across various tasks, including image understanding, generation, segmentation, and editing. The model's dynamic clustering mechanism enables flexible tokenization, adapting to different image complexities and semantic structures. The tokenizer is integrated with a vision decoder and mask decoder to generate realistic images and semantic masks. Experimental results show that SeTok significantly improves vision-language alignment, enhances interpretability, and reduces training and inference costs. The paper also provides a comprehensive analysis of the impact of different clustering strategies on model performance, highlighting the effectiveness of dynamic clustering in achieving semantic equivalence between vision and language tokens. The proposed method is evaluated on multiple benchmarks, demonstrating its superiority in visual understanding, generation, and segmentation tasks. The results indicate that SeTok enables more accurate and detailed visual segmentation and enhances the model's ability to follow instructions for image editing. The approach is also shown to be effective in generating high-fidelity images and accurately identifying objects in complex scenes. Overall, the paper presents a significant advancement in vision tokenization for MLLMs, enabling more effective and accurate multimodal understanding and generation.This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok) to enhance the semantic alignment between vision and language in Multimodal Large Language Models (MLLMs). Existing vision tokenizers, which fragment visual input into fixed patches or codebooks, lead to semantic fragmentation and loss of high-frequency visual information. SeTok addresses this by clustering visual features into semantic units based on dynamic clustering, allowing the number of tokens to vary according to image complexity. This approach preserves semantic integrity and captures both low- and high-frequency visual features. The proposed MLLM, SETOKIM, is trained with SeTok and demonstrates superior performance across various tasks, including image understanding, generation, segmentation, and editing. The model's dynamic clustering mechanism enables flexible tokenization, adapting to different image complexities and semantic structures. The tokenizer is integrated with a vision decoder and mask decoder to generate realistic images and semantic masks. Experimental results show that SeTok significantly improves vision-language alignment, enhances interpretability, and reduces training and inference costs. The paper also provides a comprehensive analysis of the impact of different clustering strategies on model performance, highlighting the effectiveness of dynamic clustering in achieving semantic equivalence between vision and language tokens. The proposed method is evaluated on multiple benchmarks, demonstrating its superiority in visual understanding, generation, and segmentation tasks. The results indicate that SeTok enables more accurate and detailed visual segmentation and enhances the model's ability to follow instructions for image editing. The approach is also shown to be effective in generating high-fidelity images and accurately identifying objects in complex scenes. Overall, the paper presents a significant advancement in vision tokenization for MLLMs, enabling more effective and accurate multimodal understanding and generation.
Reach us at info@study.space
[slides] Towards Semantic Equivalence of Tokenization in Multimodal LLM | StudySpace