Understanding Auto-Encoding Morph-Tokens for Multimodal LLM

The paper introduces Morph-Tokens to address the conflicting objectives between visual comprehension and generation in multimodal Large Language Models (MLLLs). Morph-Tokens are designed to serve dual purposes: they act as visual prompts for comprehension tasks, where they abstract visual semantics, and as complete visual tokens for image reconstruction in generation tasks, where they recover lost visual features. The proposed 3-stage training strategy detaches the textual and image reconstruction losses, allowing the MLLM to recognize abstract pre-MLLM visual tokens for comprehension while ensuring their recovery into visually complete tokens for image generation. Extensive experiments demonstrate that Morph-Tokens achieve state-of-the-art performance on challenging vision-language benchmarks and outperform other models in multi-turn image editing and in-context learning, preserving image fidelity while understanding language instructions. The project is available at <https://github.com/DCDmlm/MorphTokens>.The paper introduces Morph-Tokens to address the conflicting objectives between visual comprehension and generation in multimodal Large Language Models (MLLLs). Morph-Tokens are designed to serve dual purposes: they act as visual prompts for comprehension tasks, where they abstract visual semantics, and as complete visual tokens for image reconstruction in generation tasks, where they recover lost visual features. The proposed 3-stage training strategy detaches the textual and image reconstruction losses, allowing the MLLM to recognize abstract pre-MLLM visual tokens for comprehension while ensuring their recovery into visually complete tokens for image generation. Extensive experiments demonstrate that Morph-Tokens achieve state-of-the-art performance on challenging vision-language benchmarks and outperform other models in multi-turn image editing and in-context learning, preserving image fidelity while understanding language instructions. The project is available at <https://github.com/DCDmlm/MorphTokens>.

Auto-Encoding Morph-Tokens for Multimodal LLM

3 May 2024 | Kaihang Pan, Siliang Tang, Juncheng Li, Zhaoyu Fan, Wei Chow, Shuicheng Yan, Tat-Seng Chua, Yueting Zhuang, Hanwang Zhang