This paper introduces a method for large language models (LLMs) to comprehend visual signals without requiring fine-tuning on multi-modal datasets. The key idea is to treat images as a "foreign language" and translate them into a sequence of tokens from the LLM's vocabulary using a Vision-to-Language Tokenizer (V2L Tokenizer). This tokenizer employs an encoder-quantizer-decoder structure, leveraging an LLM's vocabulary as a codebook to convert images into tokens that can be processed by the LLM. The V2L Tokenizer generates both global and local tokens, with global tokens capturing semantic information and local tokens capturing detailed, patch-level features. This enables the LLM to perform tasks such as image recognition, captioning, visual question answering, and image denoising without any fine-tuning.
The V2L Tokenizer is trained using a combination of encoder, decoder, and projector, with the encoder extracting local and global features from the image, and the decoder reconstructing the image from the tokens. The training process uses vector quantization, perceptual loss, and GAN loss to ensure the tokens accurately represent the image. The method is evaluated on various tasks, including image classification, captioning, and denoising, showing superior performance compared to existing methods. The results demonstrate that the V2L Tokenizer enables a frozen LLM to understand and generate visual content without the need for extensive training on multi-modal datasets. The approach is efficient and effective, offering a new way to leverage LLMs for visual signal comprehension and image restoration.This paper introduces a method for large language models (LLMs) to comprehend visual signals without requiring fine-tuning on multi-modal datasets. The key idea is to treat images as a "foreign language" and translate them into a sequence of tokens from the LLM's vocabulary using a Vision-to-Language Tokenizer (V2L Tokenizer). This tokenizer employs an encoder-quantizer-decoder structure, leveraging an LLM's vocabulary as a codebook to convert images into tokens that can be processed by the LLM. The V2L Tokenizer generates both global and local tokens, with global tokens capturing semantic information and local tokens capturing detailed, patch-level features. This enables the LLM to perform tasks such as image recognition, captioning, visual question answering, and image denoising without any fine-tuning.
The V2L Tokenizer is trained using a combination of encoder, decoder, and projector, with the encoder extracting local and global features from the image, and the decoder reconstructing the image from the tokens. The training process uses vector quantization, perceptual loss, and GAN loss to ensure the tokens accurately represent the image. The method is evaluated on various tasks, including image classification, captioning, and denoising, showing superior performance compared to existing methods. The results demonstrate that the V2L Tokenizer enables a frozen LLM to understand and generate visual content without the need for extensive training on multi-modal datasets. The approach is efficient and effective, offering a new way to leverage LLMs for visual signal comprehension and image restoration.