Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

12 Mar 2024 | Lei Zhu1 Fangyun Wei2* Yanye Lu1
This paper explores the potential of a large language model (LLM) to directly comprehend visual signals without the need for fine-tuning on multi-modal datasets. The authors introduce the Vision-to-Language Tokenizer (V2L Tokenizer), which transforms images into a set of discrete tokens derived from the LLM's vocabulary. This method views images as linguistic entities and translates them into a "foreign language" using an encoder-decoder structure, a CLIP model, and the LLM's vocabulary. The V2L Tokenizer enables the LLM to perform visual comprehension tasks such as image recognition, captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. The paper includes rigorous experiments and comparisons with prior methods, demonstrating the effectiveness of the proposed approach. The code and models are available at <https://github.com/zj460045050/V2L-Tokenizer>.This paper explores the potential of a large language model (LLM) to directly comprehend visual signals without the need for fine-tuning on multi-modal datasets. The authors introduce the Vision-to-Language Tokenizer (V2L Tokenizer), which transforms images into a set of discrete tokens derived from the LLM's vocabulary. This method views images as linguistic entities and translates them into a "foreign language" using an encoder-decoder structure, a CLIP model, and the LLM's vocabulary. The V2L Tokenizer enables the LLM to perform visual comprehension tasks such as image recognition, captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. The paper includes rigorous experiments and comparisons with prior methods, demonstrating the effectiveness of the proposed approach. The code and models are available at <https://github.com/zj460045050/V2L-Tokenizer>.
Reach us at info@study.space
[slides and audio] Beyond Text%3A Frozen Large Language Models in Visual Signal Comprehension