A Vision Check-up for Language Models

A Vision Check-up for Language Models

3 Jan 2024 | Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrián Rodríguez-Muñoz, Shivam Duggal, Phillip Isola, Antonio Torralba
This paper explores the visual capabilities of large language models (LLMs) by evaluating their ability to generate and recognize visual concepts. The study introduces a hierarchical visual aptitude dataset containing shapes, objects, and scenes, and assesses LLMs' performance in generating, recognizing, and correcting visual concepts through text-based feedback. The results show that LLMs can generate complex visual scenes and recognize visual concepts from code, although they struggle with certain aspects like texture and precise shapes. The study also demonstrates that text-based feedback can improve the quality of generated images, and that LLMs can be used to train vision models for semantic assessments of natural images. The paper concludes that LLMs, despite not being able to process visual information directly, can learn visual properties of the real world and generate images with useful visual properties that complement traditional procedural generation approaches. The findings suggest that LLMs can be used to train vision systems for natural images, and that their generated images can be used to create effective visual representations for downstream tasks. The study highlights the potential of LLMs in visual representation learning and their ability to generate diverse and realistic images, although they still fall short of natural images in terms of realism. The paper also shows that LLMs can be used to train vision models for tasks such as image classification and retrieval, and that their generated images can be used to create powerful visual representations for natural images. The study provides insights into the visual capabilities of LLMs and their potential in visual representation learning and image generation.This paper explores the visual capabilities of large language models (LLMs) by evaluating their ability to generate and recognize visual concepts. The study introduces a hierarchical visual aptitude dataset containing shapes, objects, and scenes, and assesses LLMs' performance in generating, recognizing, and correcting visual concepts through text-based feedback. The results show that LLMs can generate complex visual scenes and recognize visual concepts from code, although they struggle with certain aspects like texture and precise shapes. The study also demonstrates that text-based feedback can improve the quality of generated images, and that LLMs can be used to train vision models for semantic assessments of natural images. The paper concludes that LLMs, despite not being able to process visual information directly, can learn visual properties of the real world and generate images with useful visual properties that complement traditional procedural generation approaches. The findings suggest that LLMs can be used to train vision systems for natural images, and that their generated images can be used to create effective visual representations for downstream tasks. The study highlights the potential of LLMs in visual representation learning and their ability to generate diverse and realistic images, although they still fall short of natural images in terms of realism. The paper also shows that LLMs can be used to train vision models for tasks such as image classification and retrieval, and that their generated images can be used to create powerful visual representations for natural images. The study provides insights into the visual capabilities of LLMs and their potential in visual representation learning and image generation.
Reach us at info@study.space
[slides and audio] A Vision Check-up for Language Models