[slides and audio] A Vision Check-up for Language Models

The paper evaluates the visual capabilities of Large Language Models (LLMs) by testing their ability to generate and recognize visual concepts. It introduces a hierarchical visual categories dataset consisting of shapes, objects, and scenes to assess LLMs' visual aptitude. The study finds that LLMs can generate complex visual scenes with multiple objects, effectively modeling spatial relations, but struggle with details like textures and precise shapes. LLMs also show limited recognition capabilities when presented with human drawings represented as code. However, iterative text-based corrections improve the quality of generated images. The paper demonstrates that LLM-generated images can be used to train vision models for natural images, achieving state-of-the-art performance when combined with other synthetic datasets. The main contributions include the Visual Aptitude Dataset, the evaluation of LLMs' generation and recognition abilities, and the successful training of vision models using LLM-generated images.The paper evaluates the visual capabilities of Large Language Models (LLMs) by testing their ability to generate and recognize visual concepts. It introduces a hierarchical visual categories dataset consisting of shapes, objects, and scenes to assess LLMs' visual aptitude. The study finds that LLMs can generate complex visual scenes with multiple objects, effectively modeling spatial relations, but struggle with details like textures and precise shapes. LLMs also show limited recognition capabilities when presented with human drawings represented as code. However, iterative text-based corrections improve the quality of generated images. The paper demonstrates that LLM-generated images can be used to train vision models for natural images, achieving state-of-the-art performance when combined with other synthetic datasets. The main contributions include the Visual Aptitude Dataset, the evaluation of LLMs' generation and recognition abilities, and the successful training of vision models using LLM-generated images.

A Vision Check-up for Language Models

3 Jan 2024 | Pratyusha Sharma*, Tamar Rott Shaham*, Manel Baradad, Stephanie Fu, Adrián Rodríguez-Muñoz, Shivam Duggal, Phillip Isola, Antonio Torralba

3 Jan 2024 | Pratyusha Sharma, Tamar Rott Shaham, Manel Baradad, Stephanie Fu, Adrián Rodríguez-Muñoz, Shivam Duggal, Phillip Isola, Antonio Torralba