[slides and audio] Getting it Right%3A Improving Spatial Consistency in Text-to-Image Models

The paper "Getting it Right: Improving Spatial Consistency in Text-to-Image Models" addresses the key shortcoming of current text-to-image (T2I) models, which is their inability to consistently generate images that accurately follow the spatial relationships specified in the text prompt. The authors investigate this limitation and develop datasets and methods to improve spatial reasoning in T2I models. They find that spatial relationships are underrepresented in existing vision-language datasets and create SPRIGHT, a large-scale dataset focused on spatially focused captions. By re-captioning 6 million images from four widely used vision datasets, SPRIGHT significantly improves the proportion of spatial relationships in existing datasets. The authors demonstrate the efficacy of SPRIGHT by showing that using only 0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images, as well as improvements in FID and CMMD scores. They also show that training on images with a larger number of objects leads to substantial improvements in spatial consistency, achieving state-of-the-art results on the T2I-CompBench benchmark with a spatial score of 0.2133. The paper includes a comprehensive evaluation and analysis of the generated captions, as well as ablation studies and analyses to understand the factors affecting spatial consistency in T2I models.The paper "Getting it Right: Improving Spatial Consistency in Text-to-Image Models" addresses the key shortcoming of current text-to-image (T2I) models, which is their inability to consistently generate images that accurately follow the spatial relationships specified in the text prompt. The authors investigate this limitation and develop datasets and methods to improve spatial reasoning in T2I models. They find that spatial relationships are underrepresented in existing vision-language datasets and create SPRIGHT, a large-scale dataset focused on spatially focused captions. By re-captioning 6 million images from four widely used vision datasets, SPRIGHT significantly improves the proportion of spatial relationships in existing datasets. The authors demonstrate the efficacy of SPRIGHT by showing that using only 0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images, as well as improvements in FID and CMMD scores. They also show that training on images with a larger number of objects leads to substantial improvements in spatial consistency, achieving state-of-the-art results on the T2I-CompBench benchmark with a spatial score of 0.2133. The paper includes a comprehensive evaluation and analysis of the generated captions, as well as ablation studies and analyses to understand the factors affecting spatial consistency in T2I models.

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

6 Aug 2024 | Agneeet Chatterjee *1, Gabriela Ben Melech Stan *2, Estelle Aflalo2, Sayak Paul3, Dhuruba Ghosh1, Tejas Gokhale5, Ludwig Schmidt4, Hannaneh Hajishirzi4, Vasudev Lai2, Chitta Baral1, and Yezhou Yang1

6 Aug 2024 | Agneeet Chatterjee 1, Gabriela Ben Melech Stan 2, Estelle Aflalo2, Sayak Paul3, Dhuruba Ghosh1, Tejas Gokhale5, Ludwig Schmidt4, Hannaneh Hajishirzi4, Vasudev Lai2, Chitta Baral1, and Yezhou Yang1