6 Aug 2024 | Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruva Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang
This paper presents SPRIGHT, a new large-scale, spatially focused vision-language dataset designed to improve spatial consistency in text-to-image (T2I) models. Current T2I models struggle to generate images that accurately reflect spatial relationships described in text prompts. SPRIGHT addresses this by re-captioning 6 million images from four widely used vision datasets, focusing on spatial relationships. The dataset is evaluated and shown to significantly improve the proportion of spatial relationships in existing datasets. Using only 0.25% of SPRIGHT data, the authors achieve a 22% improvement in spatial accuracy on T2I-CompBench, along with improvements in FID and CMMD scores. They also find that training on images with a larger number of objects leads to better spatial consistency, achieving state-of-the-art results on T2I-CompBench with a spatial score of 0.2133. Through controlled experiments and ablations, the authors document additional findings that support future research on spatial consistency in T2I models. The SPRIGHT dataset is shown to enhance the quality of captions, improve linguistic diversity, and capture more object occurrences. The paper also explores the impact of spatial captions on model performance, demonstrating that longer captions lead to better spatial consistency. The authors further investigate the CLIP text encoder's behavior and find that fine-tuning on spatial captions improves its ability to understand spatial relationships. They also show that training on images with negations improves spatial consistency, although performance drops when evaluating on prompts containing only negations. Overall, the paper contributes a new dataset and methods to improve spatial reasoning in T2I models, with significant improvements in spatial consistency and image fidelity.This paper presents SPRIGHT, a new large-scale, spatially focused vision-language dataset designed to improve spatial consistency in text-to-image (T2I) models. Current T2I models struggle to generate images that accurately reflect spatial relationships described in text prompts. SPRIGHT addresses this by re-captioning 6 million images from four widely used vision datasets, focusing on spatial relationships. The dataset is evaluated and shown to significantly improve the proportion of spatial relationships in existing datasets. Using only 0.25% of SPRIGHT data, the authors achieve a 22% improvement in spatial accuracy on T2I-CompBench, along with improvements in FID and CMMD scores. They also find that training on images with a larger number of objects leads to better spatial consistency, achieving state-of-the-art results on T2I-CompBench with a spatial score of 0.2133. Through controlled experiments and ablations, the authors document additional findings that support future research on spatial consistency in T2I models. The SPRIGHT dataset is shown to enhance the quality of captions, improve linguistic diversity, and capture more object occurrences. The paper also explores the impact of spatial captions on model performance, demonstrating that longer captions lead to better spatial consistency. The authors further investigate the CLIP text encoder's behavior and find that fine-tuning on spatial captions improves its ability to understand spatial relationships. They also show that training on images with negations improves spatial consistency, although performance drops when evaluating on prompts containing only negations. Overall, the paper contributes a new dataset and methods to improve spatial reasoning in T2I models, with significant improvements in spatial consistency and image fidelity.