From Pixels to Prose: A Large Dataset of Dense Image Captions

From Pixels to Prose: A Large Dataset of Dense Image Captions

14 Jun 2024 | Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatle, Gowthami Somepalli, Tom Goldstein
PixelProse is a large-scale dataset of over 16 million synthetic image captions, designed to provide high-quality, detailed descriptions for vision-language models (VLMs) and diffusion models. It addresses the limitations of existing web-scraped datasets, which often contain noisy, incomplete, or inaccurate image descriptions. PixelProse leverages cutting-edge vision-language models to generate detailed and accurate captions, ensuring data integrity by filtering out problematic content such as child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. The dataset includes metadata like watermark presence and aesthetic scores, aiding in further filtering and analysis. The dataset is sourced from three web-scraped databases: CommonPool, CC12M, and RedCaps. Each source contributes images with varying qualities and aesthetic values. CommonPool provides a wide range of images with lower quality but higher diversity, while CC12M and RedCaps offer higher quality and more curated content. PixelProse captions are generated using Google Gemini 1.0 Pro Vision Model, with prompts designed to capture detailed descriptions of objects, their attributes, spatial relationships, and text content. Negative descriptions are also included to address the limitations of text-to-image models in handling negative instructions. To ensure safety and ethical standards, PixelProse undergoes extensive filtering to remove CSAM, PII, and toxic content. It is processed through multiple commercial APIs, including PhotoDNA, to detect and remove CSAM. The dataset also includes a significant amount of text recognition data, with manual checks to ensure accuracy. PixelProse captions are more detailed and diverse compared to original alt-text captions, making them suitable for various applications, including vision captioning, question-answering (VQA), and other data formats. PixelProse is a valuable resource for training vision-language models, offering a large, high-quality dataset with detailed, accurate, and diverse image descriptions. It is designed to be used as a standalone dataset or in combination with LLM refactoring for various tasks. The dataset includes a wide range of image properties and styles, making it suitable for a variety of applications in vision-language research.PixelProse is a large-scale dataset of over 16 million synthetic image captions, designed to provide high-quality, detailed descriptions for vision-language models (VLMs) and diffusion models. It addresses the limitations of existing web-scraped datasets, which often contain noisy, incomplete, or inaccurate image descriptions. PixelProse leverages cutting-edge vision-language models to generate detailed and accurate captions, ensuring data integrity by filtering out problematic content such as child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. The dataset includes metadata like watermark presence and aesthetic scores, aiding in further filtering and analysis. The dataset is sourced from three web-scraped databases: CommonPool, CC12M, and RedCaps. Each source contributes images with varying qualities and aesthetic values. CommonPool provides a wide range of images with lower quality but higher diversity, while CC12M and RedCaps offer higher quality and more curated content. PixelProse captions are generated using Google Gemini 1.0 Pro Vision Model, with prompts designed to capture detailed descriptions of objects, their attributes, spatial relationships, and text content. Negative descriptions are also included to address the limitations of text-to-image models in handling negative instructions. To ensure safety and ethical standards, PixelProse undergoes extensive filtering to remove CSAM, PII, and toxic content. It is processed through multiple commercial APIs, including PhotoDNA, to detect and remove CSAM. The dataset also includes a significant amount of text recognition data, with manual checks to ensure accuracy. PixelProse captions are more detailed and diverse compared to original alt-text captions, making them suitable for various applications, including vision captioning, question-answering (VQA), and other data formats. PixelProse is a valuable resource for training vision-language models, offering a large, high-quality dataset with detailed, accurate, and diverse image descriptions. It is designed to be used as a standalone dataset or in combination with LLM refactoring for various tasks. The dataset includes a wide range of image properties and styles, making it suitable for a variety of applications in vision-language research.
Reach us at info@study.space
Understanding From Pixels to Prose%3A A Large Dataset of Dense Image Captions