From Pixels to Prose: A Large Dataset of Dense Image Captions

From Pixels to Prose: A Large Dataset of Dense Image Captions

14 Jun 2024 | Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, Tom Goldstein
The paper introduces PixelProse, a comprehensive dataset of over 16 million synthetically generated image captions designed to bridge the gap between existing web-scraped datasets and high-quality, detailed image descriptions. The captions are generated using advanced vision-language models, ensuring detailed and accurate descriptions. The dataset is rigorously analyzed to remove problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. It also provides valuable metadata such as watermark presence and aesthetic scores. PixelProse is intended for use in pre-training tasks, image captioning, and refactoring into other formats like VQA and instructions. The paper discusses the dataset's creation, image sources, captioning process, ethical considerations, and related work, highlighting its potential as a valuable resource for future vision-language research.The paper introduces PixelProse, a comprehensive dataset of over 16 million synthetically generated image captions designed to bridge the gap between existing web-scraped datasets and high-quality, detailed image descriptions. The captions are generated using advanced vision-language models, ensuring detailed and accurate descriptions. The dataset is rigorously analyzed to remove problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. It also provides valuable metadata such as watermark presence and aesthetic scores. PixelProse is intended for use in pre-training tasks, image captioning, and refactoring into other formats like VQA and instructions. The paper discusses the dataset's creation, image sources, captioning process, ethical considerations, and related work, highlighting its potential as a valuable resource for future vision-language research.
Reach us at info@study.space
Understanding From Pixels to Prose%3A A Large Dataset of Dense Image Captions