The paper introduces PixelProse, a comprehensive dataset of over 16 million synthetically generated image captions designed to bridge the gap between existing web-scraped datasets and high-quality, detailed image descriptions. The captions are generated using advanced vision-language models, ensuring detailed and accurate descriptions. The dataset is rigorously analyzed to remove problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. It also provides valuable metadata such as watermark presence and aesthetic scores. PixelProse is intended for use in pre-training tasks, image captioning, and refactoring into other formats like VQA and instructions. The paper discusses the dataset's creation, image sources, captioning process, ethical considerations, and related work, highlighting its potential as a valuable resource for future vision-language research.The paper introduces PixelProse, a comprehensive dataset of over 16 million synthetically generated image captions designed to bridge the gap between existing web-scraped datasets and high-quality, detailed image descriptions. The captions are generated using advanced vision-language models, ensuring detailed and accurate descriptions. The dataset is rigorously analyzed to remove problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. It also provides valuable metadata such as watermark presence and aesthetic scores. PixelProse is intended for use in pre-training tasks, image captioning, and refactoring into other formats like VQA and instructions. The paper discusses the dataset's creation, image sources, captioning process, ethical considerations, and related work, highlighting its potential as a valuable resource for future vision-language research.