| Xiaohua Zhai*, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer*
The paper introduces a new loss function, the Sigmoid Loss, for Language-Image Pre-training (SigLIP), which operates solely on image-text pairs and does not require global normalization. This loss function simplifies the distributed implementation and improves efficiency, allowing for larger batch sizes. The authors compare the Sigmoid Loss with the standard Softmax Loss and find that the Sigmoid Loss performs better, especially at smaller batch sizes. They demonstrate that a batch size of 32k is sufficient for optimal performance, and training with batch sizes up to one million shows diminishing returns. The paper also explores the impact of batch size, negative-to-positive ratio, and data noise robustness. The authors release their models and hope to inspire further research in improving language-image pre-training.The paper introduces a new loss function, the Sigmoid Loss, for Language-Image Pre-training (SigLIP), which operates solely on image-text pairs and does not require global normalization. This loss function simplifies the distributed implementation and improves efficiency, allowing for larger batch sizes. The authors compare the Sigmoid Loss with the standard Softmax Loss and find that the Sigmoid Loss performs better, especially at smaller batch sizes. They demonstrate that a batch size of 32k is sufficient for optimal performance, and training with batch sizes up to one million shows diminishing returns. The paper also explores the impact of batch size, negative-to-positive ratio, and data noise robustness. The authors release their models and hope to inspire further research in improving language-image pre-training.