| Xiaohua Zhai*, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer*
This paper introduces a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP), which differs from standard contrastive learning using softmax normalization. The Sigmoid loss operates solely on image-text pairs and does not require a global view of pairwise similarities for normalization. It allows for larger batch sizes and performs better at smaller batch sizes. Combined with Locked-image Tuning, the SigLiT model achieves 84.5% ImageNet zero-shot accuracy in two days using only four TPUv4 chips. The disentanglement of batch size from the loss enables studying the impact of examples vs pairs and negative to positive ratio. The batch size was pushed to one million, but performance saturates at around 32k. The paper also releases models at https://github.com/google-research/big_vision.
The paper evaluates SigLiT and SigLIP across various batch sizes, showing that the Sigmoid loss outperforms the softmax loss for small batch sizes and performs similarly at larger sizes. SigLiT achieves 79.7% zero-shot accuracy on ImageNet in one day with four chips, while SigLIP achieves 73.4% accuracy in five days with 32 chips. The paper also discusses multilingual pre-training, showing that a batch size of 32k is sufficient for training on over 100 languages. The Sigmoid loss is more memory efficient and allows for larger batch sizes without additional resources. The paper also explores the impact of the learned bias term and the ratio of positive to negative pairs in the Sigmoid loss. The results show that the Sigmoid loss is more robust to data noise and can be trained with a smaller batch size. The paper concludes that the Sigmoid loss performs better than the softmax loss, particularly for small batch sizes, and is more memory efficient, allowing for larger batch sizes without additional resources. The paper also discusses the impact of the bias term and the ratio of positive to negative pairs in the Sigmoid loss. The results show that the Sigmoid loss is more robust to data noise and can be trained with a smaller batch size. The paper also discusses the impact of the bias term and the ratio of positive to negative pairs in the Sigmoid loss.This paper introduces a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP), which differs from standard contrastive learning using softmax normalization. The Sigmoid loss operates solely on image-text pairs and does not require a global view of pairwise similarities for normalization. It allows for larger batch sizes and performs better at smaller batch sizes. Combined with Locked-image Tuning, the SigLiT model achieves 84.5% ImageNet zero-shot accuracy in two days using only four TPUv4 chips. The disentanglement of batch size from the loss enables studying the impact of examples vs pairs and negative to positive ratio. The batch size was pushed to one million, but performance saturates at around 32k. The paper also releases models at https://github.com/google-research/big_vision.
The paper evaluates SigLiT and SigLIP across various batch sizes, showing that the Sigmoid loss outperforms the softmax loss for small batch sizes and performs similarly at larger sizes. SigLiT achieves 79.7% zero-shot accuracy on ImageNet in one day with four chips, while SigLIP achieves 73.4% accuracy in five days with 32 chips. The paper also discusses multilingual pre-training, showing that a batch size of 32k is sufficient for training on over 100 languages. The Sigmoid loss is more memory efficient and allows for larger batch sizes without additional resources. The paper also explores the impact of the learned bias term and the ratio of positive to negative pairs in the Sigmoid loss. The results show that the Sigmoid loss is more robust to data noise and can be trained with a smaller batch size. The paper concludes that the Sigmoid loss performs better than the softmax loss, particularly for small batch sizes, and is more memory efficient, allowing for larger batch sizes without additional resources. The paper also discusses the impact of the bias term and the ratio of positive to negative pairs in the Sigmoid loss. The results show that the Sigmoid loss is more robust to data noise and can be trained with a smaller batch size. The paper also discusses the impact of the bias term and the ratio of positive to negative pairs in the Sigmoid loss.