18 May 2022 | Tianyu Gao, Xingcheng Yao, Danqi Chen
SimCSE is a simple contrastive learning framework that significantly improves state-of-the-art sentence embeddings. The framework includes both unsupervised and supervised approaches. The unsupervised approach uses dropout as noise to predict the input sentence itself, creating positive pairs with different dropout masks. This method achieves performance comparable to supervised methods. The supervised approach incorporates annotated pairs from natural language inference (NLI) datasets, using entailment pairs as positives and contradiction pairs as hard negatives. SimCSE achieves high Spearman's correlation on semantic textual similarity (STS) tasks, with 76.3% for the unsupervised model and 81.6% for the supervised model using BERT_base, representing improvements of 4.2% and 2.2% over previous results. The framework also improves the uniformity of pre-trained embeddings and aligns positive pairs better when supervised signals are available. The contrastive learning objective helps flatten the singular value distribution of sentence embeddings, making them more isotropic. SimCSE performs well on various STS and transfer tasks, and its results are consistent across different settings. The framework is simple, effective, and has potential for broader applications in NLP.SimCSE is a simple contrastive learning framework that significantly improves state-of-the-art sentence embeddings. The framework includes both unsupervised and supervised approaches. The unsupervised approach uses dropout as noise to predict the input sentence itself, creating positive pairs with different dropout masks. This method achieves performance comparable to supervised methods. The supervised approach incorporates annotated pairs from natural language inference (NLI) datasets, using entailment pairs as positives and contradiction pairs as hard negatives. SimCSE achieves high Spearman's correlation on semantic textual similarity (STS) tasks, with 76.3% for the unsupervised model and 81.6% for the supervised model using BERT_base, representing improvements of 4.2% and 2.2% over previous results. The framework also improves the uniformity of pre-trained embeddings and aligns positive pairs better when supervised signals are available. The contrastive learning objective helps flatten the singular value distribution of sentence embeddings, making them more isotropic. SimCSE performs well on various STS and transfer tasks, and its results are consistent across different settings. The framework is simple, effective, and has potential for broader applications in NLP.