# SimCSE **Repository Path**: panxuefeng235/SimCSE ## Basic Information - **Project Name**: SimCSE - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: develop - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-05-05 - **Last Updated**: 2021-05-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821). **************************** **Updates** **************************** * 4/23: We released our [training code](#training). * 4/20: We released our [model checkpoints](#use-our-models-out-of-the-box) and [evaluation code](#evaluation). * 4/18: We released [our paper](https://arxiv.org/pdf/2104.08821.pdf). Check it out! ## Quick links - [Overview](#overview) - [Pre-trained sentence embeddings](#use-our-models-out-of-the-box) - [Requirements](#requirements) - [Evaluation](#evaluation) - [Training](#training) - [Bugs or Questions?](#Bugs-or-questions) - [Citation](#citation) - [SimCSE Elsewhere](#simcse-elsewhere) ## Overview We propose a simple contrastive learning framework that works with both unlabeled and labeled data. Unsupervised SimCSE simply takes an input sentence and predicts itself in a contrastive learning framework, with only standard dropout used as noise. Our supervised SimCSE incorporates annotated pairs from NLI datasets into contrastive learning by using `entailment` pairs as positives and `contradiction` pairs as hard negatives. The following figure is an illustration of our models. ![](figure/model.png) ## Use our models out of the box Our pre-trained models are now publicly available with [HuggingFace's Transformers](https://github.com/huggingface/transformers). Models and their performance are presented as follows: | Model | Avg. STS | |:-------------------------------:|:--------:| | [unsup-simcse-bert-base-uncased](https://huggingface.co/princeton-nlp/unsup-simcse-bert-base-uncased) | 74.54 | | [unsup-simcse-bert-large-uncased](https://huggingface.co/princeton-nlp/unsup-simcse-bert-large-uncased) | 76.05 | | [unsup-simcse-roberta-base](https://huggingface.co/princeton-nlp/unsup-simcse-roberta-base) | 76.50 | | [unsup-simcse-roberta-large](https://huggingface.co/princeton-nlp/unsup-simcse-roberta-large) | 77.47 | | [sup-simcse-bert-base-uncased](https://huggingface.co/princeton-nlp/sup-simcse-bert-base-uncased) | 81.57 | | [sup-simcse-bert-large-uncased](https://huggingface.co/princeton-nlp/sup-simcse-bert-large-uncased) | 82.21 | | [sup-simcse-roberta-base](https://huggingface.co/princeton-nlp/sup-simcse-roberta-base) | 82.52 | | [sup-simcse-roberta-large](https://huggingface.co/princeton-nlp/sup-simcse-roberta-large) | 83.76 | **Naming rules**: `unsup` and `sup` represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively. You can easily import our model in an out-of-the-box way with HuggingFace's API: ```python import torch from scipy.spatial.distance import cosine from transformers import AutoModel, AutoTokenizer # Import our models. The package will take care of downloading the models automatically tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased") model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased") # Tokenize input texts texts = [ "There's a kid on a skateboard.", "A kid is skateboarding.", "A kid is inside the house." ] inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt") # Get the embeddings with torch.no_grad(): embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output # Calculate cosine similarities # Cosine similarities are in [-1, 1]. Higher means more similar cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1]) cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2]) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2)) ``` If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use `model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL})`. If you only want to use our models in an out-of-the-box way, just installing the latest version of `torch`, `transformers` and `scipy` is enough. If you want to use our training or evaluation code, see the requirement section below. ## Requirements First, install PyTorch by following the instructions from [the official website](https://pytorch.org). To faithfully reproduce our results, please use the correct `1.7.1` version corresponding to your platforms/CUDA versions. PyTorch version higher than `1.7.1` should also work. For example, if you use Linux and **CUDA11** ([how to check CUDA version](https://varhowto.com/check-cuda-version/)), install PyTorch by the following command, ```bash pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html ``` If you instead use **CUDA** `<11` or **CPU**, install PyTorch by the following command, ```bash pip install torch==1.7.1 ``` Then run the following script to install the remaining dependencies, ```bash pip install -r requirements.txt ``` ## Evaluation Our evaluation code for sentence embeddings is based on a modified version of [SentEval](https://github.com/facebookresearch/SentEval). It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation. See [our paper](https://arxiv.org/pdf/2104.08821.pdf) (Appendix B) for evaluation details. Before evaluation, please download the evaluation datasets by running ```bash cd SentEval/data/downstream/ bash download_dataset.sh ``` Then come back to the root directory, you can evaluate any `transformers`-based pre-trained models using our evaluation code. For example, ```bash python evaluation.py \ --model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \ --pooler cls \ --task_set sts \ --mode test ``` which is expected to output the results in a tabular format: ``` ------ test ------ +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 75.30 | 84.67 | 80.19 | 85.40 | 80.82 | 84.26 | 80.39 | 81.58 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ ``` Arguments for the evaluation script are as follows, * `--model_name_or_path`: The name or path of a `transformers`-based pre-trained checkpoint. You can directly use the models in the above table, e.g., `princeton-nlp/sup-simcse-bert-base-uncased`. * `--pooler`: Pooling method. Now we support * `cls` (default): Use the representation of `[CLS]` token. A linear+activation layer is applied after the representation (it's in the standard BERT implementation). If you use SimCSE, you should use this option. * `cls_before_pooler`: Use the representation of `[CLS]` token without the extra linear+activation. * `avg`: Average embeddings of the last layer. If you use checkpoints of SBERT/SRoBERTa ([paper](https://arxiv.org/abs/1908.10084)), you should use this option. * `avg_top2`: Average embeddings of the last two layers. * `avg_first_last`: Average embeddings of the first and last layers. If you use vanilla BERT or RoBERTa, this works the best. * `--mode`: Evaluation mode * `test` (default): The default test mode. To faithfully reproduce our results, you should use this option. * `dev`: Report the development set results. Note that in STS tasks, only `STS-B` and `SICK-R` have development sets, so we only report their numbers. It also takes a fast mode for transfer tasks, so the running time is much shorter than the `test` mode (though numbers are slightly lower). * `fasttest`: It is the same as `test`, but with a fast mode so the running time is much shorter, but the reported numbers may be lower (only for transfer tasks). * `--task_set`: What set of tasks to evaluate on (if set, it will override `--tasks`) * `sts` (default): Evaluate on STS tasks, including `STS 12~16`, `STS-B` and `SICK-R`. This is the most commonly-used set of tasks to evaluate the quality of sentence embeddings. * `transfer`: Evaluate on transfer tasks. * `full`: Evaluate on both STS and transfer tasks. * `na`: Manually set tasks by `--tasks`. * `--tasks`: Specify which dataset(s) to evaluate on. Will be overridden if `--task_set` is not `na`. See the code for a full list of tasks. ## Training ### Data For unsupervised SimCSE, we sample 1 million sentences from English Wikipedia; for supervised SimCSE, we use the SNLI and MNLI datasets. You can run `data/download_wiki.sh` and `data/download_nli.sh` to download the two datasets. ### Training scripts We provide example training scripts for both unsupervised and supervised SimCSE. In `run_unsup_example.sh`, we provide a single-GPU (or CPU) example for the unsupervised version, and in `run_sup_example.sh` we give a **multiple-GPU** example for the supervised version. Both scripts call `train.py` for training. We explain the arguments in following: * `--train_file`: Training file path. We support "txt" files (one line for one sentence) and "csv" files (2-column: pair data with no hard negative; 3-column: pair data with one corresponding hard negative instance). You can use our provided Wikipedia or NLI data, or you can use your own data with the same format. * `--model_name_or_path`: Pre-trained checkpoints to start with. For now we support BERT-based models (`bert-base-uncased`, `bert-large-uncased`, etc.) and RoBERTa-based models (`RoBERTa-base`, `RoBERTa-large`, etc.). * `--temp`: Temperature for the contrastive loss. * `--pooler_type`: Pooling method. It's the same as the `--pooler_type` in the [evaluation part](#evaluation). * `--hard_negative_weight`: If using hard negatives (i.e., there are 3 columns in the training file), this is the logarithm of the weight. For example, if the weight is 1, then this argument should be set as 0 (default value). * `--do_mlm`: Whether to use the MLM auxiliary objective. If True: * `--mlm_weight`: Weight for the MLM objective. * `--mlm_probability`: Masking rate for the MLM objective. All the other arguments are standard Huggingface's `transformers` training arguments. Some of the often-used arguments are: `--output_dir`, `--learning_rate`, `--per_device_train_batch_size`. In our example scripts, we also set to evaluate the model on the STS-B development set (need to download the dataset following the [evaluation](#evaluation) section) and save the best checkpoint. **REPRODUCTION**: For results in the paper, we use Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance. ### Convert models **IMPORTANT**: Our saved checkpoints are slightly different from Huggingface's pre-trained checkpoints. Run `python simcse_to_huggingface.py --path {PATH_TO_CHECKPOINT_FOLDER}` to convert it. After that, you can evaluate it by our [evaluation](#evaluation) code or directly use it [out of the box](#use-our-models-out-of-the-box). ## Bugs or questions? If you have any questions related to the code or the paper, feel free to email Tianyu (`tianyug@cs.princeton.edu`) and Xingcheng (`yxc18@mails.tsinghua.edu.cn`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker! ## Citation Please cite our paper if you use SimCSE in your work: ``` @article{gao2021simcse, title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings}, author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi}, journal={arXiv preprint arXiv:2104.08821}, year={2021} } ``` ## SimCSE Elsewhere We thank the community's efforts for extending SimCSE! - [Jianlin Su](https://github.com/bojone) has provided [a Chinese version of SimCSE](https://github.com/bojone/SimCSE). - [AK391](https://github.com/AK391) has provided a [Gradio Web Demo](https://gradio.app/g/AK391/SimCSE) of SimCSE. - [Nils Reimers](https://github.com/nreimers) has implemented a `sentence-transformers`-based [training code](https://colab.research.google.com/drive/1gAjXcI4uSxDE_IcvZdswFYVAo7XvPeoU?usp=sharing#scrollTo=UXUsikOc6oiB) for SimCSE.