# trans-encoder **Repository Path**: mirrors_amzn/trans-encoder ## Basic Information - **Project Name**: trans-encoder - **Description**: Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-11-25 - **Last Updated**: 2026-03-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Trans-Encoder

[arxiv] · [amazon.science blog] · [5min-video] · [talk@RIKEN] · [openreview]

Code repo for **ICLR 2022** paper **_[Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations](https://arxiv.org/abs/2109.13059)_**
by [Fangyu Liu](http://fangyuliu.me/about.html), [Yunlong Jiao](https://yunlongjiao.github.io/), [Jordan Massiah](https://www.linkedin.com/in/jordan-massiah-562862136/?originalSubdomain=uk), [Emine Yilmaz](https://sites.google.com/site/emineyilmaz/), [Serhii Havrylov](https://serhii-havrylov.github.io/). Trans-Encoder is a state-of-the-art unsupervised sentence similarity model. It conducts self-knowledge-distillation on top of pretrained language models by alternating between their bi- and cross-encoder forms. ## Huggingface pretrained models for STS
base models large models
|model | STS avg. | |--------|--------| |baseline: [unsup-simcse-bert-base](https://huggingface.co/princeton-nlp/unsup-simcse-bert-base-uncased) | 76.21 | | [trans-encoder-bi-simcse-bert-base](https://huggingface.co/cambridgeltl/trans-encoder-bi-simcse-bert-base) | 80.41 | | [trans-encoder-cross-simcse-bert-base](https://huggingface.co/cambridgeltl/trans-encoder-cross-simcse-bert-base) | 79.90 | |baseline: [unsup-simcse-roberta-base](https://huggingface.co/princeton-nlp/unsup-simcse-roberta-base) | 76.10 | | [trans-encoder-bi-simcse-roberta-base](https://huggingface.co/cambridgeltl/trans-encoder-bi-simcse-roberta-base) | 80.47 | | [trans-encoder-cross-simcse-roberta-base](https://huggingface.co/cambridgeltl/trans-encoder-cross-simcse-roberta-base) | **81.15** | |model | STS avg. | |--------|--------| |baseline: [unsup-simcse-bert-large](https://huggingface.co/princeton-nlp/unsup-simcse-bert-large-uncased) | 78.42 | | [trans-encoder-bi-simcse-bert-large](https://huggingface.co/cambridgeltl/trans-encoder-bi-simcse-bert-large) | 82.65 | | [trans-encoder-cross-simcse-bert-large](https://huggingface.co/cambridgeltl/trans-encoder-cross-simcse-bert-large) | 82.52 | |baseline: [unsup-simcse-roberta-large](https://huggingface.co/princeton-nlp/unsup-simcse-roberta-large) | 78.92 | | [trans-encoder-bi-simcse-roberta-large](https://huggingface.co/cambridgeltl/trans-encoder-bi-simcse-roberta-large) | **82.93** | | [trans-encoder-cross-simcse-roberta-large](https://huggingface.co/cambridgeltl/trans-encoder-cross-simcse-roberta-large) | **82.93** |
## Dependencies ``` torch==1.8.1 transformers==4.9.0 sentence-transformers==2.0.0 ``` Please view [requirements.txt](https://github.com/amzn/trans-encoder/blob/main/requirements.txt) for more details. ## Data All training and evaluation data will be automatically downloaded when running the scripts. See [src/data.py](https://github.com/amzn/trans-encoder/blob/main/src/data.py) for details. ## Train `--task` options: `sts` (STS2012-2016 and STS-b), `sickr`, `sts_sickr` (STS2012-2016, STS-b, and SICK-R), `qqp`, `qnli`, `mrpc`, `snli`, `custom`. See [src/data.py](https://github.com/amzn/trans-encoder/blob/main/src/data.py) for task data details. By default using all STS data (`sts_sickr`). #### Self-distillation ```bash >> bash train_self_distill.sh 0 ``` `0` denotes GPU device index. #### Mutual-distillation ```bash >> bash train_mutual_distill.sh 0,1 ``` Two GPUs needed; by default using SimCSE BERT & RoBERTa base models for ensembling. Add `--use_large` for switching to large models. #### Train with your custom corpus ```bash >> CUDA_VISIBLE_DEVICES=0,1 python src/mutual_distill_parallel.py \ --batch_size_bi_encoder 128 \ --batch_size_cross_encoder 64 \ --num_epochs_bi_encoder 10 \ --num_epochs_cross_encoder 1 \ --cycle 3 \ --bi_encoder1_pooling_mode cls \ --bi_encoder2_pooling_mode cls \ --init_with_new_models \ --task custom \ --random_seed 2021 \ --custom_corpus_path CORPUS_PATH ``` `CORPUS_PATH` should point to your custom corpus in which every line should be a sentence pair in the form of `sent1||sent2`. ## Evaluate #### Evaluate a single model Bi-encoder: ```bash >> python src/eval.py \ --model_name_or_path "cambridgeltl/trans-encoder-bi-simcse-roberta-large" \ --mode bi \ --task sts_sickr ``` Cross-encoder: ```bash >> python src/eval.py \ --model_name_or_path "cambridgeltl/trans-encoder-cross-simcse-roberta-large" \ --mode cross \ --task sts_sickr ``` #### Evaluate ensemble Bi-encoder: ```bash >> python src/eval.py \ --model_name_or_path1 "cambridgeltl/trans-encoder-bi-simcse-bert-large" \ --model_name_or_path2 "cambridgeltl/trans-encoder-bi-simcse-roberta-large" \ --mode bi \ --ensemble \ --task sts_sickr ``` Cross-encoder: ```bash >> python src/eval.py \ --model_name_or_path1 "cambridgeltl/trans-encoder-cross-simcse-bert-large" \ --model_name_or_path2 "cambridgeltl/trans-encoder-cross-simcse-roberta-large" \ --mode cross \ --ensemble \ --task sts_sickr ``` ## Authors - [**Fangyu Liu**](http://fangyuliu.me/about.html): Main contributor ## Security See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. ## License This project is licensed under the Apache-2.0 License.