# bert-nmt **Repository Path**: dot23/bert-nmt ## Basic Information - **Project Name**: bert-nmt - **Description**: No description available - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-11-25 - **Last Updated**: 2024-06-08 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Introduction This repository contains the code for BERT-fused NMT, which is introduced in the ICLR2020 paper [Incorporating BERT into Neural Machine Translation](https://openreview.net/forum?id=Hyl7ygStwB). If you find this work helpful in your research, please cite as: ``` @inproceedings{ Zhu2020Incorporating, title={Incorporating BERT into Neural Machine Translation}, author={Jinhua Zhu and Yingce Xia and Lijun Wu and Di He and Tao Qin and Wengang Zhou and Houqiang Li and Tieyan Liu}, booktitle={International Conference on Learning Representations}, year={2020}, url={https://openreview.net/forum?id=Hyl7ygStwB} } ``` *NOTE: We have updated our [code](https://github.com/bert-nmt/bert-nmt/tree/update-20-10) to enable you use more powerful pretrained models contained in [huggingface/transformers](https://github.com/huggingface/transformers). With `bert-base-german-dbmdz-uncased`, we get a new result $37.34$ on IWSLT'14 de->en task.* # Requirements and Installation * [PyTorch](http://pytorch.org/) version == 1.0.0/1.1.0 * Python version >= 3.5 **Installing from source** To install fairseq from source and develop locally: ``` git clone https://github.com/bert-nmt/bert-nmt cd bertnmt pip install --editable . ``` # Getting Started ### Data Preprocessing First, you should run Fairseq `prepaer-xxx.sh` to get tokenized&bped files like: ``` train.en train.de valid.en valid.de test.en test.de ``` Then you can use [makedataforbert.sh](examples/translation/makedataforbert.sh) to get input file for BERT model (please note that the path is correct). You can get ``` train.en train.de valid.en valid.de test.en test.de train.bert.en valid.bert.en test.bert.en ``` Then preprocess data like Fairseq: ``` python preprocess.py --source-lang src_lng --target-lang tgt_lng \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ --destdir destdir --joined-dictionary --bert-model-name bert-base-uncased ``` *Note: For more language pairs used in our paper, please refer to another [repo](https://github.com/teslacool/preprocess_iwslt/blob/master/preprocess.sh).* ### Train a vanilla NMT model using [Fairseq](https://github.com/pytorch/fairseq) Using data above and standard [Fairseq](https://github.com/pytorch/fairseq) repository, you can get a pretrained NMT model. *Note: The update_freq in iwslt en->zh translation is set to 2, and other hyper-parameters are the same as de<->en* ### Train a BERT-fused NMT model The important options we add: ``` parser.add_argument('--bert-model-name', default='bert-base-uncased', type=str) parser.add_argument('--warmup-from-nmt', action='store_true', ) parser.add_argument('--warmup-nmt-file', default='checkpoint_nmt.pt', ) parser.add_argument('--encoder-bert-dropout', action='store_true',) parser.add_argument('--encoder-bert-dropout-ratio', default=0.25, type=float) ``` 1. `--bert-model-name` specify the BERT model name, provided in [file](bert/modeling.py). 2. `--warmup-from-nmt` indicate you will also use a pretrained NMT model to train your BERT-fused NMT model. If you this option, we suggest you use `--reset-lr-scheduler`, too. 3. `--warmup-nmt-file` specify the NMT model name (in your $savedir). 4. `--encoder-bert-dropout` indicate you will use drop-net trick. 5. `--encoder-bert-dropout-ratio` specify the ratio ($\in [0, 0.5]$) used in drop-net. This is a training script example: ``` #!/usr/bin/env bash nvidia-smi cd /yourpath/bertnmt python3 -c "import torch; print(torch.__version__)" src=en tgt=de bedropout=0.5 ARCH=transformer_s2_iwslt_de_en DATAPATH=/yourdatapath SAVEDIR=checkpoints/iwed_${src}_${tgt}_${bedropout} mkdir -p $SAVEDIR if [ ! -f $SAVEDIR/checkpoint_nmt.pt ] then cp /your_pretrained_nmt_model $SAVEDIR/checkpoint_nmt.pt fi if [ ! -f "$SAVEDIR/checkpoint_last.pt" ] then warmup="--warmup-from-nmt --reset-lr-scheduler" else warmup="" fi python train.py $DATAPATH \ -a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \ --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \ --adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup \ --encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log ``` ### Generate Using the `generate.py` to test model is the same as the Fairseq, but you should add `--bert-model-name` to indicate your BERT model name. Using the `interactive.py` to test model is a little different from the Fairseq. You should follow this procedure: ``` sed -r 's/(@@ )|(@@ ?$)//g' $bpefile > $bpefile.debpe $MOSE/scripts/tokenizer/detokenizer.perl -l $src < $bpefile.debpe > $bpefile.debpe.detok paste -d "\n" $bpefile $bpefile.debpe.detok > $bpefile.in cat $bpefile.in | python interactive.py -s $src -t $tgt \ --buffer-size 1024 --batch-size 128 --beam 5 --remove-bpe > output.log ```