# pytorch_basic_nmt **Repository Path**: lin323/pytorch_basic_nmt ## Basic Information - **Project Name**: pytorch_basic_nmt - **Description**: A simple yet strong implementation of neural machine translation in pytorch - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-03-04 - **Last Updated**: 2025-01-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## A Basic PyTorch Implementation of Attentional Neural Machine Translation This is a basic implementation of attentional neural machine translation (Bahdanau et al., 2015, Luong et al., 2015) in Pytorch 0.4. It implements the model described in [Luong et al., 2015](https://arxiv.org/abs/1508.04025), and supports label smoothing, beam-search decoding and random sampling. With 256-dimensional LSTM hidden size, it achieves 28.13 BLEU score on the IWSLT 2014 Germen-English dataset (Ranzato et al., 2015). This codebase is used for instructional purposes in Stanford [CS224N Nautral Language Processing with Deep Learning]( http://web.stanford.edu/class/cs224n/) and CMU [11-731 Machine Translation and Sequence-to-Sequence Models](http://www.phontron.com/class/mtandseq2seq2018/). ### File Structure * `nmt.py`: contains the neural machine translation model and training/testing code. * `vocab.py`: a script that extracts vocabulary from training data * `util.py`: contains utility/helper functions ### Example Dataset We provide a preprocessed version of the IWSLT 2014 German-English translation task used in (Ranzato et al., 2015) [[script]](https://github.com/harvardnlp/BSO/blob/master/data_prep/MT/prepareData.sh). To download the dataset: ```bash wget http://www.cs.cmu.edu/~pengchey/iwslt2014_ende.zip unzip iwslt2014_ende.zip ``` Running the script will extract a`data/` folder which contains the IWSLT 2014 dataset. The dataset has 150K German-English training sentences. The `data/` folder contains a copy of the public release of the dataset. Files with suffix `*.wmixerprep` are pre-processed versions of the dataset from Ranzato et al., 2015, with long sentences chopped and rared words replaced by a special `` token. You could use the pre-processed training files for training/developing (or come up with your own pre-processing strategy), but for testing you have to use the **original** version of testing files, ie., `test.de-en.(de|en)`. ### Environment The code is written in Python 3.6 using some supporting third-party libraries. We provided a conda environment to install Python 3.6 with required libraries. Simply run ```bash conda env create -f environment.yml ``` ### Usage Each runnable script (`nmt.py`, `vocab.py`) is annotated using `dotopt`. Please refer to the source file for complete usage. First, we extract a vocabulary file from the training data using the command: ```bash python vocab.py \ --train-src=data/train.de-en.de.wmixerprep \ --train-tgt=data/train.de-en.en.wmixerprep \ data/vocab.json ``` This generates a vocabulary file `data/vocab.json`. The script also has options to control the cutoff frequency and the size of generated vocabulary, which you may play with. To start training and evaluation, simply run `data/train.sh`. After training and decoding, we call the official evaluation script `multi-bleu.perl` to compute the corpus-level BLEU score of the decoding results against the gold-standard. ### License This work is licensed under a Creative Commons Attribution 4.0 International License.