# ChineseNMT

**Repository Path**: zyabo/ChineseNMT

## Basic Information

- **Project Name**: ChineseNMT
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 1
- **Created**: 2021-09-21
- **Last Updated**: 2024-06-27

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

Language: 简体中文 | [English](https://github.com/hemingkx/ChineseNMT/blob/master/README-en.md)

# ChineseNMT

基于transformer的英译中翻译模型🤗。

项目说明参考知乎文章：[教你用PyTorch玩转Transformer英译中翻译模型！](https://zhuanlan.zhihu.com/p/347061440)

## Data

The dataset is from [WMT 2018 Chinese-English track](http://statmt.org/wmt18/translation-task.html) (Only NEWS Area)

## Data Process

### 分词

- 工具：[sentencepiece](https://github.com/google/sentencepiece)
- 预处理：`./data/get_corpus.py`抽取train、dev和test中双语语料，分别保存到`corpus.en`和`corpus.ch`中，每行一个句子。
- 训练分词模型：`./tokenizer/tokenize.py`中调用了sentencepiece.SentencePieceTrainer.Train()方法，利用`corpus.en`和`corpus.ch`中的语料训练分词模型，训练完成后会在`./tokenizer`文件夹下生成`chn.model`，`chn.vocab`，`eng.model`和`eng.vocab`，其中`.model`和`.vocab`分别为模型文件和对应的词表。

## Model

采用Harvard开源的 [transformer-pytorch](http://nlp.seas.harvard.edu/2018/04/03/attention.html) ，中文说明可参考 [传送门](https://zhuanlan.zhihu.com/p/144825330) 。

## Requirements

This repo was tested on Python 3.6+ and PyTorch 1.5.1. The main requirements are:

- tqdm
- pytorch >= 1.5.1
- sacrebleu >= 1.4.14
- sentencepiece >= 0.1.94

To get the environment settled quickly, run:

```
pip install -r requirements.txt
```

## Usage

模型参数在`config.py`中设置。

- 由于transformer显存要求，支持MultiGPU，需要设置`config.py`中的`device_id`列表以及`main.py`中的`os.environ['CUDA_VISIBLE_DEVICES']`。

如要运行模型，可在命令行输入：

```
python main.py
```

实验结果在`./experiment/train.log`文件中，测试集翻译结果在`./experiment/output.txt`中。

> 在两块GeForce GTX 1080 Ti上运行，每个epoch用时一小时左右。

## Results

| Model | NoamOpt | LabelSmoothing | Best Dev Bleu | Test Bleu |
| :---: | :-----: | :------------: | :-----------: | :-------: |
|   1   |   No    |       No       |     24.07     |   24.03   |
|   2   |   Yes   |       No       |   **26.08**   | **25.94** |
|   3   |   No    |      Yes       |     23.92     |   23.84   |

## Pretrained Model

训练好的 Model 2 模型（当前最优模型）可以在如下链接直接下载😊：

链接: https://pan.baidu.com/s/1RKC-HV_UmXHq-sy1-yZd2Q  密码: g9wl

## Beam Search

当前最优模型（Model 2）使用beam search测试的结果

| Beam_size |   2   |   3   |   4   |     5     |
| :-------: | :---: | :---: | :---: | :-------: |
| Test Bleu | 26.59 | 26.80 | 26.84 | **26.86** |

## One Sentence Translation

将训练好的model或者上述Pretrained model以`model.pth`命名，保存在`./experiment`路径下。在`main.py`中运行`translate_example`，即可实现单句翻译。

如英文输入单句为：

```
The near-term policy remedies are clear: raise the minimum wage to a level that will keep a fully employed worker and his or her family out of poverty, and extend the earned-income tax credit to childless workers.
```

ground truth为：

```
近期的政策对策很明确：把最低工资提升到足以一个全职工人及其家庭免于贫困的水平，扩大对无子女劳动者的工资所得税减免。
```

beam size = 3的翻译结果为：

```
短期政策方案很清楚:把最低工资提高到充分就业的水平,并扩大向无薪工人发放所得的税收信用。
```

## Mention

The codes released in this reposity are only tested successfully with **Linux**. If you wanna try it with **Windows**, steps below may be useful to you as mentioned in [issue 2](https://github.com/hemingkx/ChineseNMT/issues/2):

1. **adding utf-8 encoding declaration:**

   in lines 16 and 19 of get_corpus.py:

   ```
   with open(ch_path, "w", encoding="utf-8") as fch:
   with open(en_path, "w", encoding="utf-8") as fen:
   ```

   in line 165 of train.py:

   ```
   with open(config.output_path, "w", encoding="utf-8") as fp:
   ```

2. **using conda command to install sacrebleu if Anoconda is used for building your virtual env:**

   ```
   conda install -c conda-forge sacrebleu
   ```

For any other problems you meet when doing your own project, welcome to issuing or sending emails to me 😊~