# 基于连续空间修改的问句数据增广-四川大学

**Repository Path**: dicalab/CRQDA

## Basic Information

- **Project Name**: 基于连续空间修改的问句数据增广-四川大学
- **Description**: 基于连续空间修改的问句数据增广算法，可以生成上下文相关的可回答问句和不可回答问句作为增广数据
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-07-05
- **Last Updated**: 2021-07-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space
We provide a Pytorch implementation of the following paper:
> **Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space**, Dayiheng Liu, Yeyun Gong, Jie Fu, Yu Yan, Jiusheng Chen, Jiancheng Lv, Nan Duan and Ming Zhou, Conference on Empirical Methods in Natural Language Processing. EMNLP 2020 [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.467.pdf)


# Prerequisites
- Python 3.6
- Tensorflow 1.10.0+
- Pytorch 1.3.0+
- nltk 3.3+
- cuda 9.0

Please install the Huggingface transformers locally as follows:
```
cd pytorch-transformers-master
python setup.py install
```

# Datasets
Download the SQuAD 2.0 dataset files (`train-v2.0.json` and `dev-v2.0.json`) at [here](https://rajpurkar.github.io/SQuAD-explorer/).

The Transformer autoencoder can be trained with the questions in `train-v2.0.json`. 

We can also pretrain the Transformer autoencoder with our collected 2M question corpus, which contains about 2M questions from the training sets of several MRC and QA datasets, including SQuAD2.0, Natural Questions, NewsQA, QuAC, TriviaQA, CoQA, HotpotQA, DuoRC, and MS MARCO. This 2M questions corpus can be downloaded at [here](https://drive.google.com/file/d/1SkjAqTlM3KWZbX66fTAs6rCze0TyOJxu/view?usp=sharing). 

In addition, we can pretrain the Transformer autoencoder with the large-scale corpora English Wikipedia and BookCorpus, please refer to [here](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT#quick-start-guide) to download and preprocess the dataset. After that, you can obtain a text file `wikicorpus_en_one_article_per_line.txt` for Transformer autoencoder pre-training.

# CRQDA

## Model Training

### Pre-trained Language Model based MRC Model

We adopt the BERT (`BERTforQuestionAnswering`) and RoBERTa (`RobertaForQuestionAnswering`) based models of [Huggingface](https://github.com/huggingface/transformers/tree/master/examples/question-answering) as the SQuAD 2.0 MRC models.

We provide a well trained RoBERTa SQuAD 2.0 MRC model whose checkpoint can be downloaded at [here](https://drive.google.com/file/d/1B4I7tGp2pnUs0kQhg4Pn1FgBnr3fbNB1/view?usp=sharing).

###  Transformer-based Autoencoder

Before training the Transformer-based Autoencoder, please put the checkpoint files of the well trained RoBERTa SQuAD 2.0 MRC model into the default directory `crqda/data/mrc_model`, and put the `wikicorpus_en_one_article_per_line.txt` (or other dataset, like 2M questions corpus) into the default directory `crqda/data/`.

Then train the Transformer-based Autoencoder with this script:
```
cd crqda
./run_train.sh
```
The Transformer-based Autoencoder will be saved at `data/ae_models`.

## Rewriting Question with Gradient-based Optimization

To rewrite the question and obtain the augmented dataset, please run this script:
```
cd crqda
python inference.py \
--OS_ID 0 \
--GAP 33000 \
--NEG \
--ae_model_path 'data/ae_models/pytorch_model.bin'
```
set `--NEG` to generate unanswerable questions, and `--para` to generate answerable questions. Since the rewriting process is slow, we set up a manual parallel rewriting function, set `OS_ID` to indicate which GPU should be used for this rewriting, and `GAP` is the number of original training samples should be rewritten in this GPU.

Here we provide a SQuAD 2.0 augmented dataset which contains the original SQuAD 2.0 training data pairs and some unanswerable question data pairs generated by CRQDA. It can be downloaded at [here](https://drive.google.com/file/d/1nZGjQfxP1pSUu3siiJxdQsm1y-xHNIjB/view?usp=sharing).

## Finetuning MRC model with Augmented Dataset

After question data augmention with CRQDA, we can finetune the BERT-large model on the augmented dataset with the script:
```
cd pytorch-transformers-master/examples
./run_fine_tune_bert_with_crqda.sh
```
You may obtain the results like
```
"best_exact": 80.56093657879222,   "best_f1": 83.3359726931614,   "exact": 80.03032089615093,   "f1": 82.97608915068454
```


# Other Baselines

We also provide the implementation of the baselines, including EDA, Back-Translation, and Text-VAE, which can be found in `baselines/EDA
`, `baselines/Mu-Forcing-VRAE`, and `baselines/Style-Transfer-Through-Back-Translation`, respectively.


# Citation
```
@inproceedings{liu2020crqda,
    title = "Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space",
    author="Liu, Dayiheng and Gong, Yeyun and Fu, Jie, and Yan, Yu and Chen Jiusheng, and Lv, Jiancheng and Duan, Nan and Zhou, Ming",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2020"
}
```