# gector
**Repository Path**: noletter/gector
## Basic Information
- **Project Name**: gector
- **Description**: No description available
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-07-07
- **Last Updated**: 2021-07-07
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# GECToR – Grammatical Error Correction: Tag, Not Rewrite
This repository provides code for training and testing state-of-the-art models for grammatical error correction with the official PyTorch implementation of the following paper:
> [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/abs/2005.12592)
> [Kostiantyn Omelianchuk](https://github.com/komelianchuk), [Vitaliy Atrasevych](https://github.com/atrasevych), [Artem Chernodub](https://github.com/achernodub), [Oleksandr Skurzhanskyi](https://github.com/skurzhanskyi)
> Grammarly
> [15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)](https://sig-edu.org/bea/current)
It is mainly based on `AllenNLP` and `transformers`.
## Installation
The following command installs all necessary packages:
```.bash
pip install -r requirements.txt
```
The project was tested using Python 3.7.
## Datasets
All the public GEC datasets used in the paper can be downloaded from [here](https://www.cl.cam.ac.uk/research/nl/bea2019st/#data).
Synthetically created datasets can be generated/downloaded [here](https://github.com/awasthiabhijeet/PIE/tree/master/errorify).
To train the model data has to be preprocessed and converted to special format with the command:
```.bash
python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE
```
## Pretrained models
| Pretrained encoder |
Confidence bias |
Min error prob |
CoNNL-2014 (test) |
BEA-2019 (test) |
| BERT [link] |
0.10 |
0.41 |
63.0 |
67.6 |
| RoBERTa [link] |
0.20 |
0.50 |
64.0 |
71.5 |
| XLNet [link] |
0.35 |
0.66 |
65.3 |
72.4 |
| RoBERTa + XLNet |
0.24 |
0.45 |
66.0 |
73.7 |
| BERT + RoBERTa + XLNet |
0.16 |
0.40 |
66.5 |
73.6 |
## Train model
To train the model, simply run:
```.bash
python train.py --train_set TRAIN_SET --dev_set DEV_SET \
--model_dir MODEL_DIR
```
There are a lot of parameters to specify among them:
- `cold_steps_count` the number of epochs where we train only last linear layer
- `transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}` model encoder
- `tn_prob` probability of getting sentences with no errors; helps to balance precision/recall
- `pieces_per_token` maximum number of subwords per token; helps not to get CUDA out of memory
In our experiments we had 98/2 train/dev split.
## Training parameters
We described all parameters that we use for training and evaluating [here](https://github.com/grammarly/gector/blob/master/docs/training_parameters.md).
## Model inference
To run your model on the input file use the following command:
```.bash
python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
--vocab_path VOCAB_PATH --input_file INPUT_FILE \
--output_file OUTPUT_FILE
```
Among parameters:
- `min_error_probability` - minimum error probability (as in the paper)
- `additional_confidence` - confidence bias (as in the paper)
- `special_tokens_fix` to reproduce some reported results of pretrained models
For evaluation use [M^2Scorer](https://github.com/nusnlp/m2scorer) and [ERRANT](https://github.com/chrisjbryant/errant).
## Text Simplification
The code and README for Text Simplification version of GECToR will be added soon.
## Citation
If you find this work is useful for your research, please cite our paper:
```
@inproceedings{omelianchuk-etal-2020-gector,
title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
author = "Omelianchuk, Kostiantyn and
Atrasevych, Vitaliy and
Chernodub, Artem and
Skurzhanskyi, Oleksandr",
booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
month = jul,
year = "2020",
address = "Seattle, WA, USA → Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.bea-1.16",
pages = "163--170",
abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{\_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{\_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}
```