# gector **Repository Path**: noletter/gector ## Basic Information - **Project Name**: gector - **Description**: No description available - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-07-07 - **Last Updated**: 2021-07-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # GECToR – Grammatical Error Correction: Tag, Not Rewrite This repository provides code for training and testing state-of-the-art models for grammatical error correction with the official PyTorch implementation of the following paper: > [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/abs/2005.12592)
> [Kostiantyn Omelianchuk](https://github.com/komelianchuk), [Vitaliy Atrasevych](https://github.com/atrasevych), [Artem Chernodub](https://github.com/achernodub), [Oleksandr Skurzhanskyi](https://github.com/skurzhanskyi)
> Grammarly
> [15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)](https://sig-edu.org/bea/current)
It is mainly based on `AllenNLP` and `transformers`. ## Installation The following command installs all necessary packages: ```.bash pip install -r requirements.txt ``` The project was tested using Python 3.7. ## Datasets All the public GEC datasets used in the paper can be downloaded from [here](https://www.cl.cam.ac.uk/research/nl/bea2019st/#data).
Synthetically created datasets can be generated/downloaded [here](https://github.com/awasthiabhijeet/PIE/tree/master/errorify).
To train the model data has to be preprocessed and converted to special format with the command: ```.bash python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE ``` ## Pretrained models

Pretrained encoder	Confidence bias	Min error prob	CoNNL-2014 (test)	BEA-2019 (test)
BERT [link]	0.10	0.41	63.0	67.6
RoBERTa [link]	0.20	0.50	64.0	71.5
XLNet [link]	0.35	0.66	65.3	72.4
RoBERTa + XLNet	0.24	0.45	66.0	73.7
BERT + RoBERTa + XLNet	0.16	0.40	66.5	73.6

## Train model To train the model, simply run: ```.bash python train.py --train_set TRAIN_SET --dev_set DEV_SET \ --model_dir MODEL_DIR ``` There are a lot of parameters to specify among them: - `cold_steps_count` the number of epochs where we train only last linear layer - `transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}` model encoder - `tn_prob` probability of getting sentences with no errors; helps to balance precision/recall - `pieces_per_token` maximum number of subwords per token; helps not to get CUDA out of memory In our experiments we had 98/2 train/dev split. ## Training parameters We described all parameters that we use for training and evaluating [here](https://github.com/grammarly/gector/blob/master/docs/training_parameters.md).
## Model inference To run your model on the input file use the following command: ```.bash python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \ --vocab_path VOCAB_PATH --input_file INPUT_FILE \ --output_file OUTPUT_FILE ``` Among parameters: - `min_error_probability` - minimum error probability (as in the paper) - `additional_confidence` - confidence bias (as in the paper) - `special_tokens_fix` to reproduce some reported results of pretrained models For evaluation use [M^2Scorer](https://github.com/nusnlp/m2scorer) and [ERRANT](https://github.com/chrisjbryant/errant). ## Text Simplification The code and README for Text Simplification version of GECToR will be added soon. ## Citation If you find this work is useful for your research, please cite our paper: ``` @inproceedings{omelianchuk-etal-2020-gector, title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite", author = "Omelianchuk, Kostiantyn and Atrasevych, Vitaliy and Chernodub, Artem and Skurzhanskyi, Oleksandr", booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications", month = jul, year = "2020", address = "Seattle, WA, USA â†’ Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.bea-1.16", pages = "163--170", abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{\_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{\_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.", } ```