# acl2017 **Repository Path**: pdsxsf/acl2017 ## Basic Information - **Project Name**: acl2017 - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2017-09-03 - **Last Updated**: 2021-06-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Learning attention for historical text normalization by learning to pronounce This repository contains information on how to train and run models as described in: + Marcel Bollmann, Joachim Bingel, and Anders Søgaard (2017). *Learning attention for historical text normalization by learning to pronounce.* In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL2017). ## Requirements + Python 3 + [Keras 1.x](https://keras.io) (last version 1.2.2 is recommended) + [mblearn](https://bitbucket.org/mbollmann/mblearn) (a repo that keeps all our code building on top of Keras) (It's possible that the code won't currently run on a Tensorflow backend, since it was only ever tested on Theano, and some of the custom extensions might use Theano-specific code.) You can install all dependencies via: ```bash pip install -r requirements.txt ``` We **strongly recommend creating a virtualenv** to install these dependencies in. ## Running the code The supplied bash script `run_example.sh` contains some examples of how to run the code; it will train and evaluate two models on the sample data contained in this repo: 1. a bi-directional encoder/decoder model with attention; and 2. an encoder/decoder model (without attention) using a multi-task learning setup. Feel free to look at the script and play around with the included parameters. The sample data files used by this script are excerpts from [the Anselm corpus](https://www.linguistics.rub.de/anselm/), taken from texts "B" (`example.train` and `example.test`) and "Me" (`example.aux_train`). ## Data For our experiments, we used data from [the Anselm corpus](https://www.linguistics.rub.de/anselm/) and the [CELEX2 lexical database](https://catalog.ldc.upenn.edu/ldc96l14). Unfortunately, the full Anselm corpus is not yet publicly available (though a first release is planned for late 2017). We used the German phonology/wordforms database of CELEX2 with lower-case wordforms only. If you have access to CELEX2, you can prepare it like this to get the exact same dataset we used: ```bash ./celex_extract_phonemes.py -r /german/gpw/gpw.cd > gpw.tsv awk -F '\t' 'BEGIN {OFS=FS} {$1=tolower($1);print}' gpw.tsv > gpw_lower.tsv ``` ## Contact For any questions concerning the code, please contact Marcel Bollmann ().