# acl2017

**Repository Path**: pdsxsf/acl2017

## Basic Information

- **Project Name**: acl2017
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2017-09-03
- **Last Updated**: 2021-06-20

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Learning attention for historical text normalization by learning to pronounce

This repository contains information on how to train and run models as described
in:

+ Marcel Bollmann, Joachim Bingel, and Anders Søgaard (2017).  *Learning
  attention for historical text normalization by learning to pronounce.* In:
  Proceedings of the 55th Annual Meeting of the Association for Computational
  Linguistics (ACL2017).

## Requirements

+ Python 3
+ [Keras 1.x](https://keras.io) (last version 1.2.2 is recommended)
+ [mblearn](https://bitbucket.org/mbollmann/mblearn) (a repo that keeps all our
  code building on top of Keras)

(It's possible that the code won't currently run on a Tensorflow backend,
since it was only ever tested on Theano, and some of the custom extensions might
use Theano-specific code.)

You can install all dependencies via:

```bash
pip install -r requirements.txt
```

We **strongly recommend creating a virtualenv** to install these dependencies
in.

## Running the code

The supplied bash script `run_example.sh` contains some examples of how to run
the code; it will train and evaluate two models on the sample data contained in
this repo:

1. a bi-directional encoder/decoder model with attention; and
2. an encoder/decoder model (without attention) using a multi-task learning
   setup.

Feel free to look at the script and play around with the included parameters.

The sample data files used by this script are excerpts
from [the Anselm corpus](https://www.linguistics.rub.de/anselm/), taken from
texts "B" (`example.train` and `example.test`) and "Me" (`example.aux_train`).

## Data

For our experiments, we used data
from [the Anselm corpus](https://www.linguistics.rub.de/anselm/) and
the [CELEX2 lexical database](https://catalog.ldc.upenn.edu/ldc96l14).
Unfortunately, the full Anselm corpus is not yet publicly available (though a
first release is planned for late 2017).

We used the German phonology/wordforms database of CELEX2 with lower-case
wordforms only.  If you have access to CELEX2, you can prepare it like this to
get the exact same dataset we used:

```bash
./celex_extract_phonemes.py -r <celex2-dir>/german/gpw/gpw.cd > gpw.tsv
awk -F '\t' 'BEGIN {OFS=FS} {$1=tolower($1);print}' gpw.tsv > gpw_lower.tsv
```

## Contact

For any questions concerning the code, please contact Marcel Bollmann
(<bollmann@linguistics.rub.de>).