# data2text-plan-py **Repository Path**: henry-nlp/data2text-plan-py ## Basic Information - **Project Name**: data2text-plan-py - **Description**: No description available - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-03-22 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # data2text-plan-py This repo contains code for [Data-to-Text Generation with Content Selection and Planning](https://arxiv.org/abs/1809.00582) (Puduppully, R., Dong, L., & Lapata, M.; AAAI 2019); this code is based on an earlier fork of [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py). The Pytorch version is 0.3.1. ## Requirements All dependencies can be installed via: ```bash pip install -r requirements.txt ``` Note that the Pytorch version is 0.3.1 and Python version is 2.7. The path to Pytorch wheel in ```requirements.txt``` is configured with CUDA 8.0. You may change it to the desired CUDA version. ## Dataset The boxscore-data json files can be downloaded from the [boxscore-data repo](https://github.com/harvardnlp/boxscore-data). The input dataset for data2text-plan-py can be created by running the script ```create_dataset.py``` in ```scripts``` folder. The dataset so obtained is available at link https://drive.google.com/open?id=1R_82ifGiybHKuXnVnC8JhBTW8BAkdwek ## Preprocessing Assuming the OpenNMT-py input files reside at `~/boxscore-data`, the following command will preprocess the data ``` BASE=~/boxscore-data IDENTIFIER=cc mkdir $BASE/preprocess python preprocess.py -train_src1 $BASE/rotowire/src_train.txt -train_tgt1 $BASE/rotowire/train_content_plan.txt -train_src2 $BASE/rotowire/inter/train_content_plan.txt -train_tgt2 $BASE/rotowire/tgt_train.txt -valid_src1 $BASE/rotowire/src_valid.txt -valid_tgt1 $BASE/rotowire/valid_content_plan.txt -valid_src2 $BASE/rotowire/inter/valid_content_plan.txt -valid_tgt2 $BASE/rotowire/tgt_valid.txt -save_data $BASE/preprocess/roto -src_seq_length 1000 -tgt_seq_length 1000 -dynamic_dict -train_ptr $BASE/rotowire/train-roto-ptrs.txt ``` The train-roto-ptrs.txt file is available along with the dataset and can also be created by the following command ``` python data_utils.py -mode ptrs -input_path $BASE/rotowire/train.json -train_content_plan $BASE/rotowire/inter/train_content_plan.txt -output_fi $BASE/rotowire/train-roto-ptrs.txt ``` ## Training (and Downloading Trained Models) The command for training the Neural Content Planning model with conditional copy NCP+CC is as follows: ``` BASE=~/boxscore-data IDENTIFIER=cc python train.py -data $BASE/preprocess/roto -save_model $BASE/gen_model/$IDENTIFIER/roto -encoder_type1 mean -decoder_type1 pointer -enc_layers1 1 -dec_layers1 1 -encoder_type2 brnn -decoder_type2 rnn -enc_layers2 2 -dec_layers2 2 -batch_size 5 -feat_merge mlp -feat_vec_size 600 -word_vec_size 600 -rnn_size 600 -seed 1234 -start_checkpoint_at 4 -epochs 25 -optim adagrad -learning_rate 0.15 -adagrad_accumulator_init 0.1 -report_every 100 -copy_attn -truncated_decoder 100 -gpuid $GPUID -attn_hidden 64 -reuse_copy_attn -start_decay_at 4 -learning_rate_decay 0.97 -valid_batch_size 5 ``` The NCP+CC model can be downloaded from https://www.dropbox.com/sh/vo5wb2fuq7m0bk0/AABikW0KomOKIor24wD8VSFWa?dl=0 ## Generation During inference, we first generate the content plan ``` MODEL_PATH= python translate.py -model $MODEL_PATH -src1 $BASE/rotowire/inf_src_valid.txt -output $BASE/gen/roto_stage1_$IDENTIFIER-beam5_gens.txt -batch_size 10 -max_length 80 -gpu $GPUID -min_length 35 -stage1 ``` This script generates the content plan with records from input of content plan with indices ``` python scripts/create_content_plan_from_index.py $BASE/rotowire/inf_src_valid.txt $BASE/gen/roto_stage1_$IDENTIFIER-beam5_gens.txt $BASE/transform_gen/roto_stage1_$IDENTIFIER-beam5_gens.h5-tuples.txt $BASE/gen/roto_stage1_inter_$IDENTIFIER-beam5_gens.txt ``` The accuracy of content plan in first stage can be evaluated using the following command ``` python non_rg_metrics.py $BASE/transform_gen/roto-gold-val-beam5_gens.h5-tuples.txt $BASE/transform_gen/roto_stage1_$IDENTIFIER-beam5_gens.h5-tuples.txt ``` The output summary is generated using the command ``` MODEL_PATH2= python translate.py -model $MODEL_PATH -model2 $MODEL_PATH2 -src1 $BASE/rotowire/inf_src_valid.txt -tgt1 $BASE/gen/roto_stage1_$IDENTIFIER-beam5_gens.txt -src2 $BASE/gen/roto_stage1_inter_$IDENTIFIER-beam5_gens.txt -output $BASE/gen/roto_stage2_$IDENTIFIER-beam5_gens.txt -batch_size 10 -max_length 850 -min_length 150 -gpu $GPUID ``` ## Automatic evaluation using IE metrics Metrics of RG, CS, CO are computed using the below commands. ``` python data_utils.py -mode prep_gen_data -gen_fi $BASE/gen/roto_stage2_$IDENTIFIER-beam5_gens.txt -dict_pfx "roto-ie" -output_fi $BASE/transform_gen/roto_stage2_$IDENTIFIER-beam5_gens.h5 -input_path "/boxcore-json/rotowire" th extractor.lua -gpuid $GPUID -datafile roto-ie.h5 -preddata $BASE/transform_gen/roto_stage2_$IDENTIFIER-beam5_gens.h5 -dict_pfx "roto-ie" -just_eval python non_rg_metrics.py $BASE/transform_gen/roto-gold-val-beam5_gens.h5-tuples.txt $BASE/transform_gen/roto_stage2_$IDENTIFIER-beam5_gens.h5-tuples.txt ``` ## Evaluation using BLEU script The BLEU perl script can be obtained from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl Command to compute BLEU score: ``` ~/multi-bleu.perl $BASE/rotowire/inf_tgt_valid.txt < $BASE/gen/roto_stage2_$IDENTIFIER-beam5_gens.txt ``` ## IE models For training the IE models, follow the updated code in https://github.com/ratishsp/data2text-1 which contains bug fixes for number handling. The repo contains the downloadable links for IE models too.