# topically-driven-language-model **Repository Path**: pdsxsf/topically-driven-language-model ## Basic Information - **Project Name**: topically-driven-language-model - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2017-09-03 - **Last Updated**: 2021-06-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Requirements - python2.7 (python3 code available in python3 branch) - gensim: pip install gensim - tensorflow 0.8-0.12 # Data format - One line per document - Sentences are delimited by tabs in each document - See examples in data/ # Running the code (example.sh) #### Train a word2vec model using gensim. This step is *optional*, you'll only need to do this if you want to initialise TDLM with pre-trained embeddings. word2vec model settings are in the python file (word2vec.py) `python word2vec_train.py` #### Train a model; configurations/hyper-parameters are defined in tdlm_config.py `python tdlm_train.py` #### All test inferences are invoked with tdlm_test.py. E.g. to compute language and topic model perplexity `python tdlm_test.py -m output/toy-model/ -d data/toy-valid.txt --print_perplexity` #### Print topics (to topics.txt) `python tdlm_test.py -m output/toy-model/ -d data/toy-valid.txt --output_topic topics.txt` #### Infer topic distribution in documents (saved as a npy file) `python tdlm_test.py -m output/toy-model/ -d data/toy-valid.txt --output_topic_dist topic-dist.npy` #### Generate sentences conditioned on topics `python tdlm_test.py -m output/toy-model/ -d data/toy-valid.txt --gen_sent_on_topic topic-sents.txt` #### tdlm_test.py arguments: ``` usage: tdlm_test.py [-h] -m MODEL_DIR [-d INPUT_DOC] [-l INPUT_LABEL] [-t INPUT_TAG] [--print_perplexity] [--print_acc] [--output_topic OUTPUT_TOPIC] [--output_topic_dist OUTPUT_TOPIC_DIST] [--output_tag_embedding OUTPUT_TAG_EMBEDDING] [--gen_sent_on_topic GEN_SENT_ON_TOPIC] [--gen_sent_on_doc GEN_SENT_ON_DOC] Given a trained TDLM model, perform various test inferences optional arguments: -h, --help show this help message and exit -m MODEL_DIR, --model_dir MODEL_DIR directory of the saved model -d INPUT_DOC, --input_doc INPUT_DOC input file containing the test documents -l INPUT_LABEL, --input_label INPUT_LABEL input file containing the test labels -t INPUT_TAG, --input_tag INPUT_TAG input file containing the test tags --print_perplexity print topic and language model perplexity of the input test documents --print_acc print supervised classification accuracy --output_topic OUTPUT_TOPIC output file to save the topics (prints top-N words of each topic) --output_topic_dist OUTPUT_TOPIC_DIST output file to save the topic distribution of input docs (npy format) --output_tag_embedding OUTPUT_TAG_EMBEDDING output tag embeddings to file (npy format) --gen_sent_on_topic GEN_SENT_ON_TOPIC generate sentences conditioned on topics --gen_sent_on_doc GEN_SENT_ON_DOC generate sentences conditioned on input test documents ``` # Publication Lau, Jey Han, Timothy Baldwin and Trevor Cohn (to appear) [Topically Driven Neural Language Model](https://arxiv.org/abs/1704.08012). In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, Canada.