# coling2016-claim-classification **Repository Path**: pdsxsf/coling2016-claim-classification ## Basic Information - **Project Name**: coling2016-claim-classification - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2017-04-18 - **Last Updated**: 2024-07-01 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CNN- and LSTM-based Claim Classification in Online User Comments This project contains experimental code for classifying claims using convolutional neural networks (CNNs) and long short-term memory networks (LSTMs). Please use the following citation: ``` @inproceedings{guggilla2016cnn, author = {Chinnappa Guggilla and Tristan Miller and Iryna Gurevych}, title = {{CNN}- and {LSTM}-based Claim Classification in Online User Comments}, year = 2016, booktitle = {Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING 2016)}, month = dec, url = {https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/publikationen/2016/2016_COLING_CG.pdf}, pages = {2740--2751}, isbn = {978-4-87974-702-0}, } ``` > **Abstract:** When processing arguments in online user interactive discourse, it is often necessary to determine their bases of support. In this paper, we describe a supervised approach, based on deep neural networks, for classifying the claims made in online arguments. We conduct experiments using convolutional neural networks (CNNs) and long short-term memory networks (LSTMs) on two claim data sets compiled from online user comments. Using different types of distributional word embeddings, but without incorporating any rich, expensive set of features, we achieve a significant improvement over the state of the art for one data set (which categorizes arguments as factual vs. emotional), and performance comparable to the state of the art on the other data set (which categorizes claims according to their verifiability). Our approach has the advantages of using a generalized, simple, and effective methodology that works for claim categorization on different data sets and tasks. Contact person: Tristan Miller, miller@ukp.informatik.tu-darmstadt.de https://www.ukp.tu-darmstadt.de/ https://www.tu-darmstadt.de/ Don't hesitate to send us an e-mail or report an issue if something is broken (and it shouldn't be) or if you have further questions. > This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication. ## Project structure * `cnn_claim_classification` -- Experiments using Yoon's [Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882), an embedding layer followed by convolution layer. * `lstm_claim_classification` -- Experiments similar to the above, but concatenating an embedding layer with an [LSTM-RNN module](http://deeplearning.net/tutorial/lstm.html#lstm). * `compile_embeddings_factbankcorpus` -- Code for compiling factual embeddings used in the above experiments. ## Requirements * Software dependencies * Perl 5 * Python 2.7 * [Theano](http://deeplearning.net/software/theano/) * [pandas](http://pandas.pydata.org/) * [gensim](https://radimrehurek.com/gensim/install.html) * Data sets * Park & Cardie's [verifiable and unverifiable claims data set](http://www.aclweb.org/anthology/W14-2105) * Oraby et al.'s [Fact-Feeling data set](https://nlds.soe.ucsc.edu/node/33) * Embeddings * Word2Vec (Mikolov) * Dependency embeddings (Levy et al.) * Factual embeddings -- compile these from the FactBank 1.0 corpus using Gensim; see instructions in [`compile_embeddings_factbankcorpus`](compile_embeddings_factbankcorpus/README.md) * Concatenated embeddings -- cocnatenate all three embeddings into stacked embeddings of 300 dimensions Embeddings should be placed in an `embeddings` folder. ## Running the experiments 1. Preprocessing To preprocess data sets using different embeddings for CNN claim classification, run these scripts for converting datasets into emebdding vectors/matrices: ```` python preprocess_data_verify.py ../embeddings/GoogleNews-vectors-negative300.bin ../embeddings/deps.words ../embeddings/factual.en.word2vec.model.bin python preprocess_data_factfeel.py ../embeddings/GoogleNews-vectors-negative300.bin ../embeddings/deps.words ../embeddings/factual.en.word2vec.model.bin ```` To preprocess data sets using different embeddings for LSTM claim classification: ```` python preprocess_data_verify.py ../embeddings/GoogleNews-vectors-negative300.bin ../embeddings/deps.words ../embeddings/factual.en.word2vec.model.bin python preprocess_data_factfeel.py ../embeddings/GoogleNews-vectors-negative300.bin ../embeddings/deps.words ../embeddings/factual.en.word2vec.model.bin ```` In both cases, the scripts will create embedding weight matrices and word dictionary in pickle format. The files are generated in the current directory. 2. Create output directory for predictions ```` mkdir cnn_claim_classification/predictions mkdir lstm_claim_classification/predictions ```` 3. Perform CNN-based classification * Verifiabile and unverifiable data set: ```` python conv_net_sentence_verify.py -word2vec python conv_net_sentence_verify.py -dep2vec python conv_net_sentence_verify.py -fact2vec python conv_net_sentence_verify.py -concat ```` * Fact-Feeling data set: ```` python conv_net_sentence_factfeel.py -word2vec python conv_net_sentence_factfeel.py -dep2vec python conv_net_sentence_factfeel.py -fact2vec python conv_net_sentence_factfeel.py -concat ```` 4. Perform LSTM-based classification * Verifiabile and unverifiable data set: ```` python lstm_verify.py -word2vec python lstm_verify.py -dep2vec python lstm_verify.py -fact2vec python lstm_verify.py -concat ```` * Fact-Feeling data set: ```` python lstm_factfeel.py -word2vec python lstm_factfeel.py -dep2vec python lstm_factfeel.py -fact2vec python lstm_factfeel.py -concat ```` 5. Check the output The corresponding claim predictions with given embeddings will be stored in the `predictions` folder. Accuracies will be reported in the console output. Choose the best accuracy from iterations.