# ClusType

**Repository Path**: as11221208/ClusType

## Basic Information

- **Project Name**: ClusType
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: GPL-3.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2018-08-07
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ClusType

Source code for SIGKDD'15 paper *[ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering](http://web.engr.illinois.edu/~xren7/fp611-ren.pdf)* ([Slides](http://web.engr.illinois.edu/~xren7/KDD15-ClusType_v3.pdf)). 

Given a *text corpus* (e.g., a collection of news articles), it performs automatically [entity extraction and typing](https://en.wikipedia.org/wiki/Named-entity_recognition) using [distant supervision](http://deepdive.stanford.edu/distant_supervision) (i.e., examples from external knowledge bases like Freebase). For example, from the sentence "`The best BBQ I’ve tasted in Phoenix `" the system will recognize `BBQ` as *food* and `phoenix` as *location*. More background can be found in our [WWW'16 tutorial](http://web.engr.illinois.edu/~elkishk2/www2016/).

ClusType works on coarse-grained entity types (e.g., Person, Location, Organization); for more fine-grained entity typing, please use [AFET](https://github.com/shanzhenren/AFET) (Ren et al., EMNLP'16).

## Data

- NYT:
  - Corpus: 110k New York Times news articles ([download](https://www.dropbox.com/s/y20wv7xmfgcjx65/nyt13_110k.txt?dl=0))
  - Seed entities: entity linking result by DBpediaSpotlight ([download](https://www.dropbox.com/s/k0qzsvbbpngptjt/seed_nyt.txt?dl=0))  
- Yelp:
  - Corpus: 230k Yelp reviews sampled from [Yelp Dataset](https://www.yelp.com/dataset_challenge) ([download](https://www.dropbox.com/s/nqouxgqmz2fdemy/yelp_230k.txt?dl=0))
  - Seed entities: entity linking result by DBpediaSpotlight ([download](https://www.dropbox.com/s/w628rwpb3kbmuea/seed_yelp.txt?dl=0))
- Tweet:
  - Corpus: 302k tweets from May 2011 ([download](https://www.dropbox.com/s/tlf4qi5siqka14n/tweet_302k.txt?dl=0))
  - Seed entities: entity linking result by DBpediaSpotlight ([download](https://www.dropbox.com/s/c1yuqy3fakga015/tweet_seed.txt?dl=0))


## System Output & Evaluation

The system output on NYT dataset can be downloaded from [here](https://www.dropbox.com/s/s1cqym4qmub3jkt/results.txt?dl=0). We evaluated the result over ~1k (20,874 annotated entity mentions) [gold standard set](https://www.dropbox.com/s/n46gr1aented5n1/gt_nyt.txt?dl=0). Sample output on 50k Yelp reviews can be download from [here](https://www.dropbox.com/s/opzbmth7kq6qe0c/results.txt?dl=0). 

Evaluate the result:
```
python src/evaluation.py -ResultPath -GroundTruthPath
```


## Dependencies

* python 2.7
* numpy, scipy, scikit-learn, lxml, TextBlob and related corpora
```
$ sudo pip install numpy scipy sklearn lxml textblob
$ sudo python -m textblob.download_corpora
```

## Default Run

```
$ ./run.sh  
```

## Run.sh - File path setup
We take Yelp dataset as an example.

Input: text corpus path.
```
RawText='data/yelp/yelp_230k.txt'
```
- format: "docId \TAB document \n"


Input: type mapping file path.
```
TypeFile='data/yelp/type_tid.txt'
```
- format: "type name \TAB typeId \n". "NIL" means "Not-of-Interest"


Input: mapping between Freebase and DBpedia entities. 
```
FreebaseMap='data/freebase_links.nt'
```
- Download [Freebase-to-DBpedia mapping file](https://drive.google.com/open?id=0Bw2KHcvHhx-gQ2RJVVJLSHJGYlk). Place it under "data/" directory


Output: output file from candidate generation (format: "docId \TAB segmented sentence \n").
```
SegmentOutFile='result/segment.txt'
```
- Segments are separated by ",". Entity mention candidates are marked with ":EP". Relation phrases are marked with ":RP".


Output: entity linking result (please download the corresponding seed entity files).
```
SeedFile='data/yelp/seed_yelp.txt'
```
- Format: "docId \TAB entity name \TAB Original Freebase Type \TAB Refined Type \TAB Freebase EntityID \TAB Similarity Score \TAB Relative Rank \n". 
- NOTE: Our entity linking module calls [DBpediaSpotLight Web service](https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service), which has limited querying speed. This process can be largely accelarated by installing the tool on your local machine [Link](https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Installation).


Output: entity mentions found in each document.
```
ResultFile='result/yelp/results.txt'
```
- Format: "docId \TAB entity mention \TAB entity type \n".


Output: In-text annotation of entity mentions. 
```
ResultFileInText='result/yelp/resultsInText.txt'
```

## Run.sh - Model parameters

Threshold on significance score for candidate generation.
```
significance="2"
```

Switch on capitalization feature for candidate generation.
```
capitalize="1"
```

Maximal phrase length for candidate generation.
```
maxLength='4'
```

Minimal support of phrases for candidate generation.
```
minSup='30'
```

Number of relation phrase clusters.
```
NumRelationPhraseClusters='500'
```


## Reference

```
@inproceedings{ren2015clustype,
  title={Clustype: Effective entity recognition and typing by relation phrase-based clustering},
  author={Ren, Xiang and El-Kishky, Ahmed and Wang, Chi and Tao, Fangbo and Voss, Clare R and Han, Jiawei},
  booktitle={Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  pages={995--1004},
  year={2015},
  organization={ACM}
}
```