# Text-Complexity-Classification

**Repository Path**: tony-han/Text-Complexity-Classification

## Basic Information

- **Project Name**: Text-Complexity-Classification
- **Description**: The goal of our project is to build a classifier that can identify text complexity with three readability levels for people who feel hard to understand to read texts in English to some extent, especially language learners. We also tested the hypothesis that clustering texts based on their topics can improve the performance on the text complexity classification because of the pertinence.
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-04-10
- **Last Updated**: 2021-06-24

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Text-Complexity-Classification
The goal of our project is to build a classifier that can identify text complexity with three readability levels for people who feel hard to understand to read texts in English to some extent, especially language learners. We also tested the hypothesis that clustering texts based on their topics can improve the performance on the text complexity classification because of the pertinence. After utilizing K-Means on the topic clustering, we applied three other models on both the whole data and on each subgroup in the classification. The result shows that when working on the unclustered data, SVM model is relatively the most accurate among the three. However, when handling the clustered data, it does not over-perform logistic and C-LSTM models. C-LSTM shows more resilience to size of training data and better captures both of global and local contextual information. We conducted detailed error analysis of the low accuracy issue which indicates that the success of text classification might rely on data pre-processing, choice of model assumptions and size of training data. Overall,our analysis and experiments highlight the important roles of feature design, choice of word representation and ability of models to capture generative text information in text readability classification.