# speech-enhancement

**Repository Path**: linan2/speech-enhancement

## Basic Information

- **Project Name**: speech-enhancement
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-01-05
- **Last Updated**: 2022-01-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Speech Enhancement

Tinkering with speech enhancement models.

Borrowed code, models and techniques from:

- Improved Speech Enhancement with the Wave-U-Net (([arXiv](https://arxiv.org/abs/1811.11307))
- Wave-U-Net: a multi-scale neural network for end-to-end audio source separation ([arXiv](https://arxiv.org/pdf/1806.03185.pdf))
- Speech Denoising with Deep Feature Losses ([arXiv](https://arxiv.org/abs/1806.10522), [sound examples](https://ccrma.stanford.edu/~francois/SpeechDenoisingWithDeepFeatureLosses/), [GitHub](https://github.com/francoisgermain/SpeechDenoisingWithDeepFeatureLosses))
- MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis ([arXiv](https://arxiv.org/abs/1910.06711), [sound examples](https://melgan-neurips.github.io/), [GitHub](https://github.com/seungwonpark/melgan))

### Datasets

The following datasets are used:

- The Univeristy of Edinburgh [Noisy speech database](https://datashare.is.ed.ac.uk/handle/10283/2791) for speech enhancement problem
- The TUT Acoustic scenes 2016 [dataset](https://zenodo.org/record/45739) is used to train the scene classifier network, which is used for the loss function. ([dataset paper](http://www.cs.tut.fi/~mesaros/pubs/mesaros_eusipco2016-dcase.pdf))
- The CHiME-Home (Computational Hearing in Multisource Environments) [dataset](https://archive.org/details/chime-home) (2015) is also used for the scene classifier, in some experiments
- The "train-clean-100" dataset from [Librispeech](http://www.openslr.org/12), mixed with the TUT acoustic scenes dataset.

### Data format

At the moment, the algorithm uses 32-bit floating-point audio files at a 16kHz sampling rate to perform correctly. You can use `sox` to convert your file. To convert `audiofile.wav` to 32-bit floating-point audio at 16kHz sampling rate, run:

```bash
sox audiofile.wav -r 16000 -b 32 -e float audiofile.float.wav
```