# radtts
**Repository Path**: mirrors_NVIDIA/radtts
## Basic Information
- **Project Name**: radtts
- **Description**: Provides training, inference and voice conversion recipes for RADTTS and RADTTS++: Flow-based TTS models with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-06-28
- **Last Updated**: 2026-03-14
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Flow-based TTS with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.
This repository contains the source code and several checkpoints for our work based on RADTTS. RADTTS is a normalizing-flow-based TTS framework with state of the art acoustic fidelity and a highly robust audio-transcription alignment module. Our project page and some samples can be found [here](https://nv-adlr.github.io/RADTTS), with relevant works listed [here](#relevant-papers).
This repository can be used to train the following models:
- A normalizing-flow bipartite architecture for mapping text to mel spectrograms
- A variant of the above, conditioned on F0 and Energy
- Normalizing flow models for explicitly modeling text-conditional phoneme duration, fundamental frequency (F0), and energy
- A standalone alignment module for learning unspervised text-audio alignments necessary for TTS training
## HiFi-GAN vocoder pre-trained models
We provide a [checkpoint](https://drive.google.com/file/d/1lD62jl5hF6T5AkGoWKOcgMZuMR4Ir76d/view?usp=sharing) and [config](https://drive.google.com/file/d/1WRtyvkmQxlYShkeTwWmlj7_WiS70R7Jb/view?usp=sharing) for a HiFi-GAN vocoder trained on LibriTTS 100 and 360.
For a HiFi-GAN vocoder trained on LJS, please download the v1 model provided by the HiFi-GAN authors [here](https://github.com/jik876/hifi-gan), .
## RADTTS pre-trained models
| Model name | Description | Dataset |
|---------------------------|---------------------------------------------------------|----------------------------------------------|
| [RADTTS++DAP-LJS](https://drive.google.com/file/d/1Rb2VMUwQahGrnpFSlAhCPh7OpDN3xgOr/view?usp=sharing) | RADTTTS model conditioned on F0 and Energy with deterministic attribute predictors | LJSpeech Dataset
We will soon provide more pre-trained RADTTS models with generative attribute predictors trained on LJS and LibriTTS. Stay tuned!
## Setup
1. Clone this repo: `git clone https://github.com/NVIDIA/RADTTS.git`
2. Install python requirements or build docker image
- Install python requirements: `pip install -r requirements.txt`
3. Update the filelists inside the filelists folder and json configs to point to your data
- `basedir` – the folder containing the filelists and the audiodir
- `audiodir` – name of the audiodir
- `filelist` – | (pipe) separated text file with relative audiopath, text, speaker, and optionally categorical label and audio duration in seconds
## Training RADTTS (without pitch and energy conditioning)
1. Train the decoder
`python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir`
2. Further train with the duration predictor
`python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir_dir train_config.warmstart_checkpoint_path=model_path.pt model_config.include_modules="decatndur"`
## Training RADTTS++ (with pitch and energy conditioning)
1. Train the decoder
`python train.py -c config_ljs_decoder.json -p train_config.output_directory=outdir`
2. Train the attribute predictor: autoregressive flow (agap), bi-partite flow (bgap) or deterministic (dap)
`python train.py -c config_ljs_{agap,bgap,dap}.json -p train_config.output_directory=outdir_wattr train_config.warmstart_checkpoint_path=model_path.pt`
## Training starting from a pre-trained model, ignoring the speaker embedding table
1. Download our pre-trained model
2. `python train.py -c config.json -p train_config.ignore_layers_warmstart=["speaker_embedding.weight"] train_config.warmstart_checkpoint_path=model_path.pt`
## Multi-GPU (distributed)
1. `python -m torch.distributed.launch --use_env --nproc_per_node=NUM_GPUS_YOU_HAVE train.py -c config.json -p train_config.output_directory=outdir`
## Inference demo
1. `python inference.py -c CONFIG_PATH -r RADTTS_PATH -v HG_PATH -k HG_CONFIG_PATH -t TEXT_PATH -s ljs --speaker_attributes ljs --speaker_text ljs -o results/`
## Inference Voice Conversion demo
1. `python inference_voice_conversion.py --radtts_path RADTTS_PATH --radtts_config_path RADTTS_CONFIG_PATH --vocoder_path HG_PATH --vocoder_config_path HG_CONFIG_PATH --f0_mean=211.413 --f0_std=46.6595 --energy_mean=0.724884 --energy_std=0.0564605 --output_dir=results/ -p data_config.validation_files="{'Dummy': {'basedir': 'data/', 'audiodir':'22khz', 'filelist': 'vc_audiopath_txt_speaker_emotion_duration_filelist.txt'}}"`
## Config Files
| Filename | Description | Nota bene |
|--------------------------|---------------------------------------------------------|------------------------------------------------|
| [config\_ljs_decoder.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_decoder.json) | Config for the decoder conditioned on F0 and Energy | |
| [config\_ljs_radtts.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_radtts.json) | Config for the decoder not conditioned on F0 and Energy | |
| [config\_ljs_agap.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_agap.json) | Config for the Autoregressive Flow Attribute Predictors | Requires at least pre-trained alignment module |
| [config\_ljs_bgap.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_bgap.json) | Config for the Bi-Partite Flow Attribute Predictors | Requires at least pre-trained alignment module |
| [config\_ljs_dap.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_dap.json) | Config for the Deterministic Attribute Predictors | Requires at least pre-trained alignment module |
## LICENSE
Unless otherwise specified, the source code within this repository is provided under the
[MIT License](LICENSE)
## Acknowledgements
The code in this repository is heavily inspired by or makes use of source code from the following works:
- Tacotron implementation from [Keith Ito](https://github.com/keithito/tacotron/)
- STFT code from [Prem Seetharaman](https://github.com/pseeth/pytorch-stft)
- [Masked Autoregressive Flows](https://arxiv.org/abs/1705.07057)
- [Flowtron](https://arxiv.org/abs/2005.05957)
- Source for neural spline functions used in this work: https://github.com/ndeutschmann/zunis
- Original Source for neural spline functions: https://github.com/bayesiains/nsf
- Bipartite Architecture based on code from [WaveGlow](https://github.com/NVIDIA/waveglow)
- [HiFi-GAN](https://github.com/jik876/hifi-gan)
- [Glow-TTS](https://github.com/jaywalnut310/glow-tts)
## Relevant Papers
Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro.
[One TTS Alignment to Rule Them All.](https://ieeexplore.ieee.org/abstract/document/9747707) ICASSP 2022
Kevin J Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro.
[RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis.](https://openreview.net/pdf?id=0NQwnnwAORi)
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models 2021
Kevin J Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro.
[Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows.](https://arxiv.org/pdf/2203.01786) Technical Report