# radtts **Repository Path**: mirrors_NVIDIA/radtts ## Basic Information - **Project Name**: radtts - **Description**: Provides training, inference and voice conversion recipes for RADTTS and RADTTS++: Flow-based TTS models with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-06-28 - **Last Updated**: 2026-03-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Flow-based TTS with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes. This repository contains the source code and several checkpoints for our work based on RADTTS. RADTTS is a normalizing-flow-based TTS framework with state of the art acoustic fidelity and a highly robust audio-transcription alignment module. Our project page and some samples can be found [here](https://nv-adlr.github.io/RADTTS), with relevant works listed [here](#relevant-papers). This repository can be used to train the following models: - A normalizing-flow bipartite architecture for mapping text to mel spectrograms - A variant of the above, conditioned on F0 and Energy - Normalizing flow models for explicitly modeling text-conditional phoneme duration, fundamental frequency (F0), and energy - A standalone alignment module for learning unspervised text-audio alignments necessary for TTS training ## HiFi-GAN vocoder pre-trained models We provide a [checkpoint](https://drive.google.com/file/d/1lD62jl5hF6T5AkGoWKOcgMZuMR4Ir76d/view?usp=sharing) and [config](https://drive.google.com/file/d/1WRtyvkmQxlYShkeTwWmlj7_WiS70R7Jb/view?usp=sharing) for a HiFi-GAN vocoder trained on LibriTTS 100 and 360.
For a HiFi-GAN vocoder trained on LJS, please download the v1 model provided by the HiFi-GAN authors [here](https://github.com/jik876/hifi-gan), . ## RADTTS pre-trained models | Model name | Description | Dataset | |---------------------------|---------------------------------------------------------|----------------------------------------------| | [RADTTS++DAP-LJS](https://drive.google.com/file/d/1Rb2VMUwQahGrnpFSlAhCPh7OpDN3xgOr/view?usp=sharing) | RADTTTS model conditioned on F0 and Energy with deterministic attribute predictors | LJSpeech Dataset We will soon provide more pre-trained RADTTS models with generative attribute predictors trained on LJS and LibriTTS. Stay tuned! ## Setup 1. Clone this repo: `git clone https://github.com/NVIDIA/RADTTS.git` 2. Install python requirements or build docker image - Install python requirements: `pip install -r requirements.txt` 3. Update the filelists inside the filelists folder and json configs to point to your data - `basedir` – the folder containing the filelists and the audiodir - `audiodir` – name of the audiodir - `filelist` – | (pipe) separated text file with relative audiopath, text, speaker, and optionally categorical label and audio duration in seconds ## Training RADTTS (without pitch and energy conditioning) 1. Train the decoder
`python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir` 2. Further train with the duration predictor `python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir_dir train_config.warmstart_checkpoint_path=model_path.pt model_config.include_modules="decatndur"` ## Training RADTTS++ (with pitch and energy conditioning) 1. Train the decoder
`python train.py -c config_ljs_decoder.json -p train_config.output_directory=outdir` 2. Train the attribute predictor: autoregressive flow (agap), bi-partite flow (bgap) or deterministic (dap)
`python train.py -c config_ljs_{agap,bgap,dap}.json -p train_config.output_directory=outdir_wattr train_config.warmstart_checkpoint_path=model_path.pt` ## Training starting from a pre-trained model, ignoring the speaker embedding table 1. Download our pre-trained model 2. `python train.py -c config.json -p train_config.ignore_layers_warmstart=["speaker_embedding.weight"] train_config.warmstart_checkpoint_path=model_path.pt` ## Multi-GPU (distributed) 1. `python -m torch.distributed.launch --use_env --nproc_per_node=NUM_GPUS_YOU_HAVE train.py -c config.json -p train_config.output_directory=outdir` ## Inference demo 1. `python inference.py -c CONFIG_PATH -r RADTTS_PATH -v HG_PATH -k HG_CONFIG_PATH -t TEXT_PATH -s ljs --speaker_attributes ljs --speaker_text ljs -o results/` ## Inference Voice Conversion demo 1. `python inference_voice_conversion.py --radtts_path RADTTS_PATH --radtts_config_path RADTTS_CONFIG_PATH --vocoder_path HG_PATH --vocoder_config_path HG_CONFIG_PATH --f0_mean=211.413 --f0_std=46.6595 --energy_mean=0.724884 --energy_std=0.0564605 --output_dir=results/ -p data_config.validation_files="{'Dummy': {'basedir': 'data/', 'audiodir':'22khz', 'filelist': 'vc_audiopath_txt_speaker_emotion_duration_filelist.txt'}}"` ## Config Files | Filename | Description | Nota bene | |--------------------------|---------------------------------------------------------|------------------------------------------------| | [config\_ljs_decoder.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_decoder.json) | Config for the decoder conditioned on F0 and Energy | | | [config\_ljs_radtts.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_radtts.json) | Config for the decoder not conditioned on F0 and Energy | | | [config\_ljs_agap.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_agap.json) | Config for the Autoregressive Flow Attribute Predictors | Requires at least pre-trained alignment module | | [config\_ljs_bgap.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_bgap.json) | Config for the Bi-Partite Flow Attribute Predictors | Requires at least pre-trained alignment module | | [config\_ljs_dap.json](https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_dap.json) | Config for the Deterministic Attribute Predictors | Requires at least pre-trained alignment module | ## LICENSE Unless otherwise specified, the source code within this repository is provided under the [MIT License](LICENSE) ## Acknowledgements The code in this repository is heavily inspired by or makes use of source code from the following works: - Tacotron implementation from [Keith Ito](https://github.com/keithito/tacotron/) - STFT code from [Prem Seetharaman](https://github.com/pseeth/pytorch-stft) - [Masked Autoregressive Flows](https://arxiv.org/abs/1705.07057) - [Flowtron](https://arxiv.org/abs/2005.05957) - Source for neural spline functions used in this work: https://github.com/ndeutschmann/zunis - Original Source for neural spline functions: https://github.com/bayesiains/nsf - Bipartite Architecture based on code from [WaveGlow](https://github.com/NVIDIA/waveglow) - [HiFi-GAN](https://github.com/jik876/hifi-gan) - [Glow-TTS](https://github.com/jaywalnut310/glow-tts) ## Relevant Papers Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro.
[One TTS Alignment to Rule Them All.](https://ieeexplore.ieee.org/abstract/document/9747707) ICASSP 2022

Kevin J Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro.
[RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis.](https://openreview.net/pdf?id=0NQwnnwAORi)
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models 2021

Kevin J Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro.
[Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows.](https://arxiv.org/pdf/2203.01786) Technical Report