RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis

Published:

RAD-TTS

Kevin Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping and Bryan Catanzaro


In our recent paper, we propose RAD-TTS: a parallel flow-based generative network for text-to-speech synthesis. It extends prior parallel approaches by additionally modeling speech rhythm as a separate generative distribution to facilitate variable token duration during inference. We further propose a robust framework for the on-line extraction of speech-text alignments: a critical yet highly unstable learning problem in end-to-end TTS frameworks. Our experiments demonstrate that our proposed techniques yield improved alignment quality, better output diversity compared to controlled baselines.

RAD-TTS is trained by maximizing the likelihood of the training data, which makes the training procedure simple and stable. We additionally use an unsupervised alignment learning objective that maximizes the likelihood of text given mels. This additional objective allows us to learn speech text alignments online as the RAD-TTS model trains.

An important contribution of our work is generative modelling of these durations, instead of deterministic regression models that most non-autoregressive TTS models currently use. Utilizing a normalizing flow for duration modelling allows us to resolve the output diversity issue in parallel TTS architectures.

Following diagram summarizes the inference pipeline for RAD-TTS. The duration normalizing flow first samples the phoneme durations which are then used to prepare the input to the parallel Mel-Decoder flow.

RAD-TTS_MODEL

Below we provide samples produced with RAD-TTS for mel-spectrogram synthesis and WaveGlow for waveform synthesis.

1. Mean Opinion Score Comparison

RAD-TTS has Mean Opinion Scores (MOS) comparable to state of the art parallel text to speech models. Compared to the most similar GlowTTS, our overall quality is a bit worse, likely due to our architecture being much larger and therefore less dataefficient on LJSpeech. The following table compares the mean opinion scores of models trained only on LJSpeech.

Here we provide a samples from RAD-TTS and GlowTTS trained on the LJSpeech dataset.

LJSpeech Ground TruthRAD-TTS with prior(σ²=0.667)GlowTTS w/ blanks

2. Diversity in samples generated

Similar to Flowtron, we can control the amount of prosodic variation in speech by adjusting σ² for the duration generative model in RAD-TTS. Here we provide the evaluation of RAD-TTS wrt other non-autoregressice models in terms of variability in speech.

Phoneme-level duration distributions for the word Climate with 95% confidence intervals obtained from 100 samples collected from different models conditioned on the phrase 'Climate change knows no borders'. Explicit generative models (shades of green and blue) provide high diversity in speech rhythm by adjusting σ, whereas test-time dropout (yellow) provides limited variability.

3. Online Alignment Learning Algorithm


Please visit our blogpost for details on our online unsupervised alignment learning framework. We provide samples and results demonstrating the effectiveness of our alignment learning framework.
Following table compares the alignment errors using RAD-TTS.

4. RAD-TTS++ : explicitly conditioning RAD-TTS decoder on f0 (pitch) and energy

We also attempt to explicitly condition RAD-TTS decoder on f0 (pitch) and energy. Since this model is an extension over RAD-TTS, we call this model RAD-TTS++. This allows to explicitly control the pitch and energy. More importantly, similar to modelling durations using a generative model (as done in RAD-TTS), we can use a generative model for f0 (pitch) and energy to enhance the expressivity and diversity of synthesized samples.

The following are some samples from RAD-TTS++, in which we use a discriminative model to predict f0 (pitch) and energy and condition the decoder on the predicted f0, energy. The following is a RAD-TTS++ synthesized sample from a speaker in our dataset:



We also train our RAD-TTS++ model on speaker from the Blizzard Challenge and following is a synthesized samples for a speaker from Blizzard Challenge:



This conditioning of f0 and energy allows us to make any speaker sing or rap songs. Here's a rap sample from RAD-TTS where we make one of RAD-TTS++ speaker rap a song by explicitly conditioning on f0 and energy from the ground truth 'The Real Slim Shady' rap song by Eminem.



In the following samples, we take Etta James' song 'At Last' and make one of RAD-TTS++ speakers sing the same song. We then overlay the original and synthesized audio to generate a duet for them!

Implementation details

Code for training and inference, along with pretrained models on LJS, will be released soon.

Citation

@inproceedings{
shih2021radtts,
title={RAD-TTS: Parallel Flow-Based {TTS} with Robust Alignment Learning and Diverse Synthesis},
author={Kevin J. Shih and Rafael Valle and Rohan Badlani and Adrian Lancucki and Wei Ping and Bryan Catanzaro},
booktitle={ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models},
year={2021},
url={https://openreview.net/forum?id=0NQwnnwAORi}
}