1. Mean Opinion Score Comparison

RAD-TTS has Mean Opinion Scores (MOS) comparable to state of the art parallel text to speech models. Compared to the most similar GlowTTS, our overall quality is a bit worse, likely due to our architecture being much larger and therefore less dataefficient on LJSpeech. The following table compares the mean opinion scores of models trained only on LJSpeech.

Here we provide a samples from RAD-TTS and GlowTTS trained on the LJSpeech dataset.

LJSpeech Ground Truth RAD-TTS with prior(σ²=0.667) GlowTTS w/ blanks

2. Diversity in samples generated

Similar to Flowtron, we can control the amount of prosodic variation in speech by adjusting σ² for the duration generative model in RAD-TTS. Here we provide the evaluation of RAD-TTS wrt other non-autoregressice models in terms of variability in speech.

Phoneme-level duration distributions for the word Climate with 95% confidence intervals obtained from 100 samples collected from different models conditioned on the phrase 'Climate change knows no borders'. Explicit generative models (shades of green and blue) provide high diversity in speech rhythm by adjusting σ, whereas test-time dropout (yellow) provides limited variability.

3. Online Alignment Learning Algorithm

Please visit our blogpost for details on our online unsupervised alignment learning framework. We provide samples and results demonstrating the effectiveness of our alignment learning framework.
Following table compares the alignment errors using RAD-TTS.

4. RAD-TTS++ : explicitly conditioning RAD-TTS decoder on f0 (pitch) and energy

We also attempt to explicitly condition RAD-TTS decoder on f0 (pitch) and energy. Since this model is an extension over RAD-TTS, we call this model RAD-TTS++. This allows to explicitly control the pitch and energy. More importantly, similar to modelling durations using a generative model (as done in RAD-TTS), we can use a generative model for f0 (pitch) and energy to enhance the expressivity and diversity of synthesized samples.

The following are some samples from RAD-TTS++, in which we use a discriminative model to predict f0 (pitch) and energy and condition the decoder on the predicted f0, energy. The following is a RAD-TTS++ synthesized sample from a speaker in our dataset:

We also train our RAD-TTS++ model on speaker from the Blizzard Challenge and following is a synthesized samples for a speaker from Blizzard Challenge:

This conditioning of f0 and energy allows us to make any speaker sing or rap songs. Here's a rap sample from RAD-TTS where we make one of RAD-TTS++ speaker rap a song by explicitly conditioning on f0 and energy from the ground truth 'The Real Slim Shady' rap song by Eminem.

In the following samples, we take Etta James' song 'At Last' and make one of RAD-TTS++ speakers sing the same song. We then overlay the original and synthesized audio to generate a duet for them!