Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis



Rafael Valle, Kevin Shih, Ryan Prenger, and Bryan Catanzaro

In our recent paper, we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron combines insights from IAF and optimizes Tacotron 2 in order to provide high-quality and controllable mel-spectrogram synthesis.

FlowTron is trained by maximizing the likelihood of the training data, which makes the training procedure simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to influence many aspects of mel-spectrogram synthesis.

Below we provide samples produced with Flowtron for mel-spectrogram synthesis and WaveGlow for waveform synthesis. Code for training and inference, along with pretrained models on LJS and LibriTTS, will be available on our Github repository.

Flowtron narrating the I AM AI GTC 2020 Video

4.2 Mean Opinion Score Comparison

Flowtron has Mean Opinion Scores (MOS) comparable to state of the art text to speech models. Here we provide a sample from Flowtron and Tacotron 2 trained on the LJSpeech dataset.
LJSpeech Ground TruthFlowtronTacotron 2

4.3.1 Sampling the Prior ( Speech Variation )

With Flowtron we can control the amount of prosodic variation in speech by adjusting σ². Despite all the variability added by increasing σ², all the samples synthesized with Flowtron still produce high quality speech. The three columns contain three separate samples, so that you can compare variation for each value of σ², and also compare with Tacotron 2 variation. With Flowtron, we can create samples with highly varying prosody, which can make the voice much less monotonous.
Flowtron σ²=0
Flowtron σ²=0.5
Flowtron σ²=1
Tacotron 2 p=0.5

4.3.2 Sampling the Prior ( Interpolation between samples )

Flowtron model with speaker embeddings. We interpolate between two random z-vectors with the speaker Sally and the phrase "It is well known that deep generative models have a rich latent space".
Flowtron same speaker

Please visit our blogpost for examples in which we interpolate between z-vectors producing speech from Sally and Helen with the phrase "We are testing this model".

4.4.1 Sampling the Posterior ( Seen speaker without alignments )

We compare Sally samples from Flowtron and Tacotron 2 GST generated by conditioning on the posterior computed over 30 Helen samples with the highest variance in fundamental frequency. The goal is to make a speech from a monotone speaker more expressive by sampling a region of Flowtron's z-space that is associated with a different speaker that has more expressivity.
Flowtron Style Transfer
Tacotron GST Style Transfer

4.4.2 Sampling the Posterior ( Seen speaker with alignments )

We illustrate Flowtron's ability to learn and transfer acoustic characteristics that are hard to express algorithmically but easy to perceive acoustically. We transfer the style with distinguished nasal voice and oscillation in fundamental frequency to our Flowtron baseline speaker.
FlowtronStyleFlowtron Style Transfer

4.4.3 Sampling the Posterior ( Unseen speaker style )

We modify a speaker's style by using data from the same speaker but from a style not seen during training. Flowtron succeeds in transferring the somber style and the long pauses associated with the narrative style.
Flowtron Style Transfer
Tacotron GST Style Transfer

4.4.4 Sampling the Posterior ( Unseen speaker )

We transfer the style from speaker ID 03 from RAVDESS and the label "surprised" to Sally. Flowtron is able to make Sally sound surprised, which is drastically different from the monotonous baseline.
FlowtronStyleFlowtron Style Transfer

We transfer Richard Feynman's prosody and acoustic characteristics to Sally. Flowtron is able to pick up some of the prosody and articulation details particular to Feynman's speaking style and transfer them to Sally.
FlowtronStyleFlowtron Style Transfer

4.5.2 Sampling the Gaussian Mixture ( Translating dimensions )

We select a single component from the gaussian mixture and translate a dimension associated with pitch. Although the samples have different pitch contours, they have the similar duration.
μ (a-flat)μ - 2σ (c)μ - 4σ (e-flat)

We select a single component from the gaussian mixture and translate a dimension associated with speech rate. Although the samples have different speech rates, they have similar pitch contour.
μμ - 2σμ - 4σ

Extra Flowtron samples

To reverb or not to reverb
Queen's accent