Flowtron has Mean Opinion Scores (MOS) comparable to state of the art text to speech models. Here we provide a sample from Flowtron and Tacotron 2 trained on the LJSpeech dataset.

LJSpeech Ground Truth	Flowtron	Tacotron 2

4.3.1 Sampling the Prior ( Speech Variation )

With Flowtron we can control the amount of prosodic variation in speech by adjusting σ². Despite all the variability added by increasing σ², all the samples synthesized with Flowtron still produce high quality speech. The three columns contain three separate samples, so that you can compare variation for each value of σ², and also compare with Tacotron 2 variation. With Flowtron, we can create samples with highly varying prosody, which can make the voice much less monotonous.

Flowtron σ²=0
Flowtron σ²=0.5
Flowtron σ²=1
Tacotron 2 p=0.5

4.3.2 Sampling the Prior ( Interpolation between samples )

Flowtron model with speaker embeddings. We interpolate between two random z-vectors with the speaker Sally and the phrase "It is well known that deep generative models have a rich latent space".

	1/100	33/100	66/100	100/100
Flowtron same speaker

Please visit our blogpost for examples in which we interpolate between z-vectors producing speech from Sally and Helen with the phrase "We are testing this model".

4.4.1 Sampling the Posterior ( Seen speaker without alignments )

We compare Sally samples from Flowtron and Tacotron 2 GST generated by conditioning on the posterior computed over 30 Helen samples with the highest variance in fundamental frequency. The goal is to make a speech from a monotone speaker more expressive by sampling a region of Flowtron's z-space that is associated with a different speaker that has more expressivity.

Flowtron
Flowtron Style Transfer
Tacotron GST Style Transfer

4.4.2 Sampling the Posterior ( Seen speaker with alignments )

We illustrate Flowtron's ability to learn and transfer acoustic characteristics that are hard to express algorithmically but easy to perceive acoustically. We transfer the style with distinguished nasal voice and oscillation in fundamental frequency to our Flowtron baseline speaker.

Flowtron	Style	Flowtron Style Transfer

4.4.3 Sampling the Posterior ( Unseen speaker style )

We modify a speaker's style by using data from the same speaker but from a style not seen during training. Flowtron succeeds in transferring the somber style and the long pauses associated with the narrative style.

Flowtron
Style
Flowtron Style Transfer
Tacotron GST Style Transfer

4.4.4 Sampling the Posterior ( Unseen speaker )

We transfer the style from speaker ID 03 from RAVDESS and the label "surprised" to Sally. Flowtron is able to make Sally sound surprised, which is drastically different from the monotonous baseline.

Flowtron	Style	Flowtron Style Transfer

We transfer Richard Feynman's prosody and acoustic characteristics to Sally. Flowtron is able to pick up some of the prosody and articulation details particular to Feynman's speaking style and transfer them to Sally.

Flowtron	Style	Flowtron Style Transfer

4.5.2 Sampling the Gaussian Mixture ( Translating dimensions )

We select a single component from the gaussian mixture and translate a dimension associated with pitch. Although the samples have different pitch contours, they have the similar duration.

μ (a-flat)	μ - 2σ (c)	μ - 4σ (e-flat)

We select a single component from the gaussian mixture and translate a dimension associated with speech rate. Although the samples have different speech rates, they have similar pitch contour.

μ	μ - 2σ	μ - 4σ

Extra Flowtron samples

To reverb or not to reverb
Queen's accent

Flowtron narrating the I AM AI GTC 2020 Video

4.2 Mean Opinion Score Comparison