With Flowtron we can control the amount of prosodic variation in speech by adjusting σ². Despite all the variability added by increasing σ², all the samples synthesized with Flowtron still produce high quality speech. The three columns contain three separate samples, so that you can compare variation for each value of σ², and also compare with Tacotron 2 variation. With Flowtron, we can create samples with highly varying prosody, which can make the voice much less monotonous.
We compare Sally samples from Flowtron and Tacotron 2 GST generated by conditioning on the posterior computed over 30 Helen samples with the highest variance in fundamental frequency. The goal is to make a speech from a monotone speaker more expressive by sampling a region of Flowtron's z-space that is associated with a different speaker that has more expressivity.
We illustrate Flowtron's ability to learn and transfer acoustic characteristics that are hard to express algorithmically but easy to perceive acoustically. We transfer the style with distinguished nasal voice and oscillation in fundamental frequency to our Flowtron baseline speaker.
We modify a speaker's style by using data from the same speaker but from a style not seen during training. Flowtron succeeds in transferring the somber style and the long pauses associated with the narrative style.
We transfer the style from speaker ID 03 from RAVDESS and the label "surprised" to Sally. Flowtron is able to make Sally sound surprised, which is drastically different from the monotonous baseline.
Flowtron Style Transfer
We transfer Richard Feynman's prosody and acoustic characteristics to Sally. Flowtron is able to pick up some of the prosody and articulation details particular to Feynman's speaking style and transfer them to Sally.