Ryan Prenger, Rafael Valle, and Bryan Catanzaro
In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
Our PyTorch implementation produces audio samples at a rate of 1200 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.
Below we provide real samples and synthesized samples using our WaveGlow model, Griffin-Lim and an open source WaveNet implementation. We also provide WaveGlow samples using mel-spectrograms produced with our Tacotron 2 implementation.
Code for training and inference, along with a pretrained model on LJS, is available on our Github repository.
|Tacotron 2 + WaveGlow|