BigVGAN: A Universal Neural Vocoder with Large-Scale Training

Published:

Author: Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon

Posted: Wei Ping

Overview

Despite recent progress in generative adversarial network(GAN)-based vocoders, where the model generates raw waveform conditioned on mel spectrogram, it is still challenging to synthesize high-fidelity audio for numerous speakers across varied recording environments. In our recent paper, we present BigVGAN, a universal audio synthesizer that generalizes well under various unseen conditions in zero-shot setting. More interestingly, it is trained only on speech data but shows extraordinary zero-shot generalization ability for non-speech vocalizations (e.g., laughter, applaud), singing voices, music, instrumental audio that are even recorded in varied noisy environment!

The following are the key highlights of our work:

  • We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and significantly improves audio quality.
  • Based on our improved generator and the state-of-the-art discriminators, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. In particular, we identify and address the training instabilities specific to such scale, while maintaining high-fidelity output without over-regularization.
  • Our BigVGAN achieves the state-of-the-art zero-shot performance for various out-of-distribution scenarios, including new speakers, novel languages, singing voices, music and instrumental audio in unseen (even noisy) recording environments.

We release our code and model at link.

Method

Below diagram illustrates BigVGAN’s generator architecture. It is composed of multiple blocks of transposed 1-D convolution followed by the proposed anti-aliased multi-periodicity composition (AMP) module. The AMP module adds features from multiple residual blocks with different channel-wise periodicities for different dilated 1-D convolutions. It applies Snake function for providing periodic inductive bias, and filtered nonlinearities for anti-aliasing purpose. We combine our improved generator with the state-of-the-art discriminators.

BigVGAN_arch

Audio Samples

Out-of-distribution robustness: YouTube clips in-the-wild:

BigVGAN, trained only on LibriTTS, exhibits strong zero-shot performance and robustness to out-of-distribution scenarios. BigVGAN is capable of synthesizing non-speech vocalizations or audios, such as laughter, singing, and any types of in-the-wild audio from YouTube clips.

Ground-TruthHiFi-GAN (V1)UnivNet-c32 (train-clean-360)
BigVGANBigVGAN-base

Out-of-distribution robustness: MUSDB18-HQ:

BigVGAN, trained only on LibriTTS, exhibits strong zero-shot performance and robustness to out-of-distribution scenarios. BigVGAN is capable of synthesizing a wide range of singing voice, music, and instrumental audio which are unseen during training.

TypesGround-TruthHiFi-GAN (V1)UnivNet-c32 (train-clean-360)
Others (Guitars)
Vocal
Drums
Bass
Mixture
Mixture
TypesBigVGANBigVGAN-base
Others (Guitars)
Vocal
Drums
Bass
Mixture
Mixture

LibriTTS test-other samples from unseen speakers

Ground-TruthHiFi-GAN (V1)UnivNet-c32 (train-clean-360)
BigVGANBigVGAN-base

Unseen languages and recording environments

Ground-TruthHiFi-GAN (V1)UnivNet-c32 (train-clean-360)
BigVGANBigVGAN-base