Speech Denoising in the Waveform Domain with Self-Attention

Published: February 01, 2022

Author: Zhifeng Kong, Wei Ping, Ambrish Dantrey, Bryan Catanzaro

Posted: Wei Ping

Overview

In our recent paper, we present CleanUNet, a causal speech denoising model on the raw waveform. The following are the key highlights of our work:

The proposed model is based on an encoder-decoder U-Net architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results.
The model is optimized through a set of losses defined over both waveform and multi-resolution STFT. We find that soley minizing the high-band STFT loss with waveform loss will improve the human percetual quality of denoised audio.
The proposed method outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.

Method

Below diagram illustrates the CleanUNet architecture: It contains an encoder, a decoder, and a bottleneck between them. The encoder and decoder have the same number of causal convolutional layers, and are connected via skip connections (identity maps). The bottleneck is composed of N self-attention blocks. Each self-attention block is composed of a multi-head self attention layer and a position-wise fully connected layer. Each multi-head self-attention layer has 8 heads, model dimension 512, and the attention map is masked to ensure causality.

We optimize a loss function defined as a combination of l1 waveform loss and multi-resolution STFT (M-STFT) loss. In practice, we find the full-band M-STFT loss sometimes leads to low-frequency noises on the silence part of denoised speech, which deteriorates the human listening test. On the other hand, if our model is trained with l1 waveform loss, the silence part of the output is clean, but the high-frequency bands are less accurate than the model trained with the M-STFT loss. This motivated us to define a high-band M-STFT loss, which only compute the loss on high-frequency band (i.e., 4kHz to 8kHz for 16kHz audio).

Results

Objective and subjective evaluation results for denoising on the DNS (2020)

Denoised Samples

Keyboard / Mechanical noise:

Noisy	CleanUNet	FAIR-denoiser

Clean	FullSubNet

Dog barking:

Noisy	CleanUNet	FAIR-denoiser

Clean	FullSubNet

Human talking:

Noisy	CleanUNet	FAIR-denoiser

Clean	FullSubNet

Indoor noise:

Noisy	CleanUNet	FAIR-denoiser

Clean	FullSubNet

Street noise:

Noisy	CleanUNet	FAIR-denoiser

Clean	FullSubNet

Shrill noise:

Noisy	CleanUNet	FAIR-denoiser

Clean	FullSubNet

Wind noise:

Noisy	CleanUNet	FAIR-denoiser

Clean	FullSubNet

Citation

@inproceedings{kong2022speech,
      title={Speech Denoising in the Waveform Domain with Self-Attention}, 
      author={Zhifeng Kong and Wei Ping and Ambrish Dantrey and Bryan Catanzaro},
      booktitle={ICASSP},
      year={2022},
}

Share on

Twitter Facebook LinkedIn

NVIDIA ADLR