Sitemap

Page Not Found

Page not found. Your pixels are in another canvas.

SDCNet: Video Prediction Using Spatially Displaced Convolution

Published in European Conference on Computer Vision (ECCV), 2018

SDCNet is a 3D convolutional neural network proposed for frame prediction. The model takes as input a sequence of past frames and their inter-frame optical flows and generates a per-pixel kernel and motion vector. A future frame is then synthesised by sampling past frames guided by the motion vectors and weighted by the learned kernels.

Recommended citation: Fitsum A. Reda, Guilin Liu, Kevin J. Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, Bryan Catanzaro, SDCNet: Video Prediction Using Spatially Displaced Convolution. ECCV 2018. https://arxiv.org/abs/1811.00684

NVIDIA Applied Deep Learning Research

NVIDIA ADLR

Archive Layout with Content

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

Published: June 10, 2022

we present BigVGAN, a universal neural vocoder. It’s trained only on speech data but shows extraordinary zero-shot generalization ability for non-speech vocalizations (laughter, applaud), singing voices, music, instrumental audio that are even recorded in varied noisy environment!

Posts by Category

Speech Denoising in the Waveform Domain with Self-Attention

Published: February 01, 2022

We present CleanUNet, a speech denoising model on the raw waveform. It is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. It outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.

Posts by Collection

CV

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Published: May 12, 2020

Flowtron is an autoregressive flow-based generative network for text-to-speech synthesis with direct control over speech variation and style transfer

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Published in 2018 High Performance Machine Learning Workshop, 2018

This paper shows how to do large scale distributed, large batch, mixed precision training of language models with investigations into the successes and limitations of large batch training on publicly available language datasets.

Recommended citation: Raul Puri, Robert Kirby, Nikolai Yakovenko, Bryan Catanzaro, Large Scale Language Modeling: Converging on 40GB of Text in Four Hours. arXiv. 2018. https://arxiv.org/abs/1808.01371

Malware Detection by Eating a Whole EXE

Published in 2018 AAAI Workshop on AI for Cyber Security, 2018

This paper shows how to do whole binary classification for malware detection with a convolutional neural network. Done in collaboration with researchers at the University of Maryland.

Recommended citation: Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, Charles Nicholas, Malware Detection by Eating a Whole EXE. arXiv. 2017. http://arxiv.org/abs/1710.09435

Markdown

MegatronLM’s Supercharged V1.0

Published: May 15, 2020

We release version 1.0 of Megatron which makes the training of large NLP models even faster and sustains 62.4 teraFLOPs in the end-to-end training that is 48% of the theoretical peak FLOPS for a single GPU in a DGX2-H server.

MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

Published: August 13, 2019

We train an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

Published: October 23, 2019

Mellotron is a multispeaker voice synthesis model that can make a voice emote and sing without emotive or singing training data

Page not in menu

This is a page not in th emain menu

One TTS Alignment to Rule Them All

Published: August 20, 2021

We present an unsupervised alignment learning framework that learns speech-text alignments online in text to speech models. We showcase this alignment learning framework can be applied to any TTS model removing the dependency of TTS systems on external aligners. It also enhances the speech quality as evaluated by human evaluators.

Page Archive

Image Inpainting for Irregular Holes Using Partial Convolutions

Published in ECCV 2018, 2018

Recommended citation: Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, Bryan Catanzaro, Image Inpainting for Irregular Holes Using Partial Convolutions, Proceedings of the European Conference on Computer Vision (ECCV) 2018. https://arxiv.org/abs/1804.07723

Partial Convolution based Padding

Published in arXiv, 2018

Recommended citation: Guilin Liu, Kevin J. Shih, Ting-Chun Wang, Fitsum A. Reda, Karan Sapra, Zhiding Yu, Andrew Tao, Bryan Catanzaro, Partial Convolution based Padding, arXiv:1811.11718, 2018. https://arxiv.org/abs/1811.11718

Portfolio

Projects

Publications

RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis

Published: August 16, 2021

RAD-TTS is a parallel flow-based generative network for text-to-speech synthesis which does not rely on external aligners to learn speech-text alignments and supports diversity in generated speech by modeling speech rhythm as a separate generative distribution.

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Published in arXiv, 2018

This paper shows how to scale up training sets for semantic segmentation by using video prediction-based data synthesis method. Our proposed joint propagation strategy and boundary relaxation technique can alleviate the label noise in the synthesized samples and lead to state-of-the-art performance on three benchmark datasets Cityscapes, CamVid and KITTI.

Recommended citation: Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao and Bryan Catanzaro, Improving Semantic Segmentation via Video Propagation and Label Relaxation, arXiv:1812.01593, 2018. https://arxiv.org/abs/1812.01593

Sitemap

Posts by Tags

Talk map

Talks and presentations

Teaching

Terms and Privacy Policy

Fine Detailed Texture Learning for 3D Meshes with Generative Models

Published in arXiv, 2022

Recommended citation: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro, Fine Detailed Texture Learning for 3D Meshes with Generative Models, arXiv:2203.09362, 2022. https://arxiv.org/abs/2203.09362

Long-Short Transformer: Efficient Transformers for Language and Vision

Published: July 29, 2021

Long-Short Transformer is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks

Unsupervised Video Interpolation Using Cycle Consistency

Published in International Conference on Computer Vision (ICCV), 2019

We propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. We also introduce a pseudo-supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model. The pseudo-supervised loss term, used together with cycle consistency, can effectively adapt a pre-trained model to a new target domain. We show results that significantly reduce the domain gap problem in video frame interpolation.

Recommended citation: Fitsum A. Reda, Deqing Sun, Aysegul Dundar, Mohammad Shoeybi, Guilin Liu, Kevin J. Shih, Andrew Tao, Jan Kautz, Bryan Catanzaro, "Unsupervised Video Interpolation Using Cycle Consistency". In ICCV 2019. https://arxiv.org/abs/1906.05928

View Generalization for Single Image Textured 3D Models

Published in CVPR 2021, 2021

Recommended citation: Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, Bryan Catanzaro, View Generalization for Single Image Textured 3D Models, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR) 2021.

WaveGlow: a Flow-based Generative Network for Speech Synthesis

Published: October 29, 2018

WaveGlow is an invertible neural network that can generate high quality speech efficiently from mel-spectrograms.

Blog posts

Jupyter notebook markdown generator

Future Blog Post

less than 1 minute read

Published: January 01, 2199

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published: August 14, 2015

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published: August 14, 2014

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published: August 14, 2013

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published: August 14, 2012

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Malware Detection by Eating a Whole EXE

Published: February 02, 2018

This paper shows how to do whole binary classification for malware detection with a convolutional neural network. Done in collaboration with researchers at the University of Maryland.

Recommended citation: Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, Charles Nicholas, Malware Detection by Eating a Whole EXE. arXiv. 2017. http://arxiv.org/abs/1710.09435

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Published: August 03, 2018

This paper shows how to do large scale distributed, large batch, mixed precision training of language models with investigations into the successes and limitations of large batch training on publicly available language datasets.

Recommended citation: Raul Puri, Robert Kirby, Nikolai Yakovenko, Bryan Catanzaro, Large Scale Language Modeling: Converging on 40GB of Text in Four Hours. arXiv. 2018. https://arxiv.org/abs/1808.01371

SDCNet: Video Prediction Using Spatially Displaced Convolution

Published: September 08, 2018

SDCNet is a 3D convolutional neural network proposed for frame prediction. The model takes as input a sequence of past frames and their inter-frame optical flows and generates a per-pixel kernel and motion vector. A future frame is then synthesised by sampling past frames guided by the motion vectors and weighted by the learned kernels.

Recommended citation: Fitsum A. Reda, Guilin Liu, Kevin J. Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, Bryan Catanzaro, SDCNet: Video Prediction Using Spatially Displaced Convolution. ECCV 2018. https://arxiv.org/abs/1811.00684

WaveGlow: a Flow-based Generative Network for Speech Synthesis

Published: October 29, 2018

WaveGlow is an invertible neural network that can generate high quality speech efficiently from mel-spectrograms.

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Published: December 05, 2018

This paper shows how to scale up training sets for semantic segmentation by using video prediction-based data synthesis method. Our proposed joint propagation strategy and boundary relaxation technique can alleviate the label noise in the synthesized samples and lead to state-of-the-art performance on three benchmark datasets Cityscapes, CamVid and KITTI.

Recommended citation: Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao and Bryan Catanzaro, Improving Semantic Segmentation via Video Propagation and Label Relaxation, arXiv:1812.01593, 2018. https://arxiv.org/abs/1812.01593

Image Inpainting for Irregular Holes Using Partial Convolutions

Published: December 09, 2018

Recommended citation: Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, Bryan Catanzaro, Image Inpainting for Irregular Holes Using Partial Convolutions, Proceedings of the European Conference on Computer Vision (ECCV) 2018. https://arxiv.org/abs/1804.07723

Partial Convolution based Padding

Published: December 10, 2018

Recommended citation: Guilin Liu, Kevin J. Shih, Ting-Chun Wang, Fitsum A. Reda, Karan Sapra, Zhiding Yu, Andrew Tao, Bryan Catanzaro, Partial Convolution based Padding, arXiv:1811.11718, 2018. https://arxiv.org/abs/1811.11718

MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

Published: August 13, 2019

We train an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2

Unsupervised Video Interpolation Using Cycle Consistency

Published: September 26, 2019

We propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. We also introduce a pseudo-supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model. The pseudo-supervised loss term, used together with cycle consistency, can effectively adapt a pre-trained model to a new target domain. We show results that significantly reduce the domain gap problem in video frame interpolation.

Recommended citation: Fitsum A. Reda, Deqing Sun, Aysegul Dundar, Mohammad Shoeybi, Guilin Liu, Kevin J. Shih, Andrew Tao, Jan Kautz, Bryan Catanzaro, "Unsupervised Video Interpolation Using Cycle Consistency". In ICCV 2019. https://arxiv.org/abs/1906.05928

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

Published: October 23, 2019

Mellotron is a multispeaker voice synthesis model that can make a voice emote and sing without emotive or singing training data

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Published: May 12, 2020

Flowtron is an autoregressive flow-based generative network for text-to-speech synthesis with direct control over speech variation and style transfer

MegatronLM’s Supercharged V1.0

Published: May 15, 2020

We release version 1.0 of Megatron which makes the training of large NLP models even faster and sustains 62.4 teraFLOPs in the end-to-end training that is 48% of the theoretical peak FLOPS for a single GPU in a DGX2-H server.

View Generalization for Single Image Textured 3D Models

Published: June 13, 2021

Recommended citation: Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, Bryan Catanzaro, View Generalization for Single Image Textured 3D Models, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR) 2021.

Long-Short Transformer: Efficient Transformers for Language and Vision

Published: July 29, 2021

Long-Short Transformer is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks

RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis

Published: August 16, 2021

RAD-TTS is a parallel flow-based generative network for text-to-speech synthesis which does not rely on external aligners to learn speech-text alignments and supports diversity in generated speech by modeling speech rhythm as a separate generative distribution.

One TTS Alignment to Rule Them All

Published: August 20, 2021

We present an unsupervised alignment learning framework that learns speech-text alignments online in text to speech models. We showcase this alignment learning framework can be applied to any TTS model removing the dependency of TTS systems on external aligners. It also enhances the speech quality as evaluated by human evaluators.

Speech Denoising in the Waveform Domain with Self-Attention

Published: February 01, 2022

We present CleanUNet, a speech denoising model on the raw waveform. It is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. It outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.

Fine Detailed Texture Learning for 3D Meshes with Generative Models

Published: March 17, 2022

Recommended citation: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro, Fine Detailed Texture Learning for 3D Meshes with Generative Models, arXiv:2203.09362, 2022. https://arxiv.org/abs/2203.09362

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

Published: June 10, 2022

we present BigVGAN, a universal neural vocoder. It’s trained only on speech data but shows extraordinary zero-shot generalization ability for non-speech vocalizations (laughter, applaud), singing voices, music, instrumental audio that are even recorded in varied noisy environment!

NVIDIA ADLR

Sitemap

Pages

Posts

projects