This paper shows how to do large scale distributed, large batch, mixed precision training of language models with investigations into the successes and limitations of large batch training on publicly available language datasets.
Recommended citation: Raul Puri, Robert Kirby, Nikolai Yakovenko, Bryan Catanzaro, Large Scale Language Modeling: Converging on 40GB of Text in Four Hours. arXiv. 2018. https://arxiv.org/abs/1808.01371
SDCNet is a 3D convolutional neural network proposed for frame prediction. The model takes as input a sequence of past frames and their inter-frame optical flows and generates a per-pixel kernel and motion vector. A future frame is then synthesised by sampling past frames guided by the motion vectors and weighted by the learned kernels.
Recommended citation: Fitsum A. Reda, Guilin Liu, Kevin J. Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, Bryan Catanzaro, SDCNet: Video Prediction Using Spatially Displaced Convolution. ECCV 2018. https://arxiv.org/abs/1811.00684
This paper shows how to scale up training sets for semantic segmentation by using video prediction-based data synthesis method. Our proposed joint propagation strategy and boundary relaxation technique can alleviate the label noise in the synthesized samples and lead to state-of-the-art performance on three benchmark datasets Cityscapes, CamVid and KITTI.
Recommended citation: Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao and Bryan Catanzaro, Improving Semantic Segmentation via Video Propagation and Label Relaxation, arXiv:1812.01593, 2018. https://arxiv.org/abs/1812.01593
Recommended citation: Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, Bryan Catanzaro, Image Inpainting for Irregular Holes Using Partial Convolutions, Proceedings of the European Conference on Computer Vision (ECCV) 2018. https://arxiv.org/abs/1804.07723
We train an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2
We propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. We also introduce a pseudo-supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model. The pseudo-supervised loss term, used together with cycle consistency, can effectively adapt a pre-trained model to a new target domain. We show results that significantly reduce the domain gap problem in video frame interpolation.
Recommended citation: Fitsum A. Reda, Deqing Sun, Aysegul Dundar, Mohammad Shoeybi, Guilin Liu, Kevin J. Shih, Andrew Tao, Jan Kautz, Bryan Catanzaro, "Unsupervised Video Interpolation Using Cycle Consistency". In ICCV 2019. https://arxiv.org/abs/1906.05928
We release version 1.0 of Megatron which makes the training of large NLP models even faster and sustains 62.4 teraFLOPs in the end-to-end training that is 48% of the theoretical peak FLOPS for a single GPU in a DGX2-H server.