MegatronLM’s Supercharged V1.0

Published: May 15, 2020

We recently released version 1.0 of Megatron-lm in our github repository. In addition to training support for the world’s largest BERT models which established state-of-the-art results on the RACE leaderboard, we performed several software optimizations to make the training of large NLP models even faster. As a result, our baseline model with 1.2 billion parameters now achieves 62.4 teraFLOPs which is 48% of the theoretical peak FLOPS for a single GPU in a DGX2-H server. This is a 60% improvement over our previously 39 teraFLOPs published number.

In addition, to test the effect of the optimizations on our model parallel scaling, we considered four GPT2 configurations ranging from 1.2B to 8.7B parameters with eight-way model parallelism. We fixed the batch size to 8 and increased the model parallel size as the model size increases. The scaling results are shown in Table 1. We observed excellent scaling numbers in both settings. For example, our experiments with 8.7 billion parameters and 8-way (8 GPUs) model parallelism achieved 79.6% of linear scaling.

**Table 1:** Weak model parallel scaling.
Number of Parameters (billions)	Model Parallel GPUs	Iteration Time (ms)	Weak Scaling
1.2	1	1288	Baseline (100%)
2.0	2	1242	90.7%
4.2	4	1357	86.5%
8.7	8	1508	79.6%

We are constantly improving the computational efficiency of our code-base and will release the latest advancements in large scale LM training through our github repository.

Share on

Twitter Facebook LinkedIn

NVIDIA ADLR

Share on