This paper shows how to do distributed, large batch, mixed precision training of language models with investigations into the successes and limitations of large batch training on publicly available language datasets. We train a 4096-dimension multiplicative LSTM on amazon reviews: an expensive 20-exaflop per epoch process. Training is performed in mixed precision with tensorcores on 128 GPUs across 16 DGX-1V nodes. This work enables training of new large language models in hours instead of weeks or days. See our GitHub.