The full-stack NVIDIA accelerated computing platform has once again demonstrated exceptional performance in the latest MLPerf Training v4.0 benchmarks, according to the NVIDIA Blog.
Unprecedented Performance in Large Language Models
NVIDIA more than tripled its performance on the large language model (LLM) benchmark, based on GPT-3 175B, compared to its previous record-setting submission. This feat was achieved using an AI supercomputer featuring 11,616 NVIDIA H100 Tensor Core GPUs connected with NVIDIA Quantum-2 InfiniBand networking, a significant increase from the 3,584 H100 GPUs used last year. This scalability showcases the extensive full-stack engineering efforts by NVIDIA.
The scalability of the NVIDIA AI platform enables faster training of massive AI models like GPT-3 175B, translating into significant business opportunities. For instance, NVIDIA's recent earnings call highlighted that LLM service providers could potentially turn a single dollar invested into seven dollars over four years by running the Llama 3 70B model on NVIDIA HGX H200 servers.
NVIDIA H200 GPU: Pushing Boundaries
The NVIDIA H200 Tensor GPU, built on the Hopper architecture, offers 141GB of HBM3 memory and over 40% more memory bandwidth compared to the H100 GPU. In its MLPerf Training debut, the H200 extended the H100’s performance by up to 47%, pushing the boundaries of AI training capabilities.
Software Optimizations Drive Performance Gains
NVIDIA also reported a 27% performance boost in its 512 H100 GPU configuration compared to the previous year, thanks to numerous software stack optimizations. This improvement underscores the impact of continuous software enhancements on performance, even with existing hardware.
The submission highlighted nearly perfect scaling, with performance increasing proportionally as the number of GPUs rose from 3,584 to 11,616.
Excellence in LLM Fine-Tuning
LLM fine-tuning, a critical workload for enterprises customizing pretrained large language models, was also a highlight. NVIDIA excelled in this area, scaling from eight to 1,024 GPUs and completing the benchmark in a record 1.5 minutes.
Accelerating Stable Diffusion and GNN Training
NVIDIA achieved up to an 80% increase in Stable Diffusion v2 training performance at the same system scales as the previous round. Additionally, the H200 GPU delivered a 47% boost in single-node graph neural network (GNN) training compared to the H100, demonstrating the powerful performance and efficiency of NVIDIA GPUs for various AI applications.
Broad Ecosystem Support
The breadth of the NVIDIA AI ecosystem was evident with 10 partners, including ASUS, Dell Technologies, and Lenovo, submitting their own impressive benchmark results. This widespread participation underscores the industry’s trust in NVIDIA’s AI platform.
MLCommons continues to play a vital role in AI computing by enabling peer-reviewed comparisons of AI and HPC platforms. This is crucial for guiding important purchasing decisions in a rapidly evolving field.
Looking ahead, the NVIDIA Blackwell platform promises next-level AI performance for trillion-parameter generative AI models, both in training and inference.
Image source: Shutterstock