Nvidia's New MoE Kernels Promise 93% Speedup for AI Training
Rongchai Wang Jun 15, 2026 17:29
Nvidia unveils advanced MoE training kernels, boosting AI model throughput by up to 93% in GPT pre-training and redefining large-scale efficiency.
Nvidia has introduced cutting-edge fused kernels for Mixture-of-Experts (MoE) models, offering significant improvements in training throughput. The new kernels, available via cuDNN Frontend, Transformer Engine, and Megatron Core, promise a 1.3x-2.1x speedup at the kernel level. More impressively, they deliver up to a 93% boost in overall training speed for GPT-based models, according to Nvidia's internal testing, as reported on June 15, 2026.
MoE architectures have become critical in scaling AI models, enabling massive parameter counts while keeping computational costs manageable. Nvidia's new kernels aim to address key bottlenecks in MoE training, including memory overhead, CPU-GPU synchronization delays, and inefficiencies in activation and quantization routines. By leveraging the CuTe DSL (CUDA Templates for Experts), Nvidia has re-engineered its software stack to keep Tensor Cores fully utilized throughout the training process.
Breaking Down the Bottlenecks
Three major challenges have historically hindered MoE training efficiency:
- Activation bottlenecks: Standard activation functions often underutilize Tensor Cores due to excessive memory operations.
- CPU overhead: Dynamic token routing across experts introduces significant CPU-GPU synchronization delays.
- Quantization inefficiencies: Converting tensors to lower precision adds unnecessary memory-bound operations.
To solve these issues, Nvidia has developed custom fused kernels that integrate operations like grouped GEMM, activation functions (SwiGLU, GeGLU, sReLU), and quantization into single CUDA kernels. This eliminates intermediate tensor reads/writes and reduces memory overhead, particularly for low-precision formats like MXFP8 and NVFP4.
Real-World Impact: GPT and DeepSeek Speedups
The impact of these innovations is striking. Nvidia reports an 8% end-to-end speedup for its DeepSeek-V3 pre-training setup and a staggering 93% improvement for GPT-OSS pre-training. Such gains are critical as the AI arms race intensifies, with organizations increasingly reliant on MoE's ability to scale models efficiently. Nvidia's advancements come at a time when the U.S. government is scrutinizing top AI models for national security risks, as noted in a June 2, 2026 executive order.
These performance boosts also have strategic implications for Nvidia's partnerships. The Pentagon, for instance, recently inked deals with Nvidia, Microsoft, and AWS to deploy AI on classified networks. Faster training cycles could accelerate model readiness for such high-stakes applications.
How to Access the Technology
Nvidia's fused MoE kernels are already integrated into its software ecosystem. Developers can access them through:
- cuDNN Frontend: Available in version 1.23.0+, this library allows direct invocation or use via a wrapper API for cached, reusable compilation.
- Transformer Engine: Version 2.15+ supports these kernels, enabling seamless integration with PyTorch workflows.
- Megatron Core: Starting with version 26.04-alpha.rc2, users can activate the kernels by adjusting runtime configurations.
For those interested in trying the technology, detailed benchmarks and instructions are available on Nvidia's GitHub repository.
Why It Matters
Nvidia’s advancements highlight the ongoing push to optimize AI at scale. With MoE models dominating frontier research since 2023, the ability to train these architectures efficiently has become a top priority for both commercial entities and governments. Nvidia's focus on hardware-aware software design ensures its GPUs remain the backbone of this AI revolution.
As MoE adoption grows in domains like language, vision, and multimodal AI systems, faster training is not just a technical milestone—it's a strategic advantage. Nvidia's innovations could redefine how organizations train and deploy large-scale AI models, making them an essential tool in the race for AI dominance.
Image source: Shutterstock