NVIDIA Unveils NVFP4 Checkpoint for Nemotron 3 Ultra

NVIDIA has introduced the Nemotron 3 Ultra NVFP4 checkpoint, a significant step forward in AI model optimization. By leveraging its NVFP4 quantization format, part of the Blackwell GPU architecture, the company reports up to a 5.9x boost in inference throughput on decode-heavy tasks compared to traditional FP4 models, while maintaining BF16-level accuracy in nearly all benchmarks.

Quantization, the process of compressing model weights into smaller data formats, is at the heart of this breakthrough. NVIDIA’s Model Optimizer tool was key in transforming the 550-billion-parameter Nemotron 3 Ultra model into NVFP4, reducing its footprint from 1,121 GB to 352.3 GB—a 3.2x size reduction. This optimization not only slashes hardware requirements but also expands deployment flexibility across NVIDIA's Hopper and Blackwell GPU architectures. For instance, on Hopper, the model dynamically switches to W4A16 (4-bit weights, 16-bit activations), while Blackwell GPUs utilize native W4A4 for maximum efficiency.

What sets the NVFP4 checkpoint apart is its precision management. Contrary to common assumptions, not every layer is stored in NVFP4. Sensitive layers like attention linears remain in BF16 to preserve accuracy. Meanwhile, other components, such as Mixture of Experts (MoE) routed experts, are quantized to NVFP4 or FP8, depending on their precision requirements. This selective quantization strategy ensures that the model maintains high performance while minimizing resource demands.

Technical Innovations and Industry Context

The NVFP4 quantization format introduces unique scaling strategies to optimize weight representation. NVIDIA tested several methods, including max scaling, mean squared error (MSE) scaling, and a novel "four-over-six" scaling approach. The latter proved instrumental in minimizing reconstruction errors for weights, significantly boosting downstream task accuracy without inflating the model's storage size. For instance, the four-over-six method achieved a 16.4% reduction in median reconstruction MSE across 48 MoE expert layers in Nemotron 3 Ultra.

NVIDIA's advancements align with its broader strategy to dominate the AI hardware and software ecosystem. The Nemotron 3 Ultra NVFP4 checkpoint benefits from integration with NVIDIA Model Optimizer, an open-source library designed to compress and accelerate AI models. This tool has become critical as enterprises adopt larger models for agentic AI, multimodal tasks, and robotics. Recent product launches, such as Nemotron 3 Super and the Vera Rubin GPU platform, underline NVIDIA's commitment to enabling efficient, scalable AI deployments.

Why This Matters

For enterprises, the ability to compress models like Nemotron 3 Ultra without sacrificing accuracy translates to lower inference costs, higher throughput, and reduced energy consumption. These optimizations are particularly relevant as AI use cases expand into resource-intensive domains like natural language processing, agentic AI, and robotics. The NVFP4 checkpoint positions NVIDIA to address these demands head-on, offering solutions that balance performance and efficiency.

With a market cap exceeding $4.7 trillion as of June 26, 2026, NVIDIA continues to solidify its role as a leader in AI innovation. The Nemotron 3 Ultra NVFP4 checkpoint could accelerate adoption of NVIDIA’s Blackwell and Hopper architectures, reinforcing the company's dominance in the AI hardware and software markets.

Developers and enterprises can begin experimenting with the NVFP4 format through Model Optimizer 0.46, set for release in July. The accompanying technical report and open-source recipes on GitHub provide detailed guidance for replicating NVIDIA’s results.

Sources: NVIDIA Developer Blog, NVIDIA Model Optimizer GitHub, Market Data as of 2026-06-26

Image source: Shutterstock

Bookmark

NVIDIA Unveils NVFP4 Checkpoint for Nemotron 3 Ultra

Technical Innovations and Industry Context

Why This Matters

Premium Sponsors

Flash News