Winvest — Bitcoin investment
FlashAttention-4 Hits 71% GPU Utilization on NVIDIA Blackwell B200 - Blockchain.News

FlashAttention-4 Hits 71% GPU Utilization on NVIDIA Blackwell B200

Terrill Dicki Mar 05, 2026 14:04

Together AI's FlashAttention-4 achieves 1,605 TFLOPs/s on B200 GPUs, up to 2.7x faster than Triton. New pipelining overcomes asymmetric hardware scaling bottlenecks.

FlashAttention-4 Hits 71% GPU Utilization on NVIDIA Blackwell B200

Together AI has released FlashAttention-4, achieving up to 1,605 TFLOPs/s on NVIDIA's Blackwell B200 GPUs—representing 71% hardware utilization and marking a 2.7x speedup over Triton implementations. The release addresses a fundamental challenge in modern AI hardware: tensor core throughput is scaling far faster than other critical resources.

For context, NVIDIA's market cap sits at $4.49 trillion as of March 4, 2026, with shares trading at $179.86. The company released its own Flash Attention optimization guide for Blackwell GPUs just yesterday, signaling the growing importance of attention optimization in production AI workloads.

The Asymmetric Scaling Problem

Here's what makes this interesting. From Hopper H100 to Blackwell B200, BF16 tensor core throughput jumped from 1 to 2.25 PFLOPs. But special function units for exponential operations and shared memory bandwidth? Unchanged. That creates a bottleneck nobody was expecting.

The Together AI team discovered that the forward pass isn't compute-bound at all on B200—it's bottlenecked by exponential calculations in softmax. The backward pass? Shared memory traffic dominates. Traditional attention optimization focused on the wrong constraints.

How FA4 Solves It

The forward pass uses a ping-pong schedule processing two query tiles per CTA, with dedicated warpgroups handling softmax while others issue matrix operations. The clever bit: software emulation of the exponential function using FMA units alongside hardware MUFU.EX2, effectively doubling exponential throughput.

Conditional online softmax rescaling skips small corrections entirely. If the max jump stays below a threshold, the kernel avoids unnecessary vector operations. Final normalization still produces correct results—but the critical path shrinks considerably.

The backward pass exploits Blackwell's new 2-CTA MMA mode, partitioning output accumulators across CTA pairs. Each CTA stages half of operand B while keeping only its accumulator slice, roughly halving shared memory traffic. Global atomic reductions for dQ gradients also drop by half.

Performance Numbers

Against cuDNN 9.13, FlashAttention-4 delivers 1.1-1.3x improvement on forward passes and consistent gains on backward passes at large sequence lengths. The Triton comparison shows the starkest difference—up to 2.7x faster forward performance.

Deterministic mode, which serializes global reductions for reproducible training, still achieves 85-90% of non-deterministic throughput. That's significant for teams requiring exact reproducibility across training runs.

The Broader Picture

FlashAttention has evolved rapidly since its May 2022 debut. Version 1 achieved 25-40% utilization on A100s. FA2 pushed that to 50-73% in July 2023. FA3 targeted Hopper GPUs specifically, hitting 75% utilization with FP16 and nearly 1.2 PFLOPS with FP8.

FA4 represents a philosophical shift—algorithm and kernel co-design that accounts for asymmetric hardware evolution. The techniques have already been partially incorporated into cuDNN 9.13 and 9.14 through collaboration with NVIDIA's teams.

The implementation uses CuTe-DSL, CUTLASS's Python kernel DSL, cutting compile times by 20-30x versus C++ templates. For teams running large-scale training on Blackwell hardware, the efficiency gains compound across millions of attention operations daily.

Image source: Shutterstock