AI INFERENCE
Mamba-3 SSM Drops With Inference-First Design Beating Transformers at Decode
Together.ai releases Mamba-3, an open-source state space model built for inference that outperforms Mamba-2 and matches Transformer decode speeds at 16K sequences.
NVIDIA Unveils Groq 3 LPX Rack System for Ultra-Low Latency AI Inference
NVIDIA's new Groq 3 LPX delivers 315 PFLOPS and 35x better inference throughput per megawatt, targeting agentic AI workloads on the Vera Rubin platform.
NVIDIA Blackwell Smashes Finance AI Benchmark With 3.2x Speed Gains
NVIDIA's GB200 NVL72 sets new STAC-AI record for LLM inference in financial trading, delivering up to 3.2x performance over Hopper architecture.
NVIDIA Blackwell Delivers 4x Inference Boost for India's Sarvam AI Models
NVIDIA's hardware-software co-design achieves 4x inference speedup for Sarvam AI's 30B parameter sovereign models, showcasing Blackwell's NVFP4 capabilities.
NVIDIA TensorRT for RTX Brings Self-Optimizing AI to Consumer GPUs
NVIDIA's TensorRT for RTX introduces adaptive inference that automatically optimizes AI workloads at runtime, delivering 1.32x performance gains on RTX 5090.
NVIDIA Achieves 10x AI Image Generation Speedup on Blackwell Data Center GPUs
NVIDIA's new NVFP4 optimizations deliver 10.2x faster FLUX.2 inference on Blackwell B200 GPUs versus H200, with near-linear multi-GPU scaling.
NVIDIA Grove Simplifies AI Inference on Kubernetes
NVIDIA introduces Grove, a Kubernetes API that streamlines complex AI inference workloads, enhancing scalability and orchestration of multi-component systems.
NVIDIA Enhances AI Inference with Dynamo and Kubernetes Integration
NVIDIA's Dynamo platform now integrates with Kubernetes to streamline AI inference management, offering improved performance and reduced costs for data centers, according to NVIDIA's latest updates.
NVIDIA Dynamo Tackles KV Cache Bottlenecks in AI Inference
NVIDIA Dynamo introduces KV Cache offloading to address memory bottlenecks in AI inference, enhancing efficiency and reducing costs for large language models.
Reducing AI Inference Latency with Speculative Decoding
Explore how speculative decoding techniques, including EAGLE-3, reduce latency and enhance efficiency in AI inference, optimizing large language model performance on NVIDIA GPUs.
NVIDIA Enhances AI Scalability with NIM Operator 3.0.0 Release
NVIDIA's NIM Operator 3.0.0 introduces advanced features for scalable AI inference, enhancing Kubernetes deployments with multi-LLM and multi-node capabilities, and efficient GPU utilization.
NVIDIA's Rubin CPX GPU Revolutionizes Long-Context AI Inference
NVIDIA unveils Rubin CPX GPU, enhancing AI inference with unprecedented efficiency for 1M+ token workloads, transforming sectors like software development and video generation.
NVIDIA NVLink and Fusion Drive AI Inference Performance
NVIDIA's NVLink and NVLink Fusion technologies are redefining AI inference performance with enhanced scalability and flexibility to meet the exponential growth in AI model complexity.
Enhancing AI Model Efficiency: Torch-TensorRT Speeds Up PyTorch Inference
Discover how Torch-TensorRT optimizes PyTorch models for NVIDIA GPUs, doubling inference speed for diffusion models with minimal code changes.
NVIDIA Dynamo Expands AWS Support for Enhanced AI Inference Efficiency
NVIDIA Dynamo now supports AWS services, offering developers enhanced efficiency for large-scale AI inference. The integration promises performance improvements and cost savings.