NVIDIA and AWS Join Forces to Enhance AI Training Scalability
Iris Coleman Jun 24, 2025 12:39
NVIDIA Run:ai and AWS SageMaker HyperPod integrate to streamline AI training, offering enhanced scalability and resource management across hybrid cloud environments.

NVIDIA Run:ai and Amazon Web Services (AWS) have unveiled a strategic integration aimed at enhancing the scalability and management of complex AI training workloads. This collaboration merges AWS SageMaker HyperPod with NVIDIA Run:ai's advanced AI workload and GPU orchestration platform, promising improved efficiency and flexibility, according to NVIDIA.
Streamlining AI Infrastructure
The AWS SageMaker HyperPod is designed to provide a resilient and persistent cluster specifically for large-scale distributed training and inference. By optimizing resource utilization across multiple GPUs, it significantly cuts down model training times. This feature is compatible with any model architecture, allowing teams to scale their training jobs effectively.
Moreover, SageMaker HyperPod enhances resiliency by automatically detecting and handling infrastructure failures, ensuring uninterrupted training job recovery without significant downtime. This capability accelerates the machine learning lifecycle and boosts productivity.
Centralized Management with NVIDIA Run:ai
NVIDIA Run:ai offers a centralized interface for AI workload and GPU orchestration across hybrid environments, including on-premise and cloud setups. This approach allows IT administrators to efficiently manage GPU resources across various geographic locations, facilitating seamless cloud bursts when demand spikes.
The integration has been thoroughly tested by technical teams from both AWS and NVIDIA Run:ai. It allows users to leverage SageMaker HyperPod’s flexibility while benefiting from NVIDIA Run:ai’s GPU optimization and resource-management features.
Dynamic and Cost-Effective Scaling
The collaboration enables organizations to extend their AI infrastructure seamlessly across on-premise and cloud environments. NVIDIA Run:ai's control plane allows enterprises to manage GPU resources efficiently, whether on-prem or in the cloud. This capability supports dynamic scaling without the need for over-provisioning hardware, thus reducing costs while maintaining performance.
SageMaker HyperPod’s flexible infrastructure is ideal for large-scale model training and inference, making it suitable for enterprises focused on training or fine-tuning foundation models, such as Llama or Stable Diffusion.
Enhanced Resource Management
NVIDIA Run:ai ensures that AI infrastructure is used efficiently, thanks to its advanced scheduling and GPU fractioning capabilities. This flexibility is particularly beneficial for enterprises managing fluctuating demand, as it adapts to shifts in compute needs, reducing idle time and maximizing GPU return on investment.
As part of the validation process, NVIDIA Run:ai tested several key capabilities, including hybrid and multi-cluster management, automatic job resumption after hardware failures, and inference serving. This integration represents a significant step forward in managing AI workloads across hybrid environments.
Image source: Shutterstock