Together AI Upgrades GPU Clusters With Autoscaling and Self-Healing Features

Together AI Upgrades GPU Clusters With Autoscaling and Self-Healing Features - Blockchain.News

Together AI has rolled out a significant infrastructure upgrade to its GPU Clusters platform, adding autoscaling, role-based access control, full-stack observability, and self-healing node repair capabilities. The enhancements arrive as the AI cloud company reportedly pursues $1 billion in fresh funding, according to reports from earlier this month.

The timing isn't coincidental. Enterprise customers running distributed training workloads across hundreds of GPUs need more than raw compute—they need infrastructure that doesn't require babysitting.

Autoscaling Targets GPU Waste

The new autoscaling feature, powered by the Kubernetes Cluster Autoscaler, monitors for GPU-constrained workloads and automatically provisions or decommissions nodes based on real-time demand. For teams running variable inference workloads or bursty training jobs, this means no more paying for idle hardware during quiet periods.

Static GPU provisioning has been a persistent pain point. Organizations either overprovision (expensive) or underprovision (performance bottlenecks during demand spikes). Together's approach lets clusters expand during peak load and contract when demand subsides.

Self-Healing Addresses Hardware Reality

GPU hardware fails. In large fleets, it's not a question of if but when. For distributed training, a single unstable node can invalidate hours of compute time.

Together's solution: self-serve health checks that users can trigger before launching major training jobs. Tests range from basic DCGM diagnostics to multi-node NCCL and InfiniBand bandwidth tests. When a node does fail, a three-click self-repair process automatically cordons, drains, and recreates the node—bringing clusters back to healthy status within minutes rather than hours.

Acceptance tests now run automatically during provisioning. Clusters won't be marked ready until they pass.

Enterprise Access Controls

The RBAC implementation introduces "Projects" as isolation boundaries for teams. Two default roles split responsibilities cleanly: Admins get full control plane access for cluster creation and deletion, while Members can access GPU worker nodes and run workloads without touching infrastructure provisioning.

This matters for organizations where platform engineers need to lock down infrastructure while giving ML researchers freedom to experiment.

Observability Gets Native

Every GPU Cluster project now includes a dedicated Grafana instance with pre-built dashboards. Telemetry covers GPU utilization via DCGM metrics, InfiniBand and NIC-level networking data, storage I/O performance, and Kubernetes orchestration health. The feature is currently in private preview.

Market Context

Together AI has been building momentum in the GPU-as-a-service space. The company launched self-service GPU infrastructure in September 2025 and introduced Instant GPU Clusters at NVIDIA GTC 2025 in March of that year. The platform supports NVIDIA Hopper (H100) and Blackwell (B200) GPUs, with Instant Clusters scaling up to 64 GPUs and Dedicated Clusters reaching 1,000 GPUs.

With a reported $7.5 billion market cap and a potential billion-dollar funding round in progress, Together is positioning itself as a serious alternative to hyperscaler GPU offerings—targeting teams that want bare-metal performance without the operational overhead of managing their own hardware.

The new features are available immediately to existing Together GPU Clusters customers.

Image source: Shutterstock