Felix Pinkston
Apr 09, 2026 17:23
NVIDIA’s Slinky challenge permits operating Slurm clusters on Kubernetes, already deployed on 8,000+ GPU methods for large-scale AI coaching infrastructure.
NVIDIA has launched Slinky, an open-source challenge that bridges the hole between Slurm—the job scheduler operating over 65% of TOP500 supercomputers—and Kubernetes, the dominant platform for managing GPU infrastructure at scale. The corporate already runs Slinky in manufacturing throughout clusters with greater than 8,000 GPUs.
The technical downside right here is actual: organizations have years invested in Slurm job scripts, fair-share insurance policies, and accounting workflows. However Kubernetes has turn out to be the usual for managing GPU infrastructure. Working two separate environments creates operational complications that compound at scale.
How Slinky Truly Works
Slinky’s slurm-operator represents every Slurm element—scheduling, accounting, compute staff, API entry—as Kubernetes Customized Useful resource Definitions. You outline a Slurm cluster utilizing Customized Sources, and Slinky spins up containerized Slurm daemons in their very own pods.
The high-availability story issues for manufacturing deployments. Slinky handles management aircraft HA by means of pod regeneration slightly than Slurm’s native mechanism. Configuration adjustments propagate routinely with zero scheduler downtime. Staff can autoscale primarily based on cluster metrics, and on scale-in, Slinky absolutely drains nodes earlier than terminating pods—operating workloads full first.
For NVIDIA’s GB200 NVL72 structure, the place GPUs talk throughout nodes by means of multinode NVLink, Slinky permits ComputeDomains that dynamically handle high-bandwidth GPU-to-GPU connectivity. Distributed coaching jobs obtain full NVLink bandwidth throughout node boundaries.
Manufacturing Outcomes at NVIDIA
NVIDIA studies GPU communication benchmarks—NCCL all-reduce and all-gather—match noncontainerized Slurm deployments with no measurable influence from the Kubernetes layer. New clusters reportedly go from zero to operating jobs in hours utilizing Helm charts.
The operational wins compound at scale: Prometheus scrapes Slurm metrics alongside customary Kubernetes metrics. When well being checks flag an unhealthy node, the state syncs routinely between methods. Rolling updates proceed whereas coaching jobs proceed on remaining capability.
One constraint value noting: Slinky at present assumes one employee pod per node. In the event you’re operating completely single-node Slurm jobs, this over-provisions relative to what you want.
What’s New in v1.1.0
The lately launched slurm-operator v1.1.0 provides dynamic topology help—employee pods now register with topology primarily based on their Kubernetes node, enabling topology-aware scheduling as pods transfer. DaemonSet-style scaling ties pods to their nodeSelector, simplifying operations for clusters the place each GPU node ought to run a Slurm employee.
The roadmap contains sleek cluster upgrades, deliberate outage workflows, and configuration rollback. For AI infrastructure groups weighing build-versus-integrate selections, Slinky represents a significant choice that did not exist a yr in the past. The code is on the market on GitHub underneath the SlinkyProject group.
Picture supply: Shutterstock
