NVIDIA Open-Sources Slinky to Run Slurm GPU Workloads on Kubernetes

Contents

How Slinky Truly Works
Manufacturing Outcomes at NVIDIA
What’s New in v1.1.0

Felix Pinkston
Apr 09, 2026 17:23

NVIDIA’s Slinky challenge permits operating Slurm clusters on Kubernetes, already deployed on 8,000+ GPU methods for large-scale AI coaching infrastructure.

NVIDIA has launched Slinky, an open-source challenge that bridges the hole between Slurm—the job scheduler operating over 65% of TOP500 supercomputers—and Kubernetes, the dominant platform for managing GPU infrastructure at scale. The corporate already runs Slinky in manufacturing throughout clusters with greater than 8,000 GPUs.

The technical downside right here is actual: organizations have years invested in Slurm job scripts, fair-share insurance policies, and accounting workflows. However Kubernetes has turn out to be the usual for managing GPU infrastructure. Working two separate environments creates operational complications that compound at scale.

How Slinky Truly Works

Slinky’s slurm-operator represents every Slurm element—scheduling, accounting, compute staff, API entry—as Kubernetes Customized Useful resource Definitions. You outline a Slurm cluster utilizing Customized Sources, and Slinky spins up containerized Slurm daemons in their very own pods.

The high-availability story issues for manufacturing deployments. Slinky handles management aircraft HA by means of pod regeneration slightly than Slurm’s native mechanism. Configuration adjustments propagate routinely with zero scheduler downtime. Staff can autoscale primarily based on cluster metrics, and on scale-in, Slinky absolutely drains nodes earlier than terminating pods—operating workloads full first.

For NVIDIA’s GB200 NVL72 structure, the place GPUs talk throughout nodes by means of multinode NVLink, Slinky permits ComputeDomains that dynamically handle high-bandwidth GPU-to-GPU connectivity. Distributed coaching jobs obtain full NVLink bandwidth throughout node boundaries.

Manufacturing Outcomes at NVIDIA

NVIDIA studies GPU communication benchmarks—NCCL all-reduce and all-gather—match noncontainerized Slurm deployments with no measurable influence from the Kubernetes layer. New clusters reportedly go from zero to operating jobs in hours utilizing Helm charts.

The operational wins compound at scale: Prometheus scrapes Slurm metrics alongside customary Kubernetes metrics. When well being checks flag an unhealthy node, the state syncs routinely between methods. Rolling updates proceed whereas coaching jobs proceed on remaining capability.

One constraint value noting: Slinky at present assumes one employee pod per node. In the event you’re operating completely single-node Slurm jobs, this over-provisions relative to what you want.

What’s New in v1.1.0

The lately launched slurm-operator v1.1.0 provides dynamic topology help—employee pods now register with topology primarily based on their Kubernetes node, enabling topology-aware scheduling as pods transfer. DaemonSet-style scaling ties pods to their nodeSelector, simplifying operations for clusters the place each GPU node ought to run a Slurm employee.

The roadmap contains sleek cluster upgrades, deliberate outage workflows, and configuration rollback. For AI infrastructure groups weighing build-versus-integrate selections, Slinky represents a significant choice that did not exist a yr in the past. The code is on the market on GitHub underneath the SlinkyProject group.

Picture supply: Shutterstock

Earnings name transcript: Unite Group’s Q1 2026 reveals resilience amid challenges

Goldman Sachs non-public credit score fund narrowly misses a redemption disaster

LARRY KUDLOW: Low taxes are making the American center class richer

Folks on each side of the strait are Chinese language, Xi tells Taiwan opposition chief

Each day – Vickers High Insider Picks for 04/09/2026

NVIDIA Open-Sources Slinky to Run Slurm GPU Workloads on Kubernetes

How Slinky Truly Works

Manufacturing Outcomes at NVIDIA

What’s New in v1.1.0

Leave a Reply Cancel reply

Follow US

Popular News

Success Story: Charles Tyler’s Studying Journey with 101 Blockchains

Key Advantages, Use Circumstances, And Developments

The Innovation Hub Playbook: Constructing a Digital Ecosystem for the Recent Meals Chain

Follow Us on Socials

We influence 20 million users and is the number one business blockchain and crypto news network on the planet.

Topics

How Slinky Truly Works

Manufacturing Outcomes at NVIDIA

What’s New in v1.1.0

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Follow US

Popular News

Topics