Kubernetes for ML Workloads: A Practical Guide to GPU Scheduling, Ray, and KubeFlow

After spending the last three years building and operating ML platforms on Kubernetes across two companies, I have strong opinions about what works and what is a waste of engineering time. Kubernetes is not a silver bullet for ML infrastructure, but when configured correctly, it is the closest thing we have to a universal substrate for the full ML lifecycle—from exploratory notebooks to distributed training to production inference. This guide covers everything I wish I had known before my first GPU node pool went sideways at 2 AM.

Why Kubernetes for ML Workloads

The pitch is simple: ML teams need heterogeneous compute (CPUs for preprocessing, GPUs for training, TPUs if you are on GCP), resource isolation between teams, and the ability to scale from zero to hundreds of nodes for a training run and back down again. Kubernetes gives you all of that with a single control plane.

But the real reason Kubernetes wins for ML is not the orchestration—it is the ecosystem. Every major ML framework, from PyTorch to TensorFlow to JAX, ships with first-class Kubernetes operators. Tools like KubeFlow, Ray, MLflow, and Seldon all assume Kubernetes as the deployment target. If you try to build an ML platform on bare EC2 instances or docker-compose, you end up reimplementing half of what Kubernetes already provides: service discovery, health checks, resource quotas, rolling deployments, and log aggregation.

The three things that make Kubernetes particularly well-suited for ML:

  • Resource isolation via namespaces and quotas. Your training jobs do not compete with your serving endpoints for GPU memory. You can give the research team a namespace with 8 A100s and the production team a separate namespace with guaranteed resources.
  • GPU sharing and scheduling. The NVIDIA device plugin, combined with time-slicing or MIG, lets you pack multiple workloads onto a single GPU—critical when you are paying $3/hour per A100.
  • Autoscaling. Cluster autoscaler provisions GPU nodes only when jobs are queued, and scales back to zero when idle. Combined with spot/preemptible instances, this can cut your GPU bill by 60-70%.

GPU Scheduling on Kubernetes: Getting It Right

GPU scheduling is where most teams hit their first wall. The NVIDIA device plugin exposes GPUs as an extended resource (nvidia.com/gpu), and by default each GPU can only be assigned to a single pod. This works fine for large training runs but is wasteful for inference workloads or development notebooks that use a fraction of the GPU memory.

Installing the NVIDIA Device Plugin

First, your nodes need the NVIDIA driver and container toolkit installed. On managed Kubernetes (EKS, GKE, AKS), the GPU node pools come preconfigured. On bare metal, you will need to install nvidia-container-toolkit and verify with nvidia-smi inside a test pod.

The device plugin itself is a DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: nvidia-device-plugin-ctr
          image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

A Basic GPU Pod

Here is a minimal pod spec that requests a single GPU. This is the building block for everything else:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-pod
  labels:
    app: model-training
    team: ml-platform
spec:
  restartPolicy: Never
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  containers:
    - name: trainer
      image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
      command: ["python", "train.py", "--epochs", "50", "--batch-size", "64"]
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: "32Gi"
          nvidia.com/gpu: "1"
      volumeMounts:
        - name: training-data
          mountPath: /data
        - name: model-output
          mountPath: /output
      env:
        - name: NCCL_DEBUG
          value: "INFO"
        - name: CUDA_VISIBLE_DEVICES
          value: "all"
  nodeSelector:
    accelerator: nvidia-a100
  volumes:
    - name: training-data
      persistentVolumeClaim:
        claimName: training-dataset-pvc
    - name: model-output
      persistentVolumeClaim:
        claimName: model-artifacts-pvc

A few things I have learned the hard way: always set restartPolicy: Never for training jobs (you do not want a failed training run restarting in a loop and burning GPU hours). Always set both requests and limits for GPU—unlike CPU, GPU resources are not compressible, so requests and limits should match. And always use a nodeSelector or node affinity to target the right GPU type. Scheduling an A10G workload onto a T4 node because you forgot the selector is a debugging session you only want to have once.

GPU Sharing: Time-Slicing and MIG

For inference and development workloads, dedicating a full GPU per pod is wasteful. Two approaches exist:

Time-slicing lets multiple pods share a GPU by context-switching between them. It is configured via a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

This makes each physical GPU appear as 4 logical GPUs. The downside: there is no memory isolation. One pod can OOM-kill another pod's GPU process. I only use this for development environments where the risk is acceptable.

Multi-Instance GPU (MIG) is available on A100 and H100 GPUs and provides true hardware-level partitioning. Each partition gets dedicated compute cores and memory. It is more complex to set up but gives you the isolation you need for production mixed workloads. An A100 80GB can be split into up to 7 MIG instances, each with its own memory and compute slice.

Distributed Training on Kubernetes

Single-GPU training hits a ceiling fast. Once your model or dataset outgrows one GPU, you need distributed training. On Kubernetes, the standard approach is the Training Operator (formerly Kubeflow Training Operator), which provides CRDs for PyTorchJob, TFJob, MPIJob, and others.

PyTorchJob for Multi-Node Training

Here is a real PyTorchJob manifest I use for distributed training with PyTorch DDP (Distributed Data Parallel):

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: llm-finetune-job
  namespace: ml-training
spec:
  elasticPolicy:
    rdzvBackend: etcd
    rdzvHost: etcd-service
    minReplicas: 2
    maxReplicas: 8
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
          containers:
            - name: pytorch
              image: registry.internal/ml-training:v2.4.1
              command:
                - python
                - -m
                - torch.distributed.run
                - --nproc_per_node=4
                - --nnodes=4
                - train.py
                - --model=llama-7b
                - --data=/data/training-set
                - --output=/output/checkpoints
                - --lr=2e-5
                - --epochs=3
              resources:
                requests:
                  cpu: "8"
                  memory: "64Gi"
                  nvidia.com/gpu: "4"
                limits:
                  cpu: "16"
                  memory: "128Gi"
                  nvidia.com/gpu: "4"
              volumeMounts:
                - name: shared-data
                  mountPath: /data
                - name: output
                  mountPath: /output
                - name: shm
                  mountPath: /dev/shm
          volumes:
            - name: shared-data
              persistentVolumeClaim:
                claimName: training-data-nfs
            - name: output
              persistentVolumeClaim:
                claimName: model-output-pvc
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "16Gi"
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
          containers:
            - name: pytorch
              image: registry.internal/ml-training:v2.4.1
              command:
                - python
                - -m
                - torch.distributed.run
                - --nproc_per_node=4
                - train.py
                - --model=llama-7b
                - --data=/data/training-set
                - --output=/output/checkpoints
              resources:
                requests:
                  cpu: "8"
                  memory: "64Gi"
                  nvidia.com/gpu: "4"
                limits:
                  cpu: "16"
                  memory: "128Gi"
                  nvidia.com/gpu: "4"
              volumeMounts:
                - name: shared-data
                  mountPath: /data
                - name: output
                  mountPath: /output
                - name: shm
                  mountPath: /dev/shm
          volumes:
            - name: shared-data
              persistentVolumeClaim:
                claimName: training-data-nfs
            - name: output
              persistentVolumeClaim:
                claimName: model-output-pvc
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "16Gi"

Critical detail: the /dev/shm mount. PyTorch uses shared memory for data loading, and the default 64MB Docker shm size will cause your training to crash silently or slow to a crawl. I always mount an emptyDir with medium: Memory and set it to at least 8-16GB.

The elastic training policy is a game-changer. If a spot instance gets preempted and a worker dies, the job continues with the remaining workers and scales back up when a new node joins. This is essential for cost-effective training on spot instances.

Ray on Kubernetes with KubeRay

Ray has become my go-to framework for anything beyond simple DDP training. Ray Train, Ray Tune, and Ray Serve cover the full lifecycle, and KubeRay makes running Ray on Kubernetes straightforward. The key abstraction is the RayCluster CRD.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: ml-platform-ray
  namespace: ml-platform
spec:
  rayVersion: "2.9.3"
  headGroupSpec:
    rayStartParams:
      dashboard-host: "0.0.0.0"
      num-cpus: "0"  # head node should not run tasks
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray-ml:2.9.3-py310-gpu
            ports:
              - containerPort: 6379  # GCS
              - containerPort: 8265  # dashboard
              - containerPort: 10001 # client
            resources:
              requests:
                cpu: "4"
                memory: "16Gi"
              limits:
                cpu: "4"
                memory: "16Gi"
            volumeMounts:
              - name: ray-logs
                mountPath: /tmp/ray
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
    - groupName: gpu-workers
      replicas: 4
      minReplicas: 0
      maxReplicas: 16
      rayStartParams:
        num-gpus: "1"
      template:
        spec:
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
          containers:
            - name: ray-worker
              image: rayproject/ray-ml:2.9.3-py310-gpu
              resources:
                requests:
                  cpu: "4"
                  memory: "32Gi"
                  nvidia.com/gpu: "1"
                limits:
                  cpu: "8"
                  memory: "64Gi"
                  nvidia.com/gpu: "1"
              volumeMounts:
                - name: shared-storage
                  mountPath: /mnt/data
          volumes:
            - name: shared-storage
              persistentVolumeClaim:
                claimName: ray-shared-nfs
    - groupName: cpu-workers
      replicas: 2
      minReplicas: 0
      maxReplicas: 32
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-worker
              image: rayproject/ray-ml:2.9.3-py310-gpu
              resources:
                requests:
                  cpu: "8"
                  memory: "32Gi"
                limits:
                  cpu: "16"
                  memory: "64Gi"

What I like about KubeRay is the autoscaling. Set minReplicas: 0 and the GPU workers only spin up when a Ray job requests GPU resources. Combined with Kubernetes cluster autoscaler, this means you are truly paying only for what you use. I have seen teams go from a steady $40K/month GPU bill to $12K/month by switching from always-on GPU instances to Ray autoscaling on spot nodes.

Ray also shines for hyperparameter tuning. Ray Tune can run hundreds of trials in parallel across your cluster, with built-in support for early stopping and population-based training. Doing this with vanilla Kubernetes Jobs is possible but requires writing a lot of orchestration code that Ray gives you for free.

KubeFlow vs Building Your Own ML Platform

This is the question every ML platform team faces. KubeFlow promises a complete ML platform: pipelines, notebooks, training, serving, feature store, experiment tracking. The reality is more nuanced.

KubeFlow works well when:

  • You want a complete, opinionated ML platform without building one
  • Your team is comfortable with Kubernetes operations
  • You need KubeFlow Pipelines for reproducible ML workflows
  • You are already using Istio (KubeFlow depends on it heavily)

KubeFlow is painful when:

  • You only need one or two components (the installation is monolithic)
  • You do not want Istio as a service mesh (it adds significant operational overhead)
  • You need to upgrade individual components independently
  • Your team is small and cannot dedicate someone to KubeFlow maintenance

My recommendation: use the individual KubeFlow components that solve real problems for your team. The Training Operator (PyTorchJob, etc.) is excellent and can be installed standalone. KubeFlow Pipelines is solid for workflow orchestration if you prefer it over Argo Workflows or Airflow. But installing the full KubeFlow distribution "because we might need it" has been the source of more regret than value in teams I have worked with.

For most mid-size teams (5-15 ML engineers), I recommend this stack:

  • Training: Kubeflow Training Operator (PyTorchJob)
  • Distributed compute: KubeRay
  • Pipelines: Argo Workflows or Airflow (not KubeFlow Pipelines, unless you are already committed)
  • Experiment tracking: MLflow or Weights & Biases
  • Serving: KServe or Seldon Core
  • Notebooks: JupyterHub on Kubernetes

Serving Models on Kubernetes: KServe and Seldon

Training a model is half the battle. Serving it reliably with low latency, autoscaling, canary deployments, and A/B testing is where Kubernetes really earns its keep.

KServe (formerly KFServing) is the standard for model serving on Kubernetes. It provides a CRD that wraps your model in an inference service with built-in support for scaling to zero, GPU sharing, and multi-model serving. It supports every major framework out of the box: TensorFlow, PyTorch, XGBoost, ONNX, and custom containers.

Seldon Core is the alternative, stronger in multi-step inference pipelines where you need preprocessing, model ensemble, and postprocessing as separate containers. If your inference graph is a single model, KServe is simpler. If you need a pipeline of transformers and models, Seldon is worth evaluating.

Both tools solve the same core problem: you should not be writing Flask/FastAPI wrappers for every model you deploy. The boilerplate around health checks, readiness probes, metrics, logging, batching, and GPU management is the same for every model. Let the serving framework handle it.

Cost Management: The Part Nobody Talks About

GPU compute on Kubernetes gets expensive fast. An 8-node A100 cluster on AWS costs roughly $25,000/month on-demand. Here is how I keep costs under control:

Spot Instances for Training

Training workloads are inherently fault-tolerant (you checkpoint periodically and resume from the last checkpoint). This makes them perfect for spot instances, which are 60-70% cheaper than on-demand. Configure your node pools with spot instances and set up your training code to checkpoint every N steps. The elastic training policy in PyTorchJob or Ray handles the rest.

Cluster Autoscaler and Karpenter

The Kubernetes Cluster Autoscaler watches for unschedulable pods and provisions new nodes. Karpenter (AWS-specific but excellent) goes further: it selects the optimal instance type based on your pod's resource requirements. If your pod needs 1 GPU and 32GB RAM, Karpenter will choose a g5.2xlarge instead of a p4d.24xlarge.

Critical setting: configure scale-down delay. The default is 10 minutes, which means your expensive GPU node hangs around for 10 minutes after the last pod finishes. For development clusters, I set it to 2 minutes. For production, 5 minutes is a reasonable balance.

Resource Quotas Per Team

Without quotas, one team will inevitably consume all the GPUs. Set resource quotas per namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: ml-team-gpu-quota
  namespace: ml-research
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"
    requests.cpu: "64"
    requests.memory: "256Gi"
    pods: "20"

Priority Classes for Preemption

Use PriorityClasses to ensure production serving workloads are never evicted in favor of training jobs. I typically set up three tiers: production-serving (priority 1000), training-critical (priority 500), and development (priority 100). If the cluster runs out of capacity, development pods get evicted first.

Managed ML Platforms vs DIY Kubernetes

The final question: should you build this yourself on Kubernetes, or use a managed platform like AWS SageMaker, Google Vertex AI, or Azure ML?

Factor Managed (SageMaker/Vertex) DIY Kubernetes
Time to first training job Hours Days to weeks
Operational overhead Low High (dedicated platform team)
Cost at scale (100+ GPUs) Higher (managed markup) Lower (spot, autoscaling tuning)
Flexibility and customization Limited to platform features Unlimited
Multi-cloud / hybrid Vendor lock-in Portable with effort
GPU type selection Limited options Full control
Custom serving pipelines Constrained Fully flexible
Team size needed 2-3 ML engineers 5+ including platform engineers

My rule of thumb: if your ML team is under 10 people and runs on a single cloud, start with a managed platform. The time you save on infrastructure is better spent on model development. If you have a dedicated platform team, run multi-cloud or on-premise, or need fine-grained control over scheduling and cost, Kubernetes is the right choice.

Many teams land in the middle: they use managed Kubernetes (EKS, GKE) with tools like KubeRay and the Training Operator, combining the reliability of a managed control plane with the flexibility of custom ML infrastructure. This is the sweet spot I recommend for most organizations doing serious ML work.

Lessons from Three Years of ML on Kubernetes

Here are the condensed lessons I would give to anyone starting an ML platform on Kubernetes:

  1. Start with the Training Operator and KubeRay. These two cover 80% of ML compute needs without the complexity of a full KubeFlow installation.
  2. Always mount /dev/shm as a memory-backed emptyDir. This one setting prevents the single most common failure mode in distributed PyTorch training on Kubernetes.
  3. Use spot instances for training from day one. Retrofit fault tolerance into existing training code is painful. Build it in from the start with periodic checkpointing.
  4. Set GPU resource quotas per namespace. Without them, your GPU cluster becomes a tragedy of the commons within weeks.
  5. Do not install KubeFlow unless you need three or more of its components. Each standalone component (Training Operator, KServe) is easier to operate individually.
  6. Monitor GPU utilization, not just GPU allocation. A pod can hold a GPU at 5% utilization for hours. Use DCGM exporter with Prometheus to track actual usage and identify waste.
  7. Use priority classes aggressively. Production inference must never be evicted by a training experiment. Set this up before you need it, not after your first incident.
  8. Invest in a shared filesystem early. NFS or a cloud-native equivalent (EFS, Filestore) for training data avoids the copy-data-to-every-node problem that kills training startup time.

Kubernetes for ML is not about making things easy—it is about making things possible at scale. The complexity is real, but the alternative is a fragile patchwork of scripts, SSH tunnels, and shared GPU boxes. Pick your complexity carefully, and invest in the parts that compound: autoscaling, resource isolation, and reproducible training infrastructure.

The ML infrastructure landscape moves fast. New tools like vLLM for LLM serving, Skypilot for multi-cloud GPU orchestration, and the evolving Kubernetes Device Plugin API will continue to simplify GPU workloads on Kubernetes. But the fundamentals covered here—resource management, distributed training, autoscaling, and cost control—remain stable. Master them, and you will be well-equipped to adopt whatever comes next.

Leave a Comment