After spending the last three years building and operating ML platforms on Kubernetes across two companies, I have strong opinions about what works and what is a waste of engineering time. Kubernetes is not a silver bullet for ML infrastructure, but when configured correctly, it is the closest thing we have to a universal substrate for the full ML lifecycle—from exploratory notebooks to distributed training to production inference. This guide covers everything I wish I had known before my first GPU node pool went sideways at 2 AM.
Why Kubernetes for ML Workloads
The pitch is simple: ML teams need heterogeneous compute (CPUs for preprocessing, GPUs for training, TPUs if you are on GCP), resource isolation between teams, and the ability to scale from zero to hundreds of nodes for a training run and back down again. Kubernetes gives you all of that with a single control plane.
But the real reason Kubernetes wins for ML is not the orchestration—it is the ecosystem. Every major ML framework, from PyTorch to TensorFlow to JAX, ships with first-class Kubernetes operators. Tools like KubeFlow, Ray, MLflow, and Seldon all assume Kubernetes as the deployment target. If you try to build an ML platform on bare EC2 instances or docker-compose, you end up reimplementing half of what Kubernetes already provides: service discovery, health checks, resource quotas, rolling deployments, and log aggregation.
The three things that make Kubernetes particularly well-suited for ML:
- Resource isolation via namespaces and quotas. Your training jobs do not compete with your serving endpoints for GPU memory. You can give the research team a namespace with 8 A100s and the production team a separate namespace with guaranteed resources.
- GPU sharing and scheduling. The NVIDIA device plugin, combined with time-slicing or MIG, lets you pack multiple workloads onto a single GPU—critical when you are paying $3/hour per A100.
- Autoscaling. Cluster autoscaler provisions GPU nodes only when jobs are queued, and scales back to zero when idle. Combined with spot/preemptible instances, this can cut your GPU bill by 60-70%.
GPU Scheduling on Kubernetes: Getting It Right
GPU scheduling is where most teams hit their first wall. The NVIDIA device plugin exposes GPUs as an extended resource (nvidia.com/gpu), and by default each GPU can only be assigned to a single pod. This works fine for large training runs but is wasteful for inference workloads or development notebooks that use a fraction of the GPU memory.
Installing the NVIDIA Device Plugin
First, your nodes need the NVIDIA driver and container toolkit installed. On managed Kubernetes (EKS, GKE, AKS), the GPU node pools come preconfigured. On bare metal, you will need to install nvidia-container-toolkit and verify with nvidia-smi inside a test pod.
The device plugin itself is a DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: nvidia-device-plugin-ctr
image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
A Basic GPU Pod
Here is a minimal pod spec that requests a single GPU. This is the building block for everything else:
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-pod
labels:
app: model-training
team: ml-platform
spec:
restartPolicy: Never
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: trainer
image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
command: ["python", "train.py", "--epochs", "50", "--batch-size", "64"]
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
env:
- name: NCCL_DEBUG
value: "INFO"
- name: CUDA_VISIBLE_DEVICES
value: "all"
nodeSelector:
accelerator: nvidia-a100
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-dataset-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-artifacts-pvc
A few things I have learned the hard way: always set restartPolicy: Never for training jobs (you do not want a failed training run restarting in a loop and burning GPU hours). Always set both requests and limits for GPU—unlike CPU, GPU resources are not compressible, so requests and limits should match. And always use a nodeSelector or node affinity to target the right GPU type. Scheduling an A10G workload onto a T4 node because you forgot the selector is a debugging session you only want to have once.
GPU Sharing: Time-Slicing and MIG
For inference and development workloads, dedicating a full GPU per pod is wasteful. Two approaches exist:
Time-slicing lets multiple pods share a GPU by context-switching between them. It is configured via a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
namespace: kube-system
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4
This makes each physical GPU appear as 4 logical GPUs. The downside: there is no memory isolation. One pod can OOM-kill another pod's GPU process. I only use this for development environments where the risk is acceptable.
Multi-Instance GPU (MIG) is available on A100 and H100 GPUs and provides true hardware-level partitioning. Each partition gets dedicated compute cores and memory. It is more complex to set up but gives you the isolation you need for production mixed workloads. An A100 80GB can be split into up to 7 MIG instances, each with its own memory and compute slice.
Distributed Training on Kubernetes
Single-GPU training hits a ceiling fast. Once your model or dataset outgrows one GPU, you need distributed training. On Kubernetes, the standard approach is the Training Operator (formerly Kubeflow Training Operator), which provides CRDs for PyTorchJob, TFJob, MPIJob, and others.
PyTorchJob for Multi-Node Training
Here is a real PyTorchJob manifest I use for distributed training with PyTorch DDP (Distributed Data Parallel):
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: llm-finetune-job
namespace: ml-training
spec:
elasticPolicy:
rdzvBackend: etcd
rdzvHost: etcd-service
minReplicas: 2
maxReplicas: 8
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: pytorch
image: registry.internal/ml-training:v2.4.1
command:
- python
- -m
- torch.distributed.run
- --nproc_per_node=4
- --nnodes=4
- train.py
- --model=llama-7b
- --data=/data/training-set
- --output=/output/checkpoints
- --lr=2e-5
- --epochs=3
resources:
requests:
cpu: "8"
memory: "64Gi"
nvidia.com/gpu: "4"
limits:
cpu: "16"
memory: "128Gi"
nvidia.com/gpu: "4"
volumeMounts:
- name: shared-data
mountPath: /data
- name: output
mountPath: /output
- name: shm
mountPath: /dev/shm
volumes:
- name: shared-data
persistentVolumeClaim:
claimName: training-data-nfs
- name: output
persistentVolumeClaim:
claimName: model-output-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: "16Gi"
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: pytorch
image: registry.internal/ml-training:v2.4.1
command:
- python
- -m
- torch.distributed.run
- --nproc_per_node=4
- train.py
- --model=llama-7b
- --data=/data/training-set
- --output=/output/checkpoints
resources:
requests:
cpu: "8"
memory: "64Gi"
nvidia.com/gpu: "4"
limits:
cpu: "16"
memory: "128Gi"
nvidia.com/gpu: "4"
volumeMounts:
- name: shared-data
mountPath: /data
- name: output
mountPath: /output
- name: shm
mountPath: /dev/shm
volumes:
- name: shared-data
persistentVolumeClaim:
claimName: training-data-nfs
- name: output
persistentVolumeClaim:
claimName: model-output-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: "16Gi"
Critical detail: the /dev/shm mount. PyTorch uses shared memory for data loading, and the default 64MB Docker shm size will cause your training to crash silently or slow to a crawl. I always mount an emptyDir with medium: Memory and set it to at least 8-16GB.
The elastic training policy is a game-changer. If a spot instance gets preempted and a worker dies, the job continues with the remaining workers and scales back up when a new node joins. This is essential for cost-effective training on spot instances.
Ray on Kubernetes with KubeRay
Ray has become my go-to framework for anything beyond simple DDP training. Ray Train, Ray Tune, and Ray Serve cover the full lifecycle, and KubeRay makes running Ray on Kubernetes straightforward. The key abstraction is the RayCluster CRD.
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: ml-platform-ray
namespace: ml-platform
spec:
rayVersion: "2.9.3"
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
num-cpus: "0" # head node should not run tasks
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.9.3-py310-gpu
ports:
- containerPort: 6379 # GCS
- containerPort: 8265 # dashboard
- containerPort: 10001 # client
resources:
requests:
cpu: "4"
memory: "16Gi"
limits:
cpu: "4"
memory: "16Gi"
volumeMounts:
- name: ray-logs
mountPath: /tmp/ray
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
- groupName: gpu-workers
replicas: 4
minReplicas: 0
maxReplicas: 16
rayStartParams:
num-gpus: "1"
template:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: ray-worker
image: rayproject/ray-ml:2.9.3-py310-gpu
resources:
requests:
cpu: "4"
memory: "32Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "64Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: shared-storage
mountPath: /mnt/data
volumes:
- name: shared-storage
persistentVolumeClaim:
claimName: ray-shared-nfs
- groupName: cpu-workers
replicas: 2
minReplicas: 0
maxReplicas: 32
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.9.3-py310-gpu
resources:
requests:
cpu: "8"
memory: "32Gi"
limits:
cpu: "16"
memory: "64Gi"
What I like about KubeRay is the autoscaling. Set minReplicas: 0 and the GPU workers only spin up when a Ray job requests GPU resources. Combined with Kubernetes cluster autoscaler, this means you are truly paying only for what you use. I have seen teams go from a steady $40K/month GPU bill to $12K/month by switching from always-on GPU instances to Ray autoscaling on spot nodes.
Ray also shines for hyperparameter tuning. Ray Tune can run hundreds of trials in parallel across your cluster, with built-in support for early stopping and population-based training. Doing this with vanilla Kubernetes Jobs is possible but requires writing a lot of orchestration code that Ray gives you for free.
KubeFlow vs Building Your Own ML Platform
This is the question every ML platform team faces. KubeFlow promises a complete ML platform: pipelines, notebooks, training, serving, feature store, experiment tracking. The reality is more nuanced.
KubeFlow works well when:
- You want a complete, opinionated ML platform without building one
- Your team is comfortable with Kubernetes operations
- You need KubeFlow Pipelines for reproducible ML workflows
- You are already using Istio (KubeFlow depends on it heavily)
KubeFlow is painful when:
- You only need one or two components (the installation is monolithic)
- You do not want Istio as a service mesh (it adds significant operational overhead)
- You need to upgrade individual components independently
- Your team is small and cannot dedicate someone to KubeFlow maintenance
My recommendation: use the individual KubeFlow components that solve real problems for your team. The Training Operator (PyTorchJob, etc.) is excellent and can be installed standalone. KubeFlow Pipelines is solid for workflow orchestration if you prefer it over Argo Workflows or Airflow. But installing the full KubeFlow distribution "because we might need it" has been the source of more regret than value in teams I have worked with.
For most mid-size teams (5-15 ML engineers), I recommend this stack:
- Training: Kubeflow Training Operator (PyTorchJob)
- Distributed compute: KubeRay
- Pipelines: Argo Workflows or Airflow (not KubeFlow Pipelines, unless you are already committed)
- Experiment tracking: MLflow or Weights & Biases
- Serving: KServe or Seldon Core
- Notebooks: JupyterHub on Kubernetes
Serving Models on Kubernetes: KServe and Seldon
Training a model is half the battle. Serving it reliably with low latency, autoscaling, canary deployments, and A/B testing is where Kubernetes really earns its keep.
KServe (formerly KFServing) is the standard for model serving on Kubernetes. It provides a CRD that wraps your model in an inference service with built-in support for scaling to zero, GPU sharing, and multi-model serving. It supports every major framework out of the box: TensorFlow, PyTorch, XGBoost, ONNX, and custom containers.
Seldon Core is the alternative, stronger in multi-step inference pipelines where you need preprocessing, model ensemble, and postprocessing as separate containers. If your inference graph is a single model, KServe is simpler. If you need a pipeline of transformers and models, Seldon is worth evaluating.
Both tools solve the same core problem: you should not be writing Flask/FastAPI wrappers for every model you deploy. The boilerplate around health checks, readiness probes, metrics, logging, batching, and GPU management is the same for every model. Let the serving framework handle it.
Cost Management: The Part Nobody Talks About
GPU compute on Kubernetes gets expensive fast. An 8-node A100 cluster on AWS costs roughly $25,000/month on-demand. Here is how I keep costs under control:
Spot Instances for Training
Training workloads are inherently fault-tolerant (you checkpoint periodically and resume from the last checkpoint). This makes them perfect for spot instances, which are 60-70% cheaper than on-demand. Configure your node pools with spot instances and set up your training code to checkpoint every N steps. The elastic training policy in PyTorchJob or Ray handles the rest.
Cluster Autoscaler and Karpenter
The Kubernetes Cluster Autoscaler watches for unschedulable pods and provisions new nodes. Karpenter (AWS-specific but excellent) goes further: it selects the optimal instance type based on your pod's resource requirements. If your pod needs 1 GPU and 32GB RAM, Karpenter will choose a g5.2xlarge instead of a p4d.24xlarge.
Critical setting: configure scale-down delay. The default is 10 minutes, which means your expensive GPU node hangs around for 10 minutes after the last pod finishes. For development clusters, I set it to 2 minutes. For production, 5 minutes is a reasonable balance.
Resource Quotas Per Team
Without quotas, one team will inevitably consume all the GPUs. Set resource quotas per namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-team-gpu-quota
namespace: ml-research
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8"
requests.cpu: "64"
requests.memory: "256Gi"
pods: "20"
Priority Classes for Preemption
Use PriorityClasses to ensure production serving workloads are never evicted in favor of training jobs. I typically set up three tiers: production-serving (priority 1000), training-critical (priority 500), and development (priority 100). If the cluster runs out of capacity, development pods get evicted first.
Managed ML Platforms vs DIY Kubernetes
The final question: should you build this yourself on Kubernetes, or use a managed platform like AWS SageMaker, Google Vertex AI, or Azure ML?
| Factor | Managed (SageMaker/Vertex) | DIY Kubernetes |
|---|---|---|
| Time to first training job | Hours | Days to weeks |
| Operational overhead | Low | High (dedicated platform team) |
| Cost at scale (100+ GPUs) | Higher (managed markup) | Lower (spot, autoscaling tuning) |
| Flexibility and customization | Limited to platform features | Unlimited |
| Multi-cloud / hybrid | Vendor lock-in | Portable with effort |
| GPU type selection | Limited options | Full control |
| Custom serving pipelines | Constrained | Fully flexible |
| Team size needed | 2-3 ML engineers | 5+ including platform engineers |
My rule of thumb: if your ML team is under 10 people and runs on a single cloud, start with a managed platform. The time you save on infrastructure is better spent on model development. If you have a dedicated platform team, run multi-cloud or on-premise, or need fine-grained control over scheduling and cost, Kubernetes is the right choice.
Many teams land in the middle: they use managed Kubernetes (EKS, GKE) with tools like KubeRay and the Training Operator, combining the reliability of a managed control plane with the flexibility of custom ML infrastructure. This is the sweet spot I recommend for most organizations doing serious ML work.
Lessons from Three Years of ML on Kubernetes
Here are the condensed lessons I would give to anyone starting an ML platform on Kubernetes:
- Start with the Training Operator and KubeRay. These two cover 80% of ML compute needs without the complexity of a full KubeFlow installation.
- Always mount
/dev/shmas a memory-backed emptyDir. This one setting prevents the single most common failure mode in distributed PyTorch training on Kubernetes. - Use spot instances for training from day one. Retrofit fault tolerance into existing training code is painful. Build it in from the start with periodic checkpointing.
- Set GPU resource quotas per namespace. Without them, your GPU cluster becomes a tragedy of the commons within weeks.
- Do not install KubeFlow unless you need three or more of its components. Each standalone component (Training Operator, KServe) is easier to operate individually.
- Monitor GPU utilization, not just GPU allocation. A pod can hold a GPU at 5% utilization for hours. Use DCGM exporter with Prometheus to track actual usage and identify waste.
- Use priority classes aggressively. Production inference must never be evicted by a training experiment. Set this up before you need it, not after your first incident.
- Invest in a shared filesystem early. NFS or a cloud-native equivalent (EFS, Filestore) for training data avoids the copy-data-to-every-node problem that kills training startup time.
Kubernetes for ML is not about making things easy—it is about making things possible at scale. The complexity is real, but the alternative is a fragile patchwork of scripts, SSH tunnels, and shared GPU boxes. Pick your complexity carefully, and invest in the parts that compound: autoscaling, resource isolation, and reproducible training infrastructure.
The ML infrastructure landscape moves fast. New tools like vLLM for LLM serving, Skypilot for multi-cloud GPU orchestration, and the evolving Kubernetes Device Plugin API will continue to simplify GPU workloads on Kubernetes. But the fundamentals covered here—resource management, distributed training, autoscaling, and cost control—remain stable. Master them, and you will be well-equipped to adopt whatever comes next.
Leave a Comment