Kubernetes has become the default substrate for running ML inference workloads at scale. It provides the scheduling flexibility, resource isolation, and operational tooling that large-scale inference requires. But Kubernetes was not designed for ML workloads, and the defaults are wrong for several important configuration decisions. This guide covers the practical details that matter when you move from a hello-world deployment to a production inference cluster.
Sizing Pods Correctly
The most common mistake in Kubernetes ML deployments is setting resource requests and limits incorrectly. For CPU inference pods, set requests equal to the steady-state CPU usage of a single inference worker and limits 20-30% higher to allow for burst. Setting requests too low causes your pods to get scheduled onto nodes that cannot actually serve them, leading to throttling. Setting limits too high causes resource fragmentation and poor bin-packing on nodes.
For GPU pods, the situation is simpler and stricter. GPU resources are not divisible by default in Kubernetes; a pod either gets a full GPU or none. Request exactly as many GPUs as your serving process requires. For most single-model serving deployments, that is one GPU per pod. If you need to share GPUs across models for cost efficiency, look at NVIDIA's MIG (Multi-Instance GPU) support or a time-slicing configuration, but be aware that these come with their own scheduling complexity.
Memory limits for GPU pods must account for both CPU RAM and GPU VRAM. Model weights loaded into GPU memory are not visible to the Kubernetes resource accounting. Monitor actual GPU memory usage with nvidia-smi or the Kubernetes GPU metrics exporter to catch OOM conditions before they cause pod restarts.
Choosing a Serving Framework
The serving framework running inside your pod has a significant effect on throughput and latency. NVIDIA Triton Inference Server supports multiple model formats, built-in dynamic batching, and concurrent model execution on a single GPU. TorchServe is the natural choice for PyTorch models and supports custom handlers for complex preprocessing. TensorFlow Serving is the go-to for TF models with strong production track record. BentoML and Ray Serve provide framework-agnostic serving with Python-native preprocessing pipelines.
For most teams, the choice is simpler than the options suggest. If your model is in PyTorch and your preprocessing is complex Python, TorchServe or BentoML. If your model is in ONNX or TensorRT and you need maximum throughput with dynamic batching, Triton. If your team is already running Ray for distributed compute, Ray Serve. Avoid building a custom serving layer unless you have a specific requirement that none of the above frameworks meet.
Autoscaling Inference Pods
Horizontal Pod Autoscaler (HPA) works well for CPU-based inference but requires careful metric selection. CPU utilization is a poor autoscaling signal for ML workloads because it is not directly proportional to request rate or latency. A better signal is request queue depth or requests per second per pod, measured via a Prometheus adapter and surfaced to HPA as a custom metric. Scale up when queue depth exceeds a threshold; scale down when queue depth drops below a lower threshold with a cooldown window to prevent flapping.
For GPU workloads, HPA needs KEDA (Kubernetes Event-Driven Autoscaling) or similar to scale based on GPU utilization or custom metrics. Node autoscaling for GPU nodes requires the cluster autoscaler configured with appropriate scale-up and scale-down delays. GPU nodes are expensive; configure the scale-down threshold conservatively to avoid thrashing between node provision and termination cycles.
Scale-to-zero is attractive for cost but problematic for latency. Cold-starting a GPU pod from zero takes 2-5 minutes depending on the node provisioning time and model load time. For latency-sensitive endpoints, keep a minimum of one pod warm at all times. For batch or asynchronous inference, scale-to-zero is viable with appropriate timeout handling on the client side.
Health Checks for Model Pods
Kubernetes liveness and readiness probes for ML pods have different requirements than for web servers. A liveness probe that checks only whether the process is running will not catch a model stuck in a degraded state where it accepts requests but returns garbage. A useful liveness check runs a lightweight inference request against the model with a known input and asserts that the output is within expected bounds.
Readiness probes should block traffic until the model is fully loaded. Model loading can take 30-120 seconds for large models. A pod that passes liveness before the model is loaded will receive requests it cannot serve. The readiness probe should check that the model weights are loaded and a warmup inference pass has completed successfully.
Persistent Storage and Model Loading
Baking model weights into the container image is convenient for small models but unworkable for models over a few hundred megabytes. Container image pulls are slow and image registries are not designed for binary artifact storage. The standard pattern is to store model artifacts in object storage (S3, GCS, or equivalent) and load them at pod startup using an init container or the serving framework's remote artifact loading support.
For fast pod startup, consider using a PersistentVolumeClaim with a ReadOnlyMany access mode backed by a high-throughput network filesystem. Pods on the same node can share the volume and avoid redundant model downloads. This is particularly valuable for large models where each download takes minutes.
Networking and Load Balancing
Use a Kubernetes Service of type ClusterIP in front of your inference pods with an Ingress controller handling external traffic. Configure connection draining on the Service to prevent in-flight requests from being dropped during rolling updates. For gRPC-based serving frameworks, ensure your ingress controller supports HTTP/2 and gRPC properly. Many default nginx ingress configurations do not.
MLPipeX's Kubernetes integration deploys a lightweight agent into your cluster that manages all of the above: resource sizing recommendations, autoscaling configuration, health check templates, and storage mounting. The MLPipeX control plane handles orchestration while your cluster handles execution.
Conclusion
Kubernetes is a powerful platform for ML inference when configured correctly. The defaults are wrong for most ML workloads. Getting resource sizing, autoscaling signals, health checks, and storage configuration right upfront prevents a class of operational problems that are difficult to debug after the fact. Start with a single model serving deployment, validate each configuration decision under realistic load, and build your cluster practices from real observations rather than defaults.