Introducing multi-cluster GKE Inference Gateway: Scale AI workloads around the world

The world of artificial intelligence is moving fast, and so is the need to serve models reliably and at scale. Today, we're thrilled to announce the preview of multi-cluster GKE Inference Gateway to enhance the scalability, resilience, and efficiency of your AI/ML inference workloads across multiple Google Kubernetes Engine (GKE) clusters — even those spanning different Google Cloud regions.

Built as an extension of the GKE Gateway API, the multi-cluster Inference Gateway leverages the power of multi-cluster Gateways to provide intelligent, model-aware load balancing for your most demanding AI applications.

Why multi-cluster for AI inference?

As AI models grow in complexity and users become more global, single-cluster deployments can face limitations:

Availability risks: Regional outages or cluster maintenance can impact service.
Scalability caps: Hitting hardware limits (GPUs/TPUs) within a single cluster or region.
Resource silos: Underutilized accelerator capacity in one cluster can’t be used by another
Latency: Users far from your serving cluster may experience higher latency

The multi-cluster GKE Inference Gateway addresses these challenges head-on, providing a variety of features and benefits:

Enhanced high reliability and fault tolerance: Intelligently route traffic across multiple GKE clusters, including across different regions. If one cluster or region experiences issues, traffic is automatically re-routed, minimizing downtime.
Improved scalability and optimized resource usage: Pool and leverage GPU/TPU resources from various clusters. Handle demand spikes by bursting beyond the capacity of a single cluster and efficiently utilize available accelerators across your entire fleet.
Globally optimized, model-aware routing: The Inference Gateway can make smart routing decisions using advanced signals. With GCPBackendPolicy, you can configure load balancing based on real-time custom metrics, such as the model server's KV cache utilization metric, so that requests are sent to the best-equipped backend instance. Other modes like in-flight request limits are also supported.
Simplified operations: Manage traffic to a globally distributed AI service through a single Inference Gateway configuration in a dedicated GKE "config cluster," while your models run in multiple "target clusters."

How it works

In GKE Inference Gateway there are two foundational resources, InferencePool and InferenceObjective. An InferencePool acts as a resource group for pods that share the same compute hardware (like GPUs or TPUs) and model configuration, helping to ensure scalable and high-availability serving. An InferenceObjective defines the specific model names and assigns serving priorities, allowing Inference Gateway to intelligently route traffic and multiplex latency-sensitive tasks alongside less urgent workloads.

With this release, the system uses Kubernetes Custom Resources to manage your distributed inference service. InferencePool resources in each "target cluster" group model-server backends. These backends are exported and become visible as GCPInferencePoolImport resources in the "config cluster." Standard Gateway and HTTPRoute resources in the config cluster define the entry point and routing rules, directing traffic to these imported pools. Fine-grained load-balancing behaviors, such as using CUSTOM_METRICS or IN_FLIGHT requests, are configured using the GCPBackendPolicy resource attached to GCPInferencePoolImport.

This architecture enables use cases like global low-latency serving, disaster recovery, capacity bursting, and efficient use of heterogeneous hardware.

For more information about GKE Inference Gateway core concepts check out our guide.

Get started today

As you scale your AI inference serving workloads to more users in more places, we're excited for you to try multi-cluster GKE Inference Gateway. To learn more and get started, check out the documentation: