Run real-time and async inference on the same infrastructure with GKE Inference Gateway

2026-04-01

7 min read

by Poonam Lamba

Tags:

AI & Machine Learning

GKE

Containers & Kubernetes

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

This post introduces GKE Inference Gateway, a unified platform on Google Kubernetes Engine for running both real-time and async AI inference workloads on shared GPU/TPU infrastructure.

•Real-time inference uses latency-aware scheduling based on metrics like KV cache utilization to minimize time-to-first-token under heavy load.
•Async inference is handled by an Async Processor Agent that integrates with Cloud Pub/Sub, routing batch requests as 'sheddable' traffic during idle accelerator cycles.
•Real-time traffic always takes strict priority at the gateway level; async tasks fill unused compute capacity between real-time spikes.
•Initial testing showed that without the Async Processor, unmanaged multiplexing caused 99% message drop; with it, 100% of latency-tolerant requests were served.

Run real-time and async inference on the same infrastructure with GKE Inference Gateway

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

Benchmark and optimize LLMs on-device with AI Edge Portal

Agent Sandbox on GKE is now available for everyone, and a first look at Agent Substrate

Introducing Agent Executor, Google’s distributed Agent Runtime

May 20, 2026