Experimenting with GPUs: GKE managed DRANET and Inference Gateway AI Deployment

2026-04-08

8 min read

by Ammett Williams

Tags:

Networking

Developers & Practitioners

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

This post walks through deploying a LLM (DeepSeek) on Google Kubernetes Engine using managed DRANET and GKE Inference Gateway with NVIDIA B200 GPUs.

•DRANET (Dynamic Resource Allocation Networking) enables pods to request and allocate RDMA-capable network interfaces for high-performance GPU-to-GPU communication
•The RDMA network runs on an isolated VPC using RoCEv2 network profile, dedicated to low-latency inter-node GPU traffic
•The setup uses A4 VMs (8x NVIDIA B200 GPUs each), provisioned via future reservations tied to a specific zone
•Three VPCs are involved: one manually created standard VPC, and two auto-created by GKE managed DRANET (one standard, one RDMA)
•The model is served privately via GKE Inference Gateway using a regional internal Application Load Balancer (gke-l7-rilb)

Related Articles