Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.
Endigest AI Core Summary
Google Cloud announces llm-d as an official CNCF Sandbox project, positioning Kubernetes as the foundation for large-scale AI inference infrastructure.
•llm-d is co-founded by Google Cloud, Red Hat, IBM Research, CoreWeave, and NVIDIA with the vision of supporting any model, any accelerator, any cloud
•GKE Inference Gateway uses llm-d's Endpoint Picker (EPP) to route requests based on KV-cache hit rates, inflight requests, and queue depth
•Model-aware routing reduced TTFT latency by 35% for Qwen Coder and improved P95 tail latency by 52% for DeepSeek workloads
•Prefix cache hit rate on Vertex AI doubled from 35% to 70%, lowering re-computation overhead and cost-per-token
•LeaderWorkerSet (LWS) API enables disaggregated prefill and decode phases, managing large fleets of TPUs and GPUs at global scale
•vLLM extended natively for Cloud TPUs delivers up to 5x throughput gains with Ragged Paged Attention v3
This summary was automatically generated by AI based on the original article and may not be fully accurate.