Building the foundation for running extra-large language models

2026-04-16

11 min read

by Michelle Chen

Tags:

Agents Week

Agents

Developer Platform

Developers

Infrastructure

Workers AI

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

Cloudflare explains how they built infrastructure to run extra-large language models like Kimi K2.5 on Workers AI, optimizing for agentic use cases with long contexts and frequent tool calls.

•Hardware configurations are tuned based on input/output token patterns, with emphasis on fast input processing and tool calling for agent workloads
•Prefill/Decode disaggregation separates input processing from token generation on different servers, reducing p90 latency by 3x while maintaining 20-30ms intertoken latency
•Prompt caching with session affinity headers increased cache hit ratios from 60% to 80%, significantly boosting throughput for interactive sessions
•Mooncake Transfer Engine enables efficient KV cache sharing across multiple GPUs via RDMA, with Mooncake Store extending cache beyond GPU VRAM using NVMe storage

Building the foundation for running extra-large language models

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

Connecting AI agents with unstructured data using Google Cloud Storage MCP Servers

Holo3.1: Fast & Local Computer Use Agents

Travelers deploys AI-powered claims countrywide with OpenAI

Codex for every role, tool, and workflow