Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.
Endigest AI Core Summary
Cloudflare explains how they built infrastructure to run extra-large language models like Kimi K2.5 on Workers AI, optimizing for agentic use cases with long contexts and frequent tool calls.
•Hardware configurations are tuned based on input/output token patterns, with emphasis on fast input processing and tool calling for agent workloads
•Prefill/Decode disaggregation separates input processing from token generation on different servers, reducing p90 latency by 3x while maintaining 20-30ms intertoken latency
•Prompt caching with session affinity headers increased cache hit ratios from 60% to 80%, significantly boosting throughput for interactive sessions
•Mooncake Transfer Engine enables efficient KV cache sharing across multiple GPUs via RDMA, with Mooncake Store extending cache beyond GPU VRAM using NVMe storage
•Speculative decoding with NVIDIA EAGLE-3 draft model accelerates token generation, particularly effective for predictable tool calls and structured outputs
This summary was automatically generated by AI based on the original article and may not be fully accurate.