How Workers powers our internal maintenance scheduling pipeline
2025-12-22
14 min read
0
by Kevin Deems
Endigest AI Core Summary
This post explains how Cloudflare built an internal maintenance scheduling system on Cloudflare Workers to safely coordinate data center operations across 330+ cities globally.
- •The scheduler enforces maintenance constraints to prevent simultaneous downtime of redundant edge routers or customer-specific Aegis egress IP pools
- •Initial approach of loading all data into a single Worker caused out-of-memory errors, requiring a more targeted data-loading strategy
- •Cloudflare adopted a graph-based data model inspired by Facebook's TAO paper, using typed object/association interfaces to fetch only relevant regional data
- •Response payload sizes dropped 100x by switching from few large requests to many targeted small requests, though this introduced subrequest limit issues
- •A middleware fetch pipeline was built with request deduplication (singleflight pattern), LRU caching, CDN caching via caches.default.match, and backoff retry logic to stay within Workers platform limits
Tags:
#Cloudflare Workers
#Reliability
#Prometheus
#Infrastructure
