Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters
2026-02-09
6 min read
5
Endigest AI Core Summary
Meta details how Backend Aggregation (BAG) enables its gigawatt-scale Prometheus AI cluster by interconnecting tens of thousands of GPUs across multiple data centers.
- •BAG is a centralized Ethernet-based super spine network layer connecting multiple spine fabrics across data centers, with inter-BAG capacities reaching 16-48 Pbps per region pair
- •Two L2 fabric technologies are used: Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF), each connecting to BAG via different topologies
- •Inter-BAG connectivity uses either planar (one-to-one) or spread connection topology, chosen based on site size and fiber availability
- •Hardware uses modular chassis with Jericho3 (J3) ASIC line cards providing up to 432x800G ports; routing uses eBGP with UCMP for load balancing
- •BAG-to-BAG connections are secured with MACsec; oversubscription from L2 to BAG is typically 4.5:1
Tags:
#Data Center Engineering
