Engineering at Meta logoEngineering at Meta
|DevOps

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

2026-02-09
6 min read
5

Endigest AI Core Summary

Meta details how Backend Aggregation (BAG) enables its gigawatt-scale Prometheus AI cluster by interconnecting tens of thousands of GPUs across multiple data centers.

  • BAG is a centralized Ethernet-based super spine network layer connecting multiple spine fabrics across data centers, with inter-BAG capacities reaching 16-48 Pbps per region pair
  • Two L2 fabric technologies are used: Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF), each connecting to BAG via different topologies
  • Inter-BAG connectivity uses either planar (one-to-one) or spread connection topology, chosen based on site size and fiber availability
  • Hardware uses modular chassis with Jericho3 (J3) ASIC line cards providing up to 432x800G ports; routing uses eBGP with UCMP for load balancing
  • BAG-to-BAG connections are secured with MACsec; oversubscription from L2 to BAG is typically 4.5:1
Tags:
#Data Center Engineering