DrP: Meta’s Root Cause Analysis Platform at Scale
2025-12-19
8 min read
0
Endigest AI Core Summary
Meta's DrP is a root cause analysis platform that automates incident investigation across large-scale systems, reducing MTTR by 20-80%.
- •Engineers author investigation workflows called analyzers using a type-safe SDK with built-in ML algorithms for anomaly detection, time series correlation, and dimension analysis
- •The scalable backend manages a request queue and worker pool, executing 50,000 automated analyses daily across 300+ teams and 2,000 analyzers
- •Analyzers integrate with alerting and incident management tools to auto-trigger on alert activation, delivering immediate results to on-call engineers
- •A post-processing system takes automated actions based on analysis results, such as creating tasks or pull requests for mitigation
- •DrP is evolving toward an AI-native platform as part of Meta's AI4Ops vision, with ongoing enhancements to ML algorithms, SDK, and integrations
Tags:
#Data Infrastructure
#ML Applications
