Engineering at Meta logoEngineering at Meta
|DevOps

DrP: Meta’s Root Cause Analysis Platform at Scale

2025-12-19
8 min read
0

Endigest AI Core Summary

Meta's DrP is a root cause analysis platform that automates incident investigation across large-scale systems, reducing MTTR by 20-80%.

  • Engineers author investigation workflows called analyzers using a type-safe SDK with built-in ML algorithms for anomaly detection, time series correlation, and dimension analysis
  • The scalable backend manages a request queue and worker pool, executing 50,000 automated analyses daily across 300+ teams and 2,000 analyzers
  • Analyzers integrate with alerting and incident management tools to auto-trigger on alert activation, delivering immediate results to on-call engineers
  • A post-processing system takes automated actions based on analysis results, such as creating tasks or pull requests for mitigation
  • DrP is evolving toward an AI-native platform as part of Meta's AI4Ops vision, with ongoing enhancements to ML algorithms, SDK, and integrations
Tags:
#Data Infrastructure
#ML Applications