o11y-bench benchmarks AI agents on observability workflows.
- •Tests with real Grafana, Prometheus, Loki, and Tempo services
- •63 tasks covering metrics, logs, traces, incident investigation, and dashboards
- •Verifies results against ground-truth queries rather than evaluating responses alone
- •Prioritizes consistency (Pass^3) over best-of-three success (Pass@3)
- •Opus 4.7 without reasoning achieved top consistency, dashboards remain most challenging
This summary was automatically generated by AI based on the original article and may not be fully accurate.