IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

2026-02-18

1 min read

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

IBM Research and UC Berkeley applied the MAST (Multi-Agent System Failure Taxonomy) framework to diagnose why LLM agents fail in enterprise IT automation, analyzing 310 ITBench SRE traces across three models.

•Stronger models fail cleanly: Gemini-3-Flash averages 2.6 failure modes per failed trace, while GPT-OSS-120B compounds to 5.3, where a single early reasoning error cascades into full derailment.
•The strongest universal predictor of failure is FM-3.3 (Incorrect Verification) — agents declare success without checking ground truth, appearing 52% more in Gemini's failed traces.
•Kimi-K2 shows a spike in premature termination (+46%) and unaware-of-termination-conditions (+43%), often quitting just before solving the problem.
•MAST classifies failures as non-fatal (e.g., step repetition, which appears in 90%+ of successful Kimi-K2 runs) versus fatal (verification errors, termination failures, reasoning-action mismatch).

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

Developer's guide to Gemini Enterprise and A2UI integration

Boston Children’s uses AI to unlock new diagnoses

How Braintrust turns customer requests into code with Codex

May 29, 2026