Stripe built a benchmark to evaluate whether AI agents can autonomously build complete, production-accurate Stripe integrations end-to-end.
- •11 diverse environments were created covering backend-only tasks, full-stack tasks, and gym problem sets targeting specific Stripe features like Checkout and subscriptions
- •Agents were given a goose-based harness with MCP server access to a terminal, browser, and Stripe-specific search tools for consistent evaluation
- •Claude Opus 4.5 scored 92% average on full-stack API integration tasks; OpenAI GPT-5.2 scored 73% on gym problem sets; best-performing runs averaged 63 turns
- •Agents exceeded expectations in browser use, navigating UIs and reverse-engineering Checkout Session API parameters with over 80% accuracy
- •Key failure modes include mishandling ambiguous situations (accepting 400 errors as success) and getting stuck in browser interactions due to focus loss in form fields