coSTAR: How We Ship AI Agents at Databricks Fast, Without Breaking Things

2026-03-20

1 min read

Tags:

Mosaic Research

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

This post introduces coSTAR, Databricks' methodology for reliably testing and iterating on AI agents using MLflow.

•coSTAR runs two coupled STAR loops: an agent loop that uses judges to auto-score traces and refine the agent, and a judge loop that aligns judges with human expert assessments.
•Scenario definitions act as test fixtures, bundling initial state, user prompts, and expected outcomes in a portable, reusable structure.
•Trace capture decouples execution from scoring, allowing judges to be re-run against persisted traces without repeating expensive agent runs.
•Judges are implemented as agentic LLMs equipped with tools to selectively inspect traces, avoiding the quality degradation of feeding full traces into a single context window.
•

The test suite spans three categories: deterministic checks (syntax, schema, tool sequence), LLM-based judgment (code quality, best practices), and operational metrics (token usage, latency, tool call failure rates).

coSTAR: How We Ship AI Agents at Databricks Fast, Without Breaking Things

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

Developer's guide to Gemini Enterprise and A2UI integration

Boston Children’s uses AI to unlock new diagnoses

How Braintrust turns customer requests into code with Codex

May 29, 2026