A practical blueprint for evaluating conversational AI at scale
2025-10-02
18 min read
2
by
Hicham Badri,Appu Shaji,Craig Wilhite,Josh Clemm,Jason Shang,Artem Nabirkin,Dropbox Team,Ameya Bhatawdekar,Sean-Michael Lewis,Appu Shaji,Hicham Badri,Appu Shaji,Ranjitha Gurunath Kulkarni,Ameya Bhatawdekar,Gonzalo Garcia
Endigest AI Core Summary
This post outlines Dropbox's systematic evaluation framework for conversational AI, developed while building Dropbox Dash.
- •Public datasets (Natural Questions, MS MARCO, MuSiQue) were combined with internal production logs to create representative evaluation sets covering diverse real-world queries
- •Traditional NLP metrics (BLEU, ROUGE, BERTScore) proved insufficient for production AI, failing to catch hallucinations, missing citations, and factual errors
- •LLM-as-a-judge approach was adopted, using structured rubrics to score factual accuracy, citation correctness, clarity, and formatting with both scalar and categorical outputs
- •Three metric types enforce quality gates: boolean gates (hard fail), scalar budgets (deployment blockers), and rubric scores (monitored dashboards)
- •The Braintrust platform was adopted to centralize dataset management, versioned experiment tracking, and automated regression testing across PR, staging, and production
Tags:
#models
#AI
#Machine Learning
#Dash
#Testing
