Dropbox Tech Blog  logo Dropbox Tech Blog
|Machine Learning

A practical blueprint for evaluating conversational AI at scale

2025-10-02
18 min read
2
by Hicham Badri,Appu Shaji,Craig Wilhite,Josh Clemm,Jason Shang,Artem Nabirkin,Dropbox Team,Ameya Bhatawdekar,Sean-Michael Lewis,Appu Shaji,Hicham Badri,Appu Shaji,Ranjitha Gurunath Kulkarni,Ameya Bhatawdekar,Gonzalo Garcia

Endigest AI Core Summary

This post outlines Dropbox's systematic evaluation framework for conversational AI, developed while building Dropbox Dash.

  • Public datasets (Natural Questions, MS MARCO, MuSiQue) were combined with internal production logs to create representative evaluation sets covering diverse real-world queries
  • Traditional NLP metrics (BLEU, ROUGE, BERTScore) proved insufficient for production AI, failing to catch hallucinations, missing citations, and factual errors
  • LLM-as-a-judge approach was adopted, using structured rubrics to score factual accuracy, citation correctness, clarity, and formatting with both scalar and categorical outputs
  • Three metric types enforce quality gates: boolean gates (hard fail), scalar budgets (deployment blockers), and rubric scores (monitored dashboards)
  • The Braintrust platform was adopted to centralize dataset management, versioned experiment tracking, and automated regression testing across PR, staging, and production
Tags:
#models
#AI
#Machine Learning
#Dash
#Testing