A practical blueprint for evaluating conversational AI at scale

2025-10-02

18 min read

by Hicham Badri,Appu Shaji,Craig Wilhite,Josh Clemm,Jason Shang,Artem Nabirkin,Dropbox Team,Ameya Bhatawdekar,Sean-Michael Lewis,Appu Shaji,Hicham Badri,Appu Shaji,Ranjitha Gurunath Kulkarni,Ameya Bhatawdekar,Gonzalo Garcia

Tags:

models

Machine Learning

Dash

Testing

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

This post outlines Dropbox's systematic evaluation framework for conversational AI, developed while building Dropbox Dash.

•Public datasets (Natural Questions, MS MARCO, MuSiQue) were combined with internal production logs to create representative evaluation sets covering diverse real-world queries
•Traditional NLP metrics (BLEU, ROUGE, BERTScore) proved insufficient for production AI, failing to catch hallucinations, missing citations, and factual errors
•LLM-as-a-judge approach was adopted, using structured rubrics to score factual accuracy, citation correctness, clarity, and formatting with both scalar and categorical outputs
•Three metric types enforce quality gates: boolean gates (hard fail), scalar budgets (deployment blockers), and rubric scores (monitored dashboards)
•

A practical blueprint for evaluating conversational AI at scale

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

How we built Cloudflare's data platform and an AI agent on top of it

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook

Introducing Nova, our internal platform for coding agents