Dropbox Tech Blog  logo Dropbox Tech Blog
|AI

Using LLMs to amplify human labeling and improve Dash search relevance

2026-02-26
13 min read
1
by Dmitriy Meyerzon

Endigest AI Core Summary

This post explains how Dropbox Dash trains its search ranking model by combining small-scale human labeling with LLM-generated relevance judgments to produce training data at scale.

  • Dash follows a RAG pattern where enterprise search retrieves candidate documents before an LLM generates answers, making search ranking quality critical to overall response quality
  • The ranking model uses XGBoost trained on query-document pairs annotated with 1-5 relevance scores, where higher scores indicate closer alignment with user intent
  • Human labeling is expensive and hard to scale, while LLMs offer cheaper and more consistent relevance judgments across large multilingual datasets
  • A small human-labeled dataset is used to tune LLM prompts and validate quality thresholds before deploying the LLM to generate hundreds of thousands to millions of training labels
  • LLM accuracy is measured via mean squared error against human judgments, and document sampling prioritizes cases where LLM predictions di
Tags:
#LLM
#models
#Search
#Machine Learning
#Dash
#RAG