Using LLMs to amplify human labeling and improve Dash search relevance
2026-02-26
13 min read
1
by
Dmitriy Meyerzon
Endigest AI Core Summary
This post explains how Dropbox Dash trains its search ranking model by combining small-scale human labeling with LLM-generated relevance judgments to produce training data at scale.
- •Dash follows a RAG pattern where enterprise search retrieves candidate documents before an LLM generates answers, making search ranking quality critical to overall response quality
- •The ranking model uses XGBoost trained on query-document pairs annotated with 1-5 relevance scores, where higher scores indicate closer alignment with user intent
- •Human labeling is expensive and hard to scale, while LLMs offer cheaper and more consistent relevance judgments across large multilingual datasets
- •A small human-labeled dataset is used to tune LLM prompts and validate quality thresholds before deploying the LLM to generate hundreds of thousands to millions of training labels
- •LLM accuracy is measured via mean squared error against human judgments, and document sampling prioritizes cases where LLM predictions di
Tags:
#LLM
#models
#Search
#Machine Learning
#Dash
#RAG
