How Superhuman and Databricks built a 200K QPS inference platform together

2026-05-08

1 min read

Tags:

Platform

Product

Engineering

Data Science and ML

AI Engineering

Industries

Technology

Data Strategy

Data Leader

Company

Customers

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Superhuman and Databricks partnered to build a 200K QPS inference platform serving grammatical error correction at massive scale.

•Achieved peak traffic of 200K+ QPS with P99 latency under 1 second and 99.99% reliability using Databricks model serving
•Migrated from DIY vLLM-based serving stack to Databricks model serving to improve operational efficiency and performance consistency
•Implemented power-of-two-choices load balancing algorithm in Endpoint Discovery Service to eliminate hotspots at high QPS
•Accelerated container startup via lazy-loading image format, reducing pod start time from minutes to seconds for dynamic scaling
•Improved per-pod throughput 60% (750 to 1,200 QPS) through FP8 quantization (30% gain) and multiprocessing runtime (20% gain)

Related Articles