How Superhuman and Databricks built a 200K QPS inference platform together | Endigest
Databricks
|BackendTags:Platform
Product
Engineering
Data Science and ML
AI
AI Engineering
Industries
Technology
Data Strategy
Data Leader
Company
Customers
Get the latest tech trends every morning
Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.
Superhuman and Databricks partnered to build a 200K QPS inference platform serving grammatical error correction at massive scale.
- •Achieved peak traffic of 200K+ QPS with P99 latency under 1 second and 99.99% reliability using Databricks model serving
- •Migrated from DIY vLLM-based serving stack to Databricks model serving to improve operational efficiency and performance consistency
- •Implemented power-of-two-choices load balancing algorithm in Endpoint Discovery Service to eliminate hotspots at high QPS
- •Accelerated container startup via lazy-loading image format, reducing pod start time from minutes to seconds for dynamic scaling
- •Improved per-pod throughput 60% (750 to 1,200 QPS) through FP8 quantization (30% gain) and multiprocessing runtime (20% gain)
This summary was automatically generated by AI based on the original article and may not be fully accurate.