Databricks operates a large-scale LLM inference platform serving frontier models with 120T+ tokens monthly.
- •Supports OpenAI, Gemini, Claude, and open-source models for major customer applications
- •Introduces "model units" to estimate request costs based on token distribution and hardware type
- •Uses cost-based load balancing (Dicer) with model unit metrics for optimal routing
- •Achieves 80% GPU savings through autoscaling based on model unit utilization
- •Detects failures using black-box health checks with priority scheduling
This summary was automatically generated by AI based on the original article and may not be fully accurate.