This article introduces Meta's Adaptive Ranking Model for serving LLM-scale ad recommendations at sub-second latency.
- •Addresses the inference trilemma: balancing model complexity, latency, and cost at billion-user scale
- •Request-Oriented Optimization computes user signals once per request, making scaling costs sub-linear
- •Wukong Turbo architecture mitigates numeric instability via No-Bias approach and small parameter delegation
- •Selective FP8 post-training quantization improves hardware throughput with negligible quality loss
- •Multi-card GPU infrastructure enables O(1T) parameter scaling; launched on Instagram in Q4 2025 with +3% ad conversions and +5% CTR
This summary was automatically generated by AI based on the original article and may not be fully accurate.