Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models…
2026-02-02
10 min read
0
by Pinterest Engineering
Endigest AI Core Summary
Spotify's ads team describes how they re-architected their serving stack to replace the Two-Tower model with more expressive neural networks capable of deep feature interactions.
- •Two-Tower models are efficient but cannot leverage interaction features, target attention, or early feature crossing between user and item representations
- •High-value O(1M) candidates have features embedded directly as PyTorch registered buffers in the model file, eliminating network I/O and host-to-GPU transfer overhead
- •Business logic (utility calculation, diversity rules, top-k selection) was moved inside the PyTorch model to reduce GPU-to-CPU data transfer from O(100K) to O(1K) documents
- •GPU inference latency was reduced from 4000ms p90 to 20ms via multi-stream CUDA, worker-to-core alignment, Triton kernel fusion, and BF16 precision
- •Retrieval data flow was restructured to return only IDs and Bids in a column-wise format first, deferring heavy metadata fetch to after ranking reduces candidate set
Tags:
#engineering
#pinterest
#monetization
