Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
2026-02-17
16 min read
24
by Pinterest Engineering
Endigest AI Core Summary
This post describes Pinterest's Auto Memory Retries feature for Apache Spark, which automatically retries OOM-failed tasks on larger executors to reduce failures and resource waste.
- •Over 4.6% of Pinterest's 90k+ daily Spark jobs fail due to OOM errors, consuming significant compute and creating on-call burden
- •The hybrid retry strategy first doubles CPU-per-task (letting the task monopolize executor memory), then launches physically larger executors (2x, 3x, 4x profiles) if OOM persists
- •Core Spark classes (Task, TaskSetManager, TaskSchedulerImpl, ExecutorAllocationManager) were extended rather than using a Spark listener approach for finer scheduling control
- •Off-heap memory (used with Apache Gluten/Velox) is also doubled during retries; SparkUI was updated to display task resource profile IDs
Tags:
#engineering
#data
#pinterest
#apache-spark
#open-source
