Pinterest Engineering Blog - Medium logoPinterest Engineering Blog - Medium
|Data Engineering

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

2026-02-17
16 min read
24
by Pinterest Engineering

Endigest AI Core Summary

This post describes Pinterest's Auto Memory Retries feature for Apache Spark, which automatically retries OOM-failed tasks on larger executors to reduce failures and resource waste.

  • Over 4.6% of Pinterest's 90k+ daily Spark jobs fail due to OOM errors, consuming significant compute and creating on-call burden
  • The hybrid retry strategy first doubles CPU-per-task (letting the task monopolize executor memory), then launches physically larger executors (2x, 3x, 4x profiles) if OOM persists
  • Core Spark classes (Task, TaskSetManager, TaskSchedulerImpl, ExecutorAllocationManager) were extended rather than using a Spark listener approach for finer scheduling control
  • Off-heap memory (used with Apache Gluten/Velox) is also doubled during retries; SparkUI was updated to display task resource profile IDs
Tags:
#engineering
#data
#pinterest
#apache-spark
#open-source