Pinterest Engineering Blog - Medium logoPinterest Engineering Blog - Medium
|Data Engineering

Next Generation DB Ingestion at Pinterest

2026-02-05
10 min read
2
by Pinterest Engineering

Endigest AI Core Summary

Pinterest describes its next-generation database ingestion framework built on CDC, Kafka, Flink, Spark, and Iceberg to replace legacy batch-based pipelines.

  • Legacy batch systems suffered from 24+ hour data latency, full-table scans despite <5% daily change rates, no row-level deletion support, and fragmented pipelines
  • The new unified CDC-based framework uses Debezium/TiCDC to capture changes, writes events to Kafka in under one second, and processes them via Flink into CDC Iceberg tables on S3
  • Spark jobs run every 15 minutes using MERGE INTO statements to upsert a base Iceberg table, reducing latency from hours to 15 minutes–1 hour
  • Merge-on-Read (MOR) was chosen over Copy-on-Write (COW) due to significantly lower storage and compute costs
  • Partitioning base tables by primary key hash (bucket function) parallelizes upserts across partitions; WRITE DISTRIBUTED BY PARTITION was added to mitigate small file explosion
Tags:
#pinterest
#icebergs
#change-data-capture
#spark
#engineering