|Data Engineering

Next Generation DB Ingestion at Pinterest

2026-02-05

10 min read

by Pinterest Engineering

Tags:

icebergs

change-data-capture

spark

engineering

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

Pinterest describes its next-generation database ingestion framework built on CDC, Kafka, Flink, Spark, and Iceberg to replace legacy batch-based pipelines.

•Legacy batch systems suffered from 24+ hour data latency, full-table scans despite <5% daily change rates, no row-level deletion support, and fragmented pipelines
•The new unified CDC-based framework uses Debezium/TiCDC to capture changes, writes events to Kafka in under one second, and processes them via Flink into CDC Iceberg tables on S3
•Spark jobs run every 15 minutes using MERGE INTO statements to upsert a base Iceberg table, reducing latency from hours to 15 minutes–1 hour
•Merge-on-Read (MOR) was chosen over Copy-on-Write (COW) due to significantly lower storage and compute costs
•

Partitioning base tables by primary key hash (bucket function) parallelizes upserts across partitions; WRITE DISTRIBUTED BY PARTITION was added to mitigate small file explosion

Next Generation DB Ingestion at Pinterest

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

Databricks at SIGMOD 2026

From petabytes to predictions: Easy BigQuery insights in Google Sheets

Advancing Apache Iceberg on Databricks: Iceberg v3 GA, Open Sharing, and Unified Governance

Evolving Dataflow to process massive datasets for machine learning