Spark Declarative Pipelines (SDP) extends declarative data processing from individual queries to entire pipelines in Apache Spark, reducing operational burden for data engineering teams.
- •Data engineers currently spend most of their time on operational glue work (orchestration, incremental processing, data quality, backfills) rather than business logic
- •SDP lets engineers declare what datasets should exist, while the framework handles dependency inference, execution ordering, incremental updates, and failure recovery automatically
- •A weekly sales pipeline that requires hundreds of lines in PySpark or dbt with external tools like Airflow can be expressed in ~20 lines with SDP
- •Built-in capabilities include automatic incremental processing, inline data quality via @dp.expect_or_drop, dependency tracking, retries, and a monitoring UI—no external orchestrator needed
- •SDP ships with Python and SQL APIs, batch and streaming support, and a CLI for scaffolding, validating, and running pip