DazzleDuck SQL Spark Integration

The DazzleDuck SQL Spark Integration enables Apache Spark SQL to query data stored in DazzleDuck SQL Server using Apache Arrow Flight SQL. It allows Spark to treat remote DuckDB-backed datasets as native Spark tables, without copying data into Spark-managed storage.

This module is designed for distributed analytics, where Spark handles execution planning and parallelism, while DazzleDuck acts as a high-performance, Arrow-native storage and query backend.

Purpose

This integration exists to:

Query remote DuckDB / DuckLake / Parquet data directly from Spark
Avoid data duplication between DuckDB and Spark
Use Arrow Flight SQL for efficient, columnar, zero-copy data transfer
Enable Spark-based analytics on top of DazzleDuck-managed warehouses

It is particularly useful when:

Spark is used for large-scale joins or ML pipelines
DuckDB/DazzleDuck is used for ingestion, compaction, and storage
You want Spark and DuckDB to coexist in the same data architecture

What This Module Provides

A Spark DataSource V2 implementation
An Arrow Flight SQL–backed table provider
Partition-aware reading for parallel Spark execution
Predicate and aggregation-aware query planning

Spark sees DazzleDuck tables as temporary views backed by a remote Arrow RPC source.

High-Level Architecture

Spark SQL
  │
  │ DataSource V2
  ▼
ArrowRPCTableProvider
  │
  │ Arrow Flight SQL (gRPC)
  ▼
DazzleDuck SQL Server
  │
  │ DuckDB / DuckLake
  ▼
Warehouse (Parquet / DuckLake)

Spark drives query execution
DazzleDuck serves Arrow record batches
Data remains columnar end-to-end

Relationship to DazzleDuck SQL Server

This project is not part of dazzleduck-sql-server, but depends on it at runtime.

Component	Responsibility
Spark	Distributed execution & scheduling
DazzleDuck SQL Server	Storage, ingestion, Arrow Flight SQL
Arrow Flight	Transport layer

Typical Use Cases

Distributed analytics over Parquet or DuckLake
Spark-based ETL reading from DuckDB-managed data
Federated architectures (DuckDB for storage, Spark for compute)
Large partitioned reads with predicate pushdown

Scope & Non-Goals

In scope:

Reading data from DazzleDuck into Spark
Partition-aware parallelism

Out of scope:

Writing data back to DazzleDuck from Spark
Replacing Spark’s execution engine
Acting as a Spark catalog

Purpose​

What This Module Provides​

High-Level Architecture​

Relationship to DazzleDuck SQL Server​

Typical Use Cases​

Scope & Non-Goals​