Architecture
The Spark integration is built as a Spark DataSource V2 that communicates with DazzleDuck SQL Server using Arrow Flight SQL.
Key Components
ArrowRPCTableProvider
- Entry point used by Spark (
USING io.dazzleduck.sql.spark.ArrowRPCTableProvider) - Parses Spark OPTIONS
- Constructs Arrow RPC–backed tables
ArrowRPCTable
- Represents a logical Spark table
- Exposes schema and partitioning information
- Delegates scanning to Arrow RPC readers
ArrowRPCScan / ArrowRPCScanBuilder
- Converts Spark logical plans into Arrow Flight SQL queries
- Handles projection, filtering, and aggregation rewriting
ArrowPartitionReaderFactory
- Creates per-partition readers
- Enables Spark parallelism
- Each Spark task opens its own Arrow Flight stream
FlightSqlClientPool
- Manages reusable Arrow Flight SQL connections
- Handles timeouts and authentication
Query Lifecycle
-
Spark parses SQL query
-
Logical plan is translated into Arrow-compatible SQL
-
Spark requests partitions
-
Each task:
- Opens Arrow Flight SQL connection
- Streams Arrow record batches
-
Spark consumes batches as internal rows
Query Rewriting
The integration rewrites Spark expressions into DuckDB-compatible SQL using:
DuckDBExpressionSQLBuilderQueryBuilderV2- Expression utilities (literals, field references)
Unsupported expressions are filtered early to fail fast.
Partition Awareness
- Partition columns are declared explicitly
- Spark splits work across partitions
- DazzleDuck applies partition pruning where possible
This allows Spark to scale reads efficiently.
Security Model
- Authentication handled via Arrow Flight SQL
- Username/password passed via JDBC-style URL
- TLS supported when enabled on server