Skip to main content

Ingestion Utilities

Shared helpers for Arrow-based ingestion, partitioned writes, and Parquet output.


Overview

Ingestion utilities provide reusable components used by:

  • HTTP ingestion (/v1/ingest)
  • Flight SQL ingestion
  • Batch pipelines

They coordinate Arrow readers, partition logic, transformations, and file output.


Key Components

ComponentPurpose
ParquetIngestionQueueBuffered, batched Parquet writes
BulkIngestQueueTime / size-based batching
PostIngestionTaskMetadata and registration hooks
MappedReaderColumn transformation

ParquetIngestionQueue

Responsibilities

  • Accumulate Arrow batches
  • Apply transformations
  • Partition data
  • Execute DuckDB COPY statements
  • Emit ingestion metadata

Writes occur when:

  • Bucket size threshold is reached, or
  • Max delay expires

SQL Generation

Ingestion uses DuckDB's native COPY:

COPY (
SELECT *, expr AS col
FROM read_arrow([...])
ORDER BY col
)
TO 'path'
(FORMAT parquet, PARTITION_BY(col), RETURN_FILES);

Cancellation Support

  • Each write task registers a cancel hook
  • Active DuckDB statements are cancelled safely

Production Considerations

  • Tune batch sizes carefully
  • Monitor memory usage
  • Prefer partitioned writes
  • Handle retries at producer layer