DazzleDuck Log Tail → Arrow Pipeline
This project implements a log ingestion pipeline that tails JSON log files from disk, converts them into Apache Arrow format, and sends them to a DazzleDuck HTTP server for ingestion and storage (Parquet).
It supports real-time log ingestion, batching, fault tolerance, and full end-to-end testing with real servers and real files.
What This Project Does
- Watches a directory for log files (
*.log) - Reads new JSON log entries incrementally (tailing)
- Converts log records into Apache Arrow batches
- Sends Arrow data to a DazzleDuck HTTP ingestion endpoint
- Writes ingested data as Parquet in the warehouse
- Supports local testing, unit tests, and real end-to-end runs
Core Components
Log Processing
-
LogFileTailReader
Detects new log files and tails appended lines safely. -
LogTailToArrowProcessor
Orchestrates tailing → JSON parsing → Arrow conversion → sending. -
JsonToArrowConverter
Converts log JSON records into Arrow vectors using a fixed schema.
Sending & Ingestion
- HttpSender
Handles authentication (JWT), batching, retries, and backpressure when sending Arrow streams.
Log Generation (Testing)
-
SimpleLogGenerator
Writes basic static JSON logs (unit tests). -
LogFileGenerator
Generates realistic rolling log files for end-to-end testing.
Running the Log Processor
The processor can be run as a standalone application.
Entry Point
LogProcessorMain
What it does
- Reads configuration from
application.conf - Starts directory monitoring
- Continuously processes logs until shutdown
End-to-End Testing
EndToEndTest
Runs a real pipeline:
- Starts the real DazzleDuck HTTP server
- Creates temporary log & warehouse directories
- Generates real log files on disk
- Tails logs and sends Arrow data
- Verifies Parquet ingestion using DuckDB
- Cleans up resources
This test validates:
- File tailing
- JSON parsing
- Arrow conversion
- HTTP ingestion
- Parquet output correctness
Unit Tests
LogTailToArrowProcessorTest
Covers:
- Single and multiple log files
- Invalid JSON handling
- Empty files
- Missing files
- Correct Parquet record counts
All tests use temporary directories and clean up automatically.
Log Format
Logs must be one JSON object per line, for example:
{
"timestamp": "2024-01-01T10:00:00Z",
"level": "INFO",
"thread": "main",
"logger": "App",
"message": "Hello world"
}
Invalid JSON lines are safely skipped.
Requirements
- Java 21+
- Apache Arrow
- DuckDB
- SLF4J
- DazzleDuck SQL Server (HTTP mode)
Design Goals
- Streaming-friendly
- Low memory overhead
- Safe file tailing
- Backpressure-aware ingestion
- Production-like testing with real servers
When to Use This
Use this project if you need:
- File-based log ingestion
- Arrow-based transport
- Real-time or near-real-time analytics
- Reliable end-to-end validation
Status
✅ Fully working
✅ End-to-end verified
✅ Production-ready pipeline