pg_duckdb for Analytics and Data Lake Queries
DuckDB's vectorized columnar engine, embedded as a Postgres extension. Run dashboard queries in your OLTP without an external warehouse.
Row Store vs Column Store
PostgreSQL stores rows together. That is perfect for OLTP — fetch one row by primary key and you read one disk block. It is the wrong shape for dashboards: SELECT region, SUM(amount) FROM events GROUP BY region over 100M rows reads every column, not just the two you need, and decompresses values one tuple at a time.
Column stores flip this: each column is its own contiguous block, perfectly compressible, perfectly vectorizable. DuckDB is a fast in-process column engine. pg_duckdb embeds it inside Postgres so analytical queries get column-store performance without leaving the database.
What pg_duckdb Is
A Postgres extension built jointly by MotherDuck and Hydra that loads DuckDB into the backend process. Queries that would benefit from columnar execution are routed through DuckDB; OLTP queries continue to use the native Postgres planner.
Install pg_duckdb
# Add the Hydra/pg_duckdb apt repo:
curl -fsSL https://repo.hydra.so/apt/hydra.gpg \
| sudo gpg --dearmor -o /usr/share/keyrings/hydra.gpg
echo "deb [signed-by=/usr/share/keyrings/hydra.gpg] \
https://repo.hydra.so/apt $(lsb_release -cs) main" | \
sudo tee /etc/apt/sources.list.d/hydra.list
sudo apt update
sudo apt install -y postgresql-17-pg-duckdb
# Required preload + restart:
echo "shared_preload_libraries = 'pg_duckdb'" | \
sudo tee -a /etc/postgresql/17/main/conf.d/30-pgduckdb.conf
sudo systemctl restart postgresql@17-main
sudo -u postgres psql -d appdb -c "CREATE EXTENSION pg_duckdb;"Enable and Force Execution
-- Per-session: prefer DuckDB execution where it makes sense:
SET duckdb.force_execution = true;
-- Permanently per-user:
ALTER ROLE analytics SET duckdb.force_execution = true;Auto-detection kicks in for queries against DuckDB-backed tables and external data sources. Force it on for analytical sessions; leave it off in connections that handle OLTP traffic.
Query Parquet Files on Object Storage
-- Single file:
SELECT region, SUM(amount) AS total
FROM read_parquet('s3://my-bucket/events/2026-05.parquet')
GROUP BY region;
-- Glob multiple files:
SELECT date_trunc('day', ts) AS day, COUNT(*)
FROM read_parquet('s3://my-bucket/events/2026/*.parquet')
GROUP BY 1 ORDER BY 1;Object Storage Credentials via duckdb.secrets
-- Backblaze B2, Wasabi, RamNode Cloud Object Storage — all S3-compatible:
INSERT INTO duckdb.secrets (type, key_id, secret, region, endpoint, scope)
VALUES ('S3',
'AKIA....',
'SECRET....',
'us-east-1',
's3.us-east-005.backblazeb2.com',
's3://my-bucket/');CSV, JSON, and HTTP Sources
SELECT * FROM read_csv('/srv/imports/orders.csv', header=true);
SELECT * FROM read_json_auto('https://example.com/data.json');Joining External Data with Native Postgres Tables
This is where pg_duckdb stops being a curiosity and starts being indispensable:
SELECT u.email, region.name, SUM(e.amount) AS lifetime_spend
FROM read_parquet('s3://my-bucket/events/2026/*.parquet') e
JOIN users u ON u.id = e.user_id
JOIN regions region ON region.code = e.region_code
WHERE u.tenant_id = $1
GROUP BY 1, 2
ORDER BY 3 DESC;Writing Results Back to Parquet
COPY (
SELECT date_trunc('day', created_at) AS day, COUNT(*) AS signups
FROM users GROUP BY 1
) TO 's3://my-bucket/exports/signups.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);MotherDuck Integration (Optional)
-- Attach a MotherDuck cloud database from inside Postgres:
SELECT duckdb.attach('md:my_warehouse');
SELECT * FROM md.my_warehouse.fact_orders LIMIT 10;Materialized View Patterns for "Warm" Analytics
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT date_trunc('day', e.ts) AS day,
e.tenant_id,
SUM(e.amount) AS revenue
FROM read_parquet('s3://my-bucket/events/2026/*.parquet') e
GROUP BY 1, 2;
CREATE INDEX ON daily_revenue (tenant_id, day);
-- Refresh from cron / pg_cron once an hour:
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue;End-to-End: SaaS Metrics Dashboard
Event data lives as daily Parquet files in object storage (cheap, immutable, infinite retention). Dimension tables — users, tenants, plans — live in native Postgres (small, frequently updated). The dashboard query joins them at read time. No ETL pipeline, no warehouse to provision, one place to back up, one place to grant access.
Performance Comparison
Aggregating 100M rows with native PostgreSQL vs pg_duckdb on a 4 vCPU / 8 GB instance, data already cached:
Native Postgres heap scan + GROUP BY: ~14 seconds
pg_duckdb columnar execution: ~1.2 secondsOrder-of-magnitude speedups are typical for wide-aggregate queries. Lookups and short OLTP queries see no benefit and should not use DuckDB execution.
Limitations and Gotchas
- • Writes still go through Postgres — DuckDB execution is read-side
- • Some query shapes do not push down (CTEs with side-effecting functions, certain LATERAL joins)
- • Memory pressure: a wide aggregate over a 10 GB Parquet file will allocate working memory inside DuckDB. Constrain with
SET duckdb.memory_limit = '2GB';on small VPS - •
EXPLAINoutput looks unfamiliar — DuckDB has its own plan format
What's Next
The single-node story is complete. Part 6 turns it into a high-availability cluster with Patroni, etcd, and HAProxy. Tuning from Part 2 matters even more once replication is in the picture.
