Part 5 of 7

pg_duckdb for Analytics and Data Lake Queries

DuckDB's vectorized columnar engine, embedded as a Postgres extension. Run dashboard queries in your OLTP without an external warehouse.

55 minutes

4 GB+ recommended

Prerequisites

Part 1 + Part 2

Time to Complete

50–70 minutes

Recommended Plan

8 GB+ for large scans

Row Store vs Column Store

PostgreSQL stores rows together. That is perfect for OLTP — fetch one row by primary key and you read one disk block. It is the wrong shape for dashboards: SELECT region, SUM(amount) FROM events GROUP BY region over 100M rows reads every column, not just the two you need, and decompresses values one tuple at a time.

Column stores flip this: each column is its own contiguous block, perfectly compressible, perfectly vectorizable. DuckDB is a fast in-process column engine. pg_duckdb embeds it inside Postgres so analytical queries get column-store performance without leaving the database.

What pg_duckdb Is

A Postgres extension built jointly by MotherDuck and Hydra that loads DuckDB into the backend process. Queries that would benefit from columnar execution are routed through DuckDB; OLTP queries continue to use the native Postgres planner.

Install pg_duckdb

# Add the Hydra/pg_duckdb apt repo:
curl -fsSL https://repo.hydra.so/apt/hydra.gpg \
  | sudo gpg --dearmor -o /usr/share/keyrings/hydra.gpg

echo "deb [signed-by=/usr/share/keyrings/hydra.gpg] \
  https://repo.hydra.so/apt $(lsb_release -cs) main" | \
  sudo tee /etc/apt/sources.list.d/hydra.list

sudo apt update
sudo apt install -y postgresql-17-pg-duckdb

# Required preload + restart:
echo "shared_preload_libraries = 'pg_duckdb'" | \
  sudo tee -a /etc/postgresql/17/main/conf.d/30-pgduckdb.conf

sudo systemctl restart postgresql@17-main
sudo -u postgres psql -d appdb -c "CREATE EXTENSION pg_duckdb;"

Enable and Force Execution

-- Per-session: prefer DuckDB execution where it makes sense:
SET duckdb.force_execution = true;

-- Permanently per-user:
ALTER ROLE analytics SET duckdb.force_execution = true;

Auto-detection kicks in for queries against DuckDB-backed tables and external data sources. Force it on for analytical sessions; leave it off in connections that handle OLTP traffic.

Query Parquet Files on Object Storage

-- Single file:
SELECT region, SUM(amount) AS total
FROM read_parquet('s3://my-bucket/events/2026-05.parquet')
GROUP BY region;

-- Glob multiple files:
SELECT date_trunc('day', ts) AS day, COUNT(*)
FROM read_parquet('s3://my-bucket/events/2026/*.parquet')
GROUP BY 1 ORDER BY 1;

Object Storage Credentials via duckdb.secrets

-- Backblaze B2, Wasabi, RamNode Cloud Object Storage — all S3-compatible:
INSERT INTO duckdb.secrets (type, key_id, secret, region, endpoint, scope)
VALUES ('S3',
        'AKIA....',
        'SECRET....',
        'us-east-1',
        's3.us-east-005.backblazeb2.com',
        's3://my-bucket/');

CSV, JSON, and HTTP Sources

SELECT * FROM read_csv('/srv/imports/orders.csv', header=true);
SELECT * FROM read_json_auto('https://example.com/data.json');

Joining External Data with Native Postgres Tables

This is where pg_duckdb stops being a curiosity and starts being indispensable:

SELECT u.email, region.name, SUM(e.amount) AS lifetime_spend
FROM read_parquet('s3://my-bucket/events/2026/*.parquet') e
JOIN users   u      ON u.id = e.user_id
JOIN regions region ON region.code = e.region_code
WHERE u.tenant_id = $1
GROUP BY 1, 2
ORDER BY 3 DESC;

Writing Results Back to Parquet

COPY (
  SELECT date_trunc('day', created_at) AS day, COUNT(*) AS signups
  FROM users GROUP BY 1
) TO 's3://my-bucket/exports/signups.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);

MotherDuck Integration (Optional)

-- Attach a MotherDuck cloud database from inside Postgres:
SELECT duckdb.attach('md:my_warehouse');
SELECT * FROM md.my_warehouse.fact_orders LIMIT 10;

Materialized View Patterns for "Warm" Analytics

CREATE MATERIALIZED VIEW daily_revenue AS
SELECT date_trunc('day', e.ts) AS day,
       e.tenant_id,
       SUM(e.amount) AS revenue
FROM read_parquet('s3://my-bucket/events/2026/*.parquet') e
GROUP BY 1, 2;

CREATE INDEX ON daily_revenue (tenant_id, day);

-- Refresh from cron / pg_cron once an hour:
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue;

End-to-End: SaaS Metrics Dashboard

Event data lives as daily Parquet files in object storage (cheap, immutable, infinite retention). Dimension tables — users, tenants, plans — live in native Postgres (small, frequently updated). The dashboard query joins them at read time. No ETL pipeline, no warehouse to provision, one place to back up, one place to grant access.

Performance Comparison

Aggregating 100M rows with native PostgreSQL vs pg_duckdb on a 4 vCPU / 8 GB instance, data already cached:

Native Postgres heap scan + GROUP BY:    ~14 seconds
pg_duckdb columnar execution:             ~1.2 seconds

Order-of-magnitude speedups are typical for wide-aggregate queries. Lookups and short OLTP queries see no benefit and should not use DuckDB execution.

Limitations and Gotchas

• Writes still go through Postgres — DuckDB execution is read-side
• Some query shapes do not push down (CTEs with side-effecting functions, certain LATERAL joins)
• Memory pressure: a wide aggregate over a 10 GB Parquet file will allocate working memory inside DuckDB. Constrain with SET duckdb.memory_limit = '2GB'; on small VPS
• EXPLAIN output looks unfamiliar — DuckDB has its own plan format

What's Next

The single-node story is complete. Part 6 turns it into a high-availability cluster with Patroni, etcd, and HAProxy. Tuning from Part 2 matters even more once replication is in the picture.

← Part 4: ParadeDB Part 6: Patroni HA