Patroni High Availability Cluster
Three Postgres nodes, three-node etcd, HAProxy in front. Automatic failover, no manual intervention, no split-brain.
Why Patroni Won the Postgres HA Wars
Patroni delegates leader election to a real consensus store (etcd, Consul, or ZooKeeper) instead of trying to invent quorum on top of Postgres replication. That single decision is why it is the HA framework used internally by major cloud providers, Crunchy Data, and Zalando — the alternatives (repmgr, Stolon, repmgrd hand-rolled scripts) all lose either to operational complexity or to subtle split-brain conditions Patroni does not have.
Cluster Architecture
┌──────────────────┐
client ──▶ │ HAProxy / VIP │ ── 5000 (writer) ──▶ leader
│ (or PgBouncer) │ ── 5001 (reader) ──▶ replicas
└──────────────────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐
│ pg-1 │◀─▶│ pg-2 │◀─▶│ pg-3 │ Postgres + Patroni
└──┬───┘ └──┬───┘ └──┬───┘
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐
│etcd-1│◀─▶│etcd-2│◀─▶│etcd-3│ Consensus (Raft)
└──────┘ └──────┘ └──────┘Why Three Nodes and Not Two
Quorum-based consensus needs ⌊N/2⌋ + 1 healthy nodes to elect a leader. With two nodes, losing one drops you to 50% — no quorum, no leader, the cluster freezes. With three, you can lose any one and still have a majority. Always provision odd numbers (3 or 5) for the consensus tier.
Network Topology on RamNode
Use the project's private network for inter-node traffic — replication and Raft messages should never cross the public internet. If your target region does not yet expose private networking, lay a Tailscale or WireGuard mesh on top:
# Tailscale on every node:
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --advertise-tags=tag:postgres
# nodes are now reachable over their 100.x tailnet IPsProvisioning Three Nodes
Repeat Part 1 on three VPS, with one critical deviation: do not initialize the cluster yourself. Drop the package-created cluster on each node so Patroni can bootstrap from scratch:
sudo pg_dropcluster --stop 17 mainInstall and Configure etcd 3.5+
# On each of pg-1, pg-2, pg-3:
sudo apt install -y etcd-server etcd-clientETCD_NAME=etcd-1
ETCD_DATA_DIR=/var/lib/etcd/postgres
ETCD_LISTEN_PEER_URLS=http://10.0.0.11:2380
ETCD_LISTEN_CLIENT_URLS=http://10.0.0.11:2379,http://127.0.0.1:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://10.0.0.11:2380
ETCD_ADVERTISE_CLIENT_URLS=http://10.0.0.11:2379
ETCD_INITIAL_CLUSTER="etcd-1=http://10.0.0.11:2380,etcd-2=http://10.0.0.12:2380,etcd-3=http://10.0.0.13:2380"
ETCD_INITIAL_CLUSTER_TOKEN=postgres-cluster
ETCD_INITIAL_CLUSTER_STATE=newsudo systemctl enable --now etcd
etcdctl --endpoints=http://10.0.0.11:2379 endpoint healthInstall Patroni 4.x in a virtualenv
sudo apt install -y python3-venv python3-pip
sudo python3 -m venv /opt/patroni
sudo /opt/patroni/bin/pip install --upgrade pip
sudo /opt/patroni/bin/pip install 'patroni[etcd3]' psycopg[binary][Unit]
Description=Patroni
After=network.target etcd.service
Wants=etcd.service
[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=/opt/patroni/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.targetpatroni.yml Walkthrough
scope: pg-cluster
namespace: /service/
name: pg-1
restapi:
listen: 10.0.0.11:8008
connect_address: 10.0.0.11:8008
etcd3:
hosts: 10.0.0.11:2379,10.0.0.12:2379,10.0.0.13:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
synchronous_mode: false
postgresql:
use_pg_rewind: true
parameters:
shared_buffers: 1GB
effective_cache_size: 2GB
wal_level: replica
hot_standby: on
max_wal_senders: 10
max_replication_slots: 10
initdb:
- encoding: UTF8
- locale: C.UTF-8
- data-checksums
pg_hba:
- host replication replicator 10.0.0.0/24 scram-sha-256
- host all all 10.0.0.0/24 scram-sha-256
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.0.11:5432
data_dir: /var/lib/postgresql/17/main
bin_dir: /usr/lib/postgresql/17/bin
authentication:
superuser: { username: postgres, password: 'change-me' }
replication: { username: replicator, password: 'change-me' }
tags:
nofailover: false
noloadbalance: false
clonefrom: falseBootstrap the First Node
# On pg-1 only:
sudo systemctl enable --now patroni
sudo journalctl -u patroni -f
# Watch for: "i am the leader with the lock"Add Replica Nodes
# On pg-2 and pg-3 (after editing patroni.yml with their own name/IP):
sudo systemctl enable --now patroni
patronictl -c /etc/patroni/patroni.yml list+ Cluster: pg-cluster (7359...) -+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+-------------+---------+---------+----+-----------+
| pg-1 | 10.0.0.11 | Leader | running | 1 | |
| pg-2 | 10.0.0.12 | Replica | running | 1 | 0 |
| pg-3 | 10.0.0.13 | Replica | running | 1 | 0 |
+--------+-------------+---------+---------+----+-----------+Replication Slots
Patroni manages physical replication slots automatically when use_slots: true (the default). Slots prevent the primary from recycling WAL a replica still needs, but they also pin WAL on the primary if a replica disappears — monitor pg_replication_slots and remove dead slots after a confirmed node loss.
Synchronous vs Asynchronous Replication
Async replication is faster and tolerates network blips; you can lose committed transactions on failover. Synchronous replication guarantees zero data loss but pauses commits when the sync replica is unreachable.
synchronous_mode: true
synchronous_mode_strict: false # allow degrading to async if no sync standby is healthy
synchronous_node_count: 1For most production workloads on three nodes, synchronous_mode: true with strict: false is the right balance.
HAProxy Writer/Reader Endpoints
global
maxconn 5000
defaults
timeout connect 5s
timeout client 1h
timeout server 1h
listen postgres_write
bind *:5000
option httpchk OPTIONS /master
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server pg-1 10.0.0.11:5432 check port 8008
server pg-2 10.0.0.12:5432 check port 8008
server pg-3 10.0.0.13:5432 check port 8008
listen postgres_read
bind *:5001
balance roundrobin
option httpchk OPTIONS /replica
http-check expect status 200
server pg-1 10.0.0.11:5432 check port 8008
server pg-2 10.0.0.12:5432 check port 8008
server pg-3 10.0.0.13:5432 check port 8008Apps point at haproxy:5000 for writes and haproxy:5001 for read-only traffic.
Testing Failover
# Planned switchover (zero data loss):
patronictl -c /etc/patroni/patroni.yml switchover
# Forced failover (simulating a dead leader):
patronictl -c /etc/patroni/patroni.yml failover
# Real-world: kill the leader's network and watch:
sudo iptables -A INPUT -p tcp --dport 5432 -j DROP
sudo iptables -A INPUT -p tcp --dport 8008 -j DROP
# In another shell on a healthy node:
patronictl -c /etc/patroni/patroni.yml listWebhook Callbacks (Slack, Not Email)
postgresql:
callbacks:
on_role_change: /usr/local/bin/patroni-notify.sh#!/bin/bash
ROLE=$1; STATE=$2; CLUSTER=$3
curl -s -X POST -H 'Content-Type: application/json' \
-d "{\"text\":\":rotating_light: Patroni: $CLUSTER -> role=$ROLE state=$STATE on $(hostname)\"}" \
https://hooks.slack.com/services/T000/B000/XXXXMonitoring with patronictl and the REST API
patronictl -c /etc/patroni/patroni.yml list
patronictl -c /etc/patroni/patroni.yml topology
# REST API endpoints — perfect for Prometheus/blackbox checks:
curl http://10.0.0.11:8008/cluster
curl http://10.0.0.11:8008/health
curl http://10.0.0.11:8008/replicaPitfalls
- • Clock skew — Patroni TTLs are wall-clock based. Run
chronyorsystemd-timesyncdon every node. - • etcd quorum loss — losing two of three etcd nodes freezes leader election. Restore quorum before touching Patroni.
- • Replication lag — set
maximum_lag_on_failoverin DCS so a stale replica is never promoted. - • pg_rewind requirements — needs
wal_log_hints = on(Patroni handles this whenuse_pg_rewind: true).
Rolling Upgrades
Minor version: upgrade replicas first, switch over to a freshly-upgraded replica, upgrade the old leader.
Major version: stand up a new cluster on the new major version, use logical replication to sync, cut traffic over. pg_upgrade in-place is possible but coordinated through the Patroni REST API and adds risk; prefer the parallel-cluster approach.
What's Next
A healthy HA cluster is one ransomware event or accidental DROP TABLE from being useless without backups. Part 7 covers continuous archiving, encrypted offsite backups, and verified PITR — including restoring to a new node when a Patroni leader dies for good.
