Part 6 of 7

Patroni High Availability Cluster

Three Postgres nodes, three-node etcd, HAProxy in front. Automatic failover, no manual intervention, no split-brain.

75 minutes

3× 2 GB minimum

Prerequisites

Part 1 + Part 2 on three VPS

Time to Complete

70–95 minutes

Recommended Plan

3 × 2 GB minimum, 4 GB preferred

Why Patroni Won the Postgres HA Wars

Patroni delegates leader election to a real consensus store (etcd, Consul, or ZooKeeper) instead of trying to invent quorum on top of Postgres replication. That single decision is why it is the HA framework used internally by major cloud providers, Crunchy Data, and Zalando — the alternatives (repmgr, Stolon, repmgrd hand-rolled scripts) all lose either to operational complexity or to subtle split-brain conditions Patroni does not have.

Cluster Architecture

              ┌──────────────────┐
   client ──▶ │   HAProxy / VIP  │ ── 5000 (writer) ──▶ leader
              │  (or PgBouncer)  │ ── 5001 (reader) ──▶ replicas
              └──────────────────┘
                       │
            ┌──────────┼──────────┐
            ▼          ▼          ▼
        ┌──────┐   ┌──────┐   ┌──────┐
        │ pg-1 │◀─▶│ pg-2 │◀─▶│ pg-3 │   Postgres + Patroni
        └──┬───┘   └──┬───┘   └──┬───┘
           ▼          ▼          ▼
        ┌──────┐   ┌──────┐   ┌──────┐
        │etcd-1│◀─▶│etcd-2│◀─▶│etcd-3│   Consensus (Raft)
        └──────┘   └──────┘   └──────┘

Why Three Nodes and Not Two

Quorum-based consensus needs ⌊N/2⌋ + 1 healthy nodes to elect a leader. With two nodes, losing one drops you to 50% — no quorum, no leader, the cluster freezes. With three, you can lose any one and still have a majority. Always provision odd numbers (3 or 5) for the consensus tier.

Network Topology on RamNode

Use the project's private network for inter-node traffic — replication and Raft messages should never cross the public internet. If your target region does not yet expose private networking, lay a Tailscale or WireGuard mesh on top:

# Tailscale on every node:
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --advertise-tags=tag:postgres
# nodes are now reachable over their 100.x tailnet IPs

Provisioning Three Nodes

Repeat Part 1 on three VPS, with one critical deviation: do not initialize the cluster yourself. Drop the package-created cluster on each node so Patroni can bootstrap from scratch:

sudo pg_dropcluster --stop 17 main

Install and Configure etcd 3.5+

# On each of pg-1, pg-2, pg-3:
sudo apt install -y etcd-server etcd-client

/etc/default/etcd (per node, adjust NAME/IP)

ETCD_NAME=etcd-1
ETCD_DATA_DIR=/var/lib/etcd/postgres
ETCD_LISTEN_PEER_URLS=http://10.0.0.11:2380
ETCD_LISTEN_CLIENT_URLS=http://10.0.0.11:2379,http://127.0.0.1:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://10.0.0.11:2380
ETCD_ADVERTISE_CLIENT_URLS=http://10.0.0.11:2379
ETCD_INITIAL_CLUSTER="etcd-1=http://10.0.0.11:2380,etcd-2=http://10.0.0.12:2380,etcd-3=http://10.0.0.13:2380"
ETCD_INITIAL_CLUSTER_TOKEN=postgres-cluster
ETCD_INITIAL_CLUSTER_STATE=new

sudo systemctl enable --now etcd
etcdctl --endpoints=http://10.0.0.11:2379 endpoint health

Install Patroni 4.x in a virtualenv

sudo apt install -y python3-venv python3-pip
sudo python3 -m venv /opt/patroni
sudo /opt/patroni/bin/pip install --upgrade pip
sudo /opt/patroni/bin/pip install 'patroni[etcd3]' psycopg[binary]

/etc/systemd/system/patroni.service

[Unit]
Description=Patroni
After=network.target etcd.service
Wants=etcd.service

[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=/opt/patroni/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

patroni.yml Walkthrough

/etc/patroni/patroni.yml (per node, adjust name + IPs)

scope: pg-cluster
namespace: /service/
name: pg-1

restapi:
  listen: 10.0.0.11:8008
  connect_address: 10.0.0.11:8008

etcd3:
  hosts: 10.0.0.11:2379,10.0.0.12:2379,10.0.0.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      parameters:
        shared_buffers: 1GB
        effective_cache_size: 2GB
        wal_level: replica
        hot_standby: on
        max_wal_senders: 10
        max_replication_slots: 10
  initdb:
    - encoding: UTF8
    - locale: C.UTF-8
    - data-checksums

  pg_hba:
    - host replication replicator 10.0.0.0/24 scram-sha-256
    - host all          all        10.0.0.0/24 scram-sha-256

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.0.11:5432
  data_dir: /var/lib/postgresql/17/main
  bin_dir: /usr/lib/postgresql/17/bin
  authentication:
    superuser:   { username: postgres,    password: 'change-me' }
    replication: { username: replicator, password: 'change-me' }

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false

Bootstrap the First Node

# On pg-1 only:
sudo systemctl enable --now patroni
sudo journalctl -u patroni -f
# Watch for: "i am the leader with the lock"

Add Replica Nodes

# On pg-2 and pg-3 (after editing patroni.yml with their own name/IP):
sudo systemctl enable --now patroni
patronictl -c /etc/patroni/patroni.yml list

+ Cluster: pg-cluster (7359...) -+---------+----+-----------+
| Member | Host        | Role    | State   | TL | Lag in MB |
+--------+-------------+---------+---------+----+-----------+
| pg-1   | 10.0.0.11   | Leader  | running |  1 |           |
| pg-2   | 10.0.0.12   | Replica | running |  1 |         0 |
| pg-3   | 10.0.0.13   | Replica | running |  1 |         0 |
+--------+-------------+---------+---------+----+-----------+

Replication Slots

Patroni manages physical replication slots automatically when use_slots: true (the default). Slots prevent the primary from recycling WAL a replica still needs, but they also pin WAL on the primary if a replica disappears — monitor pg_replication_slots and remove dead slots after a confirmed node loss.

Synchronous vs Asynchronous Replication

Async replication is faster and tolerates network blips; you can lose committed transactions on failover. Synchronous replication guarantees zero data loss but pauses commits when the sync replica is unreachable.

Patroni DCS config

synchronous_mode: true
synchronous_mode_strict: false   # allow degrading to async if no sync standby is healthy
synchronous_node_count: 1

For most production workloads on three nodes, synchronous_mode: true with strict: false is the right balance.

HAProxy Writer/Reader Endpoints

/etc/haproxy/haproxy.cfg

global
    maxconn 5000
defaults
    timeout connect 5s
    timeout client 1h
    timeout server 1h

listen postgres_write
    bind *:5000
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg-1 10.0.0.11:5432 check port 8008
    server pg-2 10.0.0.12:5432 check port 8008
    server pg-3 10.0.0.13:5432 check port 8008

listen postgres_read
    bind *:5001
    balance roundrobin
    option httpchk OPTIONS /replica
    http-check expect status 200
    server pg-1 10.0.0.11:5432 check port 8008
    server pg-2 10.0.0.12:5432 check port 8008
    server pg-3 10.0.0.13:5432 check port 8008

Apps point at haproxy:5000 for writes and haproxy:5001 for read-only traffic.

Testing Failover

# Planned switchover (zero data loss):
patronictl -c /etc/patroni/patroni.yml switchover

# Forced failover (simulating a dead leader):
patronictl -c /etc/patroni/patroni.yml failover

# Real-world: kill the leader's network and watch:
sudo iptables -A INPUT -p tcp --dport 5432 -j DROP
sudo iptables -A INPUT -p tcp --dport 8008 -j DROP
# In another shell on a healthy node:
patronictl -c /etc/patroni/patroni.yml list

Webhook Callbacks (Slack, Not Email)

patroni.yml

postgresql:
  callbacks:
    on_role_change: /usr/local/bin/patroni-notify.sh

/usr/local/bin/patroni-notify.sh

#!/bin/bash
ROLE=$1; STATE=$2; CLUSTER=$3
curl -s -X POST -H 'Content-Type: application/json' \
  -d "{\"text\":\":rotating_light: Patroni: $CLUSTER -> role=$ROLE state=$STATE on $(hostname)\"}" \
  https://hooks.slack.com/services/T000/B000/XXXX

Monitoring with patronictl and the REST API

patronictl -c /etc/patroni/patroni.yml list
patronictl -c /etc/patroni/patroni.yml topology

# REST API endpoints — perfect for Prometheus/blackbox checks:
curl http://10.0.0.11:8008/cluster
curl http://10.0.0.11:8008/health
curl http://10.0.0.11:8008/replica

Pitfalls

• Clock skew — Patroni TTLs are wall-clock based. Run chrony or systemd-timesyncd on every node.
• etcd quorum loss — losing two of three etcd nodes freezes leader election. Restore quorum before touching Patroni.
• Replication lag — set maximum_lag_on_failover in DCS so a stale replica is never promoted.
• pg_rewind requirements — needs wal_log_hints = on (Patroni handles this when use_pg_rewind: true).

Rolling Upgrades

Minor version: upgrade replicas first, switch over to a freshly-upgraded replica, upgrade the old leader.

Major version: stand up a new cluster on the new major version, use logical replication to sync, cut traffic over. pg_upgrade in-place is possible but coordinated through the Patroni REST API and adds risk; prefer the parallel-cluster approach.

What's Next

A healthy HA cluster is one ransomware event or accidental DROP TABLE from being useless without backups. Part 7 covers continuous archiving, encrypted offsite backups, and verified PITR — including restoring to a new node when a Patroni leader dies for good.

← Part 5: pg_duckdb Part 7: pgBackRest