Modern PostgreSQL Superstack Series
    Part 6 of 7

    Patroni High Availability Cluster

    Three Postgres nodes, three-node etcd, HAProxy in front. Automatic failover, no manual intervention, no split-brain.

    75 minutes
    3× 2 GB minimum
    Prerequisites

    Part 1 + Part 2 on three VPS

    Time to Complete

    70–95 minutes

    Recommended Plan

    3 × 2 GB minimum, 4 GB preferred

    Why Patroni Won the Postgres HA Wars

    Patroni delegates leader election to a real consensus store (etcd, Consul, or ZooKeeper) instead of trying to invent quorum on top of Postgres replication. That single decision is why it is the HA framework used internally by major cloud providers, Crunchy Data, and Zalando — the alternatives (repmgr, Stolon, repmgrd hand-rolled scripts) all lose either to operational complexity or to subtle split-brain conditions Patroni does not have.

    Cluster Architecture

                  ┌──────────────────┐
       client ──▶ │   HAProxy / VIP  │ ── 5000 (writer) ──▶ leader
                  │  (or PgBouncer)  │ ── 5001 (reader) ──▶ replicas
                  └──────────────────┘
                           │
                ┌──────────┼──────────┐
                ▼          ▼          ▼
            ┌──────┐   ┌──────┐   ┌──────┐
            │ pg-1 │◀─▶│ pg-2 │◀─▶│ pg-3 │   Postgres + Patroni
            └──┬───┘   └──┬───┘   └──┬───┘
               ▼          ▼          ▼
            ┌──────┐   ┌──────┐   ┌──────┐
            │etcd-1│◀─▶│etcd-2│◀─▶│etcd-3│   Consensus (Raft)
            └──────┘   └──────┘   └──────┘

    Why Three Nodes and Not Two

    Quorum-based consensus needs ⌊N/2⌋ + 1 healthy nodes to elect a leader. With two nodes, losing one drops you to 50% — no quorum, no leader, the cluster freezes. With three, you can lose any one and still have a majority. Always provision odd numbers (3 or 5) for the consensus tier.

    Network Topology on RamNode

    Use the project's private network for inter-node traffic — replication and Raft messages should never cross the public internet. If your target region does not yet expose private networking, lay a Tailscale or WireGuard mesh on top:

    # Tailscale on every node:
    curl -fsSL https://tailscale.com/install.sh | sh
    sudo tailscale up --advertise-tags=tag:postgres
    # nodes are now reachable over their 100.x tailnet IPs

    Provisioning Three Nodes

    Repeat Part 1 on three VPS, with one critical deviation: do not initialize the cluster yourself. Drop the package-created cluster on each node so Patroni can bootstrap from scratch:

    sudo pg_dropcluster --stop 17 main

    Install and Configure etcd 3.5+

    # On each of pg-1, pg-2, pg-3:
    sudo apt install -y etcd-server etcd-client
    /etc/default/etcd (per node, adjust NAME/IP)
    ETCD_NAME=etcd-1
    ETCD_DATA_DIR=/var/lib/etcd/postgres
    ETCD_LISTEN_PEER_URLS=http://10.0.0.11:2380
    ETCD_LISTEN_CLIENT_URLS=http://10.0.0.11:2379,http://127.0.0.1:2379
    ETCD_INITIAL_ADVERTISE_PEER_URLS=http://10.0.0.11:2380
    ETCD_ADVERTISE_CLIENT_URLS=http://10.0.0.11:2379
    ETCD_INITIAL_CLUSTER="etcd-1=http://10.0.0.11:2380,etcd-2=http://10.0.0.12:2380,etcd-3=http://10.0.0.13:2380"
    ETCD_INITIAL_CLUSTER_TOKEN=postgres-cluster
    ETCD_INITIAL_CLUSTER_STATE=new
    sudo systemctl enable --now etcd
    etcdctl --endpoints=http://10.0.0.11:2379 endpoint health

    Install Patroni 4.x in a virtualenv

    sudo apt install -y python3-venv python3-pip
    sudo python3 -m venv /opt/patroni
    sudo /opt/patroni/bin/pip install --upgrade pip
    sudo /opt/patroni/bin/pip install 'patroni[etcd3]' psycopg[binary]
    /etc/systemd/system/patroni.service
    [Unit]
    Description=Patroni
    After=network.target etcd.service
    Wants=etcd.service
    
    [Service]
    Type=simple
    User=postgres
    Group=postgres
    ExecStart=/opt/patroni/bin/patroni /etc/patroni/patroni.yml
    Restart=on-failure
    RestartSec=5
    
    [Install]
    WantedBy=multi-user.target

    patroni.yml Walkthrough

    /etc/patroni/patroni.yml (per node, adjust name + IPs)
    scope: pg-cluster
    namespace: /service/
    name: pg-1
    
    restapi:
      listen: 10.0.0.11:8008
      connect_address: 10.0.0.11:8008
    
    etcd3:
      hosts: 10.0.0.11:2379,10.0.0.12:2379,10.0.0.13:2379
    
    bootstrap:
      dcs:
        ttl: 30
        loop_wait: 10
        retry_timeout: 10
        maximum_lag_on_failover: 1048576
        synchronous_mode: false
        postgresql:
          use_pg_rewind: true
          parameters:
            shared_buffers: 1GB
            effective_cache_size: 2GB
            wal_level: replica
            hot_standby: on
            max_wal_senders: 10
            max_replication_slots: 10
      initdb:
        - encoding: UTF8
        - locale: C.UTF-8
        - data-checksums
    
      pg_hba:
        - host replication replicator 10.0.0.0/24 scram-sha-256
        - host all          all        10.0.0.0/24 scram-sha-256
    
    postgresql:
      listen: 0.0.0.0:5432
      connect_address: 10.0.0.11:5432
      data_dir: /var/lib/postgresql/17/main
      bin_dir: /usr/lib/postgresql/17/bin
      authentication:
        superuser:   { username: postgres,    password: 'change-me' }
        replication: { username: replicator, password: 'change-me' }
    
    tags:
      nofailover: false
      noloadbalance: false
      clonefrom: false

    Bootstrap the First Node

    # On pg-1 only:
    sudo systemctl enable --now patroni
    sudo journalctl -u patroni -f
    # Watch for: "i am the leader with the lock"

    Add Replica Nodes

    # On pg-2 and pg-3 (after editing patroni.yml with their own name/IP):
    sudo systemctl enable --now patroni
    patronictl -c /etc/patroni/patroni.yml list
    + Cluster: pg-cluster (7359...) -+---------+----+-----------+
    | Member | Host        | Role    | State   | TL | Lag in MB |
    +--------+-------------+---------+---------+----+-----------+
    | pg-1   | 10.0.0.11   | Leader  | running |  1 |           |
    | pg-2   | 10.0.0.12   | Replica | running |  1 |         0 |
    | pg-3   | 10.0.0.13   | Replica | running |  1 |         0 |
    +--------+-------------+---------+---------+----+-----------+

    Replication Slots

    Patroni manages physical replication slots automatically when use_slots: true (the default). Slots prevent the primary from recycling WAL a replica still needs, but they also pin WAL on the primary if a replica disappears — monitor pg_replication_slots and remove dead slots after a confirmed node loss.

    Synchronous vs Asynchronous Replication

    Async replication is faster and tolerates network blips; you can lose committed transactions on failover. Synchronous replication guarantees zero data loss but pauses commits when the sync replica is unreachable.

    Patroni DCS config
    synchronous_mode: true
    synchronous_mode_strict: false   # allow degrading to async if no sync standby is healthy
    synchronous_node_count: 1

    For most production workloads on three nodes, synchronous_mode: true with strict: false is the right balance.

    HAProxy Writer/Reader Endpoints

    /etc/haproxy/haproxy.cfg
    global
        maxconn 5000
    defaults
        timeout connect 5s
        timeout client 1h
        timeout server 1h
    
    listen postgres_write
        bind *:5000
        option httpchk OPTIONS /master
        http-check expect status 200
        default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
        server pg-1 10.0.0.11:5432 check port 8008
        server pg-2 10.0.0.12:5432 check port 8008
        server pg-3 10.0.0.13:5432 check port 8008
    
    listen postgres_read
        bind *:5001
        balance roundrobin
        option httpchk OPTIONS /replica
        http-check expect status 200
        server pg-1 10.0.0.11:5432 check port 8008
        server pg-2 10.0.0.12:5432 check port 8008
        server pg-3 10.0.0.13:5432 check port 8008

    Apps point at haproxy:5000 for writes and haproxy:5001 for read-only traffic.

    Testing Failover

    # Planned switchover (zero data loss):
    patronictl -c /etc/patroni/patroni.yml switchover
    
    # Forced failover (simulating a dead leader):
    patronictl -c /etc/patroni/patroni.yml failover
    
    # Real-world: kill the leader's network and watch:
    sudo iptables -A INPUT -p tcp --dport 5432 -j DROP
    sudo iptables -A INPUT -p tcp --dport 8008 -j DROP
    # In another shell on a healthy node:
    patronictl -c /etc/patroni/patroni.yml list

    Webhook Callbacks (Slack, Not Email)

    patroni.yml
    postgresql:
      callbacks:
        on_role_change: /usr/local/bin/patroni-notify.sh
    /usr/local/bin/patroni-notify.sh
    #!/bin/bash
    ROLE=$1; STATE=$2; CLUSTER=$3
    curl -s -X POST -H 'Content-Type: application/json' \
      -d "{\"text\":\":rotating_light: Patroni: $CLUSTER -> role=$ROLE state=$STATE on $(hostname)\"}" \
      https://hooks.slack.com/services/T000/B000/XXXX

    Monitoring with patronictl and the REST API

    patronictl -c /etc/patroni/patroni.yml list
    patronictl -c /etc/patroni/patroni.yml topology
    
    # REST API endpoints — perfect for Prometheus/blackbox checks:
    curl http://10.0.0.11:8008/cluster
    curl http://10.0.0.11:8008/health
    curl http://10.0.0.11:8008/replica

    Pitfalls

    • Clock skew — Patroni TTLs are wall-clock based. Run chrony or systemd-timesyncd on every node.
    • etcd quorum loss — losing two of three etcd nodes freezes leader election. Restore quorum before touching Patroni.
    • Replication lag — set maximum_lag_on_failover in DCS so a stale replica is never promoted.
    • pg_rewind requirements — needs wal_log_hints = on (Patroni handles this when use_pg_rewind: true).

    Rolling Upgrades

    Minor version: upgrade replicas first, switch over to a freshly-upgraded replica, upgrade the old leader.

    Major version: stand up a new cluster on the new major version, use logical replication to sync, cut traffic over. pg_upgrade in-place is possible but coordinated through the Patroni REST API and adds risk; prefer the parallel-cluster approach.

    What's Next

    A healthy HA cluster is one ransomware event or accidental DROP TABLE from being useless without backups. Part 7 covers continuous archiving, encrypted offsite backups, and verified PITR — including restoring to a new node when a Patroni leader dies for good.