WireGuard Mesh & Tunnel Series
    Part 4 of 6

    Nebula Decentralized Overlay with Self-Managed PKI

    Slack's overlay protocol: Noise + an offline CA + group-aware in-tunnel firewall. Lighthouses for discovery only — the control plane never sits in the data path.

    75 minutes
    2 GB RamNode VPS for the lighthouse, peers anywhere
    Prerequisites

    Ubuntu 24.04, at least one VPS with a stable public IP for the lighthouse

    Time

    ~75 minutes

    Outcome

    Decentralized overlay with PKI-backed identity and group-aware firewall

    Why Nebula Instead of a Managed Mesh

    Nebula is the overlay Slack open-sourced after running it across tens of thousands of nodes internally. It is architecturally different from Netbird and Netmaker in two ways that matter operationally:

    • No persistent control plane in the data path. Lighthouses help peers discover each other; once a tunnel is established, the lighthouse can disappear and existing tunnels keep working indefinitely.
    • Identity is a certificate, not a database row. Every host gets an x509-style certificate signed by a CA you keep offline. Revocation, group membership, allowed IPs and subnets are baked into the cert itself.

    The trade-off is operational: there is no dashboard. Provisioning, group changes, and revocation are CLI workflows you script yourself. In return you get an overlay with no SaaS dependency, no MQTT broker, and no failure mode where a downed control plane locks operators out of production.

    The Noise Protocol & PKI Model

    Nebula uses the Noise IK handshake (the same family WireGuard uses) over UDP. Each peer presents a Nebula certificate that includes:

    • • The peer's overlay IP (e.g. 10.42.0.7) — bound to the cert, not negotiated.
    • • A list of groups (e.g. prod, db, ops).
    • • The CA's signature and an expiry timestamp.

    Because the CA root key is offline, the only way to add a node is for an operator with the ca.key to sign a new cert. This is the security feature, not a bug.

    Sizing on RamNode

    Lighthouse (any size mesh)   2 GB    Tiny memory footprint, mostly idle
    Edge nodes                   Any     Nebula uses ~10 MB RSS per agent
    High-throughput relay        4 GB+   For lighthouses that also relay

    For the rest of this part assume one RamNode VPS as the lighthouse with a stable public IP, and at least two more peers anywhere (other VPS, home, laptop).

    Bootstrapping the CA

    Do this on a workstation you trust, not on the lighthouse. The CA private key never touches a public-facing machine.

    # Download the latest release for your platform
    curl -L https://github.com/slackhq/nebula/releases/latest/download/nebula-linux-amd64.tar.gz \
      | tar xz nebula-cert nebula
    
    # Create the CA — long expiry; this is your trust root
    ./nebula-cert ca -name "RamNode Mesh CA" -duration 87600h
    ls -l ca.key ca.crt

    Store ca.key in a password manager or hardware-backed vault. Distribute ca.crt with every host config — it is public and how peers verify each other.

    Lighthouse Setup

    Sign a cert for the lighthouse, then write its config:

    ./nebula-cert sign -name "lighthouse-1" -ip "10.42.0.1/24" -groups "lighthouse,ops"
    scp lighthouse-1.crt lighthouse-1.key ca.crt root@<lighthouse-ip>:/etc/nebula/
    /etc/nebula/config.yml (lighthouse)
    pki:
      ca: /etc/nebula/ca.crt
      cert: /etc/nebula/lighthouse-1.crt
      key: /etc/nebula/lighthouse-1.key
    
    static_host_map:
      "10.42.0.1": ["<lighthouse-public-ip>:4242"]
    
    lighthouse:
      am_lighthouse: true
      interval: 60
    
    listen:
      host: 0.0.0.0
      port: 4242
    
    punchy:
      punch: true
    
    tun:
      dev: nebula1
      drop_local_broadcast: false
      drop_multicast: false
      tx_queue: 500
      mtu: 1300
    
    firewall:
      outbound: [{port: any, proto: any, host: any}]
      inbound:
        - {port: any, proto: icmp, host: any}
        - {port: any, proto: any, groups: [ops]}
    /etc/systemd/system/nebula.service
    [Unit]
    Description=Nebula
    After=network.target
    
    [Service]
    ExecStart=/usr/local/bin/nebula -config /etc/nebula/config.yml
    Restart=always
    
    [Install]
    WantedBy=multi-user.target
    systemctl daemon-reload
    systemctl enable --now nebula
    journalctl -u nebula -f

    Open UDP/4242 on the lighthouse — it is the only ingress requirement.

    Adding Nodes

    On the workstation, sign a cert per node with the appropriate groups, then ship it:

    ./nebula-cert sign -name "web-1"  -ip "10.42.0.10/24" -groups "web,prod"
    ./nebula-cert sign -name "db-1"   -ip "10.42.0.20/24" -groups "db,prod"
    ./nebula-cert sign -name "laptop" -ip "10.42.0.50/24" -groups "ops"

    Each peer's config.yml looks like the lighthouse config but flips am_lighthouse: false and adds the lighthouse to the lighthouse.hosts list:

    /etc/nebula/config.yml (peer excerpt)
    lighthouse:
      am_lighthouse: false
      interval: 60
      hosts:
        - "10.42.0.1"
    
    static_host_map:
      "10.42.0.1": ["<lighthouse-public-ip>:4242"]

    Start the service. Within a few seconds a nebula1 interface appears, registers with the lighthouse, and begins UDP hole-punching to the other peers it has reason to talk to.

    In-Tunnel Firewall and Groups

    The Nebula firewall block is evaluated inside the tunnel and is the primary access control mechanism. Rules can target by group, by individual cert name, by CIDR, by port, or by protocol. A practical example for the db-1 host:

    /etc/nebula/config.yml (db-1 firewall)
    firewall:
      conntrack:
        tcp_timeout:    12m
        udp_timeout:    3m
        default_timeout: 10m
    
      outbound:
        - {port: any, proto: any, host: any}
    
      inbound:
        # ICMP from anyone, for sanity checks
        - {port: any, proto: icmp, host: any}
        # PostgreSQL only from the web tier in prod
        - {port: 5432, proto: tcp, groups: [web, prod]}
        # SSH only from ops
        - {port: 22,   proto: tcp, group: ops}

    Two important rules: groups: (plural, list) requires all listed groups; group: (singular) requires just that one. Mixing them up is the most common source of "why can my laptop hit Postgres".

    Certificate Rotation Playbook

    Cert expiry is enforced — an expired cert means the host disconnects. Build a rotation runbook before you need it.

    1. On the offline CA workstation, sign a new cert with the same name, IP, and groups.
    2. Push .crt and .key over an existing trusted channel (SSH or even the existing Nebula tunnel).
    3. Reload Nebula: systemctl reload nebula — it re-reads the cert without dropping established tunnels.
    4. Inspect the new cert's expiry: nebula-cert print -path /etc/nebula/host.crt.

    For revocation: Nebula does not have an online CRL. The pattern is to issue short-lived certs (90 days) and let expiry handle the bulk of "revocation". For genuine compromise, rotate the CA and reissue everything — painful, which is why CA key custody is the single most important operational discipline here.

    Unsafe Routes & Egress

    Nebula calls non-overlay subnets unsafe routes. To make 192.168.50.0/24 behind a peer reachable from the overlay, sign that peer's cert with -subnets 192.168.50.0/24, then add to its config:

    config.yml (egress peer)
    tun:
      unsafe_routes:
        - route: 192.168.50.0/24
          via: 10.42.0.30   # this peer's overlay IP
          mtu: 1300
          install: true

    Other peers add the same unsafe_routes entry pointing to the same gateway. The signing step is what authorizes the route — a peer cannot announce subnets its cert does not include.

    Metrics & Observability

    Nebula exposes a Prometheus endpoint:

    config.yml
    stats:
      type: prometheus
      listen: 127.0.0.1:8080
      path: /metrics
      namespace: nebula
      interval: 10s

    Useful series: nebula_handshakes, nebula_lighthouse_messages_received_bucket, nebula_meta_chan_tx. Pair with the standard Prometheus + Grafana stack from the Monitoring series.

    Hardening Checklist

    1. CA key custody. The single thing that matters most. Offline, hardware-backed, password-manager only.
    2. Short cert lifetimes. 90 days, automated rotation. No 10-year certs.
    3. Default-deny inbound firewall. Only the rules you mean to allow.
    4. Run nebula as non-root with CAP_NET_ADMIN capability instead of full root.
    5. Lock down the lighthouse. Only UDP/4242 inbound. SSH on a non-overlay path with key auth + fail2ban.
    6. Audit who has the CA key annually. Rotate operators out promptly.
    7. Ship logs off-host. journalctl -u nebula to your central syslog or Loki.

    Troubleshooting

    • Peer never appears. Lighthouse UDP/4242 reachable? nc -u -v <lighthouse> 4242. Most failures are firewall, not Nebula.
    • Cert rejected on start. nebula-cert print -path on both .crt and ca.crt — expired or wrong CA is the usual cause.
    • Two peers behind the same NAT cannot punch. Set one as a relay (relay.am_relay: true) and list it in the other peer's relay.relays.
    • Firewall rule "obviously" matches but traffic dropped. Check groups vs group. Set logging.level: debug briefly to see why a packet was denied.
    • Throughput lower than WireGuard. Expected — Nebula is userspace; WireGuard is in-kernel. For most overlays it does not matter; for 10 Gbps line-rate it does.