Part 4 of 6

Nebula Decentralized Overlay with Self-Managed PKI

Slack's overlay protocol: Noise + an offline CA + group-aware in-tunnel firewall. Lighthouses for discovery only — the control plane never sits in the data path.

75 minutes

2 GB RamNode VPS for the lighthouse, peers anywhere

Prerequisites

Ubuntu 24.04, at least one VPS with a stable public IP for the lighthouse

Time

~75 minutes

Outcome

Decentralized overlay with PKI-backed identity and group-aware firewall

Why Nebula Instead of a Managed Mesh

Nebula is the overlay Slack open-sourced after running it across tens of thousands of nodes internally. It is architecturally different from Netbird and Netmaker in two ways that matter operationally:

• No persistent control plane in the data path. Lighthouses help peers discover each other; once a tunnel is established, the lighthouse can disappear and existing tunnels keep working indefinitely.
• Identity is a certificate, not a database row. Every host gets an x509-style certificate signed by a CA you keep offline. Revocation, group membership, allowed IPs and subnets are baked into the cert itself.

The trade-off is operational: there is no dashboard. Provisioning, group changes, and revocation are CLI workflows you script yourself. In return you get an overlay with no SaaS dependency, no MQTT broker, and no failure mode where a downed control plane locks operators out of production.

The Noise Protocol & PKI Model

Nebula uses the Noise IK handshake (the same family WireGuard uses) over UDP. Each peer presents a Nebula certificate that includes:

• The peer's overlay IP (e.g. 10.42.0.7) — bound to the cert, not negotiated.
• A list of groups (e.g. prod, db, ops).
• The CA's signature and an expiry timestamp.

Because the CA root key is offline, the only way to add a node is for an operator with the ca.key to sign a new cert. This is the security feature, not a bug.

Sizing on RamNode

Lighthouse (any size mesh)   2 GB    Tiny memory footprint, mostly idle
Edge nodes                   Any     Nebula uses ~10 MB RSS per agent
High-throughput relay        4 GB+   For lighthouses that also relay

For the rest of this part assume one RamNode VPS as the lighthouse with a stable public IP, and at least two more peers anywhere (other VPS, home, laptop).

Bootstrapping the CA

Do this on a workstation you trust, not on the lighthouse. The CA private key never touches a public-facing machine.

# Download the latest release for your platform
curl -L https://github.com/slackhq/nebula/releases/latest/download/nebula-linux-amd64.tar.gz \
  | tar xz nebula-cert nebula

# Create the CA — long expiry; this is your trust root
./nebula-cert ca -name "RamNode Mesh CA" -duration 87600h
ls -l ca.key ca.crt

Store ca.key in a password manager or hardware-backed vault. Distribute ca.crt with every host config — it is public and how peers verify each other.

Lighthouse Setup

Sign a cert for the lighthouse, then write its config:

./nebula-cert sign -name "lighthouse-1" -ip "10.42.0.1/24" -groups "lighthouse,ops"
scp lighthouse-1.crt lighthouse-1.key ca.crt root@<lighthouse-ip>:/etc/nebula/

/etc/nebula/config.yml (lighthouse)

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/lighthouse-1.crt
  key: /etc/nebula/lighthouse-1.key

static_host_map:
  "10.42.0.1": ["<lighthouse-public-ip>:4242"]

lighthouse:
  am_lighthouse: true
  interval: 60

listen:
  host: 0.0.0.0
  port: 4242

punchy:
  punch: true

tun:
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300

firewall:
  outbound: [{port: any, proto: any, host: any}]
  inbound:
    - {port: any, proto: icmp, host: any}
    - {port: any, proto: any, groups: [ops]}

/etc/systemd/system/nebula.service

[Unit]
Description=Nebula
After=network.target

[Service]
ExecStart=/usr/local/bin/nebula -config /etc/nebula/config.yml
Restart=always

[Install]
WantedBy=multi-user.target

systemctl daemon-reload
systemctl enable --now nebula
journalctl -u nebula -f

Open UDP/4242 on the lighthouse — it is the only ingress requirement.

Adding Nodes

On the workstation, sign a cert per node with the appropriate groups, then ship it:

./nebula-cert sign -name "web-1"  -ip "10.42.0.10/24" -groups "web,prod"
./nebula-cert sign -name "db-1"   -ip "10.42.0.20/24" -groups "db,prod"
./nebula-cert sign -name "laptop" -ip "10.42.0.50/24" -groups "ops"

Each peer's config.yml looks like the lighthouse config but flips am_lighthouse: false and adds the lighthouse to the lighthouse.hosts list:

/etc/nebula/config.yml (peer excerpt)

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "10.42.0.1"

static_host_map:
  "10.42.0.1": ["<lighthouse-public-ip>:4242"]

Start the service. Within a few seconds a nebula1 interface appears, registers with the lighthouse, and begins UDP hole-punching to the other peers it has reason to talk to.

In-Tunnel Firewall and Groups

The Nebula firewall block is evaluated inside the tunnel and is the primary access control mechanism. Rules can target by group, by individual cert name, by CIDR, by port, or by protocol. A practical example for the db-1 host:

/etc/nebula/config.yml (db-1 firewall)

firewall:
  conntrack:
    tcp_timeout:    12m
    udp_timeout:    3m
    default_timeout: 10m

  outbound:
    - {port: any, proto: any, host: any}

  inbound:
    # ICMP from anyone, for sanity checks
    - {port: any, proto: icmp, host: any}
    # PostgreSQL only from the web tier in prod
    - {port: 5432, proto: tcp, groups: [web, prod]}
    # SSH only from ops
    - {port: 22,   proto: tcp, group: ops}

Two important rules: groups: (plural, list) requires all listed groups; group: (singular) requires just that one. Mixing them up is the most common source of "why can my laptop hit Postgres".

Certificate Rotation Playbook

Cert expiry is enforced — an expired cert means the host disconnects. Build a rotation runbook before you need it.

On the offline CA workstation, sign a new cert with the same name, IP, and groups.
Push .crt and .key over an existing trusted channel (SSH or even the existing Nebula tunnel).
Reload Nebula: systemctl reload nebula — it re-reads the cert without dropping established tunnels.
Inspect the new cert's expiry: nebula-cert print -path /etc/nebula/host.crt.

For revocation: Nebula does not have an online CRL. The pattern is to issue short-lived certs (90 days) and let expiry handle the bulk of "revocation". For genuine compromise, rotate the CA and reissue everything — painful, which is why CA key custody is the single most important operational discipline here.

Unsafe Routes & Egress

Nebula calls non-overlay subnets unsafe routes. To make 192.168.50.0/24 behind a peer reachable from the overlay, sign that peer's cert with -subnets 192.168.50.0/24, then add to its config:

config.yml (egress peer)

tun:
  unsafe_routes:
    - route: 192.168.50.0/24
      via: 10.42.0.30   # this peer's overlay IP
      mtu: 1300
      install: true

Other peers add the same unsafe_routes entry pointing to the same gateway. The signing step is what authorizes the route — a peer cannot announce subnets its cert does not include.

Metrics & Observability

Nebula exposes a Prometheus endpoint:

config.yml

stats:
  type: prometheus
  listen: 127.0.0.1:8080
  path: /metrics
  namespace: nebula
  interval: 10s

Useful series: nebula_handshakes, nebula_lighthouse_messages_received_bucket, nebula_meta_chan_tx. Pair with the standard Prometheus + Grafana stack from the Monitoring series.

Hardening Checklist

CA key custody. The single thing that matters most. Offline, hardware-backed, password-manager only.
Short cert lifetimes. 90 days, automated rotation. No 10-year certs.
Default-deny inbound firewall. Only the rules you mean to allow.
Run nebula as non-root with CAP_NET_ADMIN capability instead of full root.
Lock down the lighthouse. Only UDP/4242 inbound. SSH on a non-overlay path with key auth + fail2ban.
Audit who has the CA key annually. Rotate operators out promptly.
Ship logs off-host. journalctl -u nebula to your central syslog or Loki.

Troubleshooting

• Peer never appears. Lighthouse UDP/4242 reachable? nc -u -v <lighthouse> 4242. Most failures are firewall, not Nebula.
• Cert rejected on start. nebula-cert print -path on both .crt and ca.crt — expired or wrong CA is the usual cause.
• Two peers behind the same NAT cannot punch. Set one as a relay (relay.am_relay: true) and list it in the other peer's relay.relays.
• Firewall rule "obviously" matches but traffic dropped. Check groups vs group. Set logging.level: debug briefly to see why a packet was denied.
• Throughput lower than WireGuard. Expected — Nebula is userspace; WireGuard is in-kernel. For most overlays it does not matter; for 10 Gbps line-rate it does.

Part 3: Netmaker Part 5: Pangolin & Chisel