Nebula Decentralized Overlay with Self-Managed PKI
Slack's overlay protocol: Noise + an offline CA + group-aware in-tunnel firewall. Lighthouses for discovery only — the control plane never sits in the data path.
Ubuntu 24.04, at least one VPS with a stable public IP for the lighthouse
~75 minutes
Decentralized overlay with PKI-backed identity and group-aware firewall
Why Nebula Instead of a Managed Mesh
Nebula is the overlay Slack open-sourced after running it across tens of thousands of nodes internally. It is architecturally different from Netbird and Netmaker in two ways that matter operationally:
- • No persistent control plane in the data path. Lighthouses help peers discover each other; once a tunnel is established, the lighthouse can disappear and existing tunnels keep working indefinitely.
- • Identity is a certificate, not a database row. Every host gets an x509-style certificate signed by a CA you keep offline. Revocation, group membership, allowed IPs and subnets are baked into the cert itself.
The trade-off is operational: there is no dashboard. Provisioning, group changes, and revocation are CLI workflows you script yourself. In return you get an overlay with no SaaS dependency, no MQTT broker, and no failure mode where a downed control plane locks operators out of production.
The Noise Protocol & PKI Model
Nebula uses the Noise IK handshake (the same family WireGuard uses) over UDP. Each peer presents a Nebula certificate that includes:
- • The peer's overlay IP (e.g.
10.42.0.7) — bound to the cert, not negotiated. - • A list of groups (e.g.
prod,db,ops). - • The CA's signature and an expiry timestamp.
Because the CA root key is offline, the only way to add a node is for an operator with the ca.key to sign a new cert. This is the security feature, not a bug.
Sizing on RamNode
Lighthouse (any size mesh) 2 GB Tiny memory footprint, mostly idle
Edge nodes Any Nebula uses ~10 MB RSS per agent
High-throughput relay 4 GB+ For lighthouses that also relayFor the rest of this part assume one RamNode VPS as the lighthouse with a stable public IP, and at least two more peers anywhere (other VPS, home, laptop).
Bootstrapping the CA
Do this on a workstation you trust, not on the lighthouse. The CA private key never touches a public-facing machine.
# Download the latest release for your platform
curl -L https://github.com/slackhq/nebula/releases/latest/download/nebula-linux-amd64.tar.gz \
| tar xz nebula-cert nebula
# Create the CA — long expiry; this is your trust root
./nebula-cert ca -name "RamNode Mesh CA" -duration 87600h
ls -l ca.key ca.crtStore ca.key in a password manager or hardware-backed vault. Distribute ca.crt with every host config — it is public and how peers verify each other.
Lighthouse Setup
Sign a cert for the lighthouse, then write its config:
./nebula-cert sign -name "lighthouse-1" -ip "10.42.0.1/24" -groups "lighthouse,ops"
scp lighthouse-1.crt lighthouse-1.key ca.crt root@<lighthouse-ip>:/etc/nebula/pki:
ca: /etc/nebula/ca.crt
cert: /etc/nebula/lighthouse-1.crt
key: /etc/nebula/lighthouse-1.key
static_host_map:
"10.42.0.1": ["<lighthouse-public-ip>:4242"]
lighthouse:
am_lighthouse: true
interval: 60
listen:
host: 0.0.0.0
port: 4242
punchy:
punch: true
tun:
dev: nebula1
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
firewall:
outbound: [{port: any, proto: any, host: any}]
inbound:
- {port: any, proto: icmp, host: any}
- {port: any, proto: any, groups: [ops]}[Unit]
Description=Nebula
After=network.target
[Service]
ExecStart=/usr/local/bin/nebula -config /etc/nebula/config.yml
Restart=always
[Install]
WantedBy=multi-user.targetsystemctl daemon-reload
systemctl enable --now nebula
journalctl -u nebula -fOpen UDP/4242 on the lighthouse — it is the only ingress requirement.
Adding Nodes
On the workstation, sign a cert per node with the appropriate groups, then ship it:
./nebula-cert sign -name "web-1" -ip "10.42.0.10/24" -groups "web,prod"
./nebula-cert sign -name "db-1" -ip "10.42.0.20/24" -groups "db,prod"
./nebula-cert sign -name "laptop" -ip "10.42.0.50/24" -groups "ops"Each peer's config.yml looks like the lighthouse config but flips am_lighthouse: false and adds the lighthouse to the lighthouse.hosts list:
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "10.42.0.1"
static_host_map:
"10.42.0.1": ["<lighthouse-public-ip>:4242"]Start the service. Within a few seconds a nebula1 interface appears, registers with the lighthouse, and begins UDP hole-punching to the other peers it has reason to talk to.
In-Tunnel Firewall and Groups
The Nebula firewall block is evaluated inside the tunnel and is the primary access control mechanism. Rules can target by group, by individual cert name, by CIDR, by port, or by protocol. A practical example for the db-1 host:
firewall:
conntrack:
tcp_timeout: 12m
udp_timeout: 3m
default_timeout: 10m
outbound:
- {port: any, proto: any, host: any}
inbound:
# ICMP from anyone, for sanity checks
- {port: any, proto: icmp, host: any}
# PostgreSQL only from the web tier in prod
- {port: 5432, proto: tcp, groups: [web, prod]}
# SSH only from ops
- {port: 22, proto: tcp, group: ops}Two important rules: groups: (plural, list) requires all listed groups; group: (singular) requires just that one. Mixing them up is the most common source of "why can my laptop hit Postgres".
Certificate Rotation Playbook
Cert expiry is enforced — an expired cert means the host disconnects. Build a rotation runbook before you need it.
- On the offline CA workstation, sign a new cert with the same name, IP, and groups.
- Push
.crtand.keyover an existing trusted channel (SSH or even the existing Nebula tunnel). - Reload Nebula:
systemctl reload nebula— it re-reads the cert without dropping established tunnels. - Inspect the new cert's expiry:
nebula-cert print -path /etc/nebula/host.crt.
For revocation: Nebula does not have an online CRL. The pattern is to issue short-lived certs (90 days) and let expiry handle the bulk of "revocation". For genuine compromise, rotate the CA and reissue everything — painful, which is why CA key custody is the single most important operational discipline here.
Unsafe Routes & Egress
Nebula calls non-overlay subnets unsafe routes. To make 192.168.50.0/24 behind a peer reachable from the overlay, sign that peer's cert with -subnets 192.168.50.0/24, then add to its config:
tun:
unsafe_routes:
- route: 192.168.50.0/24
via: 10.42.0.30 # this peer's overlay IP
mtu: 1300
install: trueOther peers add the same unsafe_routes entry pointing to the same gateway. The signing step is what authorizes the route — a peer cannot announce subnets its cert does not include.
Metrics & Observability
Nebula exposes a Prometheus endpoint:
stats:
type: prometheus
listen: 127.0.0.1:8080
path: /metrics
namespace: nebula
interval: 10sUseful series: nebula_handshakes, nebula_lighthouse_messages_received_bucket, nebula_meta_chan_tx. Pair with the standard Prometheus + Grafana stack from the Monitoring series.
Hardening Checklist
- CA key custody. The single thing that matters most. Offline, hardware-backed, password-manager only.
- Short cert lifetimes. 90 days, automated rotation. No 10-year certs.
- Default-deny inbound firewall. Only the rules you mean to allow.
- Run nebula as non-root with
CAP_NET_ADMINcapability instead of full root. - Lock down the lighthouse. Only UDP/4242 inbound. SSH on a non-overlay path with key auth + fail2ban.
- Audit who has the CA key annually. Rotate operators out promptly.
- Ship logs off-host.
journalctl -u nebulato your central syslog or Loki.
Troubleshooting
- • Peer never appears. Lighthouse UDP/4242 reachable?
nc -u -v <lighthouse> 4242. Most failures are firewall, not Nebula. - • Cert rejected on start.
nebula-cert print -pathon both.crtandca.crt— expired or wrong CA is the usual cause. - • Two peers behind the same NAT cannot punch. Set one as a relay (
relay.am_relay: true) and list it in the other peer'srelay.relays. - • Firewall rule "obviously" matches but traffic dropped. Check
groupsvsgroup. Setlogging.level: debugbriefly to see why a packet was denied. - • Throughput lower than WireGuard. Expected — Nebula is userspace; WireGuard is in-kernel. For most overlays it does not matter; for 10 Gbps line-rate it does.
