You've deployed your infrastructure and applications. Now comes the question every self-hoster faces at 2 AM: "Is everything still running?"
In this guide, we'll use Claude Code to build a complete monitoring stack from scratch—Prometheus for metrics collection, Grafana for visualization, Alertmanager for notifications, and custom exporters for your specific needs.
Deploy Monitoring Separately
For production reliability, run your monitoring stack on a separate VPS instance in a different region than the infrastructure you're monitoring. If your primary server goes down, your monitoring should still be able to detect and alert you. A small 1GB instance is sufficient for monitoring several servers.
Prerequisites
- Claude Code installed (see Part 1)
- Docker and Docker Compose on your VPS
- Basic understanding of metrics (we'll explain as we go)
The Monitoring Stack We'll Build
| Component | Purpose |
|---|---|
| Prometheus | Time-series database, scrapes and stores metrics |
| Grafana | Visualization dashboards |
| Alertmanager | Routes alerts to Discord, Slack, email, etc. |
| Node Exporter | Host metrics (CPU, memory, disk, network) |
| cAdvisor | Container metrics |
| Blackbox Exporter | Endpoint probing (HTTP, TCP, ICMP) |
Core Monitoring Stack
Let's start with the foundation. Ask Claude Code:
Create a Docker Compose Monitoring Stack
Services:
- Prometheus (latest) with 15-day retention
- Grafana (latest) with anonymous access disabled
- Node Exporter for host metrics
- cAdvisor for container metrics
Requirements:
- All on a dedicated monitoring network
- Prometheus config mounted from ./prometheus/
- Grafana data persisted
- Grafana provisioned with Prometheus as default datasource
- Node Exporter needs access to host /proc, /sys, and root filesystem (read-only)
- cAdvisor needs Docker socket access
Expose only Grafana on port 3000. Prometheus should be internal-only.Generated docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=15d
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
volumes:
- ./prometheus:/etc/prometheus:ro
- prometheus-data:/prometheus
networks:
- monitoring
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 10s
retries: 3
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:?Grafana password required}
- GF_USERS_ALLOW_SIGN_UP=false
- GF_AUTH_ANONYMOUS_ENABLED=false
- GF_SERVER_ROOT_URL=${GRAFANA_ROOT_URL:-http://localhost:3000}
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
networks:
- monitoring
depends_on:
- prometheus
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
networks:
- monitoring
pid: host
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
networks:
- monitoring
devices:
- /dev/kmsg
networks:
monitoring:
driver: bridge
volumes:
prometheus-data:
grafana-data:Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter - Host metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: '${HOSTNAME:-server}'
# cAdvisor - Container metrics
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']Grafana Datasource Provisioning
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: falseEnvironment Configuration
# Grafana
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=change-me-secure-password
GRAFANA_ROOT_URL=http://localhost:3000
# Host identification
HOSTNAME=my-serverAdding Alertmanager
Now let's add alerting:
Add Alertmanager to the stack with:
- Discord webhook notifications (primary)
- Email as backup notification channel
- Alert grouping by severity and service
- 5-minute group wait, 10-minute group interval
- Silence and inhibition rules for maintenance
Create alert rules for:
- Instance down (any target unreachable for 2 minutes)
- High CPU (>80% for 5 minutes)
- High memory (>85% for 5 minutes)
- Disk space low (<15% free)
- Container restart loop (>3 restarts in 10 minutes)
Update Prometheus to load the rules and connect to Alertmanager.Add to docker-compose.yml
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=${ALERTMANAGER_URL:-http://localhost:9093}'
volumes:
- ./alertmanager:/etc/alertmanager:ro
- alertmanager-data:/alertmanager
networks:
- monitoring
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:9093/-/healthy"]
interval: 30s
timeout: 10s
retries: 3
# Add to volumes:
alertmanager-data:Alertmanager Configuration
global:
resolve_timeout: 5m
smtp_smarthost: '${SMTP_HOST:-smtp.example.com}:${SMTP_PORT:-587}'
smtp_from: '${SMTP_FROM:-alerts@example.com}'
smtp_auth_username: '${SMTP_USER:-}'
smtp_auth_password: '${SMTP_PASSWORD:-}'
route:
group_by: ['alertname', 'severity', 'service']
group_wait: 5m
group_interval: 10m
repeat_interval: 4h
receiver: 'discord-notifications'
routes:
- match:
severity: critical
receiver: 'discord-notifications'
group_wait: 30s
repeat_interval: 1h
- match:
severity: warning
receiver: 'discord-notifications'
group_wait: 5m
repeat_interval: 6h
receivers:
- name: 'discord-notifications'
discord_configs:
- webhook_url: '${DISCORD_WEBHOOK_URL}'
title: '{{ template "discord.title" . }}'
message: '{{ template "discord.message" . }}'
email_configs:
- to: '${ALERT_EMAIL:-admin@example.com}'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
templates:
- '/etc/alertmanager/templates/*.tmpl'Discord Message Template
{{ define "discord.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}
{{ define "discord.message" }}
{{ range .Alerts }}
**Alert:** {{ .Labels.alertname }}
**Severity:** {{ .Labels.severity }}
**Instance:** {{ .Labels.instance }}
**Description:** {{ .Annotations.description }}
**Started:** {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ if .EndsAt }}**Resolved:** {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }}
---
{{ end }}
{{ end }}Prometheus Alert Rules
groups:
- name: instance_alerts
interval: 30s
rules:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }} on {{ $labels.instance }} has been unreachable for more than 2 minutes."
- name: host_alerts
interval: 30s
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
service: host
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes."
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
service: host
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85%."
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 15
for: 5m
labels:
severity: warning
service: host
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 15% free space."
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 5
for: 2m
labels:
severity: critical
service: host
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 5% free space."
- name: container_alerts
interval: 30s
rules:
- alert: ContainerRestartLoop
expr: increase(container_restart_count[10m]) > 3
for: 1m
labels:
severity: warning
service: container
annotations:
summary: "Container {{ $labels.name }} is restarting frequently"
description: "Container has restarted multiple times in the last 10 minutes."Updated Prometheus Config
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- '/etc/prometheus/rules/*.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'alertmanager'
static_configs:
- targets: ['alertmanager:9093']Blackbox Exporter for Endpoint Monitoring
Monitor your services from the outside:
Add Blackbox Exporter to probe:
- HTTP endpoints (status codes, SSL cert expiry, response time)
- TCP connectivity (database ports, Redis)
- ICMP ping (for network debugging)
Create probes for:
- My main website (https://example.com) - check every 30s
- API health endpoint (https://api.example.com/health)
- Database connectivity (internal PostgreSQL on port 5432)
Add alerts for:
- Endpoint down for 1 minute
- SSL certificate expiring within 14 days
- Response time > 2 secondsAdd to docker-compose.yml
blackbox:
image: prom/blackbox-exporter:latest
container_name: blackbox
restart: unless-stopped
volumes:
- ./blackbox:/etc/blackbox:ro
command:
- '--config.file=/etc/blackbox/blackbox.yml'
networks:
- monitoringBlackbox Configuration
modules:
http_2xx:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: [] # Defaults to 2xx
method: GET
http_2xx_ssl:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: []
method: GET
tls_config:
insecure_skip_verify: false
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5sPrometheus Scrape Config for Blackbox
# Blackbox HTTP probes
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx_ssl]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115
# Blackbox TCP probes
- job_name: 'blackbox-tcp'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- postgres:5432
- redis:6379
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115Blackbox Alert Rules
groups:
- name: blackbox_alerts
interval: 30s
rules:
- alert: EndpointDown
expr: probe_success == 0
for: 1m
labels:
severity: critical
service: endpoint
annotations:
summary: "Endpoint {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been unreachable for more than 1 minute."
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
for: 1h
labels:
severity: warning
service: ssl
annotations:
summary: "SSL certificate expiring soon for {{ $labels.instance }}"
description: "SSL certificate expires in less than 14 days."
- alert: SSLCertExpiryCritical
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
for: 1h
labels:
severity: critical
service: ssl
annotations:
summary: "SSL certificate expiring very soon for {{ $labels.instance }}"
description: "SSL certificate expires in less than 3 days!"
- alert: SlowResponse
expr: probe_http_duration_seconds > 2
for: 5m
labels:
severity: warning
service: endpoint
annotations:
summary: "Slow response from {{ $labels.instance }}"
description: "Response time is greater than 2 seconds."Custom Monitoring Scripts
Sometimes you need metrics Prometheus doesn't have exporters for. Claude Code can generate custom collectors:
Metrics needed:
- Number of files in /var/log older than 7 days
- Size of Docker volumes in MB
- Number of failed SSH login attempts today
- Last backup age in hours (check /backup/latest timestamp)
- Number of pending apt upgrades
Output in Prometheus format to /var/lib/node_exporter/textfile_collector/custom.promCustom Metrics Script
#!/bin/bash
set -euo pipefail
OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/custom.prom"
TMP_FILE="${OUTPUT_FILE}.tmp"
# Initialize output
cat > "$TMP_FILE" << 'EOF'
# HELP custom_old_log_files Number of log files older than 7 days
# TYPE custom_old_log_files gauge
# HELP custom_docker_volume_size_mb Size of Docker volumes in MB
# TYPE custom_docker_volume_size_mb gauge
# HELP custom_failed_ssh_attempts Failed SSH login attempts today
# TYPE custom_failed_ssh_attempts gauge
# HELP custom_backup_age_hours Age of last backup in hours
# TYPE custom_backup_age_hours gauge
# HELP custom_pending_upgrades Number of pending apt upgrades
# TYPE custom_pending_upgrades gauge
EOF
# Old log files (older than 7 days)
old_logs=$(find /var/log -type f -mtime +7 2>/dev/null | wc -l)
echo "custom_old_log_files ${old_logs}" >> "$TMP_FILE"
# Failed SSH attempts today
if [[ -f /var/log/auth.log ]]; then
today=$(date +%b\ %d)
failed_ssh=$(grep "Failed password" /var/log/auth.log 2>/dev/null | grep -c "$today" || echo 0)
else
failed_ssh=0
fi
echo "custom_failed_ssh_attempts ${failed_ssh}" >> "$TMP_FILE"
# Backup age
backup_file="/backup/latest"
if [[ -f "$backup_file" ]]; then
backup_time=$(stat -c %Y "$backup_file")
current_time=$(date +%s)
age_hours=$(( (current_time - backup_time) / 3600 ))
echo "custom_backup_age_hours ${age_hours}" >> "$TMP_FILE"
else
echo "custom_backup_age_hours -1" >> "$TMP_FILE"
fi
# Pending apt upgrades
if command -v apt &> /dev/null; then
pending=$(apt list --upgradable 2>/dev/null | grep -c upgradable || echo 0)
else
pending=0
fi
echo "custom_pending_upgrades ${pending}" >> "$TMP_FILE"
# Atomic move
mv "$TMP_FILE" "$OUTPUT_FILE"Cron Job
# Run every 5 minutes
*/5 * * * * /opt/monitoring/scripts/custom_metrics.shUpdate Node Exporter
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
- '--collector.textfile.directory=/var/lib/node_exporter/textfile_collector'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
- /var/lib/node_exporter/textfile_collector:/var/lib/node_exporter/textfile_collector:ro
networks:
- monitoring
pid: hostGrafana Dashboard Provisioning
Let's auto-provision a useful dashboard:
Create a Grafana Dashboard JSON
Row 1 - System Overview:
- CPU usage gauge
- Memory usage gauge
- Disk usage gauge
- Uptime stat
Row 2 - Resource Trends:
- CPU usage over time (line graph)
- Memory usage over time (line graph)
- Network traffic in/out (line graph)
Row 3 - Container Status:
- Table of all containers with CPU, memory, status
- Container restart count over time
Row 4 - Alerts:
- Current firing alerts table
- Alert history timeline
Make it responsive and use a dark theme. Include template variables for instance selection.Dashboard Provisioning Config
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 30
options:
path: /etc/grafana/provisioning/dashboardsClaude Code will generate a complete JSON dashboard file with all the panels, queries, and styling. The dashboard includes template variables for filtering by instance and proper dark theme styling.
Deployment and Verification
Deploy the Complete Stack
# Create directories
mkdir -p prometheus/rules alertmanager/templates blackbox grafana/provisioning/{datasources,dashboards}
# Set permissions for node_exporter textfile collector
sudo mkdir -p /var/lib/node_exporter/textfile_collector
sudo chmod 755 /var/lib/node_exporter/textfile_collector
# Deploy
docker compose up -d
# Verify all services are healthy
docker compose ps
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Check alerting rules
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'Testing Alerts
# Trigger a test alert (CPU stress)
docker run --rm -it progrium/stress --cpu 4 --timeout 120s
# Check Prometheus for firing alerts
curl -s 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts'
# Check Alertmanager received it
curl -s 'http://localhost:9093/api/v1/alerts' | jq '.'Tips for Effective Monitoring Generation
- Start with the basics. CPU, memory, disk, and network cover 80% of issues.
- Add application-specific metrics gradually. Database connections, queue depth, cache hit rates.
- Test your alerts. An alert that never fires (or always fires) is useless.
- Set reasonable thresholds. 80% CPU for 5 minutes is a warning; 95% for 1 minute is critical.
- Group related alerts. Don't wake yourself up for 10 separate disk alerts on one server.
Quick Reference: Monitoring Prompts
| Need | Prompt Pattern |
|---|---|
| New exporter | "Add [exporter] to monitor [service] with metrics for [specifics]" |
| Alert rule | "Create alert for [condition] lasting [duration] with [severity]" |
| Dashboard | "Create Grafana dashboard showing [metrics] with [visualization types]" |
| Custom metric | "Generate script to expose [metric] for Prometheus textfile collector" |
| Notification | "Configure Alertmanager to send [alert type] to [Discord/Slack/email]" |
What's Next
You now have production-grade monitoring generated through conversation. In Part 5, we'll cover Infrastructure as Code — From Zero to GitOps—generating Ansible playbooks, CI/CD pipelines, and version-controlled infrastructure.
Continue to Part 5