Automated Server Monitoring & Alerting | Claude Code Series Part 4

You've deployed your infrastructure and applications. Now comes the question every self-hoster faces at 2 AM: "Is everything still running?"

In this guide, we'll use Claude Code to build a complete monitoring stack from scratch—Prometheus for metrics collection, Grafana for visualization, Alertmanager for notifications, and custom exporters for your specific needs.

Deploy Monitoring Separately

For production reliability, run your monitoring stack on a separate VPS instance in a different region than the infrastructure you're monitoring. If your primary server goes down, your monitoring should still be able to detect and alert you. A small 1GB instance is sufficient for monitoring several servers.

Prerequisites

Claude Code installed (see Part 1)
Docker and Docker Compose on your VPS
Basic understanding of metrics (we'll explain as we go)

The Monitoring Stack We'll Build

Component	Purpose
Prometheus	Time-series database, scrapes and stores metrics
Grafana	Visualization dashboards
Alertmanager	Routes alerts to Discord, Slack, email, etc.
Node Exporter	Host metrics (CPU, memory, disk, network)
cAdvisor	Container metrics
Blackbox Exporter	Endpoint probing (HTTP, TCP, ICMP)

Core Monitoring Stack

Let's start with the foundation. Ask Claude Code:

Prompt to Claude Code

Create a Docker Compose Monitoring Stack

Services:
- Prometheus (latest) with 15-day retention
- Grafana (latest) with anonymous access disabled
- Node Exporter for host metrics
- cAdvisor for container metrics

Requirements:
- All on a dedicated monitoring network
- Prometheus config mounted from ./prometheus/
- Grafana data persisted
- Grafana provisioned with Prometheus as default datasource
- Node Exporter needs access to host /proc, /sys, and root filesystem (read-only)
- cAdvisor needs Docker socket access

Expose only Grafana on port 3000. Prometheus should be internal-only.

Generated docker-compose.yml

docker-compose.yml

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=15d
      - --web.console.libraries=/etc/prometheus/console_libraries
      - --web.console.templates=/etc/prometheus/consoles
      - --web.enable-lifecycle
    volumes:
      - ./prometheus:/etc/prometheus:ro
      - prometheus-data:/prometheus
    networks:
      - monitoring
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 10s
      retries: 3

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:?Grafana password required}
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_AUTH_ANONYMOUS_ENABLED=false
      - GF_SERVER_ROOT_URL=${GRAFANA_ROOT_URL:-http://localhost:3000}
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    networks:
      - monitoring
    depends_on:
      - prometheus

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    networks:
      - monitoring
    pid: host

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    networks:
      - monitoring
    devices:
      - /dev/kmsg

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus-data:
  grafana-data:

Prometheus Configuration

prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter - Host metrics
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: '${HOSTNAME:-server}'

  # cAdvisor - Container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

Grafana Datasource Provisioning

grafana/provisioning/datasources/datasources.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Environment Configuration

.env.example

# Grafana
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=change-me-secure-password
GRAFANA_ROOT_URL=http://localhost:3000

# Host identification
HOSTNAME=my-server

Adding Alertmanager

Now let's add alerting:

Prompt to Claude Code

Add Alertmanager to the stack with:
- Discord webhook notifications (primary)
- Email as backup notification channel
- Alert grouping by severity and service
- 5-minute group wait, 10-minute group interval
- Silence and inhibition rules for maintenance

Create alert rules for:
- Instance down (any target unreachable for 2 minutes)
- High CPU (>80% for 5 minutes)
- High memory (>85% for 5 minutes)
- Disk space low (<15% free)
- Container restart loop (>3 restarts in 10 minutes)

Update Prometheus to load the rules and connect to Alertmanager.

Add to docker-compose.yml

docker-compose.yml (addition)

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=${ALERTMANAGER_URL:-http://localhost:9093}'
    volumes:
      - ./alertmanager:/etc/alertmanager:ro
      - alertmanager-data:/alertmanager
    networks:
      - monitoring
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:9093/-/healthy"]
      interval: 30s
      timeout: 10s
      retries: 3

# Add to volumes:
  alertmanager-data:

Alertmanager Configuration

alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: '${SMTP_HOST:-smtp.example.com}:${SMTP_PORT:-587}'
  smtp_from: '${SMTP_FROM:-alerts@example.com}'
  smtp_auth_username: '${SMTP_USER:-}'
  smtp_auth_password: '${SMTP_PASSWORD:-}'

route:
  group_by: ['alertname', 'severity', 'service']
  group_wait: 5m
  group_interval: 10m
  repeat_interval: 4h
  receiver: 'discord-notifications'
  routes:
    - match:
        severity: critical
      receiver: 'discord-notifications'
      group_wait: 30s
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'discord-notifications'
      group_wait: 5m
      repeat_interval: 6h

receivers:
  - name: 'discord-notifications'
    discord_configs:
      - webhook_url: '${DISCORD_WEBHOOK_URL}'
        title: '{{ template "discord.title" . }}'
        message: '{{ template "discord.message" . }}'
    email_configs:
      - to: '${ALERT_EMAIL:-admin@example.com}'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

templates:
  - '/etc/alertmanager/templates/*.tmpl'

Discord Message Template

alertmanager/templates/discord.tmpl

{{ define "discord.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}

{{ define "discord.message" }}
{{ range .Alerts }}
**Alert:** {{ .Labels.alertname }}
**Severity:** {{ .Labels.severity }}
**Instance:** {{ .Labels.instance }}
**Description:** {{ .Annotations.description }}
**Started:** {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ if .EndsAt }}**Resolved:** {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }}
---
{{ end }}
{{ end }}

Prometheus Alert Rules

prometheus/rules/alerts.yml

groups:
  - name: instance_alerts
    interval: 30s
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been unreachable for more than 2 minutes."

  - name: host_alerts
    interval: 30s
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          service: host
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 5 minutes."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
          service: host
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85%."

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 15
        for: 5m
        labels:
          severity: warning
          service: host
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} has less than 15% free space."

      - alert: DiskSpaceCritical
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 5
        for: 2m
        labels:
          severity: critical
          service: host
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} has less than 5% free space."

  - name: container_alerts
    interval: 30s
    rules:
      - alert: ContainerRestartLoop
        expr: increase(container_restart_count[10m]) > 3
        for: 1m
        labels:
          severity: warning
          service: container
        annotations:
          summary: "Container {{ $labels.name }} is restarting frequently"
          description: "Container has restarted multiple times in the last 10 minutes."

Updated Prometheus Config

prometheus/prometheus.yml (updated)

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']

Blackbox Exporter for Endpoint Monitoring

Monitor your services from the outside:

Prompt to Claude Code

Add Blackbox Exporter to probe:
- HTTP endpoints (status codes, SSL cert expiry, response time)
- TCP connectivity (database ports, Redis)
- ICMP ping (for network debugging)

Create probes for:
- My main website (https://example.com) - check every 30s
- API health endpoint (https://api.example.com/health)
- Database connectivity (internal PostgreSQL on port 5432)

Add alerts for:
- Endpoint down for 1 minute
- SSL certificate expiring within 14 days
- Response time > 2 seconds

Add to docker-compose.yml

docker-compose.yml (addition)

  blackbox:
    image: prom/blackbox-exporter:latest
    container_name: blackbox
    restart: unless-stopped
    volumes:
      - ./blackbox:/etc/blackbox:ro
    command:
      - '--config.file=/etc/blackbox/blackbox.yml'
    networks:
      - monitoring

Blackbox Configuration

blackbox/blackbox.yml

modules:
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      valid_status_codes: []  # Defaults to 2xx
      method: GET

  http_2xx_ssl:
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      valid_status_codes: []
      method: GET
      tls_config:
        insecure_skip_verify: false

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp:
    prober: icmp
    timeout: 5s

Prometheus Scrape Config for Blackbox

Add to prometheus/prometheus.yml scrape_configs

  # Blackbox HTTP probes
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx_ssl]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115

  # Blackbox TCP probes
  - job_name: 'blackbox-tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - postgres:5432
          - redis:6379
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115

Blackbox Alert Rules

prometheus/rules/blackbox_alerts.yml

groups:
  - name: blackbox_alerts
    interval: 30s
    rules:
      - alert: EndpointDown
        expr: probe_success == 0
        for: 1m
        labels:
          severity: critical
          service: endpoint
        annotations:
          summary: "Endpoint {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been unreachable for more than 1 minute."

      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
        for: 1h
        labels:
          severity: warning
          service: ssl
        annotations:
          summary: "SSL certificate expiring soon for {{ $labels.instance }}"
          description: "SSL certificate expires in less than 14 days."

      - alert: SSLCertExpiryCritical
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
        for: 1h
        labels:
          severity: critical
          service: ssl
        annotations:
          summary: "SSL certificate expiring very soon for {{ $labels.instance }}"
          description: "SSL certificate expires in less than 3 days!"

      - alert: SlowResponse
        expr: probe_http_duration_seconds > 2
        for: 5m
        labels:
          severity: warning
          service: endpoint
        annotations:
          summary: "Slow response from {{ $labels.instance }}"
          description: "Response time is greater than 2 seconds."

Custom Monitoring Scripts

Sometimes you need metrics Prometheus doesn't have exporters for. Claude Code can generate custom collectors:

Prompt to Claude Code

Metrics needed:
- Number of files in /var/log older than 7 days
- Size of Docker volumes in MB
- Number of failed SSH login attempts today
- Last backup age in hours (check /backup/latest timestamp)
- Number of pending apt upgrades

Output in Prometheus format to /var/lib/node_exporter/textfile_collector/custom.prom

Custom Metrics Script

scripts/custom_metrics.sh

#!/bin/bash
set -euo pipefail

OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/custom.prom"
TMP_FILE="${OUTPUT_FILE}.tmp"

# Initialize output
cat > "$TMP_FILE" << 'EOF'
# HELP custom_old_log_files Number of log files older than 7 days
# TYPE custom_old_log_files gauge
# HELP custom_docker_volume_size_mb Size of Docker volumes in MB
# TYPE custom_docker_volume_size_mb gauge
# HELP custom_failed_ssh_attempts Failed SSH login attempts today
# TYPE custom_failed_ssh_attempts gauge
# HELP custom_backup_age_hours Age of last backup in hours
# TYPE custom_backup_age_hours gauge
# HELP custom_pending_upgrades Number of pending apt upgrades
# TYPE custom_pending_upgrades gauge
EOF

# Old log files (older than 7 days)
old_logs=$(find /var/log -type f -mtime +7 2>/dev/null | wc -l)
echo "custom_old_log_files ${old_logs}" >> "$TMP_FILE"

# Failed SSH attempts today
if [[ -f /var/log/auth.log ]]; then
  today=$(date +%b\ %d)
  failed_ssh=$(grep "Failed password" /var/log/auth.log 2>/dev/null | grep -c "$today" || echo 0)
else
  failed_ssh=0
fi
echo "custom_failed_ssh_attempts ${failed_ssh}" >> "$TMP_FILE"

# Backup age
backup_file="/backup/latest"
if [[ -f "$backup_file" ]]; then
  backup_time=$(stat -c %Y "$backup_file")
  current_time=$(date +%s)
  age_hours=$(( (current_time - backup_time) / 3600 ))
  echo "custom_backup_age_hours ${age_hours}" >> "$TMP_FILE"
else
  echo "custom_backup_age_hours -1" >> "$TMP_FILE"
fi

# Pending apt upgrades
if command -v apt &> /dev/null; then
  pending=$(apt list --upgradable 2>/dev/null | grep -c upgradable || echo 0)
else
  pending=0
fi
echo "custom_pending_upgrades ${pending}" >> "$TMP_FILE"

# Atomic move
mv "$TMP_FILE" "$OUTPUT_FILE"

Cron Job

Add to crontab

# Run every 5 minutes
*/5 * * * * /opt/monitoring/scripts/custom_metrics.sh

Update Node Exporter

docker-compose.yml (updated node-exporter)

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
      - '--collector.textfile.directory=/var/lib/node_exporter/textfile_collector'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
      - /var/lib/node_exporter/textfile_collector:/var/lib/node_exporter/textfile_collector:ro
    networks:
      - monitoring
    pid: host

Grafana Dashboard Provisioning

Let's auto-provision a useful dashboard:

Prompt to Claude Code

Create a Grafana Dashboard JSON

Row 1 - System Overview:
- CPU usage gauge
- Memory usage gauge
- Disk usage gauge
- Uptime stat

Row 2 - Resource Trends:
- CPU usage over time (line graph)
- Memory usage over time (line graph)
- Network traffic in/out (line graph)

Row 3 - Container Status:
- Table of all containers with CPU, memory, status
- Container restart count over time

Row 4 - Alerts:
- Current firing alerts table
- Alert history timeline

Make it responsive and use a dark theme. Include template variables for instance selection.

Dashboard Provisioning Config

grafana/provisioning/dashboards/dashboard.yml

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /etc/grafana/provisioning/dashboards

Claude Code will generate a complete JSON dashboard file with all the panels, queries, and styling. The dashboard includes template variables for filtering by instance and proper dark theme styling.

Deployment and Verification

Deploy the Complete Stack

Terminal

# Create directories
mkdir -p prometheus/rules alertmanager/templates blackbox grafana/provisioning/{datasources,dashboards}

# Set permissions for node_exporter textfile collector
sudo mkdir -p /var/lib/node_exporter/textfile_collector
sudo chmod 755 /var/lib/node_exporter/textfile_collector

# Deploy
docker compose up -d

# Verify all services are healthy
docker compose ps

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check alerting rules
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'

Testing Alerts

Verify alerting pipeline

# Trigger a test alert (CPU stress)
docker run --rm -it progrium/stress --cpu 4 --timeout 120s

# Check Prometheus for firing alerts
curl -s 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts'

# Check Alertmanager received it
curl -s 'http://localhost:9093/api/v1/alerts' | jq '.'

Tips for Effective Monitoring Generation

Start with the basics. CPU, memory, disk, and network cover 80% of issues.
Add application-specific metrics gradually. Database connections, queue depth, cache hit rates.
Test your alerts. An alert that never fires (or always fires) is useless.
Set reasonable thresholds. 80% CPU for 5 minutes is a warning; 95% for 1 minute is critical.
Group related alerts. Don't wake yourself up for 10 separate disk alerts on one server.

Quick Reference: Monitoring Prompts

Need	Prompt Pattern
New exporter	"Add [exporter] to monitor [service] with metrics for [specifics]"
Alert rule	"Create alert for [condition] lasting [duration] with [severity]"
Dashboard	"Create Grafana dashboard showing [metrics] with [visualization types]"
Custom metric	"Generate script to expose [metric] for Prometheus textfile collector"
Notification	"Configure Alertmanager to send [alert type] to [Discord/Slack/email]"

What's Next

You now have production-grade monitoring generated through conversation. In Part 5, we'll cover Infrastructure as Code — From Zero to GitOps—generating Ansible playbooks, CI/CD pipelines, and version-controlled infrastructure.

Continue to Part 5

Part 3: Docker Compose Part 5: GitOps Infrastructure