Part 4 of 10

    Automated Server Monitoring & Alerting

    Build a complete observability stack—Prometheus, Grafana, and Alertmanager—through conversation. Know what's happening before your users do.

    Prometheus
    Grafana
    Alertmanager

    You've deployed your infrastructure and applications. Now comes the question every self-hoster faces at 2 AM: "Is everything still running?"

    In this guide, we'll use Claude Code to build a complete monitoring stack from scratch—Prometheus for metrics collection, Grafana for visualization, Alertmanager for notifications, and custom exporters for your specific needs.

    Deploy Monitoring Separately

    For production reliability, run your monitoring stack on a separate VPS instance in a different region than the infrastructure you're monitoring. If your primary server goes down, your monitoring should still be able to detect and alert you. A small 1GB instance is sufficient for monitoring several servers.

    1

    Prerequisites

    • Claude Code installed (see Part 1)
    • Docker and Docker Compose on your VPS
    • Basic understanding of metrics (we'll explain as we go)
    2

    The Monitoring Stack We'll Build

    ComponentPurpose
    PrometheusTime-series database, scrapes and stores metrics
    GrafanaVisualization dashboards
    AlertmanagerRoutes alerts to Discord, Slack, email, etc.
    Node ExporterHost metrics (CPU, memory, disk, network)
    cAdvisorContainer metrics
    Blackbox ExporterEndpoint probing (HTTP, TCP, ICMP)
    3

    Core Monitoring Stack

    Let's start with the foundation. Ask Claude Code:

    Prompt to Claude Code
    Create a Docker Compose Monitoring Stack
    
    Services:
    - Prometheus (latest) with 15-day retention
    - Grafana (latest) with anonymous access disabled
    - Node Exporter for host metrics
    - cAdvisor for container metrics
    
    Requirements:
    - All on a dedicated monitoring network
    - Prometheus config mounted from ./prometheus/
    - Grafana data persisted
    - Grafana provisioned with Prometheus as default datasource
    - Node Exporter needs access to host /proc, /sys, and root filesystem (read-only)
    - cAdvisor needs Docker socket access
    
    Expose only Grafana on port 3000. Prometheus should be internal-only.

    Generated docker-compose.yml

    docker-compose.yml
    services:
      prometheus:
        image: prom/prometheus:latest
        container_name: prometheus
        restart: unless-stopped
        command:
          - --config.file=/etc/prometheus/prometheus.yml
          - --storage.tsdb.path=/prometheus
          - --storage.tsdb.retention.time=15d
          - --web.console.libraries=/etc/prometheus/console_libraries
          - --web.console.templates=/etc/prometheus/consoles
          - --web.enable-lifecycle
        volumes:
          - ./prometheus:/etc/prometheus:ro
          - prometheus-data:/prometheus
        networks:
          - monitoring
        healthcheck:
          test: ["CMD", "wget", "-q", "--spider", "http://localhost:9090/-/healthy"]
          interval: 30s
          timeout: 10s
          retries: 3
    
      grafana:
        image: grafana/grafana:latest
        container_name: grafana
        restart: unless-stopped
        ports:
          - "3000:3000"
        environment:
          - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
          - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:?Grafana password required}
          - GF_USERS_ALLOW_SIGN_UP=false
          - GF_AUTH_ANONYMOUS_ENABLED=false
          - GF_SERVER_ROOT_URL=${GRAFANA_ROOT_URL:-http://localhost:3000}
        volumes:
          - grafana-data:/var/lib/grafana
          - ./grafana/provisioning:/etc/grafana/provisioning:ro
        networks:
          - monitoring
        depends_on:
          - prometheus
    
      node-exporter:
        image: prom/node-exporter:latest
        container_name: node-exporter
        restart: unless-stopped
        command:
          - '--path.procfs=/host/proc'
          - '--path.sysfs=/host/sys'
          - '--path.rootfs=/rootfs'
          - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
        volumes:
          - /proc:/host/proc:ro
          - /sys:/host/sys:ro
          - /:/rootfs:ro
        networks:
          - monitoring
        pid: host
    
      cadvisor:
        image: gcr.io/cadvisor/cadvisor:latest
        container_name: cadvisor
        restart: unless-stopped
        privileged: true
        volumes:
          - /:/rootfs:ro
          - /var/run:/var/run:ro
          - /sys:/sys:ro
          - /var/lib/docker/:/var/lib/docker:ro
          - /dev/disk/:/dev/disk:ro
        networks:
          - monitoring
        devices:
          - /dev/kmsg
    
    networks:
      monitoring:
        driver: bridge
    
    volumes:
      prometheus-data:
      grafana-data:

    Prometheus Configuration

    prometheus/prometheus.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    scrape_configs:
      # Prometheus self-monitoring
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
    
      # Node Exporter - Host metrics
      - job_name: 'node'
        static_configs:
          - targets: ['node-exporter:9100']
        relabel_configs:
          - source_labels: [__address__]
            target_label: instance
            replacement: '${HOSTNAME:-server}'
    
      # cAdvisor - Container metrics
      - job_name: 'cadvisor'
        static_configs:
          - targets: ['cadvisor:8080']

    Grafana Datasource Provisioning

    grafana/provisioning/datasources/datasources.yml
    apiVersion: 1
    
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus:9090
        isDefault: true
        editable: false

    Environment Configuration

    .env.example
    # Grafana
    GRAFANA_ADMIN_USER=admin
    GRAFANA_ADMIN_PASSWORD=change-me-secure-password
    GRAFANA_ROOT_URL=http://localhost:3000
    
    # Host identification
    HOSTNAME=my-server
    4

    Adding Alertmanager

    Now let's add alerting:

    Prompt to Claude Code
    Add Alertmanager to the stack with:
    - Discord webhook notifications (primary)
    - Email as backup notification channel
    - Alert grouping by severity and service
    - 5-minute group wait, 10-minute group interval
    - Silence and inhibition rules for maintenance
    
    Create alert rules for:
    - Instance down (any target unreachable for 2 minutes)
    - High CPU (>80% for 5 minutes)
    - High memory (>85% for 5 minutes)
    - Disk space low (<15% free)
    - Container restart loop (>3 restarts in 10 minutes)
    
    Update Prometheus to load the rules and connect to Alertmanager.

    Add to docker-compose.yml

    docker-compose.yml (addition)
      alertmanager:
        image: prom/alertmanager:latest
        container_name: alertmanager
        restart: unless-stopped
        command:
          - '--config.file=/etc/alertmanager/alertmanager.yml'
          - '--storage.path=/alertmanager'
          - '--web.external-url=${ALERTMANAGER_URL:-http://localhost:9093}'
        volumes:
          - ./alertmanager:/etc/alertmanager:ro
          - alertmanager-data:/alertmanager
        networks:
          - monitoring
        healthcheck:
          test: ["CMD", "wget", "-q", "--spider", "http://localhost:9093/-/healthy"]
          interval: 30s
          timeout: 10s
          retries: 3
    
    # Add to volumes:
      alertmanager-data:

    Alertmanager Configuration

    alertmanager/alertmanager.yml
    global:
      resolve_timeout: 5m
      smtp_smarthost: '${SMTP_HOST:-smtp.example.com}:${SMTP_PORT:-587}'
      smtp_from: '${SMTP_FROM:-alerts@example.com}'
      smtp_auth_username: '${SMTP_USER:-}'
      smtp_auth_password: '${SMTP_PASSWORD:-}'
    
    route:
      group_by: ['alertname', 'severity', 'service']
      group_wait: 5m
      group_interval: 10m
      repeat_interval: 4h
      receiver: 'discord-notifications'
      routes:
        - match:
            severity: critical
          receiver: 'discord-notifications'
          group_wait: 30s
          repeat_interval: 1h
        - match:
            severity: warning
          receiver: 'discord-notifications'
          group_wait: 5m
          repeat_interval: 6h
    
    receivers:
      - name: 'discord-notifications'
        discord_configs:
          - webhook_url: '${DISCORD_WEBHOOK_URL}'
            title: '{{ template "discord.title" . }}'
            message: '{{ template "discord.message" . }}'
        email_configs:
          - to: '${ALERT_EMAIL:-admin@example.com}'
            send_resolved: true
    
    inhibit_rules:
      - source_match:
          severity: 'critical'
        target_match:
          severity: 'warning'
        equal: ['alertname', 'instance']
    
    templates:
      - '/etc/alertmanager/templates/*.tmpl'

    Discord Message Template

    alertmanager/templates/discord.tmpl
    {{ define "discord.title" }}
    [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
    {{ end }}
    
    {{ define "discord.message" }}
    {{ range .Alerts }}
    **Alert:** {{ .Labels.alertname }}
    **Severity:** {{ .Labels.severity }}
    **Instance:** {{ .Labels.instance }}
    **Description:** {{ .Annotations.description }}
    **Started:** {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
    {{ if .EndsAt }}**Resolved:** {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }}
    ---
    {{ end }}
    {{ end }}

    Prometheus Alert Rules

    prometheus/rules/alerts.yml
    groups:
      - name: instance_alerts
        interval: 30s
        rules:
          - alert: InstanceDown
            expr: up == 0
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "Instance {{ $labels.instance }} is down"
              description: "{{ $labels.job }} on {{ $labels.instance }} has been unreachable for more than 2 minutes."
    
      - name: host_alerts
        interval: 30s
        rules:
          - alert: HighCpuUsage
            expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
            for: 5m
            labels:
              severity: warning
              service: host
            annotations:
              summary: "High CPU usage on {{ $labels.instance }}"
              description: "CPU usage is above 80% for more than 5 minutes."
    
          - alert: HighMemoryUsage
            expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
            for: 5m
            labels:
              severity: warning
              service: host
            annotations:
              summary: "High memory usage on {{ $labels.instance }}"
              description: "Memory usage is above 85%."
    
          - alert: DiskSpaceLow
            expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 15
            for: 5m
            labels:
              severity: warning
              service: host
            annotations:
              summary: "Low disk space on {{ $labels.instance }}"
              description: "Disk {{ $labels.mountpoint }} has less than 15% free space."
    
          - alert: DiskSpaceCritical
            expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 5
            for: 2m
            labels:
              severity: critical
              service: host
            annotations:
              summary: "Critical disk space on {{ $labels.instance }}"
              description: "Disk {{ $labels.mountpoint }} has less than 5% free space."
    
      - name: container_alerts
        interval: 30s
        rules:
          - alert: ContainerRestartLoop
            expr: increase(container_restart_count[10m]) > 3
            for: 1m
            labels:
              severity: warning
              service: container
            annotations:
              summary: "Container {{ $labels.name }} is restarting frequently"
              description: "Container has restarted multiple times in the last 10 minutes."

    Updated Prometheus Config

    prometheus/prometheus.yml (updated)
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['alertmanager:9093']
    
    rule_files:
      - '/etc/prometheus/rules/*.yml'
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
    
      - job_name: 'node'
        static_configs:
          - targets: ['node-exporter:9100']
    
      - job_name: 'cadvisor'
        static_configs:
          - targets: ['cadvisor:8080']
    
      - job_name: 'alertmanager'
        static_configs:
          - targets: ['alertmanager:9093']
    5

    Blackbox Exporter for Endpoint Monitoring

    Monitor your services from the outside:

    Prompt to Claude Code
    Add Blackbox Exporter to probe:
    - HTTP endpoints (status codes, SSL cert expiry, response time)
    - TCP connectivity (database ports, Redis)
    - ICMP ping (for network debugging)
    
    Create probes for:
    - My main website (https://example.com) - check every 30s
    - API health endpoint (https://api.example.com/health)
    - Database connectivity (internal PostgreSQL on port 5432)
    
    Add alerts for:
    - Endpoint down for 1 minute
    - SSL certificate expiring within 14 days
    - Response time > 2 seconds

    Add to docker-compose.yml

    docker-compose.yml (addition)
      blackbox:
        image: prom/blackbox-exporter:latest
        container_name: blackbox
        restart: unless-stopped
        volumes:
          - ./blackbox:/etc/blackbox:ro
        command:
          - '--config.file=/etc/blackbox/blackbox.yml'
        networks:
          - monitoring

    Blackbox Configuration

    blackbox/blackbox.yml
    modules:
      http_2xx:
        prober: http
        timeout: 10s
        http:
          valid_http_versions: ["HTTP/1.1", "HTTP/2"]
          valid_status_codes: []  # Defaults to 2xx
          method: GET
    
      http_2xx_ssl:
        prober: http
        timeout: 10s
        http:
          valid_http_versions: ["HTTP/1.1", "HTTP/2"]
          valid_status_codes: []
          method: GET
          tls_config:
            insecure_skip_verify: false
    
      tcp_connect:
        prober: tcp
        timeout: 5s
    
      icmp:
        prober: icmp
        timeout: 5s

    Prometheus Scrape Config for Blackbox

    Add to prometheus/prometheus.yml scrape_configs
      # Blackbox HTTP probes
      - job_name: 'blackbox-http'
        metrics_path: /probe
        params:
          module: [http_2xx_ssl]
        static_configs:
          - targets:
              - https://example.com
              - https://api.example.com/health
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: blackbox:9115
    
      # Blackbox TCP probes
      - job_name: 'blackbox-tcp'
        metrics_path: /probe
        params:
          module: [tcp_connect]
        static_configs:
          - targets:
              - postgres:5432
              - redis:6379
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: blackbox:9115

    Blackbox Alert Rules

    prometheus/rules/blackbox_alerts.yml
    groups:
      - name: blackbox_alerts
        interval: 30s
        rules:
          - alert: EndpointDown
            expr: probe_success == 0
            for: 1m
            labels:
              severity: critical
              service: endpoint
            annotations:
              summary: "Endpoint {{ $labels.instance }} is down"
              description: "{{ $labels.instance }} has been unreachable for more than 1 minute."
    
          - alert: SSLCertExpiringSoon
            expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
            for: 1h
            labels:
              severity: warning
              service: ssl
            annotations:
              summary: "SSL certificate expiring soon for {{ $labels.instance }}"
              description: "SSL certificate expires in less than 14 days."
    
          - alert: SSLCertExpiryCritical
            expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
            for: 1h
            labels:
              severity: critical
              service: ssl
            annotations:
              summary: "SSL certificate expiring very soon for {{ $labels.instance }}"
              description: "SSL certificate expires in less than 3 days!"
    
          - alert: SlowResponse
            expr: probe_http_duration_seconds > 2
            for: 5m
            labels:
              severity: warning
              service: endpoint
            annotations:
              summary: "Slow response from {{ $labels.instance }}"
              description: "Response time is greater than 2 seconds."
    6

    Custom Monitoring Scripts

    Sometimes you need metrics Prometheus doesn't have exporters for. Claude Code can generate custom collectors:

    Prompt to Claude Code
    Metrics needed:
    - Number of files in /var/log older than 7 days
    - Size of Docker volumes in MB
    - Number of failed SSH login attempts today
    - Last backup age in hours (check /backup/latest timestamp)
    - Number of pending apt upgrades
    
    Output in Prometheus format to /var/lib/node_exporter/textfile_collector/custom.prom

    Custom Metrics Script

    scripts/custom_metrics.sh
    #!/bin/bash
    set -euo pipefail
    
    OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/custom.prom"
    TMP_FILE="${OUTPUT_FILE}.tmp"
    
    # Initialize output
    cat > "$TMP_FILE" << 'EOF'
    # HELP custom_old_log_files Number of log files older than 7 days
    # TYPE custom_old_log_files gauge
    # HELP custom_docker_volume_size_mb Size of Docker volumes in MB
    # TYPE custom_docker_volume_size_mb gauge
    # HELP custom_failed_ssh_attempts Failed SSH login attempts today
    # TYPE custom_failed_ssh_attempts gauge
    # HELP custom_backup_age_hours Age of last backup in hours
    # TYPE custom_backup_age_hours gauge
    # HELP custom_pending_upgrades Number of pending apt upgrades
    # TYPE custom_pending_upgrades gauge
    EOF
    
    # Old log files (older than 7 days)
    old_logs=$(find /var/log -type f -mtime +7 2>/dev/null | wc -l)
    echo "custom_old_log_files ${old_logs}" >> "$TMP_FILE"
    
    # Failed SSH attempts today
    if [[ -f /var/log/auth.log ]]; then
      today=$(date +%b\ %d)
      failed_ssh=$(grep "Failed password" /var/log/auth.log 2>/dev/null | grep -c "$today" || echo 0)
    else
      failed_ssh=0
    fi
    echo "custom_failed_ssh_attempts ${failed_ssh}" >> "$TMP_FILE"
    
    # Backup age
    backup_file="/backup/latest"
    if [[ -f "$backup_file" ]]; then
      backup_time=$(stat -c %Y "$backup_file")
      current_time=$(date +%s)
      age_hours=$(( (current_time - backup_time) / 3600 ))
      echo "custom_backup_age_hours ${age_hours}" >> "$TMP_FILE"
    else
      echo "custom_backup_age_hours -1" >> "$TMP_FILE"
    fi
    
    # Pending apt upgrades
    if command -v apt &> /dev/null; then
      pending=$(apt list --upgradable 2>/dev/null | grep -c upgradable || echo 0)
    else
      pending=0
    fi
    echo "custom_pending_upgrades ${pending}" >> "$TMP_FILE"
    
    # Atomic move
    mv "$TMP_FILE" "$OUTPUT_FILE"

    Cron Job

    Add to crontab
    # Run every 5 minutes
    */5 * * * * /opt/monitoring/scripts/custom_metrics.sh

    Update Node Exporter

    docker-compose.yml (updated node-exporter)
      node-exporter:
        image: prom/node-exporter:latest
        container_name: node-exporter
        restart: unless-stopped
        command:
          - '--path.procfs=/host/proc'
          - '--path.sysfs=/host/sys'
          - '--path.rootfs=/rootfs'
          - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
          - '--collector.textfile.directory=/var/lib/node_exporter/textfile_collector'
        volumes:
          - /proc:/host/proc:ro
          - /sys:/host/sys:ro
          - /:/rootfs:ro
          - /var/lib/node_exporter/textfile_collector:/var/lib/node_exporter/textfile_collector:ro
        networks:
          - monitoring
        pid: host
    7

    Grafana Dashboard Provisioning

    Let's auto-provision a useful dashboard:

    Prompt to Claude Code
    Create a Grafana Dashboard JSON
    
    Row 1 - System Overview:
    - CPU usage gauge
    - Memory usage gauge
    - Disk usage gauge
    - Uptime stat
    
    Row 2 - Resource Trends:
    - CPU usage over time (line graph)
    - Memory usage over time (line graph)
    - Network traffic in/out (line graph)
    
    Row 3 - Container Status:
    - Table of all containers with CPU, memory, status
    - Container restart count over time
    
    Row 4 - Alerts:
    - Current firing alerts table
    - Alert history timeline
    
    Make it responsive and use a dark theme. Include template variables for instance selection.

    Dashboard Provisioning Config

    grafana/provisioning/dashboards/dashboard.yml
    apiVersion: 1
    
    providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        updateIntervalSeconds: 30
        options:
          path: /etc/grafana/provisioning/dashboards

    Claude Code will generate a complete JSON dashboard file with all the panels, queries, and styling. The dashboard includes template variables for filtering by instance and proper dark theme styling.

    8

    Deployment and Verification

    Deploy the Complete Stack

    Terminal
    # Create directories
    mkdir -p prometheus/rules alertmanager/templates blackbox grafana/provisioning/{datasources,dashboards}
    
    # Set permissions for node_exporter textfile collector
    sudo mkdir -p /var/lib/node_exporter/textfile_collector
    sudo chmod 755 /var/lib/node_exporter/textfile_collector
    
    # Deploy
    docker compose up -d
    
    # Verify all services are healthy
    docker compose ps
    
    # Check Prometheus targets
    curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
    
    # Check alerting rules
    curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'

    Testing Alerts

    Verify alerting pipeline
    # Trigger a test alert (CPU stress)
    docker run --rm -it progrium/stress --cpu 4 --timeout 120s
    
    # Check Prometheus for firing alerts
    curl -s 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts'
    
    # Check Alertmanager received it
    curl -s 'http://localhost:9093/api/v1/alerts' | jq '.'
    9

    Tips for Effective Monitoring Generation

    • Start with the basics. CPU, memory, disk, and network cover 80% of issues.
    • Add application-specific metrics gradually. Database connections, queue depth, cache hit rates.
    • Test your alerts. An alert that never fires (or always fires) is useless.
    • Set reasonable thresholds. 80% CPU for 5 minutes is a warning; 95% for 1 minute is critical.
    • Group related alerts. Don't wake yourself up for 10 separate disk alerts on one server.

    Quick Reference: Monitoring Prompts

    NeedPrompt Pattern
    New exporter"Add [exporter] to monitor [service] with metrics for [specifics]"
    Alert rule"Create alert for [condition] lasting [duration] with [severity]"
    Dashboard"Create Grafana dashboard showing [metrics] with [visualization types]"
    Custom metric"Generate script to expose [metric] for Prometheus textfile collector"
    Notification"Configure Alertmanager to send [alert type] to [Discord/Slack/email]"

    What's Next

    You now have production-grade monitoring generated through conversation. In Part 5, we'll cover Infrastructure as Code — From Zero to GitOps—generating Ansible playbooks, CI/CD pipelines, and version-controlled infrastructure.

    Continue to Part 5