Docker in Production: Monitoring, Security, and Maintenance

Getting containers running is one thing. Keeping them running reliably, securely, and efficiently on a production VPS is another challenge entirely. This guide covers the operational practices that separate a hobby deployment from a production-ready system.

Resource Monitoring

On a VPS with limited resources, visibility into what your containers are consuming is essential. A runaway container can starve others or crash your server.

Quick Monitoring with Docker Stats

The built-in command provides real-time resource usage:

Real-time stats

docker stats

Example output

CONTAINER ID   NAME        CPU %     MEM USAGE / LIMIT     MEM %     NET I/O
a1b2c3d4e5f6   wordpress   0.50%     128MiB / 512MiB       25.00%    1.2MB / 800KB
f6e5d4c3b2a1   mariadb     1.20%     256MiB / 512MiB       50.00%    800KB / 1.2MB

Setting Resource Limits

Prevent any single container from monopolizing your VPS:

Resource limits in Compose

services:
  app:
    image: your-app:latest
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 128M

limits: Hard ceiling—container is throttled (CPU) or killed (memory) if exceeded
reservations: Guaranteed minimum resources

Lightweight Monitoring with cAdvisor

For a small VPS, a full Prometheus + Grafana stack might be overkill:

cAdvisor setup

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    ports:
      - "127.0.0.1:8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    restart: unless-stopped

Access the web UI at http://localhost:8080 via SSH tunnel.

Full Monitoring Stack

For multiple services or when you need alerting:

Prometheus + Grafana stack

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "127.0.0.1:9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "127.0.0.1:3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "127.0.0.1:9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

Create prometheus.yml:

prometheus.yml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

This stack uses around 300-500MB RAM total.

Logging Strategies

Container logs can quickly fill your disk if left unchecked. Docker's default logging driver keeps logs indefinitely.

Configure Log Rotation

Set global defaults in /etc/docker/daemon.json:

/etc/docker/daemon.json

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Restart Docker to apply:

Apply changes

sudo systemctl restart docker

This limits each container to 30MB of logs (3 files × 10MB). Existing containers need to be recreated.

Per-Container Log Configuration

Override per container

services:
  app:
    image: your-app:latest
    logging:
      driver: json-file
      options:
        max-size: "50m"
        max-file: "5"

Centralized Logging with Loki

For searching and analyzing logs across containers:

Loki + Promtail

services:
  loki:
    image: grafana/loki:latest
    ports:
      - "127.0.0.1:3100:3100"
    volumes:
      - loki-data:/loki
    restart: unless-stopped

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock
      - ./promtail.yml:/etc/promtail/promtail.yml:ro
    command: -config.file=/etc/promtail/promtail.yml
    restart: unless-stopped

volumes:
  loki-data:

Automated Updates with Watchtower

Keeping container images updated is critical for security patches. Watchtower automatically pulls new images and restarts containers.

Basic Watchtower Setup

Watchtower

services:
  watchtower:
    image: containrrr/watchtower
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - WATCHTOWER_CLEANUP=true
      - WATCHTOWER_SCHEDULE=0 0 4 * * *
    restart: unless-stopped

This checks for updates daily at 4 AM and removes old images after updating.

Selective Updates

Update only specific containers by labeling them:

Selective updates with labels

services:
  watchtower:
    image: containrrr/watchtower
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - WATCHTOWER_LABEL_ENABLE=true
      - WATCHTOWER_CLEANUP=true
      - WATCHTOWER_SCHEDULE=0 0 4 * * *
    restart: unless-stopped

  # This container will be auto-updated
  nginx:
    image: nginx:latest
    labels:
      - com.centurylinklabs.watchtower.enable=true

  # This container will NOT be auto-updated
  database:
    image: postgres:16
    labels:
      - com.centurylinklabs.watchtower.enable=false

When Not to Use Automatic Updates

Disable automatic updates for:

Databases: Schema changes in new versions can break things
Stateful applications: Where updates require migration steps
Production-critical services: Where you want to test updates first

For these, pin specific versions:

Pin versions

services:
  database:
    image: postgres:16.1  # Pinned version, not :latest or :16

Backup Strategies

Containers are ephemeral, but data isn't. A solid backup strategy covers both volumes and configuration.

Direct Volume Backup

Backup a volume

# Stop the container first for consistency
docker compose stop db

# Create a backup
docker run --rm \
  -v db-data:/source:ro \
  -v $(pwd)/backups:/backup \
  alpine tar czf /backup/db-backup-$(date +%Y%m%d).tar.gz -C /source .

# Restart
docker compose start db

Database-Specific Dumps (Preferred)

Database dumps

# PostgreSQL
docker compose exec db pg_dump -U postgres mydb > backup.sql

# MySQL/MariaDB
docker compose exec db mariadb-dump -u root -p mydb > backup.sql

# MongoDB
docker compose exec mongo mongodump --out /backup

Database dumps are more portable and can be restored to different versions.

Automated Backup Script

Create /opt/docker-backup.sh:

/opt/docker-backup.sh

#!/bin/bash
set -e

BACKUP_DIR="/opt/backups"
RETENTION_DAYS=7
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p "$BACKUP_DIR"

# Backup Docker volumes
echo "Backing up Docker volumes..."
for volume in $(docker volume ls -q); do
    echo "  Backing up $volume"
    docker run --rm \
        -v "$volume":/source:ro \
        -v "$BACKUP_DIR":/backup \
        alpine tar czf "/backup/${volume}_${DATE}.tar.gz" -C /source .
done

# Backup Compose files
echo "Backing up Compose configurations..."
tar czf "$BACKUP_DIR/compose-configs_${DATE}.tar.gz" /opt/docker-apps/

# Clean old backups
echo "Removing backups older than $RETENTION_DAYS days..."
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete

echo "Backup complete!"

Schedule with cron:

Schedule backup

chmod +x /opt/docker-backup.sh

# Run daily at 2 AM
echo "0 2 * * * root /opt/docker-backup.sh >> /var/log/docker-backup.log 2>&1" | sudo tee /etc/cron.d/docker-backup

Off-Site Backups

Sync to S3-compatible object storage:

rclone sync

# Install rclone
curl https://rclone.org/install.sh | sudo bash

# Configure (interactive)
rclone config

# Add to backup script
rclone sync /opt/backups remote:docker-backups --transfers 4

Restoring from Backup

Restore commands

# Restore a volume
docker compose stop app
docker run --rm \
  -v app-data:/target \
  -v $(pwd)/backups:/backup:ro \
  alpine sh -c "rm -rf /target/* && tar xzf /backup/app-data_20240115.tar.gz -C /target"
docker compose start app

# Restore a database dump
docker compose exec -T db psql -U postgres mydb < backup.sql

Security Hardening

Docker's defaults prioritize convenience over security. For production, tighten things up.

Run Containers as Non-Root

By default, processes inside containers run as root. If an attacker escapes the container, they have root on the host.

Non-root user

services:
  app:
    image: your-app:latest
    user: "1000:1000"  # Run as non-root user
    
  nginx:
    image: nginx:latest
    user: "nginx"
    
  postgres:
    image: postgres:16
    user: "postgres"

Read-Only Filesystems

Prevent attackers from modifying the container filesystem:

Read-only container

services:
  app:
    image: your-app:latest
    read_only: true
    tmpfs:
      - /tmp
      - /var/run
    volumes:
      - app-data:/data  # Writable volume for legitimate data

The tmpfs mounts provide writable temporary directories in memory.

Drop Unnecessary Capabilities

Linux capabilities grant specific privileges. Drop all and add only what's needed:

Drop capabilities

services:
  app:
    image: your-app:latest
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE  # Only if binding to ports < 1024

Network Segmentation

Isolate sensitive services:

Network isolation

services:
  nginx:
    networks:
      - frontend

  app:
    networks:
      - frontend
      - backend

  database:
    networks:
      - backend  # Not accessible from nginx

networks:
  frontend:
  backend:
    internal: true  # No external access at all

Protect the Docker Socket

For services that need Docker access, use socket proxies with limited permissions:

Docker socket proxy

services:
  docker-proxy:
    image: tecnativa/docker-socket-proxy
    environment:
      - CONTAINERS=1
      - IMAGES=1
      - POST=0  # Read-only
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - docker-api

  app-needing-docker:
    environment:
      - DOCKER_HOST=tcp://docker-proxy:2375
    networks:
      - docker-api

networks:
  docker-api:
    internal: true

Scan Images for Vulnerabilities

Trivy scan

docker run --rm \
  -v /var/run/docker.sock:/var/run/docker.sock \
  aquasec/trivy image your-image:tag

Security Checklist

Containers run as non-root where possible
Resource limits set on all containers
Log rotation configured
Unnecessary capabilities dropped
Sensitive services on internal networks
Docker socket only mounted when required
Images from trusted sources only
Regular backup schedule configured
Update strategy defined (automatic or manual)
Firewall rules limit exposed ports

Health Checks and Self-Healing

Production containers should recover from transient failures automatically.

Define Health Checks

Health check configuration

services:
  web:
    image: nginx:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

test: Command to check health (exit 0 = healthy)
interval: Time between checks
timeout: Max time for check to complete
retries: Consecutive failures before unhealthy
start_period: Grace period for startup

Common Health Check Patterns

Health check patterns

# HTTP endpoint
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]

# TCP port check
healthcheck:
  test: ["CMD", "nc", "-z", "localhost", "5432"]

# PostgreSQL
healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]

# MySQL/MariaDB
healthcheck:
  test: ["CMD", "healthcheck.sh", "--connect", "--innodb_initialized"]

# Redis
healthcheck:
  test: ["CMD", "redis-cli", "ping"]

Combine with Restart Policies

Self-healing container

services:
  app:
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/health"]
      interval: 30s
      retries: 3

Docker restarts unhealthy containers automatically.

Production Compose File Template

Here's a template incorporating all the production practices:

Production docker-compose.yml

services:
  app:
    image: your-app:latest
    restart: unless-stopped
    user: "1000:1000"
    read_only: true
    tmpfs:
      - /tmp
    cap_drop:
      - ALL
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 512M
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"
    networks:
      - frontend
      - backend
    volumes:
      - app-data:/data

  database:
    image: postgres:16
    restart: unless-stopped
    user: "postgres"
    deploy:
      resources:
        limits:
          memory: 1G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"
    networks:
      - backend
    volumes:
      - db-data:/var/lib/postgresql/data
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password

networks:
  frontend:
  backend:
    internal: true

volumes:
  app-data:
  db-data:

secrets:
  db_password:
    file: ./secrets/db_password.txt

What's Next

Your Docker deployments are now production-ready: monitored, logged, backed up, and secured. But Docker Compose has limits—it runs on a single host and lacks advanced orchestration features. In Part 4, we'll introduce Kubernetes concepts and explore when you might need to move beyond Compose to a full container orchestration platform.

Part 2: Docker Compose Part 4: Kubernetes Concepts