Kubernetes in Production: Monitoring, Scaling, and Reliability

A running Kubernetes cluster is just the beginning. Production readiness means knowing what's happening inside your cluster, responding to problems before users notice, scaling to meet demand, and recovering when things go wrong.

This guide covers the operational practices that keep Kubernetes clusters healthy: monitoring with Prometheus and Grafana, alerting, horizontal pod autoscaling, Helm for package management, backup strategies, and disaster recovery.

Monitoring with Prometheus and Grafana

Kubernetes generates a wealth of metrics. Prometheus collects and stores them; Grafana visualizes them. Together, they're the standard monitoring stack for Kubernetes.

Installing the kube-prometheus-stack

The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Alertmanager, and pre-configured dashboards.

Install Helm and add repository

# Install Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# Add Prometheus community repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

monitoring-values.yaml

grafana:
  adminPassword: "your-secure-password"
  persistence:
    enabled: true
    size: 5Gi
  ingress:
    enabled: true
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
    hosts:
      - grafana.yourdomain.com
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.yourdomain.com

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 20Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 5Gi

Install the monitoring stack

kubectl create namespace monitoring
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f monitoring-values.yaml

# Wait for pods
kubectl get pods -n monitoring -w

Key Metrics to Watch

Cluster-level

node_cpu_seconds_total: CPU usage per node
node_memory_MemAvailable_bytes: Available memory
node_filesystem_avail_bytes: Disk space

Kubernetes-level

kube_pod_status_phase: Pod states
kube_deployment_status_replicas_unavailable
container_memory_working_set_bytes

Adding Custom Application Metrics

ServiceMonitor for custom app

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: monitoring
  labels:
    release: kube-prometheus  # Must match Prometheus selector
spec:
  namespaceSelector:
    matchNames:
      - default
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Alerting

Metrics are useless if no one sees them when things go wrong. Alertmanager handles alert routing and notifications.

Alertmanager configuration (add to monitoring-values.yaml)

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: 'default'
      routes:
        - match:
            severity: critical
          receiver: 'critical'
    receivers:
      - name: 'default'
        email_configs:
          - to: 'alerts@yourdomain.com'
            from: 'alertmanager@yourdomain.com'
            smarthost: 'smtp.yourdomain.com:587'
            auth_username: 'smtp-user'
            auth_password: 'smtp-password'
      - name: 'critical'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#alerts-critical'

Built-in Alerts

The kube-prometheus-stack includes alerts for common issues:

KubePodCrashLooping: Pod repeatedly crashing
KubePodNotReady: Pod stuck in non-ready state
NodeNotReady: Node offline
NodeMemoryHighUtilization: Node running low on memory
NodeFilesystemSpaceFillingUp: Disk filling up

Creating Custom Alerts

PrometheusRule for custom alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus
spec:
  groups:
    - name: my-app
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m])) 
            / sum(rate(http_requests_total[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High error rate detected"
            description: "Error rate is {{ $value | humanizePercentage }}"
        
        - alert: HighLatency
          expr: |
            histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High latency detected"
            description: "95th percentile latency is {{ $value }}s"

Helm for Package Management

Helm is the package manager for Kubernetes. Instead of managing dozens of YAML files, you install applications as "charts" with configurable values.

Helm Concepts

Chart: A package containing Kubernetes manifests, templates, and default values
Release: An installed instance of a chart
Repository: A collection of charts (like apt or npm registries)
Values: Configuration that customizes a chart for your needs

Essential Helm commands

# Search for charts
helm search hub wordpress
helm search repo prometheus

# Add a repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Show chart information
helm show chart bitnami/wordpress
helm show values bitnami/wordpress  # See configurable values

# Install a chart
helm install my-release bitnami/wordpress -f values.yaml

# List installed releases
helm list -A

# Upgrade a release
helm upgrade my-release bitnami/wordpress -f values.yaml

# Rollback to previous version
helm rollback my-release 1

# Uninstall
helm uninstall my-release

Example: Installing Redis with Helm

Install Redis

# Add Bitnami repo
helm repo add bitnami https://charts.bitnami.com/bitnami

# Create values file
cat <<EOF > redis-values.yaml
architecture: standalone
auth:
  enabled: true
  password: "your-redis-password"
master:
  persistence:
    size: 2Gi
  resources:
    limits:
      memory: 256Mi
      cpu: 250m
EOF

# Install
helm install redis bitnami/redis -f redis-values.yaml -n default

# Get connection info
helm status redis

Creating Your Own Helm Charts

Create and install custom chart

# Generate chart structure
helm create my-app

# Structure created:
# my-app/
# ├── Chart.yaml          # Chart metadata
# ├── values.yaml         # Default configuration
# ├── templates/
# │   ├── deployment.yaml
# │   ├── service.yaml
# │   ├── ingress.yaml
# │   └── _helpers.tpl    # Template helpers
# └── charts/             # Dependencies

# Install from local chart
helm install my-app ./my-app -f production-values.yaml

Horizontal Pod Autoscaling

Kubernetes can automatically scale deployments based on CPU, memory, or custom metrics.

Verify metrics-server is running

kubectl get deployment metrics-server -n kube-system

Basic CPU-Based Autoscaling

HorizontalPodAutoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Important: Your deployment must have resource requests defined for CPU autoscaling to work. Without requests, HPA can't calculate utilization percentage.

Deployment with resource requests

spec:
  containers:
    - name: app
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 500m
          memory: 512Mi

Multi-Metric Autoscaling with Behavior

Advanced HPA with scaling behavior

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
      selectPolicy: Max

Monitor autoscaling

kubectl get hpa
kubectl describe hpa my-app-hpa
kubectl get hpa -w  # Watch in real-time

Backup Strategies

Kubernetes state lives in etcd and persistent volumes. Back up both.

Backing Up etcd (k3s)

etcd snapshots with k3s

# Create snapshot
sudo k3s etcd-snapshot save --name manual-backup-$(date +%Y%m%d)

# List snapshots
sudo k3s etcd-snapshot ls

# Snapshots stored in /var/lib/rancher/k3s/server/db/snapshots/

Automate with cron

# /etc/cron.d/k3s-etcd-backup
0 */6 * * * root /usr/local/bin/k3s etcd-snapshot save --name scheduled-$(date +\%Y\%m\%d-\%H\%M)

Backing Up with Velero

Velero backs up Kubernetes resources and persistent volumes to object storage.

Install Velero

# Install CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-amd64.tar.gz
tar -xvf velero-v1.13.0-linux-amd64.tar.gz
sudo mv velero-v1.13.0-linux-amd64/velero /usr/local/bin/

# Install in cluster (with S3-compatible storage)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket your-backup-bucket \
  --secret-file ./credentials-velero \
  --backup-location-config region=us-east-1,s3ForcePathStyle=true,s3Url=https://your-s3-endpoint \
  --use-volume-snapshots=false \
  --use-node-agent

Create and schedule backups

# Backup entire cluster
velero backup create full-backup

# Backup specific namespace
velero backup create wordpress-backup --include-namespaces wordpress

# Schedule automatic backups
velero schedule create daily-backup --schedule="0 2 * * *" --ttl 168h   # Keep 7 days
velero schedule create weekly-backup --schedule="0 3 * * 0" --ttl 720h  # Keep 30 days

# List and restore
velero backup get
velero restore create --from-backup full-backup

Disaster Recovery

When things go wrong, you need clear procedures.

Scenario: Single Node Failure (Multi-Node Cluster)

Kubernetes handles this automatically: Node becomes NotReady, pods are evicted after ~5 minutes, and rescheduled to healthy nodes.

Monitor the process

kubectl get nodes
kubectl get pods -o wide
kubectl get events --sort-by='.lastTimestamp'

Scenario: etcd Failure (HA Cluster)

With 3 server nodes, the cluster tolerates 1 failure. If you lose quorum (2+ failures):

Reset cluster from local etcd

# On a surviving server node
sudo k3s server --cluster-reset

# This resets to single-node using local etcd data
# Then re-add other nodes

Scenario: Complete Cluster Loss

Restore from backup

# Restore etcd snapshot on new server
sudo k3s server \
  --cluster-init \
  --cluster-reset \
  --cluster-reset-restore-path=/path/to/snapshot

# Or restore with Velero
# 1. Install fresh cluster
# 2. Install Velero with same configuration
# 3. Restore from backup
velero restore create --from-backup full-backup

Disaster Recovery Checklist

Before an incident:

etcd snapshots running on schedule
Velero backing up to off-site storage
Backup restoration tested recently
All manifests stored in version control
Runbooks documented for common scenarios

During an incident:

Assess scope (single pod, node, or cluster-wide)
Check events: kubectl get events --sort-by='.lastTimestamp'
Check node status: kubectl get nodes
Review logs in Grafana/Loki
Follow relevant runbook

Resource Optimization

Running efficiently on VPS means right-sizing resources.

Analyze resource usage

# View current usage
kubectl top nodes
kubectl top pods -A

# Compare to requests/limits
kubectl describe node node-name | grep -A 5 "Allocated resources"

Vertical Pod Autoscaler (VPA)

VPA recommends or automatically adjusts resource requests:

VPA for recommendations

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"  # Just recommendations, no auto-update

Pod Disruption Budgets

Ensure availability during voluntary disruptions (upgrades, node maintenance):

PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2  # Or use maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app

Pod Priority and Preemption

PriorityClass for critical workloads

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "High priority for critical workloads"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  template:
    spec:
      priorityClassName: high-priority
      containers:
        - name: app
          image: critical-app:latest

Production Checklist

Reliability

Resource requests and limits set
Liveness and readiness probes configured
Pod Disruption Budgets for critical services
HPA configured where appropriate
Multiple replicas for stateless services

Observability

Prometheus + Grafana deployed
Alertmanager configured with notifications
Alerts defined for critical conditions
Log aggregation in place (Loki)

Security

Network policies restricting pod communication
Secrets stored properly (or external vault)
RBAC configured (not using cluster-admin)
Pod security standards enforced

Backup & Recovery

etcd snapshots scheduled
Velero configured for resource backups
Persistent volume backups in place
Restoration procedure tested

What's Next

Your Kubernetes cluster is now production-ready: monitored, alerting, auto-scaling, and backed up. You understand the operational practices that keep clusters healthy.

In Part 7, we'll put your infrastructure to work by deploying real-world applications like Nextcloud and Gitea, configuring production databases, and implementing high-availability patterns.

Part 5: Deploying Kubernetes Part 7: Applications & Databases

This guide is part of the Docker & Kubernetes series:

Part 1: Docker Fundamentals Part 2: Docker Compose Part 3: Docker in Production Part 4: Kubernetes Concepts Part 5: Deploying KubernetesPart 6: Kubernetes in Production (you are here)Part 7: Applications & Databases