A running Kubernetes cluster is just the beginning. Production readiness means knowing what's happening inside your cluster, responding to problems before users notice, scaling to meet demand, and recovering when things go wrong.
This guide covers the operational practices that keep Kubernetes clusters healthy: monitoring with Prometheus and Grafana, alerting, horizontal pod autoscaling, Helm for package management, backup strategies, and disaster recovery.
Monitoring with Prometheus and Grafana
Kubernetes generates a wealth of metrics. Prometheus collects and stores them; Grafana visualizes them. Together, they're the standard monitoring stack for Kubernetes.
Installing the kube-prometheus-stack
The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Alertmanager, and pre-configured dashboards.
# Install Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# Add Prometheus community repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo updategrafana:
adminPassword: "your-secure-password"
persistence:
enabled: true
size: 5Gi
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- grafana.yourdomain.com
tls:
- secretName: grafana-tls
hosts:
- grafana.yourdomain.com
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gikubectl create namespace monitoring
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
-f monitoring-values.yaml
# Wait for pods
kubectl get pods -n monitoring -wKey Metrics to Watch
Cluster-level
node_cpu_seconds_total: CPU usage per nodenode_memory_MemAvailable_bytes: Available memorynode_filesystem_avail_bytes: Disk space
Kubernetes-level
kube_pod_status_phase: Pod stateskube_deployment_status_replicas_unavailablecontainer_memory_working_set_bytes
Adding Custom Application Metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: monitoring
labels:
release: kube-prometheus # Must match Prometheus selector
spec:
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metricsAlerting
Metrics are useless if no one sees them when things go wrong. Alertmanager handles alert routing and notifications.
alertmanager:
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
receivers:
- name: 'default'
email_configs:
- to: 'alerts@yourdomain.com'
from: 'alertmanager@yourdomain.com'
smarthost: 'smtp.yourdomain.com:587'
auth_username: 'smtp-user'
auth_password: 'smtp-password'
- name: 'critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts-critical'Built-in Alerts
The kube-prometheus-stack includes alerts for common issues:
KubePodCrashLooping: Pod repeatedly crashingKubePodNotReady: Pod stuck in non-ready stateNodeNotReady: Node offlineNodeMemoryHighUtilization: Node running low on memoryNodeFilesystemSpaceFillingUp: Disk filling up
Creating Custom Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: monitoring
labels:
release: kube-prometheus
spec:
groups:
- name: my-app
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s"Helm for Package Management
Helm is the package manager for Kubernetes. Instead of managing dozens of YAML files, you install applications as "charts" with configurable values.
Helm Concepts
- Chart: A package containing Kubernetes manifests, templates, and default values
- Release: An installed instance of a chart
- Repository: A collection of charts (like apt or npm registries)
- Values: Configuration that customizes a chart for your needs
# Search for charts
helm search hub wordpress
helm search repo prometheus
# Add a repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# Show chart information
helm show chart bitnami/wordpress
helm show values bitnami/wordpress # See configurable values
# Install a chart
helm install my-release bitnami/wordpress -f values.yaml
# List installed releases
helm list -A
# Upgrade a release
helm upgrade my-release bitnami/wordpress -f values.yaml
# Rollback to previous version
helm rollback my-release 1
# Uninstall
helm uninstall my-releaseExample: Installing Redis with Helm
# Add Bitnami repo
helm repo add bitnami https://charts.bitnami.com/bitnami
# Create values file
cat <<EOF > redis-values.yaml
architecture: standalone
auth:
enabled: true
password: "your-redis-password"
master:
persistence:
size: 2Gi
resources:
limits:
memory: 256Mi
cpu: 250m
EOF
# Install
helm install redis bitnami/redis -f redis-values.yaml -n default
# Get connection info
helm status redisCreating Your Own Helm Charts
# Generate chart structure
helm create my-app
# Structure created:
# my-app/
# ├── Chart.yaml # Chart metadata
# ├── values.yaml # Default configuration
# ├── templates/
# │ ├── deployment.yaml
# │ ├── service.yaml
# │ ├── ingress.yaml
# │ └── _helpers.tpl # Template helpers
# └── charts/ # Dependencies
# Install from local chart
helm install my-app ./my-app -f production-values.yamlHorizontal Pod Autoscaling
Kubernetes can automatically scale deployments based on CPU, memory, or custom metrics.
kubectl get deployment metrics-server -n kube-systemBasic CPU-Based Autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Important: Your deployment must have resource requests defined for CPU autoscaling to work. Without requests, HPA can't calculate utilization percentage.
spec:
containers:
- name: app
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512MiMulti-Metric Autoscaling with Behavior
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Maxkubectl get hpa
kubectl describe hpa my-app-hpa
kubectl get hpa -w # Watch in real-timeBackup Strategies
Kubernetes state lives in etcd and persistent volumes. Back up both.
Backing Up etcd (k3s)
# Create snapshot
sudo k3s etcd-snapshot save --name manual-backup-$(date +%Y%m%d)
# List snapshots
sudo k3s etcd-snapshot ls
# Snapshots stored in /var/lib/rancher/k3s/server/db/snapshots/# /etc/cron.d/k3s-etcd-backup
0 */6 * * * root /usr/local/bin/k3s etcd-snapshot save --name scheduled-$(date +\%Y\%m\%d-\%H\%M)Backing Up with Velero
Velero backs up Kubernetes resources and persistent volumes to object storage.
# Install CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-amd64.tar.gz
tar -xvf velero-v1.13.0-linux-amd64.tar.gz
sudo mv velero-v1.13.0-linux-amd64/velero /usr/local/bin/
# Install in cluster (with S3-compatible storage)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket your-backup-bucket \
--secret-file ./credentials-velero \
--backup-location-config region=us-east-1,s3ForcePathStyle=true,s3Url=https://your-s3-endpoint \
--use-volume-snapshots=false \
--use-node-agent# Backup entire cluster
velero backup create full-backup
# Backup specific namespace
velero backup create wordpress-backup --include-namespaces wordpress
# Schedule automatic backups
velero schedule create daily-backup --schedule="0 2 * * *" --ttl 168h # Keep 7 days
velero schedule create weekly-backup --schedule="0 3 * * 0" --ttl 720h # Keep 30 days
# List and restore
velero backup get
velero restore create --from-backup full-backupDisaster Recovery
When things go wrong, you need clear procedures.
Scenario: Single Node Failure (Multi-Node Cluster)
Kubernetes handles this automatically: Node becomes NotReady, pods are evicted after ~5 minutes, and rescheduled to healthy nodes.
kubectl get nodes
kubectl get pods -o wide
kubectl get events --sort-by='.lastTimestamp'Scenario: etcd Failure (HA Cluster)
With 3 server nodes, the cluster tolerates 1 failure. If you lose quorum (2+ failures):
# On a surviving server node
sudo k3s server --cluster-reset
# This resets to single-node using local etcd data
# Then re-add other nodesScenario: Complete Cluster Loss
# Restore etcd snapshot on new server
sudo k3s server \
--cluster-init \
--cluster-reset \
--cluster-reset-restore-path=/path/to/snapshot
# Or restore with Velero
# 1. Install fresh cluster
# 2. Install Velero with same configuration
# 3. Restore from backup
velero restore create --from-backup full-backupDisaster Recovery Checklist
Before an incident:
- etcd snapshots running on schedule
- Velero backing up to off-site storage
- Backup restoration tested recently
- All manifests stored in version control
- Runbooks documented for common scenarios
During an incident:
- Assess scope (single pod, node, or cluster-wide)
- Check events:
kubectl get events --sort-by='.lastTimestamp' - Check node status:
kubectl get nodes - Review logs in Grafana/Loki
- Follow relevant runbook
Resource Optimization
Running efficiently on VPS means right-sizing resources.
# View current usage
kubectl top nodes
kubectl top pods -A
# Compare to requests/limits
kubectl describe node node-name | grep -A 5 "Allocated resources"Vertical Pod Autoscaler (VPA)
VPA recommends or automatically adjusts resource requests:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off" # Just recommendations, no auto-updatePod Disruption Budgets
Ensure availability during voluntary disruptions (upgrades, node maintenance):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2 # Or use maxUnavailable: 1
selector:
matchLabels:
app: my-appPod Priority and Preemption
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority for critical workloads"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-app
spec:
template:
spec:
priorityClassName: high-priority
containers:
- name: app
image: critical-app:latestProduction Checklist
Reliability
- Resource requests and limits set
- Liveness and readiness probes configured
- Pod Disruption Budgets for critical services
- HPA configured where appropriate
- Multiple replicas for stateless services
Observability
- Prometheus + Grafana deployed
- Alertmanager configured with notifications
- Alerts defined for critical conditions
- Log aggregation in place (Loki)
Security
- Network policies restricting pod communication
- Secrets stored properly (or external vault)
- RBAC configured (not using cluster-admin)
- Pod security standards enforced
Backup & Recovery
- etcd snapshots scheduled
- Velero configured for resource backups
- Persistent volume backups in place
- Restoration procedure tested
What's Next
Your Kubernetes cluster is now production-ready: monitored, alerting, auto-scaling, and backed up. You understand the operational practices that keep clusters healthy.
In Part 7, we'll put your infrastructure to work by deploying real-world applications like Nextcloud and Gitea, configuring production databases, and implementing high-availability patterns.
This guide is part of the Docker & Kubernetes series:
