Part 6 of 6
    45 min

    Production Optimization & Scaling

    Scale your CI/CD platform for high availability, performance, and reliability

    Running Concourse in production requires attention to scaling, monitoring, and maintenance. This final guide covers everything you need to operate Concourse reliably at scale.

    High Availability Architecture

    Multi-Node Web Cluster

    For high availability, run multiple ATC (web) nodes behind a load balancer:

                        ┌─────────────────┐
                        │  Load Balancer  │
                        │  (Nginx/HAProxy)│
                        └────────┬────────┘
                                 │
                ┌────────────────┼────────────────┐
                │                │                │
         ┌──────┴──────┐  ┌──────┴──────┐  ┌──────┴──────┐
         │   ATC-1     │  │   ATC-2     │  │   ATC-3     │
         │  (Web/API)  │  │  (Web/API)  │  │  (Web/API)  │
         └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
                │                │                │
                └────────────────┼────────────────┘
                                 │
                        ┌────────┴────────┐
                        │   PostgreSQL    │
                        │   (Primary)     │
                        └────────┬────────┘
                                 │
                        ┌────────┴────────┐
                        │   PostgreSQL    │
                        │   (Replica)     │
                        └─────────────────┘

    Docker Compose for Multi-Web Setup

    docker-compose.ha.yml
    version: '3.8'
    
    services:
      concourse-db:
        image: postgres:15
        environment:
          POSTGRES_DB: concourse
          POSTGRES_USER: concourse_user
          POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
        volumes:
          - pgdata:/var/lib/postgresql/data
        deploy:
          resources:
            limits:
              memory: 2G
    
      concourse-web-1:
        image: concourse/concourse:7.11
        command: web
        depends_on:
          - concourse-db
        environment: &web-env
          CONCOURSE_POSTGRES_HOST: concourse-db
          CONCOURSE_POSTGRES_USER: concourse_user
          CONCOURSE_POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
          CONCOURSE_POSTGRES_DATABASE: concourse
          CONCOURSE_EXTERNAL_URL: https://concourse.example.com
          CONCOURSE_SESSION_SIGNING_KEY: /keys/session_signing_key
          CONCOURSE_TSA_HOST_KEY: /keys/tsa_host_key
          CONCOURSE_TSA_AUTHORIZED_KEYS: /keys/authorized_worker_keys
          CONCOURSE_CLUSTER_NAME: production
          CONCOURSE_ENABLE_GLOBAL_RESOURCES: "true"
        volumes:
          - ./keys:/keys:ro
        networks:
          - concourse-net
    
      concourse-web-2:
        image: concourse/concourse:7.11
        command: web
        depends_on:
          - concourse-db
        environment: *web-env
        volumes:
          - ./keys:/keys:ro
        networks:
          - concourse-net
    
      concourse-web-3:
        image: concourse/concourse:7.11
        command: web
        depends_on:
          - concourse-db
        environment: *web-env
        volumes:
          - ./keys:/keys:ro
        networks:
          - concourse-net
    
      nginx:
        image: nginx:alpine
        ports:
          - "80:80"
          - "443:443"
          - "2222:2222"  # TSA
        volumes:
          - ./nginx.conf:/etc/nginx/nginx.conf:ro
          - ./certs:/etc/nginx/certs:ro
        depends_on:
          - concourse-web-1
          - concourse-web-2
          - concourse-web-3
        networks:
          - concourse-net
    
    networks:
      concourse-net:
    
    volumes:
      pgdata:

    Load Balancer Configuration

    nginx.conf
    events {
        worker_connections 1024;
    }
    
    http {
        upstream concourse_web {
            least_conn;
            server concourse-web-1:8080;
            server concourse-web-2:8080;
            server concourse-web-3:8080;
        }
    
        server {
            listen 80;
            return 301 https://$host$request_uri;
        }
    
        server {
            listen 443 ssl http2;
            
            ssl_certificate /etc/nginx/certs/server.crt;
            ssl_certificate_key /etc/nginx/certs/server.key;
    
            location / {
                proxy_pass http://concourse_web;
                proxy_http_version 1.1;
                proxy_set_header Upgrade $http_upgrade;
                proxy_set_header Connection "upgrade";
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Proto https;
                proxy_read_timeout 900s;
                proxy_buffering off;
            }
        }
    }
    
    stream {
        upstream concourse_tsa {
            server concourse-web-1:2222;
            server concourse-web-2:2222;
            server concourse-web-3:2222;
        }
    
        server {
            listen 2222;
            proxy_pass concourse_tsa;
        }
    }

    Worker Scaling

    Horizontal Worker Scaling

    Add workers to handle more concurrent builds:

    docker-compose.workers.yml
    version: '3.8'
    
    services:
      worker-1:
        image: concourse/concourse:7.11
        command: worker
        privileged: true
        environment: &worker-env
          CONCOURSE_TSA_HOST: concourse.example.com:2222
          CONCOURSE_TSA_PUBLIC_KEY: /keys/tsa_host_key.pub
          CONCOURSE_TSA_WORKER_PRIVATE_KEY: /keys/worker_key
          CONCOURSE_RUNTIME: containerd
          CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
          CONCOURSE_WORK_DIR: /worker-state
        volumes:
          - ./keys:/keys:ro
          - worker1-state:/worker-state
    
      worker-2:
        image: concourse/concourse:7.11
        command: worker
        privileged: true
        environment: *worker-env
        volumes:
          - ./keys:/keys:ro
          - worker2-state:/worker-state
    
      worker-3:
        image: concourse/concourse:7.11
        command: worker
        privileged: true
        environment: *worker-env
        volumes:
          - ./keys:/keys:ro
          - worker3-state:/worker-state
    
    volumes:
      worker1-state:
      worker2-state:
      worker3-state:

    Worker Tags for Specialization

    Route specific workloads to appropriate workers:

    # GPU worker
    concourse worker \
      --tag=gpu \
      --tag=high-memory
    
    # ARM worker
    concourse worker \
      --tag=arm64
    
    # Windows worker (for Windows builds)
    concourse worker \
      --tag=windows

    Use tags in your pipeline:

    jobs:
    - name: train-ml-model
      plan:
      - task: train
        tags: [gpu]
        config:
          platform: linux
          # ...
    
    - name: build-arm-image
      plan:
      - task: build
        tags: [arm64]
        config:
          platform: linux
          # ...

    Worker Resource Limits

    environment:
      # Limit concurrent containers
      CONCOURSE_GARDEN_MAX_CONTAINERS: 250
      
      # Memory limits
      CONCOURSE_GARDEN_DEFAULT_CONTAINER_MEMORY_LIMIT: 4g
      
      # CPU limits
      CONCOURSE_GARDEN_DEFAULT_CONTAINER_CPU_SHARES: 1024
      
      # Disk quotas
      CONCOURSE_BAGGAGECLAIM_VOLUMES_DIR: /worker-state/volumes

    Performance Tuning

    Database Optimization

    PostgreSQL tuning for Concourse (example for 8GB RAM):

    postgresql.conf
    # Memory settings
    shared_buffers = 2GB                  # 25% of RAM
    effective_cache_size = 6GB            # 75% of RAM
    maintenance_work_mem = 512MB
    work_mem = 256MB
    
    # SSD storage optimizations
    random_page_cost = 1.1
    effective_io_concurrency = 200
    
    # Connection pooling
    max_connections = 200
    
    # Write performance
    wal_buffers = 64MB
    checkpoint_completion_target = 0.9

    Connection Pooling with PgBouncer

    For high-traffic installations:

    pgbouncer.ini
    [databases]
    concourse = host=postgres port=5432 dbname=concourse
    
    [pgbouncer]
    listen_addr = 0.0.0.0
    listen_port = 6432
    auth_type = md5
    auth_file = /etc/pgbouncer/userlist.txt
    pool_mode = transaction
    max_client_conn = 1000
    default_pool_size = 50

    Web Node Tuning

    environment:
      # Build scheduling
      CONCOURSE_BUILD_TRACKER_INTERVAL: 5s
      CONCOURSE_RESOURCE_CHECKING_INTERVAL: 30s
      
      # Connection limits
      CONCOURSE_CONCURRENT_REQUEST_LIMIT: 50
      
      # Garbage collection
      CONCOURSE_GC_INTERVAL: 30s
      CONCOURSE_GC_ONE_OFF_GRACE_PERIOD: 5m
      
      # Enable pipeline caching
      CONCOURSE_ENABLE_PIPELINE_INSTANCES: "true"
      CONCOURSE_ENABLE_CACHE_STREAMED_VOLUMES: "true"

    Global Resources

    Reduce redundant resource checks across pipelines:

    environment:
      CONCOURSE_ENABLE_GLOBAL_RESOURCES: "true"

    This deduplicates resource checking when multiple pipelines use the same resource definition.

    Monitoring

    Prometheus Metrics

    Concourse exposes Prometheus metrics:

    environment:
      CONCOURSE_PROMETHEUS_BIND_IP: 0.0.0.0
      CONCOURSE_PROMETHEUS_BIND_PORT: 9391

    Prometheus scrape config:

    prometheus.yml
    scrape_configs:
    - job_name: 'concourse'
      static_configs:
      - targets: 
        - 'concourse-web-1:9391'
        - 'concourse-web-2:9391'
        - 'concourse-web-3:9391'
      metrics_path: /metrics

    Key Metrics to Monitor

    # Build queue depth
    concourse_builds_running
    concourse_builds_pending
    
    # Worker health
    concourse_workers_registered
    concourse_workers_containers
    
    # Resource check performance
    concourse_resource_checks_total
    concourse_resource_check_duration_seconds
    
    # Database connections
    concourse_db_connections_total
    
    # API latency
    concourse_http_responses_duration_seconds_bucket

    Alerting Rules

    prometheus-alerts.yml
    groups:
    - name: concourse
      rules:
      - alert: ConcourseWorkerDown
        expr: concourse_workers_registered < 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "No workers registered"
    
      - alert: ConcourseHighBuildQueue
        expr: concourse_builds_pending > 50
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High build queue ({{ $value }} pending)"
    
      - alert: ConcourseResourceCheckSlow
        expr: histogram_quantile(0.95, concourse_resource_check_duration_seconds_bucket) > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Resource checks taking too long"

    Backup & Recovery

    Database Backups

    backup-concourse.sh
    #!/bin/bash
    
    BACKUP_DIR="/backups/concourse"
    TIMESTAMP=$(date +%Y%m%d_%H%M%S)
    BACKUP_FILE="${BACKUP_DIR}/concourse_${TIMESTAMP}.sql.gz"
    
    # Create backup
    pg_dump -h localhost -U concourse_user -d concourse | gzip > "$BACKUP_FILE"
    
    # Keep only last 7 days
    find "$BACKUP_DIR" -name "concourse_*.sql.gz" -mtime +7 -delete
    
    # Upload to S3
    aws s3 cp "$BACKUP_FILE" s3://my-backups/concourse/

    Schedule with cron:

    0 2 * * * /opt/scripts/backup-concourse.sh

    Key Backups

    backup-keys.sh
    #!/bin/bash
    
    tar -czf /backups/concourse-keys-$(date +%Y%m%d).tar.gz \
      /opt/concourse/keys/
    
    # Encrypt before storing
    gpg --symmetric --cipher-algo AES256 \
      /backups/concourse-keys-$(date +%Y%m%d).tar.gz

    Disaster Recovery

    Restore procedure:

    # 1. Stop Concourse
    docker compose down
    
    # 2. Restore database
    gunzip -c backup.sql.gz | psql -h localhost -U concourse_user -d concourse
    
    # 3. Restore keys
    tar -xzf concourse-keys-backup.tar.gz -C /
    
    # 4. Start Concourse
    docker compose up -d
    
    # 5. Verify workers reconnect
    fly -t main workers

    Maintenance Tasks

    Garbage Collection

    Concourse automatically cleans up, but you can tune it:

    environment:
      # How often to run GC
      CONCOURSE_GC_INTERVAL: 30s
      
      # How long to keep one-off build containers
      CONCOURSE_GC_ONE_OFF_GRACE_PERIOD: 5m
      
      # How long to keep missing workers before pruning
      CONCOURSE_GC_MISSING_GRACE_PERIOD: 10m
      
      # Failed containers grace period
      CONCOURSE_GC_FAILED_GRACE_PERIOD: 120h

    Manual Cleanup

    # Prune stalled workers
    fly -t main prune-worker -w stale-worker-name
    
    # Archive old pipelines
    fly -t main archive-pipeline -p old-pipeline
    
    # Clear resource cache (force re-check)
    fly -t main check-resource -r my-pipeline/my-resource --from version:1.2.3

    Upgrading Concourse

    # 1. Review release notes for breaking changes
    # https://github.com/concourse/concourse/releases
    
    # 2. Backup everything
    ./backup-concourse.sh
    ./backup-keys.sh
    
    # 3. Update image version
    # docker-compose.yml: image: concourse/concourse:7.12
    
    # 4. Rolling restart (for HA setups)
    docker compose up -d --no-deps concourse-web-1
    # Wait for healthy
    docker compose up -d --no-deps concourse-web-2
    docker compose up -d --no-deps concourse-web-3
    
    # 5. Update workers
    docker compose -f docker-compose.workers.yml up -d
    
    # 6. Verify
    fly -t main workers
    fly -t main status

    Cost Optimization

    Right-Size Workers

    Monitor actual resource usage and adjust:

    # Check container resource usage
    fly -t main containers
    
    # Worker resource utilization
    curl -s http://localhost:9391/metrics | grep concourse_workers

    Scheduled Workers

    For non-critical workloads, scale down during off-hours:

    Kubernetes KEDA example
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: concourse-worker
    spec:
      scaleTargetRef:
        name: concourse-worker
      minReplicaCount: 1
      maxReplicaCount: 10
      triggers:
      - type: prometheus
        metadata:
          query: concourse_builds_pending
          threshold: "5"

    Pipeline Efficiency

    Reduce build times and resource usage:

    # Use caching aggressively
    caches:
    - path: .npm
    - path: node_modules
    - path: .cache/pip
    
    # Avoid redundant work
    - get: source
      params:
        depth: 1  # Shallow clone
    
    # Parallelize where possible
    - in_parallel:
        limit: 4  # Control parallelism
        steps:
        - task: test-1
        - task: test-2

    Troubleshooting Guide

    Workers Not Connecting

    # Check TSA logs
    docker compose logs concourse-web | grep tsa
    
    # Verify key permissions
    ls -la keys/
    # Should be: -rw-r--r-- (644) for public keys
    #            -rw------- (600) for private keys
    
    # Test TSA connectivity from worker
    nc -zv concourse.example.com 2222

    Builds Stuck Pending

    # Check worker capacity
    fly -t main workers
    
    # Look for resource issues
    fly -t main containers
    
    # Check for job locks
    fly -t main builds -j pipeline/job

    Resource Checks Failing

    # Force resource check
    fly -t main check-resource -r pipeline/resource
    
    # View resource check errors
    fly -t main resource-versions -r pipeline/resource
    
    # Debug with intercept
    fly -t main intercept -j pipeline/job -s get-resource

    High Memory Usage

    # Check container counts per worker
    curl -s http://localhost:9391/metrics | grep containers
    
    # Reduce concurrent builds
    fly -t main pause-job -j pipeline/expensive-job
    
    # Manually trigger GC (restart web to force)
    docker compose restart concourse-web-1

    Production Checklist

    Initial Deployment

    • Multi-node web cluster for HA
    • Multiple workers across availability zones
    • PostgreSQL with replication
    • TLS/HTTPS configured
    • Authentication backend configured
    • Credential manager integrated
    • Monitoring and alerting set up
    • Backup procedures tested

    Ongoing Operations

    • Daily database backups verified
    • Log aggregation configured
    • Alerts responding to pages
    • Regular Concourse updates planned
    • Capacity planning reviewed quarterly
    • Security patches applied promptly
    • Runbooks documented

    Series Complete!

    Congratulations! You've completed the Concourse CI Mastery Series. You now have the knowledge to deploy, configure, and operate Concourse CI at any scale—from a single VPS to enterprise high-availability clusters.

    Quick Reference

    TopicGuide
    InstallationPart 1
    Core ConceptsPart 2
    Pipeline DevelopmentPart 3
    Advanced PatternsPart 4
    SecurityPart 5
    Production OperationsPart 6 (this guide)

    Additional Resources