AI / ML Guide

    Deploy TorchServe

    Production-ready PyTorch model serving infrastructure. Deploy inference APIs with REST and gRPC on RamNode's VPS hosting.

    PyTorch
    TorchServe
    Java 17
    Nginx
    Let's Encrypt SSL
    1

    Overview

    TorchServe is an open-source, production-grade model serving framework built by AWS and Meta for deploying PyTorch models at scale. It provides a robust REST API and gRPC interface for real-time inference, batch processing, model management, and monitoring — all without building custom serving infrastructure.

    What You Will Build

    • A TorchServe instance running behind a reverse proxy with TLS termination
    • A pre-packaged DenseNet-161 image classification model as a working example
    • Systemd service management for automatic restarts and boot persistence
    • Monitoring and logging infrastructure for production observability
    • Firewall rules and security hardening for public-facing inference APIs
    2

    Prerequisites

    Recommended VPS Specifications

    ComponentMinimumRecommended
    CPU2 vCPUs4+ vCPUs
    RAM4 GB8–16 GB
    Storage40 GB SSD80+ GB NVMe SSD
    OSUbuntu 22.04 LTSUbuntu 24.04 LTS
    Network1 Gbps1 Gbps unmetered

    💡 Sizing Tip: For lightweight models (ResNet, BERT-base), a 4 GB plan works well. For larger models (GPT-2, ViT-Large) or concurrent requests, start with 8 GB+. TorchServe's memory footprint scales directly with model size and worker count.

    Software Requirements

    • SSH access to your RamNode VPS with root or sudo privileges
    • A domain name pointed to your VPS IP (optional, for TLS configuration)
    • Basic familiarity with Linux system administration and PyTorch concepts
    3

    Initial Server Setup

    System Updates and Dependencies

    Java is required because TorchServe's model server runs on the JVM. Python 3.10+ is needed for the model handling and archiving tools.

    Install dependencies
    sudo apt update && sudo apt upgrade -y
    sudo apt install -y openjdk-17-jdk python3 python3-pip python3-venv \
      git wget curl unzip nginx certbot python3-certbot-nginx

    Create a Dedicated Service User

    Running TorchServe under a dedicated non-root user limits the blast radius of any potential vulnerabilities.

    Create torchserve user
    sudo useradd -r -m -d /opt/torchserve -s /bin/bash torchserve
    sudo mkdir -p /opt/torchserve/{model-store,logs,config}
    sudo chown -R torchserve:torchserve /opt/torchserve
    4

    Install TorchServe

    Set Up the Python Environment

    Create virtual environment
    sudo -u torchserve bash -c '
    cd /opt/torchserve
    python3 -m venv venv
    source venv/bin/activate
    pip install --upgrade pip setuptools wheel
    '

    Install PyTorch and TorchServe

    Install the CPU-optimized build. For GPU-equipped plans, substitute the CUDA-enabled variant.

    Install packages
    sudo -u torchserve bash -c '
    source /opt/torchserve/venv/bin/activate
    pip install torch torchvision --index-url \
      https://download.pytorch.org/whl/cpu
    pip install torchserve torch-model-archiver torch-workflow-archiver
    '

    Verify Installation

    Check versions
    sudo -u torchserve bash -c '
    source /opt/torchserve/venv/bin/activate
    torchserve --version
    torch-model-archiver --version
    '

    Both commands should return version numbers. If you see import failures, verify Java 17 is accessible by running java -version.

    5

    Package and Register a Model

    TorchServe uses the MAR (Model Archive) format to bundle model weights, handler code, and metadata into a single deployable artifact. This example uses DenseNet-161 for image classification.

    Download the Pre-trained Model

    Download model and labels
    sudo -u torchserve bash -c '
    source /opt/torchserve/venv/bin/activate
    cd /opt/torchserve
    
    # Download DenseNet-161 model weights
    wget -q https://download.pytorch.org/models/densenet161-8d451a50.pth
    
    # Download ImageNet class labels
    wget -q https://raw.githubusercontent.com/pytorch/serve/master/\
    examples/image_classifier/index_to_name.json
    '

    Create the Model Archive

    Archive the model
    sudo -u torchserve bash -c '
    source /opt/torchserve/venv/bin/activate
    cd /opt/torchserve
    
    torch-model-archiver \
      --model-name densenet161 \
      --version 1.0 \
      --model-file /dev/null \
      --serialized-file densenet161-8d451a50.pth \
      --handler image_classifier \
      --extra-files index_to_name.json \
      --export-path model-store/
    '

    This produces model-store/densenet161.mar, the deployable artifact TorchServe will load.

    6

    Configure TorchServe

    The configuration below is tuned for a RamNode VPS with 4–8 GB RAM, binding APIs to localhost so they're only accessible through the Nginx reverse proxy.

    Create /opt/torchserve/config/config.properties
    # TorchServe Configuration — RamNode VPS
    
    # Inference API (port 8080, localhost only)
    inference_address=http://127.0.0.1:8080
    
    # Management API (port 8081, localhost only)
    management_address=http://127.0.0.1:8081
    
    # Metrics API (port 8082, localhost only)
    metrics_address=http://127.0.0.1:8082
    
    # Model store directory
    model_store=/opt/torchserve/model-store
    
    # Load all models on startup
    load_models=all
    
    # Worker configuration
    default_workers_per_model=1
    job_queue_size=100
    
    # Memory management
    vmargs=-Xmx2g -XX:+UseG1GC -XX:MaxGCPauseMillis=100
    
    # Request limits
    max_request_size=6553600
    max_response_size=6553600
    
    # Logging
    async_logging=true

    Configuration Parameters

    ParameterValuePurpose
    inference_address127.0.0.1:8080Binds inference API to localhost for reverse proxy access only
    management_address127.0.0.1:8081Restricts model management to local access
    default_workers_per_model1One worker per model; increase based on available RAM
    vmargs -Xmx2g2 GB heapJVM heap limit; set to ~25% of total RAM
    job_queue_size100Max queued requests before rejecting new ones
    max_request_size65536006 MB limit for image uploads

    ⚠️ Memory Planning: Each TorchServe worker loads a full copy of the model into memory. A single DenseNet-161 worker requires ~300 MB. Formula: total memory = (model size × workers) + 2 GB JVM overhead + 1 GB OS buffer.

    7

    Create a Systemd Service

    Create /etc/systemd/system/torchserve.service
    [Unit]
    Description=TorchServe Model Serving
    After=network.target
    Wants=network-online.target
    
    [Service]
    Type=simple
    User=torchserve
    Group=torchserve
    WorkingDirectory=/opt/torchserve
    Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
    Environment=PATH=/opt/torchserve/venv/bin:/usr/local/bin:/usr/bin
    ExecStart=/opt/torchserve/venv/bin/torchserve \
      --start \
      --ts-config /opt/torchserve/config/config.properties \
      --model-store /opt/torchserve/model-store \
      --ncs
    ExecStop=/opt/torchserve/venv/bin/torchserve --stop
    Restart=on-failure
    RestartSec=10
    LimitNOFILE=65536
    StandardOutput=journal
    StandardError=journal
    
    [Install]
    WantedBy=multi-user.target
    Enable and start the service
    sudo systemctl daemon-reload
    sudo systemctl enable torchserve
    sudo systemctl start torchserve
    
    # Verify the service is running
    sudo systemctl status torchserve
    curl -s http://127.0.0.1:8080/ping

    The ping endpoint should return {"status": "Healthy"} once models finish loading. Initial startup may take 30–60 seconds.

    8

    Nginx Reverse Proxy with TLS

    Nginx handles TLS termination, request buffering, rate limiting, and provides a clean API boundary for clients.

    Create /etc/nginx/sites-available/torchserve
    upstream torchserve_inference {
        server 127.0.0.1:8080;
        keepalive 32;
    }
    
    server {
        listen 80;
        server_name your-domain.com;
    
        # Rate limiting zone
        limit_req_zone $binary_remote_addr
            zone=inference:10m rate=10r/s;
    
        # Inference API
        location /predictions/ {
            limit_req zone=inference burst=20 nodelay;
            proxy_pass http://torchserve_inference;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_read_timeout 60s;
            proxy_send_timeout 60s;
            client_max_body_size 10m;
        }
    
        # Health check endpoint
        location /ping {
            proxy_pass http://torchserve_inference;
        }
    
        # Block management API from external access
        location /models {
            deny all;
            return 403;
        }
    }
    Enable site and obtain TLS
    sudo ln -s /etc/nginx/sites-available/torchserve /etc/nginx/sites-enabled/
    sudo nginx -t
    sudo systemctl reload nginx
    
    # Obtain TLS certificate
    sudo certbot --nginx -d your-domain.com
    9

    Firewall Configuration

    Configure UFW
    sudo ufw default deny incoming
    sudo ufw default allow outgoing
    sudo ufw allow ssh
    sudo ufw allow 'Nginx Full'
    sudo ufw enable
    
    # Verify rules
    sudo ufw status verbose

    🔒 Security Note: The TorchServe management API (port 8081) and metrics API (port 8082) are bound to localhost and blocked by both UFW and Nginx. For remote management access, use an SSH tunnel: ssh -L 8081:127.0.0.1:8081 user@your-vps-ip

    10

    Test Your Deployment

    Health Check

    Verify service is healthy
    curl -s https://your-domain.com/ping | python3 -m json.tool
    # Expected: {"status": "Healthy"}

    Run an Inference Request

    Test image classification
    # Download a test image
    wget -q -O kitten.jpg \
      https://raw.githubusercontent.com/pytorch/serve/master/\
    examples/image_classifier/kitten.jpg
    
    # Send inference request
    curl -X POST https://your-domain.com/predictions/densenet161 \
      -T kitten.jpg \
      -H 'Content-Type: application/octet-stream' | python3 -m json.tool

    Expected Response

    Classification results
    {
        "tabby": 0.4664836823940277,
        "tiger_cat": 0.4645617604255676,
        "Egyptian_cat": 0.06619937717914581,
        "lynx": 0.0012969186063855886,
        "plastic_bag": 0.00022856894403230399
    }

    Model Management (localhost only)

    Manage models
    # Check model status
    curl -s http://127.0.0.1:8081/models | python3 -m json.tool
    
    # Scale workers for higher throughput
    curl -X PUT 'http://127.0.0.1:8081/models/densenet161?\
    min_worker=2&max_worker=4'
    11

    Monitoring & Logging

    TorchServe Metrics

    TorchServe exposes Prometheus-compatible metrics on the metrics API endpoint:

    View metrics
    curl -s http://127.0.0.1:8082/metrics

    Key Metrics to Monitor

    MetricDescriptionAlert Threshold
    ts_inference_latency_microsecondPer-request inference time> 5000ms p99
    ts_queue_latency_microsecondTime spent in request queue> 1000ms p99
    ts_inference_requests_totalTotal requests processedTrending to zero
    MemoryUsedJVM heap consumption> 85% of Xmx
    WorkerThreadTimeWorker processing timeIncreasing trend

    Journald Log Access

    View logs
    # View live logs
    sudo journalctl -u torchserve -f
    
    # View logs from the last hour
    sudo journalctl -u torchserve --since '1 hour ago'
    
    # Filter for errors only
    sudo journalctl -u torchserve -p err

    Log Rotation

    Create /etc/logrotate.d/torchserve
    /opt/torchserve/logs/*.log {
        daily
        rotate 7
        compress
        delaycompress
        missingok
        notifempty
        copytruncate
    }
    12

    Deploy Your Own Models

    Write a Custom Handler

    /opt/torchserve/handlers/custom_handler.py
    import torch
    import json
    from ts.torch_handler.base_handler import BaseHandler
    
    class MyModelHandler(BaseHandler):
        def preprocess(self, data):
            """Transform raw input into model-ready tensors."""
            inputs = []
            for row in data:
                input_data = row.get('data') or row.get('body')
                if isinstance(input_data, (bytes, bytearray)):
                    input_data = input_data.decode('utf-8')
                parsed = json.loads(input_data)
                tensor = torch.tensor(parsed['input'])
                inputs.append(tensor)
            return torch.stack(inputs)
    
        def postprocess(self, inference_output):
            """Convert model output to API response."""
            return inference_output.tolist()

    Archive and Register

    Package and hot-deploy
    sudo -u torchserve bash -c '
    source /opt/torchserve/venv/bin/activate
    
    torch-model-archiver \
      --model-name my_model \
      --version 1.0 \
      --serialized-file /path/to/model.pt \
      --handler handlers/custom_handler.py \
      --export-path model-store/
    '
    
    # Register with the running server (hot-deploy)
    curl -X POST 'http://127.0.0.1:8081/models?\
    url=my_model.mar&model_name=my_model&initial_workers=1'
    13

    Performance Tuning

    Worker Scaling Guidelines

    VPS PlanRAMWorkersEstimated Throughput
    Standard 4 GB4 GB1–210–20 req/sec
    Standard 8 GB8 GB2–420–50 req/sec
    Standard 16 GB16 GB4–850–120 req/sec

    CPU Optimization

    Enable OpenMP and Intel MKL threading for CPU-bound inference:

    Add to systemd service [Service] section
    Environment=OMP_NUM_THREADS=4
    Environment=MKL_NUM_THREADS=4
    Environment=TORCH_NUM_THREADS=4

    Request Batching

    Dynamic batching improves throughput by combining multiple requests into a single forward pass:

    Append to config.properties
    # Dynamic batching
    batch_size=4
    max_batch_delay=100

    The max_batch_delay (ms) controls how long TorchServe waits to fill a batch. Lower values reduce latency; higher values improve throughput under load.

    14

    Troubleshooting

    TorchServe fails to start

    Java not found or wrong version. Run java -version and ensure OpenJDK 17 is installed.

    curl /ping returns connection refused

    Service not running or still loading. Check systemctl status torchserve and wait 30–60 seconds for model loading.

    Out of memory errors

    Model + JVM exceeds available RAM. Reduce -Xmx, decrease workers, or upgrade your VPS plan.

    502 Bad Gateway from Nginx

    TorchServe not responding on port 8080. Verify the service is running and check logs with journalctl -u torchserve.

    Slow inference (>5s per request)

    Insufficient CPU or unoptimized threading. Set OMP/MKL thread counts and enable request batching.

    Model not loading

    Corrupt .mar file or missing dependencies. Re-archive the model and check handler imports.

    TorchServe Deployed Successfully!

    Your TorchServe instance is now running in production on a RamNode VPS with Nginx reverse proxy, TLS encryption, rate limiting, and Prometheus-compatible monitoring.

    Next Steps:

    • Integrate Prometheus and Grafana for real-time inference dashboards
    • Set up A/B testing with TorchServe's model versioning API
    • Deploy multiple specialized models on the same instance
    • Implement client-side request batching for high-throughput workloads
    • Configure webhook-based model updates from your CI/CD pipeline