OpenVINO Backend
    OpenAI-Compatible

    Deploy Aphrodite Engine on a VPS

    A high-throughput OpenAI-compatible LLM server built on PagedAttention. The OpenVINO backend delivers the best CPU throughput, supports INT8 weight compression, and gives you continuous batching for serving multiple concurrent clients on commodity hardware.

    At a Glance

    Projectaphrodite-engine/aphrodite-engine
    BackendOpenVINO CPU (INT8 + AVX2/AVX512)
    Recommended PlanPremium VPS 16GB or VDS 16GB (7B INT8); VDS preferred for production
    OS / PythonUbuntu 24.04 LTS / Python 3.11 (OpenVINO wheel constraint)
    Reverse ProxyNginx with Let's Encrypt + bearer-token API key

    Sizing rules

    Plan for at least 1.5x model size in RAM. A 7B INT8 model is ~7GB of weights, so 16GB total gives room for 6–8GB of KV cache + OS + overhead. Prefer VDS over Premium VPS for production — Aphrodite's continuous-batching loop is sensitive to noisy neighbors, and dedicated cores produce far more predictable latency.

    1

    Verify CPU Instruction Set Support

    OpenVINO needs AVX2 minimum and is much faster on AVX512. RamNode Premium VPS and VDS run on modern Intel Xeon hardware so AVX2 is essentially guaranteed, but verify before installing:

    Check feature flags
    grep -o 'avx2' /proc/cpuinfo | head -1
    grep -o 'avx512' /proc/cpuinfo | head -1
    2

    Initial Server Hardening

    Sudo user + SSH lockdown
    adduser aphrodite
    usermod -aG sudo aphrodite
    
    mkdir -p /home/aphrodite/.ssh
    cp ~/.ssh/authorized_keys /home/aphrodite/.ssh/
    chown -R aphrodite:aphrodite /home/aphrodite/.ssh
    chmod 700 /home/aphrodite/.ssh
    chmod 600 /home/aphrodite/.ssh/authorized_keys
    
    sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
    sed -i 's/^#*PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
    systemctl restart ssh
    
    ufw allow OpenSSH
    ufw allow 80/tcp
    ufw allow 443/tcp
    ufw --force enable

    Aphrodite's default 2242 stays closed externally — the API will sit behind Nginx on 443.

    3

    System Dependencies

    OpenVINO needs Python 3.9–3.11; install 3.11 explicitly. Aphrodite requires gcc/g++ ≥ 12.3.0:

    Toolchain + Python 3.11
    sudo apt-get update
    sudo apt-get install -y \
      python3.11 python3.11-venv python3.11-dev \
      build-essential gcc-12 g++-12 \
      libnuma-dev libtcmalloc-minimal4 \
      git wget curl nginx certbot python3-certbot-nginx
    
    sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \
      --slave /usr/bin/g++ g++ /usr/bin/g++-12
    8GB swap (essential for safe model load)
    sudo fallocate -l 8G /swapfile
    sudo chmod 600 /swapfile
    sudo mkswap /swapfile && sudo swapon /swapfile
    echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
    
    echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
    sudo sysctl -p
    4

    Create the Python Environment

    Dedicated venv — never install Aphrodite system-wide, and do not use system Python 3.12 on Ubuntu 24.04 (OpenVINO wheel availability lags behind):

    venv
    mkdir -p ~/aphrodite && cd ~/aphrodite
    
    python3.11 -m venv venv
    source venv/bin/activate
    
    pip install -U pip wheel packaging ninja "setuptools>=49.4.0" numpy
    5

    Install Aphrodite with OpenVINO Backend

    The OpenVINO backend is built from source — upstream wheels target NVIDIA. Build against the CPU PyTorch index:

    Clone + build
    cd ~/aphrodite
    git clone https://github.com/aphrodite-engine/aphrodite-engine.git
    cd aphrodite-engine
    
    pip install -r requirements/build.txt \
      --extra-index-url https://download.pytorch.org/whl/cpu
    
    PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" \
      APHRODITE_TARGET_DEVICE=openvino \
      pip install -e .

    Compilation takes 10–30 minutes depending on the plan. If Ninja workers get killed (out of memory), drop parallelism with MAX_JOBS ≈ floor(RAM_GB / 4):

    Lower parallelism if OOM
    export MAX_JOBS=2
    PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" \
      APHRODITE_TARGET_DEVICE=openvino \
      pip install -e .
    6

    Pick a Model

    Path A — let Aphrodite quantize on load: point at any FP16 HF model and enable APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON. Simplest start.

    Path B — pre-export to OpenVINO IR with optimum-cli for smaller, faster-loading artifacts:

    optimum-cli export
    pip install huggingface_hub optimum[openvino]
    huggingface-cli login   # for gated models like Llama
    
    optimum-cli export openvino \
      --model meta-llama/Llama-3.2-3B-Instruct \
      --weight-format int8 \
      ~/aphrodite/models/Llama-3.2-3B-Instruct-ov-int8

    Good fits for 16GB plans: Llama-3.2-3B-Instruct, Phi-3.5-mini-instruct, Qwen2.5-7B-Instruct INT8, Mistral-7B-Instruct-v0.3 INT8. Skip 13B+ on a 16GB box unless you can live with a tiny KV cache.

    7

    Launch the API Server

    Interactive run
    cd ~/aphrodite
    source venv/bin/activate
    
    export APHRODITE_OPENVINO_KVCACHE_SPACE=4
    export APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON
    
    aphrodite run meta-llama/Llama-3.2-3B-Instruct \
      --host 127.0.0.1 \
      --port 2242 \
      --enable-chunked-prefill \
      --max-num-batched-tokens 256 \
      --max-model-len 4096 \
      --api-keys "sk-yourlongrandomkeyhere"

    Generate a real key with openssl rand -hex 32 — do not skip --api-keys even on a private VPS. --max-num-batched-tokens 256 is the OpenVINO-recommended chunked-prefill batch size.

    Smoke test
    curl http://127.0.0.1:2242/v1/chat/completions \
      -H "Authorization: Bearer sk-yourlongrandomkeyhere" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta-llama/Llama-3.2-3B-Instruct",
        "messages": [{"role": "user", "content": "Write one sentence about RamNode."}],
        "max_tokens": 50
      }'
    8

    Run as a systemd Service

    API key file
    echo 'APHRODITE_API_KEY=sk-yourlongrandomkeyhere' > ~/aphrodite/.env
    chmod 600 ~/aphrodite/.env
    /etc/systemd/system/aphrodite.service
    [Unit]
    Description=Aphrodite Engine LLM API Server
    After=network-online.target
    Wants=network-online.target
    
    [Service]
    Type=simple
    User=aphrodite
    Group=aphrodite
    WorkingDirectory=/home/aphrodite/aphrodite
    
    # Performance tuning
    Environment="LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4"
    Environment="APHRODITE_OPENVINO_KVCACHE_SPACE=4"
    Environment="APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON"
    Environment="HF_HOME=/home/aphrodite/aphrodite/hf_cache"
    
    EnvironmentFile=/home/aphrodite/aphrodite/.env
    
    ExecStart=/home/aphrodite/aphrodite/venv/bin/aphrodite run meta-llama/Llama-3.2-3B-Instruct \
      --host 127.0.0.1 \
      --port 2242 \
      --enable-chunked-prefill \
      --max-num-batched-tokens 256 \
      --max-model-len 4096 \
      --api-keys ${APHRODITE_API_KEY}
    
    Restart=on-failure
    RestartSec=10
    LimitNOFILE=65536
    
    [Install]
    WantedBy=multi-user.target

    The LD_PRELOAD line forces tcmalloc, which the Aphrodite docs strongly recommend for the CPU backend. Verify the path with find / -name '*libtcmalloc*' 2>/dev/null as it occasionally moves between Ubuntu releases.

    Enable + watch
    sudo systemctl daemon-reload
    sudo systemctl enable aphrodite
    sudo systemctl start aphrodite
    sudo journalctl -u aphrodite -f
    9

    Nginx Reverse Proxy + TLS

    /etc/nginx/sites-available/aphrodite
    server {
        listen 80;
        server_name api.yourdomain.com;
        location /.well-known/acme-challenge/ { root /var/www/html; }
        location / { return 301 https://$host$request_uri; }
    }
    
    server {
        listen 443 ssl http2;
        server_name api.yourdomain.com;
    
        # certbot will fill in ssl_certificate / ssl_certificate_key
    
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
        proxy_connect_timeout 60s;
    
        proxy_buffering off;
        proxy_cache off;
    
        client_max_body_size 10M;
    
        location / {
            proxy_pass http://127.0.0.1:2242;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
    
            # Required for streaming chat completions
            proxy_set_header Connection "";
        }
    }
    Enable + cert
    sudo ln -s /etc/nginx/sites-available/aphrodite /etc/nginx/sites-enabled/
    sudo rm /etc/nginx/sites-enabled/default
    sudo nginx -t
    sudo certbot --nginx -d api.yourdomain.com
    
    curl https://api.yourdomain.com/v1/models \
      -H "Authorization: Bearer sk-yourlongrandomkeyhere"
    10

    Performance Tuning

    • KV cache size: APHRODITE_OPENVINO_KVCACHE_SPACE (GB). Start at 4 on a 16GB box; raise only if no OOMs and concurrency is bottlenecked.
    • KV cache precision: APHRODITE_OPENVINO_CPU_KV_CACHE_PRECISION=u8 roughly doubles cached tokens per GB at a small quality cost most apps cannot detect.
    • OpenMP thread isolation: on shared-core Premium VPS, pin Aphrodite to a subset of cores with OMP_NUM_THREADS + taskset -c 0-3. On dedicated-core VDS, just let OpenMP use all cores.
    • Chunked prefill: --max-num-batched-tokens 256 is the OpenVINO default. For many short prompts, try 512 and benchmark.
    11

    Monitoring

    Prometheus metrics
    curl https://api.yourdomain.com/metrics \
      -H "Authorization: Bearer sk-yourlongrandomkeyhere" | head -50

    Watch aphrodite:num_requests_running vs aphrodite:num_requests_waiting (waiting > running consistently = CPU-bound), aphrodite:gpu_cache_usage_perc (the metric is named for GPU but reflects the OpenVINO KV cache — near 1.0 means raise KVCACHE_SPACE), and time_to_first_token_seconds / time_per_output_token_seconds.

    OpenVINO backend limitations

    • No LoRA, vision/embedding models, or tensor/pipeline parallelism. Plan separately (llama.cpp, sentence-transformers) or move to GPU.
    • Modest CPU throughput: expect single-digit tokens/sec per concurrent request on a 7B INT8 model on a 16GB VPS, with aggregate climbing as continuous batching engages. Fine for chat, agents, lightly trafficked features. Not enough for real-time voice or high-concurrency consumer chat.
    • Significant load time: a 7B INT8 model takes 2–5 minutes to load on a Premium VPS 16GB. Keep the service running rather than starting it on demand; scale up over scaling out.