OpenVINO Backend

OpenAI-Compatible

Deploy Aphrodite Engine on a VPS

A high-throughput OpenAI-compatible LLM server built on PagedAttention. The OpenVINO backend delivers the best CPU throughput, supports INT8 weight compression, and gives you continuous batching for serving multiple concurrent clients on commodity hardware.

At a Glance

Project	aphrodite-engine/aphrodite-engine
Backend	OpenVINO CPU (INT8 + AVX2/AVX512)
Recommended Plan	Premium VPS 16GB or VDS 16GB (7B INT8); VDS preferred for production
OS / Python	Ubuntu 24.04 LTS / Python 3.11 (OpenVINO wheel constraint)
Reverse Proxy	Nginx with Let's Encrypt + bearer-token API key

Sizing rules

Plan for at least 1.5x model size in RAM. A 7B INT8 model is ~7GB of weights, so 16GB total gives room for 6–8GB of KV cache + OS + overhead. Prefer VDS over Premium VPS for production — Aphrodite's continuous-batching loop is sensitive to noisy neighbors, and dedicated cores produce far more predictable latency.

Verify CPU Instruction Set Support

OpenVINO needs AVX2 minimum and is much faster on AVX512. RamNode Premium VPS and VDS run on modern Intel Xeon hardware so AVX2 is essentially guaranteed, but verify before installing:

Check feature flags

grep -o 'avx2' /proc/cpuinfo | head -1
grep -o 'avx512' /proc/cpuinfo | head -1

Initial Server Hardening

Sudo user + SSH lockdown

adduser aphrodite
usermod -aG sudo aphrodite

mkdir -p /home/aphrodite/.ssh
cp ~/.ssh/authorized_keys /home/aphrodite/.ssh/
chown -R aphrodite:aphrodite /home/aphrodite/.ssh
chmod 700 /home/aphrodite/.ssh
chmod 600 /home/aphrodite/.ssh/authorized_keys

sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
systemctl restart ssh

ufw allow OpenSSH
ufw allow 80/tcp
ufw allow 443/tcp
ufw --force enable

Aphrodite's default 2242 stays closed externally — the API will sit behind Nginx on 443.

System Dependencies

OpenVINO needs Python 3.9–3.11; install 3.11 explicitly. Aphrodite requires gcc/g++ ≥ 12.3.0:

Toolchain + Python 3.11

sudo apt-get update
sudo apt-get install -y \
  python3.11 python3.11-venv python3.11-dev \
  build-essential gcc-12 g++-12 \
  libnuma-dev libtcmalloc-minimal4 \
  git wget curl nginx certbot python3-certbot-nginx

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \
  --slave /usr/bin/g++ g++ /usr/bin/g++-12

8GB swap (essential for safe model load)

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile && sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Create the Python Environment

Dedicated venv — never install Aphrodite system-wide, and do not use system Python 3.12 on Ubuntu 24.04 (OpenVINO wheel availability lags behind):

venv

mkdir -p ~/aphrodite && cd ~/aphrodite

python3.11 -m venv venv
source venv/bin/activate

pip install -U pip wheel packaging ninja "setuptools>=49.4.0" numpy

Install Aphrodite with OpenVINO Backend

The OpenVINO backend is built from source — upstream wheels target NVIDIA. Build against the CPU PyTorch index:

Clone + build

cd ~/aphrodite
git clone https://github.com/aphrodite-engine/aphrodite-engine.git
cd aphrodite-engine

pip install -r requirements/build.txt \
  --extra-index-url https://download.pytorch.org/whl/cpu

PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" \
  APHRODITE_TARGET_DEVICE=openvino \
  pip install -e .

Compilation takes 10–30 minutes depending on the plan. If Ninja workers get killed (out of memory), drop parallelism with MAX_JOBS ≈ floor(RAM_GB / 4):

Lower parallelism if OOM

export MAX_JOBS=2
PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" \
  APHRODITE_TARGET_DEVICE=openvino \
  pip install -e .

Pick a Model

Path A — let Aphrodite quantize on load: point at any FP16 HF model and enable APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON. Simplest start.

Path B — pre-export to OpenVINO IR with optimum-cli for smaller, faster-loading artifacts:

optimum-cli export

pip install huggingface_hub optimum[openvino]
huggingface-cli login   # for gated models like Llama

optimum-cli export openvino \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --weight-format int8 \
  ~/aphrodite/models/Llama-3.2-3B-Instruct-ov-int8

Good fits for 16GB plans: Llama-3.2-3B-Instruct, Phi-3.5-mini-instruct, Qwen2.5-7B-Instruct INT8, Mistral-7B-Instruct-v0.3 INT8. Skip 13B+ on a 16GB box unless you can live with a tiny KV cache.

Launch the API Server

Interactive run

cd ~/aphrodite
source venv/bin/activate

export APHRODITE_OPENVINO_KVCACHE_SPACE=4
export APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON

aphrodite run meta-llama/Llama-3.2-3B-Instruct \
  --host 127.0.0.1 \
  --port 2242 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 256 \
  --max-model-len 4096 \
  --api-keys "sk-yourlongrandomkeyhere"

Generate a real key with openssl rand -hex 32 — do not skip --api-keys even on a private VPS. --max-num-batched-tokens 256 is the OpenVINO-recommended chunked-prefill batch size.

Smoke test

curl http://127.0.0.1:2242/v1/chat/completions \
  -H "Authorization: Bearer sk-yourlongrandomkeyhere" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [{"role": "user", "content": "Write one sentence about RamNode."}],
    "max_tokens": 50
  }'

Run as a systemd Service

API key file

echo 'APHRODITE_API_KEY=sk-yourlongrandomkeyhere' > ~/aphrodite/.env
chmod 600 ~/aphrodite/.env

/etc/systemd/system/aphrodite.service

[Unit]
Description=Aphrodite Engine LLM API Server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=aphrodite
Group=aphrodite
WorkingDirectory=/home/aphrodite/aphrodite

# Performance tuning
Environment="LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4"
Environment="APHRODITE_OPENVINO_KVCACHE_SPACE=4"
Environment="APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON"
Environment="HF_HOME=/home/aphrodite/aphrodite/hf_cache"

EnvironmentFile=/home/aphrodite/aphrodite/.env

ExecStart=/home/aphrodite/aphrodite/venv/bin/aphrodite run meta-llama/Llama-3.2-3B-Instruct \
  --host 127.0.0.1 \
  --port 2242 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 256 \
  --max-model-len 4096 \
  --api-keys ${APHRODITE_API_KEY}

Restart=on-failure
RestartSec=10
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

The LD_PRELOAD line forces tcmalloc, which the Aphrodite docs strongly recommend for the CPU backend. Verify the path with find / -name '*libtcmalloc*' 2>/dev/null as it occasionally moves between Ubuntu releases.

Enable + watch

sudo systemctl daemon-reload
sudo systemctl enable aphrodite
sudo systemctl start aphrodite
sudo journalctl -u aphrodite -f

Nginx Reverse Proxy + TLS

/etc/nginx/sites-available/aphrodite

server {
    listen 80;
    server_name api.yourdomain.com;
    location /.well-known/acme-challenge/ { root /var/www/html; }
    location / { return 301 https://$host$request_uri; }
}

server {
    listen 443 ssl http2;
    server_name api.yourdomain.com;

    # certbot will fill in ssl_certificate / ssl_certificate_key

    proxy_read_timeout 600s;
    proxy_send_timeout 600s;
    proxy_connect_timeout 60s;

    proxy_buffering off;
    proxy_cache off;

    client_max_body_size 10M;

    location / {
        proxy_pass http://127.0.0.1:2242;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Required for streaming chat completions
        proxy_set_header Connection "";
    }
}

Enable + cert

sudo ln -s /etc/nginx/sites-available/aphrodite /etc/nginx/sites-enabled/
sudo rm /etc/nginx/sites-enabled/default
sudo nginx -t
sudo certbot --nginx -d api.yourdomain.com

curl https://api.yourdomain.com/v1/models \
  -H "Authorization: Bearer sk-yourlongrandomkeyhere"

Performance Tuning

KV cache size: APHRODITE_OPENVINO_KVCACHE_SPACE (GB). Start at 4 on a 16GB box; raise only if no OOMs and concurrency is bottlenecked.
KV cache precision: APHRODITE_OPENVINO_CPU_KV_CACHE_PRECISION=u8 roughly doubles cached tokens per GB at a small quality cost most apps cannot detect.
OpenMP thread isolation: on shared-core Premium VPS, pin Aphrodite to a subset of cores with OMP_NUM_THREADS + taskset -c 0-3. On dedicated-core VDS, just let OpenMP use all cores.
Chunked prefill: --max-num-batched-tokens 256 is the OpenVINO default. For many short prompts, try 512 and benchmark.

Monitoring

Prometheus metrics

curl https://api.yourdomain.com/metrics \
  -H "Authorization: Bearer sk-yourlongrandomkeyhere" | head -50

Watch aphrodite:num_requests_running vs aphrodite:num_requests_waiting (waiting > running consistently = CPU-bound), aphrodite:gpu_cache_usage_perc (the metric is named for GPU but reflects the OpenVINO KV cache — near 1.0 means raise KVCACHE_SPACE), and time_to_first_token_seconds / time_per_output_token_seconds.

OpenVINO backend limitations

No LoRA, vision/embedding models, or tensor/pipeline parallelism. Plan separately (llama.cpp, sentence-transformers) or move to GPU.
Modest CPU throughput: expect single-digit tokens/sec per concurrent request on a 7B INT8 model on a 16GB VPS, with aggregate climbing as continuous batching engages. Fine for chat, agents, lightly trafficked features. Not enough for real-time voice or high-concurrency consumer chat.
Significant load time: a 7B INT8 model takes 2–5 minutes to load on a Premium VPS 16GB. Keep the service running rather than starting it on demand; scale up over scaling out.

More Deployment Guides•SGLang Guide•Letta Guide