Deploy Aphrodite Engine on a VPS
A high-throughput OpenAI-compatible LLM server built on PagedAttention. The OpenVINO backend delivers the best CPU throughput, supports INT8 weight compression, and gives you continuous batching for serving multiple concurrent clients on commodity hardware.
At a Glance
| Project | aphrodite-engine/aphrodite-engine |
| Backend | OpenVINO CPU (INT8 + AVX2/AVX512) |
| Recommended Plan | Premium VPS 16GB or VDS 16GB (7B INT8); VDS preferred for production |
| OS / Python | Ubuntu 24.04 LTS / Python 3.11 (OpenVINO wheel constraint) |
| Reverse Proxy | Nginx with Let's Encrypt + bearer-token API key |
Sizing rules
Plan for at least 1.5x model size in RAM. A 7B INT8 model is ~7GB of weights, so 16GB total gives room for 6–8GB of KV cache + OS + overhead. Prefer VDS over Premium VPS for production — Aphrodite's continuous-batching loop is sensitive to noisy neighbors, and dedicated cores produce far more predictable latency.
Verify CPU Instruction Set Support
OpenVINO needs AVX2 minimum and is much faster on AVX512. RamNode Premium VPS and VDS run on modern Intel Xeon hardware so AVX2 is essentially guaranteed, but verify before installing:
grep -o 'avx2' /proc/cpuinfo | head -1
grep -o 'avx512' /proc/cpuinfo | head -1Initial Server Hardening
adduser aphrodite
usermod -aG sudo aphrodite
mkdir -p /home/aphrodite/.ssh
cp ~/.ssh/authorized_keys /home/aphrodite/.ssh/
chown -R aphrodite:aphrodite /home/aphrodite/.ssh
chmod 700 /home/aphrodite/.ssh
chmod 600 /home/aphrodite/.ssh/authorized_keys
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
systemctl restart ssh
ufw allow OpenSSH
ufw allow 80/tcp
ufw allow 443/tcp
ufw --force enableAphrodite's default 2242 stays closed externally — the API will sit behind Nginx on 443.
System Dependencies
OpenVINO needs Python 3.9–3.11; install 3.11 explicitly. Aphrodite requires gcc/g++ ≥ 12.3.0:
sudo apt-get update
sudo apt-get install -y \
python3.11 python3.11-venv python3.11-dev \
build-essential gcc-12 g++-12 \
libnuma-dev libtcmalloc-minimal4 \
git wget curl nginx certbot python3-certbot-nginx
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \
--slave /usr/bin/g++ g++ /usr/bin/g++-12sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile && sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -pCreate the Python Environment
Dedicated venv — never install Aphrodite system-wide, and do not use system Python 3.12 on Ubuntu 24.04 (OpenVINO wheel availability lags behind):
mkdir -p ~/aphrodite && cd ~/aphrodite
python3.11 -m venv venv
source venv/bin/activate
pip install -U pip wheel packaging ninja "setuptools>=49.4.0" numpyInstall Aphrodite with OpenVINO Backend
The OpenVINO backend is built from source — upstream wheels target NVIDIA. Build against the CPU PyTorch index:
cd ~/aphrodite
git clone https://github.com/aphrodite-engine/aphrodite-engine.git
cd aphrodite-engine
pip install -r requirements/build.txt \
--extra-index-url https://download.pytorch.org/whl/cpu
PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" \
APHRODITE_TARGET_DEVICE=openvino \
pip install -e .Compilation takes 10–30 minutes depending on the plan. If Ninja workers get killed (out of memory), drop parallelism with MAX_JOBS ≈ floor(RAM_GB / 4):
export MAX_JOBS=2
PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" \
APHRODITE_TARGET_DEVICE=openvino \
pip install -e .Pick a Model
Path A — let Aphrodite quantize on load: point at any FP16 HF model and enable APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON. Simplest start.
Path B — pre-export to OpenVINO IR with optimum-cli for smaller, faster-loading artifacts:
pip install huggingface_hub optimum[openvino]
huggingface-cli login # for gated models like Llama
optimum-cli export openvino \
--model meta-llama/Llama-3.2-3B-Instruct \
--weight-format int8 \
~/aphrodite/models/Llama-3.2-3B-Instruct-ov-int8Good fits for 16GB plans: Llama-3.2-3B-Instruct, Phi-3.5-mini-instruct, Qwen2.5-7B-Instruct INT8, Mistral-7B-Instruct-v0.3 INT8. Skip 13B+ on a 16GB box unless you can live with a tiny KV cache.
Launch the API Server
cd ~/aphrodite
source venv/bin/activate
export APHRODITE_OPENVINO_KVCACHE_SPACE=4
export APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON
aphrodite run meta-llama/Llama-3.2-3B-Instruct \
--host 127.0.0.1 \
--port 2242 \
--enable-chunked-prefill \
--max-num-batched-tokens 256 \
--max-model-len 4096 \
--api-keys "sk-yourlongrandomkeyhere"Generate a real key with openssl rand -hex 32 — do not skip --api-keys even on a private VPS. --max-num-batched-tokens 256 is the OpenVINO-recommended chunked-prefill batch size.
curl http://127.0.0.1:2242/v1/chat/completions \
-H "Authorization: Bearer sk-yourlongrandomkeyhere" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [{"role": "user", "content": "Write one sentence about RamNode."}],
"max_tokens": 50
}'Run as a systemd Service
echo 'APHRODITE_API_KEY=sk-yourlongrandomkeyhere' > ~/aphrodite/.env
chmod 600 ~/aphrodite/.env[Unit]
Description=Aphrodite Engine LLM API Server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=aphrodite
Group=aphrodite
WorkingDirectory=/home/aphrodite/aphrodite
# Performance tuning
Environment="LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4"
Environment="APHRODITE_OPENVINO_KVCACHE_SPACE=4"
Environment="APHRODITE_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON"
Environment="HF_HOME=/home/aphrodite/aphrodite/hf_cache"
EnvironmentFile=/home/aphrodite/aphrodite/.env
ExecStart=/home/aphrodite/aphrodite/venv/bin/aphrodite run meta-llama/Llama-3.2-3B-Instruct \
--host 127.0.0.1 \
--port 2242 \
--enable-chunked-prefill \
--max-num-batched-tokens 256 \
--max-model-len 4096 \
--api-keys ${APHRODITE_API_KEY}
Restart=on-failure
RestartSec=10
LimitNOFILE=65536
[Install]
WantedBy=multi-user.targetThe LD_PRELOAD line forces tcmalloc, which the Aphrodite docs strongly recommend for the CPU backend. Verify the path with find / -name '*libtcmalloc*' 2>/dev/null as it occasionally moves between Ubuntu releases.
sudo systemctl daemon-reload
sudo systemctl enable aphrodite
sudo systemctl start aphrodite
sudo journalctl -u aphrodite -fNginx Reverse Proxy + TLS
server {
listen 80;
server_name api.yourdomain.com;
location /.well-known/acme-challenge/ { root /var/www/html; }
location / { return 301 https://$host$request_uri; }
}
server {
listen 443 ssl http2;
server_name api.yourdomain.com;
# certbot will fill in ssl_certificate / ssl_certificate_key
proxy_read_timeout 600s;
proxy_send_timeout 600s;
proxy_connect_timeout 60s;
proxy_buffering off;
proxy_cache off;
client_max_body_size 10M;
location / {
proxy_pass http://127.0.0.1:2242;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Required for streaming chat completions
proxy_set_header Connection "";
}
}sudo ln -s /etc/nginx/sites-available/aphrodite /etc/nginx/sites-enabled/
sudo rm /etc/nginx/sites-enabled/default
sudo nginx -t
sudo certbot --nginx -d api.yourdomain.com
curl https://api.yourdomain.com/v1/models \
-H "Authorization: Bearer sk-yourlongrandomkeyhere"Performance Tuning
- KV cache size:
APHRODITE_OPENVINO_KVCACHE_SPACE(GB). Start at 4 on a 16GB box; raise only if no OOMs and concurrency is bottlenecked. - KV cache precision:
APHRODITE_OPENVINO_CPU_KV_CACHE_PRECISION=u8roughly doubles cached tokens per GB at a small quality cost most apps cannot detect. - OpenMP thread isolation: on shared-core Premium VPS, pin Aphrodite to a subset of cores with
OMP_NUM_THREADS+taskset -c 0-3. On dedicated-core VDS, just let OpenMP use all cores. - Chunked prefill:
--max-num-batched-tokens 256is the OpenVINO default. For many short prompts, try 512 and benchmark.
Monitoring
curl https://api.yourdomain.com/metrics \
-H "Authorization: Bearer sk-yourlongrandomkeyhere" | head -50Watch aphrodite:num_requests_running vs aphrodite:num_requests_waiting (waiting > running consistently = CPU-bound), aphrodite:gpu_cache_usage_perc (the metric is named for GPU but reflects the OpenVINO KV cache — near 1.0 means raise KVCACHE_SPACE), and time_to_first_token_seconds / time_per_output_token_seconds.
OpenVINO backend limitations
- No LoRA, vision/embedding models, or tensor/pipeline parallelism. Plan separately (llama.cpp, sentence-transformers) or move to GPU.
- Modest CPU throughput: expect single-digit tokens/sec per concurrent request on a 7B INT8 model on a 16GB VPS, with aggregate climbing as continuous batching engages. Fine for chat, agents, lightly trafficked features. Not enough for real-time voice or high-concurrency consumer chat.
- Significant load time: a 7B INT8 model takes 2–5 minutes to load on a Premium VPS 16GB. Keep the service running rather than starting it on demand; scale up over scaling out.
