CPU Inference
    OpenAI-Compatible

    Deploy SGLang on a VPS

    A high-performance LLM serving framework with a fully supported CPU backend — run Llama 3.2, Qwen 2.5, or Phi-3 on commodity Xeon hardware behind an OpenAI-compatible endpoint, no GPU required.

    At a Glance

    Projectsgl-project/sglang
    StackPython 3.12 + PyTorch (CPU) + SGLang SRT
    Recommended PlanCloud VPS 8GB (Qwen 2.5 3B INT8); 12GB+ for Llama 3.2 3B BF16
    OSUbuntu 24.04 LTS
    Reverse ProxyNginx with bearer-token auth + Let's Encrypt

    Sizing rule of thumb

    Allocate model weights + 30 to 50% extra for KV cache, request buffers, and the Python runtime. A 3B BF16 model is ~6GB on disk but wants 10 to 12GB RAM to serve without thrashing. Always pre-download weights before the first server start.

    1

    Server Preparation

    System packages
    apt update && apt upgrade -y
    apt install -y build-essential git curl wget htop \
        python3.12 python3.12-venv python3-pip \
        libnuma-dev numactl ca-certificates
    Dedicated user
    useradd -m -s /bin/bash sglang
    usermod -aG sudo sglang
    mkdir -p /home/sglang/.ssh
    cp /root/.ssh/authorized_keys /home/sglang/.ssh/
    chown -R sglang:sglang /home/sglang/.ssh
    chmod 700 /home/sglang/.ssh && chmod 600 /home/sglang/.ssh/authorized_keys

    On under-8GB plans, add a 4GB swap file so HuggingFace's loader does not get OOM-killed during the load phase (swap is not a substitute for RAM during inference itself):

    Swap + swappiness
    fallocate -l 4G /swapfile
    chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile
    echo '/swapfile none swap sw 0 0' >> /etc/fstab
    sysctl vm.swappiness=10
    echo 'vm.swappiness=10' >> /etc/sysctl.conf
    2

    Install SGLang in CPU Mode

    Switch to the sglang user, create a venv, then install the CPU build of PyTorch first. This is the single most common pitfall — if pip resolves to the CUDA wheel, SGLang will fail to import on a CPU host with cryptic CUDA errors.

    venv + CPU PyTorch
    su - sglang
    python3.12 -m venv ~/sglang-env
    source ~/sglang-env/bin/activate
    pip install --upgrade pip wheel
    
    pip install torch torchvision torchaudio \
        --index-url https://download.pytorch.org/whl/cpu
    
    python -c "import torch; print(torch.__version__, torch.cuda.is_available())"

    The output should end with +cpu False. If it says True, uninstall and reinstall against the CPU index.

    SGLang + helpers
    pip install "sglang[srt]"
    pip install transformers accelerate sentencepiece protobuf
    python -c "import sglang; print(sglang.__version__)"
    3

    Download a Model

    Pre-pull weights so the first server start does not block on a slow HuggingFace download. Qwen 2.5 1.5B is the easiest smoke-test; bump to 3B if you have 8GB+ free.

    huggingface-cli
    pip install huggingface_hub
    huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct \
        --local-dir ~/models/qwen2.5-1.5b-instruct

    For gated models (Llama 3.2, etc.), run huggingface-cli login first and paste an access token from your HF account.

    4

    Launch the SGLang Server

    Interactive launch
    export SGLANG_USE_CPU_ENGINE=1
    
    python -m sglang.launch_server \
        --model-path ~/models/qwen2.5-1.5b-instruct \
        --device cpu \
        --host 127.0.0.1 \
        --port 30000 \
        --mem-fraction-static 0.8 \
        --max-total-tokens 8192 \
        --disable-overlap-schedule \
        --trust-remote-code

    Startup is 30 to 90 seconds while weights load. Wait for "The server is fired up and ready to roll!". --host 127.0.0.1 keeps the API loopback-only — never bind to 0.0.0.0; SGLang has no built-in auth. --disable-overlap-schedule is recommended for the CPU backend.

    Smoke test
    curl http://127.0.0.1:30000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "qwen2.5-1.5b-instruct",
            "messages": [{"role": "user", "content": "Write one sentence about RamNode."}],
            "temperature": 0.7,
            "max_tokens": 80
        }'
    5

    Run as a systemd Service

    /etc/systemd/system/sglang.service
    [Unit]
    Description=SGLang Inference Server
    After=network-online.target
    Wants=network-online.target
    
    [Service]
    Type=simple
    User=sglang
    Group=sglang
    WorkingDirectory=/home/sglang
    Environment="SGLANG_USE_CPU_ENGINE=1"
    Environment="HF_HOME=/home/sglang/.cache/huggingface"
    Environment="PATH=/home/sglang/sglang-env/bin:/usr/local/bin:/usr/bin:/bin"
    ExecStart=/home/sglang/sglang-env/bin/python -m sglang.launch_server \
        --model-path /home/sglang/models/qwen2.5-1.5b-instruct \
        --device cpu \
        --host 127.0.0.1 \
        --port 30000 \
        --mem-fraction-static 0.8 \
        --max-total-tokens 8192 \
        --disable-overlap-schedule \
        --trust-remote-code
    Restart=on-failure
    RestartSec=15
    TimeoutStartSec=300
    LimitNOFILE=65535
    
    # Hardening
    NoNewPrivileges=true
    PrivateTmp=true
    ProtectSystem=strict
    ProtectHome=read-only
    ReadWritePaths=/home/sglang
    ProtectKernelTunables=true
    ProtectKernelModules=true
    ProtectControlGroups=true
    
    [Install]
    WantedBy=multi-user.target

    The generous TimeoutStartSec=300 matters — model load on a small CPU VPS legitimately takes a couple of minutes; you do not want systemd killing the service while weights are still mapping into memory.

    Enable + watch
    sudo systemctl daemon-reload
    sudo systemctl enable --now sglang
    sudo journalctl -u sglang -f
    6

    Nginx with Bearer-Token Auth + TLS

    Install + cert
    sudo apt install -y nginx certbot python3-certbot-nginx
    sudo certbot --nginx -d llm.yourdomain.com
    
    # Generate a strong API key
    openssl rand -hex 32

    Replace the auto-generated nginx site at /etc/nginx/sites-available/llm.yourdomain.com:

    nginx config
    map $http_authorization $auth_ok {
        default                                  0;
        "Bearer sk-ramnode-PASTE_YOUR_KEY_HERE"  1;
    }
    
    server {
        listen 80;
        listen [::]:80;
        server_name llm.yourdomain.com;
        return 301 https://$host$request_uri;
    }
    
    server {
        listen 443 ssl http2;
        listen [::]:443 ssl http2;
        server_name llm.yourdomain.com;
    
        ssl_certificate     /etc/letsencrypt/live/llm.yourdomain.com/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/llm.yourdomain.com/privkey.pem;
        ssl_protocols       TLSv1.2 TLSv1.3;
    
        client_max_body_size 8m;
        proxy_read_timeout   600s;
        proxy_send_timeout   600s;
        proxy_buffering      off;
    
        location / {
            if ($auth_ok = 0) {
                return 401 '{"error":"unauthorized"}';
            }
            add_header Content-Type application/json always;
    
            proxy_pass http://127.0.0.1:30000;
            proxy_http_version 1.1;
            proxy_set_header Host              $host;
            proxy_set_header X-Real-IP         $remote_addr;
            proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_set_header Connection        "";
        }
    }
    Verify
    # Should return 401
    curl -s https://llm.yourdomain.com/v1/models
    
    # Should return the served model list
    curl -s https://llm.yourdomain.com/v1/models \
        -H "Authorization: Bearer sk-ramnode-PASTE_YOUR_KEY_HERE"

    Any OpenAI SDK now works against your endpoint by setting base_url and api_key. For multi-tenant scenarios, swap the simple map directive for oauth2-proxy or a small FastAPI sidecar that validates per-tenant keys.

    7

    Benchmark Your Deployment

    bench_serving
    source ~/sglang-env/bin/activate
    python -m sglang.bench_serving \
        --backend sglang \
        --model qwen2.5-1.5b-instruct \
        --base-url http://127.0.0.1:30000 \
        --dataset-name random \
        --random-input 256 \
        --random-output 128 \
        --num-prompts 50 \
        --max-concurrency 4

    Watch Mean TTFT (time to first token) and Output throughput. On a 4 vCPU box with a 1.5B model expect roughly 10 to 30 tokens/sec per request, with aggregate throughput rising as you batch concurrent requests. If numbers are unacceptable: pick a smaller model, switch to a more aggressive quantization (INT8, AWQ 4-bit), or move to a GPU-backed deployment — the API surface is identical, so client code does not change.

    Troubleshooting

    • Server hangs at startup, never logs "ready": almost always a memory issue. Watch htop — if the process is OOM-killed, drop to a smaller model or resize.
    • ImportError related to CUDA: you have the CUDA build of PyTorch. pip uninstall -y torch torchvision torchaudio and reinstall with the CPU index URL.
    • Inference dramatically slower than expected: check vmstat 2 for swap activity. High swap = the model is overflowing RAM.
    • HTTP 401 with a valid header: the nginx map is a literal string match. Reproduce with curl -v — whitespace, missing Bearer prefix, or trailing newlines all fail.
    • Custom architectures (Qwen, DeepSeek): ensure --trust-remote-code is in both the systemd unit and any manual launch.