Self-Hosted AI Stack Series
    Part 1 of 8

    Ollama — Run LLMs on CPU

    Deploy large language models on your VPS without a GPU. Model selection for 2GB, 4GB, and 8GB RAM tiers starting at $4/month.

    25 minutes
    Any RamNode VPS plan
    Prerequisites

    RamNode VPS (any plan), Ubuntu 22.04/24.04, SSH access

    Time to Complete

    20–30 minutes

    Recommended Plan

    2GB ($10/mo) for small models; 4GB ($20/mo) recommended; 8GB ($40/mo) for best selection

    Looking for a quick-start guide? Check out our standalone Ollama Deployment Guide for a streamlined setup walkthrough.

    Introduction

    ChatGPT Plus costs $20/month per user. OpenAI API costs scale unpredictably with usage. And every prompt you send leaves your infrastructure, exposing sensitive data to third-party servers.

    There's a better way. Modern quantized language models run surprisingly well on CPU-only VPS instances — no expensive GPU required. By the end of this 8-part series, you'll have a fully private AI platform running on a single RamNode VPS for a fraction of the cost.

    💰 Series Cost Comparison

    Commercial AI stack (ChatGPT Team + Copilot + Pinecone + Zapier AI + API costs): $390–690+/month. Your RamNode VPS running the complete self-hosted stack: $40/month.

    Why Ollama on CPU?

    GPU instances cost $50–300+/month. But modern quantized models in GGUF format run well on CPU-only hardware thanks to efficient inference engines.

    Understanding Quantization

    Quantization reduces model precision from 16-bit floats to lower bit widths, dramatically shrinking memory requirements with minimal quality loss:

    QuantizationSize ReductionQuality ImpactBest For
    Q8_0~50%Minimal8GB+ RAM, best quality
    Q5_K_M~65%Very low4–8GB RAM, great balance
    Q4_K_M~75%LowSweet spot for CPU inference
    Q3_K_M~80%Moderate2GB RAM, constrained setups

    Q4_K_M is the sweet spot — it preserves most of the model's capability while fitting comfortably in modest RAM allocations. RamNode's VPS plans offer the best price-performance ratio for this workload.

    Installing Ollama

    Ollama provides a one-line installer that handles everything:

    Install Ollama
    curl -fsSL https://ollama.com/install.sh | sh

    Verify the installation:

    Verify Installation
    ollama --version

    Configure for Network Access

    By default, Ollama listens only on localhost. For later parts of this series (Open WebUI, n8n, etc.), configure it to accept connections from Docker containers:

    Edit systemd service
    sudo systemctl edit ollama.service

    Add the following override:

    ollama.service override
    [Service]
    Environment="OLLAMA_HOST=0.0.0.0"
    Environment="OLLAMA_ORIGINS=*"

    Reload and restart:

    sudo systemctl daemon-reload
    sudo systemctl restart ollama

    ⚠️ Security Note: Setting OLLAMA_HOST=0.0.0.0 exposes Ollama on all interfaces. Protect it with a firewall — only allow access from localhost and your Docker network:

    sudo ufw allow from 172.16.0.0/12 to any port 11434
    sudo ufw deny 11434

    Model Selection by RAM Tier

    This is the centerpiece of your Ollama deployment. Choose models based on your VPS plan's available RAM:

    2GB RAM — $10/month

    Small but capable models for lightweight tasks:

    ModelSizeRAM UsageBest For
    tinyllama637 MB~1.1 GBQuick Q&A, summarization
    phi1.6 GB~1.8 GBReasoning, general tasks
    gemma:2b1.4 GB~1.6 GBGoogle's efficient small model
    Pull a 2GB-tier model
    ollama pull tinyllama
    ollama pull phi

    4GB RAM — $20/month
    Recommended

    The sweet spot — access to powerful 7B parameter models:

    ModelSizeRAM UsageBest For
    mistral4.1 GB~3.8 GBGeneral purpose, conversation
    llama3.1:8b4.7 GB~3.9 GBMeta's latest, excellent reasoning
    codegemma5.0 GB~3.8 GBCode generation & completion
    Pull 4GB-tier models
    ollama pull mistral
    ollama pull llama3.1:8b

    8GB RAM — $40/month

    Full model selection — larger quantizations and bigger models:

    ModelSizeRAM UsageBest For
    llama3.1:8b (Q5)5.5 GB~6.2 GBHigher quality reasoning
    deepseek-coder:6.7b3.8 GB~5.5 GBSpecialized code generation
    mixtral (Q4)26 GB~7.5 GBMixture of experts, best quality
    Pull 8GB-tier models
    ollama pull llama3.1:8b
    ollama pull deepseek-coder:6.7b

    Running Your First Inference

    Interactive Chat

    Start an interactive session with your model:

    ollama run mistral

    Type your prompt and press Enter. Use /bye to exit.

    REST API

    Ollama exposes a REST API for programmatic access — this is how Open WebUI, n8n, and CrewAI will connect in later parts:

    Non-streaming request
    curl http://localhost:11434/api/generate -d '{
      "model": "mistral",
      "prompt": "Explain containers in one paragraph.",
      "stream": false
    }'
    Streaming request
    curl http://localhost:11434/api/generate -d '{
      "model": "mistral",
      "prompt": "Write a bash script to monitor disk usage.",
      "stream": true
    }'
    Chat API (conversation format)
    curl http://localhost:11434/api/chat -d '{
      "model": "mistral",
      "messages": [
        {"role": "system", "content": "You are a helpful DevOps assistant."},
        {"role": "user", "content": "How do I set up a cron job?"}
      ],
      "stream": false
    }'

    Performance Tuning

    Parallel Requests

    Control how many requests Ollama handles simultaneously. For CPU inference, keep this conservative:

    Add to systemd override
    [Service]
    Environment="OLLAMA_NUM_PARALLEL=2"

    Context Window vs RAM Tradeoff

    Larger context windows use more RAM. The default is typically 2048 tokens. Adjust per-model:

    Create a Modelfile with custom context
    FROM mistral
    PARAMETER num_ctx 4096
    ollama create mistral-4k -f Modelfile
    Context SizeExtra RAMGood For
    2048 (default)BaselineShort conversations, Q&A
    4096+~500 MBLonger conversations, code review
    8192+~1.5 GBDocument analysis, RAG (Part 3)

    Swap Space as Safety Net

    Add swap to prevent OOM kills if a model temporarily exceeds available RAM:

    Create 2GB swap
    sudo fallocate -l 2G /swapfile
    sudo chmod 600 /swapfile
    sudo mkswap /swapfile
    sudo swapon /swapfile
    echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

    Monitor Resource Usage

    # Watch memory and CPU in real-time
    htop
    
    # Check Ollama-specific resource usage
    ollama ps

    Persistence & Auto-Start

    Ollama's installer configures systemd automatically. Verify it survives reboots:

    sudo systemctl enable ollama
    sudo systemctl status ollama

    Model Pre-Pull Script

    Create a script that ensures your preferred models are available after a fresh deployment:

    /usr/local/bin/ollama-pull-models.sh
    #!/bin/bash
    # Pre-pull models after restart
    MODELS="mistral llama3.1:8b"
    
    for model in $MODELS; do
      echo "Pulling $model..."
      ollama pull "$model"
    done
    echo "All models ready."
    sudo chmod +x /usr/local/bin/ollama-pull-models.sh

    Health Check

    Verify Ollama is responding:

    curl http://localhost:11434/api/tags

    This returns a JSON list of all available models — useful for monitoring and integration testing.

    What's Next?

    You now have a local LLM inference server running on your VPS. In Part 2: Open WebUI, we'll give your Ollama instance a polished ChatGPT-like interface with:

    • Multi-user support with role-based access control
    • Conversation history and model switching
    • File uploads and document preview
    • Custom system prompts and model presets

    Running LLMs on a $5/month VPS is just the beginning. Part 2 turns it into a team-ready AI chat platform.