Part 1 of 8

Ollama — Run LLMs on CPU

Deploy large language models on your VPS without a GPU. Model selection for 2GB, 4GB, and 8GB RAM tiers starting at $4/month.

25 minutes

Any RamNode VPS plan

Prerequisites

RamNode VPS (any plan), Ubuntu 22.04/24.04, SSH access

Time to Complete

20–30 minutes

Recommended Plan

2GB ($10/mo) for small models; 4GB ($20/mo) recommended; 8GB ($40/mo) for best selection

Looking for a quick-start guide? Check out our standalone Ollama Deployment Guide for a streamlined setup walkthrough.

Introduction

ChatGPT Plus costs $20/month per user. OpenAI API costs scale unpredictably with usage. And every prompt you send leaves your infrastructure, exposing sensitive data to third-party servers.

There's a better way. Modern quantized language models run surprisingly well on CPU-only VPS instances — no expensive GPU required. By the end of this 8-part series, you'll have a fully private AI platform running on a single RamNode VPS for a fraction of the cost.

💰 Series Cost Comparison

Commercial AI stack (ChatGPT Team + Copilot + Pinecone + Zapier AI + API costs): $390–690+/month. Your RamNode VPS running the complete self-hosted stack: $40/month.

Why Ollama on CPU?

GPU instances cost $50–300+/month. But modern quantized models in GGUF format run well on CPU-only hardware thanks to efficient inference engines.

Understanding Quantization

Quantization reduces model precision from 16-bit floats to lower bit widths, dramatically shrinking memory requirements with minimal quality loss:

Quantization	Size Reduction	Quality Impact	Best For
Q8_0	~50%	Minimal	8GB+ RAM, best quality
Q5_K_M	~65%	Very low	4–8GB RAM, great balance
Q4_K_M	~75%	Low	Sweet spot for CPU inference
Q3_K_M	~80%	Moderate	2GB RAM, constrained setups

Q4_K_M is the sweet spot — it preserves most of the model's capability while fitting comfortably in modest RAM allocations. RamNode's VPS plans offer the best price-performance ratio for this workload.

Installing Ollama

Ollama provides a one-line installer that handles everything:

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

Verify Installation

ollama --version

Configure for Network Access

By default, Ollama listens only on localhost. For later parts of this series (Open WebUI, n8n, etc.), configure it to accept connections from Docker containers:

Edit systemd service

sudo systemctl edit ollama.service

Add the following override:

ollama.service override

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"

Reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

⚠️ Security Note: Setting OLLAMA_HOST=0.0.0.0 exposes Ollama on all interfaces. Protect it with a firewall — only allow access from localhost and your Docker network:

sudo ufw allow from 172.16.0.0/12 to any port 11434
sudo ufw deny 11434

Model Selection by RAM Tier

This is the centerpiece of your Ollama deployment. Choose models based on your VPS plan's available RAM:

2GB RAM — $10/month

Small but capable models for lightweight tasks:

Model	Size	RAM Usage	Best For
tinyllama	637 MB	~1.1 GB	Quick Q&A, summarization
phi	1.6 GB	~1.8 GB	Reasoning, general tasks
gemma:2b	1.4 GB	~1.6 GB	Google's efficient small model

Pull a 2GB-tier model

ollama pull tinyllama
ollama pull phi

4GB RAM — $20/month
Recommended

The sweet spot — access to powerful 7B parameter models:

Model	Size	RAM Usage	Best For
mistral	4.1 GB	~3.8 GB	General purpose, conversation
llama3.1:8b	4.7 GB	~3.9 GB	Meta's latest, excellent reasoning
codegemma	5.0 GB	~3.8 GB	Code generation & completion

Pull 4GB-tier models

ollama pull mistral
ollama pull llama3.1:8b

8GB RAM — $40/month

Full model selection — larger quantizations and bigger models:

Model	Size	RAM Usage	Best For
llama3.1:8b (Q5)	5.5 GB	~6.2 GB	Higher quality reasoning
deepseek-coder:6.7b	3.8 GB	~5.5 GB	Specialized code generation
mixtral (Q4)	26 GB	~7.5 GB	Mixture of experts, best quality

Pull 8GB-tier models

ollama pull llama3.1:8b
ollama pull deepseek-coder:6.7b

Running Your First Inference

Interactive Chat

Start an interactive session with your model:

ollama run mistral

Type your prompt and press Enter. Use /bye to exit.

REST API

Ollama exposes a REST API for programmatic access — this is how Open WebUI, n8n, and CrewAI will connect in later parts:

Non-streaming request

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Explain containers in one paragraph.",
  "stream": false
}'

Streaming request

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Write a bash script to monitor disk usage.",
  "stream": true
}'

Chat API (conversation format)

curl http://localhost:11434/api/chat -d '{
  "model": "mistral",
  "messages": [
    {"role": "system", "content": "You are a helpful DevOps assistant."},
    {"role": "user", "content": "How do I set up a cron job?"}
  ],
  "stream": false
}'

Performance Tuning

Parallel Requests

Control how many requests Ollama handles simultaneously. For CPU inference, keep this conservative:

Add to systemd override

[Service]
Environment="OLLAMA_NUM_PARALLEL=2"

Context Window vs RAM Tradeoff

Larger context windows use more RAM. The default is typically 2048 tokens. Adjust per-model:

Create a Modelfile with custom context

FROM mistral
PARAMETER num_ctx 4096

ollama create mistral-4k -f Modelfile

Context Size	Extra RAM	Good For
2048 (default)	Baseline	Short conversations, Q&A
4096	+~500 MB	Longer conversations, code review
8192	+~1.5 GB	Document analysis, RAG (Part 3)

Swap Space as Safety Net

Add swap to prevent OOM kills if a model temporarily exceeds available RAM:

Create 2GB swap

sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Monitor Resource Usage

# Watch memory and CPU in real-time
htop

# Check Ollama-specific resource usage
ollama ps

Persistence & Auto-Start

Ollama's installer configures systemd automatically. Verify it survives reboots:

sudo systemctl enable ollama
sudo systemctl status ollama

Model Pre-Pull Script

Create a script that ensures your preferred models are available after a fresh deployment:

/usr/local/bin/ollama-pull-models.sh

#!/bin/bash
# Pre-pull models after restart
MODELS="mistral llama3.1:8b"

for model in $MODELS; do
  echo "Pulling $model..."
  ollama pull "$model"
done
echo "All models ready."

sudo chmod +x /usr/local/bin/ollama-pull-models.sh

Health Check

Verify Ollama is responding:

curl http://localhost:11434/api/tags

This returns a JSON list of all available models — useful for monitoring and integration testing.

What's Next?

You now have a local LLM inference server running on your VPS. In Part 2: Open WebUI, we'll give your Ollama instance a polished ChatGPT-like interface with:

Multi-user support with role-based access control
Conversation history and model switching
File uploads and document preview
Custom system prompts and model presets

Running LLMs on a $5/month VPS is just the beginning. Part 2 turns it into a team-ready AI chat platform.

← Back to Series Part 2: Open WebUI