KServe Deployment Guide on RamNode VPS | Scalable ML Model Inference on Kubernetes

Prerequisites

Choose your RamNode plan based on your workload type. Predictive inference (scikit-learn, XGBoost) runs well on mid-range VPS plans, while generative AI and LLM serving require more resources.

Resource	Predictive AI	Small LLMs	Large LLMs
CPU Cores	4+	8+	16+
RAM	8 GB	32 GB	64+ GB
Storage	40 GB SSD	100 GB NVMe	200+ GB NVMe
OS	Ubuntu 24.04	Ubuntu 24.04	Ubuntu 24.04

Software Prerequisites

curl — pre-installed on Ubuntu 24.04
kubectl — Kubernetes CLI for cluster management
Helm 3 — Kubernetes package manager
A running Kubernetes cluster — we will install K3s in this guide

Initial Server Setup

Update system and install essentials

apt update && apt upgrade -y
apt install -y curl wget git apt-transport-https ca-certificates

Configure Firewall

Open required ports

ufw allow 22/tcp        # SSH
ufw allow 6443/tcp      # Kubernetes API
ufw allow 80/tcp        # HTTP inference endpoints
ufw allow 443/tcp       # HTTPS inference endpoints
ufw allow 8080/tcp      # KServe model endpoints
ufw allow 10250/tcp     # Kubelet
ufw --force enable

Set Kernel Parameters

Configure networking for Kubernetes

cat <<EOF >> /etc/sysctl.d/99-kubernetes.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                = 1
EOF

modprobe br_netfilter
sysctl --system

Install K3s Kubernetes

K3s is a lightweight, certified Kubernetes distribution ideal for VPS deployments. It bundles containerd, CoreDNS, and other essentials in a single binary. We disable Traefik because KServe manages traffic routing through Gateway API.

Install K3s

curl -sfL https://get.k3s.io | sh -s - \
  --write-kubeconfig-mode 644 \
  --disable traefik

Verify installation

kubectl get nodes
# Expected output:
# NAME      STATUS   ROLES                  AGE   VERSION
# ramnode   Ready    control-plane,master   1m    v1.32.x

Configure kubectl

mkdir -p ~/.kube
cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
export KUBECONFIG=~/.kube/config
echo "export KUBECONFIG=~/.kube/config" >> ~/.bashrc

Install Helm

Install Helm 3

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

Install Cert Manager

Cert Manager handles TLS certificate provisioning for KServe's webhook server. The minimum required version is 1.15.0.

Install Cert Manager

kubectl apply -f \
  https://github.com/cert-manager/cert-manager/releases/download/v1.16.3/cert-manager.yaml

# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l app=cert-manager \
  -n cert-manager --timeout=120s

Verify Cert Manager

kubectl get pods -n cert-manager
# All three pods should show Running status:
#   cert-manager
#   cert-manager-cainjector
#   cert-manager-webhook

Install Gateway API & Envoy Gateway

KServe recommends Gateway API for network configuration, providing flexible and standardized traffic management — especially important for generative AI workloads with streaming responses and long-lived connections.

Install Gateway API CRDs

kubectl apply -f \
  https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml

Install Envoy Gateway

helm install eg oci://docker.io/envoyproxy/gateway-helm \
  --version v1.3.0 \
  -n envoy-gateway-system \
  --create-namespace

# Wait for Envoy Gateway to be ready
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=gateway-helm \
  -n envoy-gateway-system --timeout=120s

Create GatewayClass

cat <<EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
EOF

Install KServe

KServe supports three deployment modes. For RamNode VPS deployments, we recommend Standard mode for full resource control.

Mode	Best For	Notes
Standard (recommended)	GPU models, LLMs, all workloads	Full resource control
Knative (Serverless)	Burst traffic, scale-to-zero	Higher overhead
ModelMesh	High-density multi-model	Many small models

Install with Helm (Recommended)

Install KServe CRDs and controller

# Install KServe CRDs
helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
  --version v0.15.2

# Install KServe in Standard deployment mode
helm install kserve oci://ghcr.io/kserve/charts/kserve \
  --version v0.15.2 \
  --set kserve.controller.deploymentMode=Standard \
  --set kserve.controller.gateway.ingressGateway.enableGatewayApi=true \
  --set kserve.controller.gateway.ingressGateway.kserveGateway=kserve/kserve-ingress-gateway

Alternative: Install with kubectl

Install via raw manifests

# Install KServe CRDs and Controller
kubectl apply --server-side -f \
  https://github.com/kserve/kserve/releases/download/v0.15.2/kserve.yaml

# Install built-in ClusterServingRuntimes
kubectl apply --server-side -f \
  https://github.com/kserve/kserve/releases/download/v0.15.2/kserve-cluster-resources.yaml

# Patch config for Standard deployment mode
kubectl patch configmap/inferenceservice-config -n kserve \
  --type=strategic -p '{"data": {"deploy": "{"defaultDeploymentMode": "Standard"}"}}'

The --server-side flag is required when applying KServe CRDs due to the large size of the InferenceService custom resource definition.

Verify KServe installation

kubectl get pods -n kserve
# Expected output:
# NAME                            READY   STATUS    AGE
# kserve-controller-manager-0     2/2     Running   2m

kubectl get clusterservingruntimes
# Should list built-in runtimes: kserve-huggingfaceserver,
# kserve-sklearnserver, kserve-xgbserver, kserve-torchserve, etc.

Deploy Your First Models

Example 1: Scikit-Learn Iris Classifier

The simplest way to verify your KServe installation. This deploys a pre-trained scikit-learn model for iris flower classification.

Deploy sklearn-iris InferenceService

cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      protocolVersion: v2
      storageUri: gs://kfserving-examples/models/sklearn/1.0/model
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi
EOF

Check status and test inference

# Wait for READY=True
kubectl get inferenceservice sklearn-iris

# Create test input
cat <<EOF > /tmp/iris-input.json
{"instances": [[6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6]]}
EOF

# Get the service URL
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris \
  -o jsonpath='{.status.url}' | cut -d/ -f3)

# Send prediction request
curl -v -H "Host: ${SERVICE_HOSTNAME}" \
  http://localhost:8080/v1/models/sklearn-iris:predict \
  -d @/tmp/iris-input.json

Example 2: Hugging Face LLM (Qwen 2.5)

For generative AI, KServe provides native Hugging Face integration with OpenAI-compatible chat completion APIs.

Deploy Qwen 2.5 LLM

cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen-llm
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - "--model_id"
        - "Qwen/Qwen2.5-0.5B-Instruct"
      resources:
        requests:
          cpu: "4"
          memory: 8Gi
        limits:
          cpu: "4"
          memory: 8Gi
EOF

Test chat completions

SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen-llm \
  -o jsonpath='{.status.url}' | cut -d/ -f3)

curl -v -H "Host: ${SERVICE_HOSTNAME}" \
  http://localhost:8080/openai/v1/chat/completions \
  -d '{
    "model": "qwen-llm",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}],
    "max_tokens": 100
  }'

🚀 GPU-Accelerated LLMs

For larger models like Llama 3.1 8B, add nvidia.com/gpu: "1" to your resource limits and ensure your RamNode VPS has GPU support configured. KServe v0.15 also supports multi-node inference for models that exceed single-GPU memory.

Production Configuration

Configure Autoscaling

KServe supports request-based autoscaling in Knative mode and metric-based autoscaling via KEDA in Standard mode.

Set min/max replicas on InferenceService

cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  annotations:
    serving.kserve.io/min-replicas: "1"
    serving.kserve.io/max-replicas: "5"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: gs://kfserving-examples/models/sklearn/1.0/model
EOF

Enable Model Caching

KServe v0.15 introduced LocalModelCache, reducing model loading times from 15–20 minutes to approximately one minute for large models.

Create LocalModelCache

cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelCache
metadata:
  name: llm-cache
spec:
  sourceModelUri: hf://Qwen/Qwen2.5-0.5B-Instruct
  nodeGroups:
    - default
EOF

Resource Defaults

Customize global resource defaults for all InferenceServices to prevent models from exceeding your VPS limits.

Edit KServe ConfigMap

kubectl edit configmap inferenceservice-config -n kserve

# Key settings in the inferenceService section:
# "cpuLimit":     "2"    — Max CPU per model container
# "memoryLimit":  "4Gi"  — Max memory per model container
# "cpuRequest":   "1"    — Default CPU request
# "memoryRequest":"2Gi"  — Default memory request

Monitoring & Troubleshooting

Health Checks

Monitor KServe status

# Check overall KServe status
kubectl get inferenceservices --all-namespaces

# View detailed model status
kubectl describe inferenceservice sklearn-iris

# Check KServe controller logs
kubectl logs -n kserve -l control-plane=kserve-controller-manager -c manager

Common Issues

Problem	Cause	Solution
InferenceService stuck in Unknown	Storage init container failing	Check storage-initializer logs
CrashLoopBackOff on predictor	Insufficient memory	Increase memory limits
Model download timeout	Slow network or large model	Use LocalModelCache
502 Bad Gateway	Model not yet ready	Wait for READY=True status
Webhook errors during install	Cert Manager not ready	Verify cert-manager pods

Debug Commands

Useful debugging commands

# Check pod events for errors
kubectl get events --sort-by=.lastTimestamp

# Inspect storage init container logs
kubectl logs <pod-name> -c storage-initializer

# Check Gateway resources
kubectl get gateways,httproutes -A

# View model server logs
kubectl logs <pod-name> -c kserve-container

Supported ML Frameworks

KServe provides built-in serving runtimes for all major machine learning frameworks. Each runtime supports the V2 inference protocol.

Framework	Model Format	Use Case
Scikit-Learn	sklearn	Classification, regression, clustering
XGBoost	xgboost	Gradient boosting models
TensorFlow	tensorflow	Deep learning models
PyTorch (TorchServe)	pytorch	Custom deep learning
ONNX	onnxruntime	Cross-framework optimized
Hugging Face	huggingface	LLMs, NLP, transformers
vLLM	vllm	High-throughput LLM serving
Custom Containers	custom	Any framework

Canary Deployments

KServe supports canary rollouts for safely deploying model updates. Split traffic between current and new model versions, gradually shifting as you gain confidence.

Canary deployment with 20% traffic split

cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
spec:
  predictor:
    canaryTrafficPercent: 20
    model:
      modelFormat:
        name: sklearn
      storageUri: gs://kfserving-examples/models/sklearn/2.0/model
EOF

This routes 20% of traffic to the new model version while 80% continues serving from the stable revision. Increase the canary percentage incrementally as you validate performance.

Cleanup

Remove models and uninstall KServe

# Delete InferenceServices
kubectl delete inferenceservice --all

# Uninstall KServe
helm uninstall kserve
helm uninstall kserve-crd

# Uninstall Envoy Gateway
helm uninstall eg -n envoy-gateway-system

# Uninstall Cert Manager
kubectl delete -f \
  https://github.com/cert-manager/cert-manager/releases/download/v1.16.3/cert-manager.yaml

KServe Deployed Successfully!

Your KServe environment is now running on a RamNode VPS with K3s Kubernetes, Envoy Gateway traffic management, Cert Manager TLS, and production-ready model serving. RamNode's flexible VPS infrastructure provides the compute resources that ML inference workloads demand — starting at just $4/month.

Deploy KServe

Prerequisites

Software Prerequisites

Initial Server Setup

Configure Firewall

Set Kernel Parameters

Install K3s Kubernetes

Install Helm

Install Cert Manager

Install Gateway API & Envoy Gateway

Install KServe

Install with Helm (Recommended)

Alternative: Install with kubectl

Deploy Your First Models

Example 1: Scikit-Learn Iris Classifier

Example 2: Hugging Face LLM (Qwen 2.5)

🚀 GPU-Accelerated LLMs

Production Configuration

Configure Autoscaling

Enable Model Caching

Resource Defaults

Monitoring & Troubleshooting

Health Checks

Common Issues

Debug Commands

Supported ML Frameworks

Canary Deployments

Cleanup

KServe Deployed Successfully!