Prerequisites
Choose your RamNode plan based on your workload type. Predictive inference (scikit-learn, XGBoost) runs well on mid-range VPS plans, while generative AI and LLM serving require more resources.
| Resource | Predictive AI | Small LLMs | Large LLMs |
|---|---|---|---|
| CPU Cores | 4+ | 8+ | 16+ |
| RAM | 8 GB | 32 GB | 64+ GB |
| Storage | 40 GB SSD | 100 GB NVMe | 200+ GB NVMe |
| OS | Ubuntu 24.04 | Ubuntu 24.04 | Ubuntu 24.04 |
Software Prerequisites
curl— pre-installed on Ubuntu 24.04kubectl— Kubernetes CLI for cluster management- Helm 3 — Kubernetes package manager
- A running Kubernetes cluster — we will install K3s in this guide
Initial Server Setup
apt update && apt upgrade -y
apt install -y curl wget git apt-transport-https ca-certificatesConfigure Firewall
ufw allow 22/tcp # SSH
ufw allow 6443/tcp # Kubernetes API
ufw allow 80/tcp # HTTP inference endpoints
ufw allow 443/tcp # HTTPS inference endpoints
ufw allow 8080/tcp # KServe model endpoints
ufw allow 10250/tcp # Kubelet
ufw --force enableSet Kernel Parameters
cat <<EOF >> /etc/sysctl.d/99-kubernetes.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
modprobe br_netfilter
sysctl --systemInstall K3s Kubernetes
K3s is a lightweight, certified Kubernetes distribution ideal for VPS deployments. It bundles containerd, CoreDNS, and other essentials in a single binary. We disable Traefik because KServe manages traffic routing through Gateway API.
curl -sfL https://get.k3s.io | sh -s - \
--write-kubeconfig-mode 644 \
--disable traefikkubectl get nodes
# Expected output:
# NAME STATUS ROLES AGE VERSION
# ramnode Ready control-plane,master 1m v1.32.xmkdir -p ~/.kube
cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
export KUBECONFIG=~/.kube/config
echo "export KUBECONFIG=~/.kube/config" >> ~/.bashrcInstall Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm versionInstall Cert Manager
Cert Manager handles TLS certificate provisioning for KServe's webhook server. The minimum required version is 1.15.0.
kubectl apply -f \
https://github.com/cert-manager/cert-manager/releases/download/v1.16.3/cert-manager.yaml
# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l app=cert-manager \
-n cert-manager --timeout=120skubectl get pods -n cert-manager
# All three pods should show Running status:
# cert-manager
# cert-manager-cainjector
# cert-manager-webhookInstall Gateway API & Envoy Gateway
KServe recommends Gateway API for network configuration, providing flexible and standardized traffic management — especially important for generative AI workloads with streaming responses and long-lived connections.
kubectl apply -f \
https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yamlhelm install eg oci://docker.io/envoyproxy/gateway-helm \
--version v1.3.0 \
-n envoy-gateway-system \
--create-namespace
# Wait for Envoy Gateway to be ready
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=gateway-helm \
-n envoy-gateway-system --timeout=120scat <<EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: envoy
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
EOFInstall KServe
KServe supports three deployment modes. For RamNode VPS deployments, we recommend Standard mode for full resource control.
| Mode | Best For | Notes |
|---|---|---|
| Standard (recommended) | GPU models, LLMs, all workloads | Full resource control |
| Knative (Serverless) | Burst traffic, scale-to-zero | Higher overhead |
| ModelMesh | High-density multi-model | Many small models |
Install with Helm (Recommended)
# Install KServe CRDs
helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
--version v0.15.2
# Install KServe in Standard deployment mode
helm install kserve oci://ghcr.io/kserve/charts/kserve \
--version v0.15.2 \
--set kserve.controller.deploymentMode=Standard \
--set kserve.controller.gateway.ingressGateway.enableGatewayApi=true \
--set kserve.controller.gateway.ingressGateway.kserveGateway=kserve/kserve-ingress-gatewayAlternative: Install with kubectl
# Install KServe CRDs and Controller
kubectl apply --server-side -f \
https://github.com/kserve/kserve/releases/download/v0.15.2/kserve.yaml
# Install built-in ClusterServingRuntimes
kubectl apply --server-side -f \
https://github.com/kserve/kserve/releases/download/v0.15.2/kserve-cluster-resources.yaml
# Patch config for Standard deployment mode
kubectl patch configmap/inferenceservice-config -n kserve \
--type=strategic -p '{"data": {"deploy": "{"defaultDeploymentMode": "Standard"}"}}'The --server-side flag is required when applying KServe CRDs due to the large size of the InferenceService custom resource definition.
kubectl get pods -n kserve
# Expected output:
# NAME READY STATUS AGE
# kserve-controller-manager-0 2/2 Running 2m
kubectl get clusterservingruntimes
# Should list built-in runtimes: kserve-huggingfaceserver,
# kserve-sklearnserver, kserve-xgbserver, kserve-torchserve, etc.Deploy Your First Models
Example 1: Scikit-Learn Iris Classifier
The simplest way to verify your KServe installation. This deploys a pre-trained scikit-learn model for iris flower classification.
cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
spec:
predictor:
model:
modelFormat:
name: sklearn
protocolVersion: v2
storageUri: gs://kfserving-examples/models/sklearn/1.0/model
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "1"
memory: 2Gi
EOF# Wait for READY=True
kubectl get inferenceservice sklearn-iris
# Create test input
cat <<EOF > /tmp/iris-input.json
{"instances": [[6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6]]}
EOF
# Get the service URL
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris \
-o jsonpath='{.status.url}' | cut -d/ -f3)
# Send prediction request
curl -v -H "Host: ${SERVICE_HOSTNAME}" \
http://localhost:8080/v1/models/sklearn-iris:predict \
-d @/tmp/iris-input.jsonExample 2: Hugging Face LLM (Qwen 2.5)
For generative AI, KServe provides native Hugging Face integration with OpenAI-compatible chat completion APIs.
cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: qwen-llm
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- "--model_id"
- "Qwen/Qwen2.5-0.5B-Instruct"
resources:
requests:
cpu: "4"
memory: 8Gi
limits:
cpu: "4"
memory: 8Gi
EOFSERVICE_HOSTNAME=$(kubectl get inferenceservice qwen-llm \
-o jsonpath='{.status.url}' | cut -d/ -f3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" \
http://localhost:8080/openai/v1/chat/completions \
-d '{
"model": "qwen-llm",
"messages": [{"role": "user", "content": "Hello, what can you do?"}],
"max_tokens": 100
}'🚀 GPU-Accelerated LLMs
For larger models like Llama 3.1 8B, add nvidia.com/gpu: "1" to your resource limits and ensure your RamNode VPS has GPU support configured. KServe v0.15 also supports multi-node inference for models that exceed single-GPU memory.
Production Configuration
Configure Autoscaling
KServe supports request-based autoscaling in Knative mode and metric-based autoscaling via KEDA in Standard mode.
cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
annotations:
serving.kserve.io/min-replicas: "1"
serving.kserve.io/max-replicas: "5"
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: gs://kfserving-examples/models/sklearn/1.0/model
EOFEnable Model Caching
KServe v0.15 introduced LocalModelCache, reducing model loading times from 15–20 minutes to approximately one minute for large models.
cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelCache
metadata:
name: llm-cache
spec:
sourceModelUri: hf://Qwen/Qwen2.5-0.5B-Instruct
nodeGroups:
- default
EOFResource Defaults
Customize global resource defaults for all InferenceServices to prevent models from exceeding your VPS limits.
kubectl edit configmap inferenceservice-config -n kserve
# Key settings in the inferenceService section:
# "cpuLimit": "2" — Max CPU per model container
# "memoryLimit": "4Gi" — Max memory per model container
# "cpuRequest": "1" — Default CPU request
# "memoryRequest":"2Gi" — Default memory requestMonitoring & Troubleshooting
Health Checks
# Check overall KServe status
kubectl get inferenceservices --all-namespaces
# View detailed model status
kubectl describe inferenceservice sklearn-iris
# Check KServe controller logs
kubectl logs -n kserve -l control-plane=kserve-controller-manager -c managerCommon Issues
| Problem | Cause | Solution |
|---|---|---|
| InferenceService stuck in Unknown | Storage init container failing | Check storage-initializer logs |
| CrashLoopBackOff on predictor | Insufficient memory | Increase memory limits |
| Model download timeout | Slow network or large model | Use LocalModelCache |
| 502 Bad Gateway | Model not yet ready | Wait for READY=True status |
| Webhook errors during install | Cert Manager not ready | Verify cert-manager pods |
Debug Commands
# Check pod events for errors
kubectl get events --sort-by=.lastTimestamp
# Inspect storage init container logs
kubectl logs <pod-name> -c storage-initializer
# Check Gateway resources
kubectl get gateways,httproutes -A
# View model server logs
kubectl logs <pod-name> -c kserve-containerSupported ML Frameworks
KServe provides built-in serving runtimes for all major machine learning frameworks. Each runtime supports the V2 inference protocol.
| Framework | Model Format | Use Case |
|---|---|---|
| Scikit-Learn | sklearn | Classification, regression, clustering |
| XGBoost | xgboost | Gradient boosting models |
| TensorFlow | tensorflow | Deep learning models |
| PyTorch (TorchServe) | pytorch | Custom deep learning |
| ONNX | onnxruntime | Cross-framework optimized |
| Hugging Face | huggingface | LLMs, NLP, transformers |
| vLLM | vllm | High-throughput LLM serving |
| Custom Containers | custom | Any framework |
Canary Deployments
KServe supports canary rollouts for safely deploying model updates. Split traffic between current and new model versions, gradually shifting as you gain confidence.
cat <<EOF | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
spec:
predictor:
canaryTrafficPercent: 20
model:
modelFormat:
name: sklearn
storageUri: gs://kfserving-examples/models/sklearn/2.0/model
EOFThis routes 20% of traffic to the new model version while 80% continues serving from the stable revision. Increase the canary percentage incrementally as you validate performance.
Cleanup
# Delete InferenceServices
kubectl delete inferenceservice --all
# Uninstall KServe
helm uninstall kserve
helm uninstall kserve-crd
# Uninstall Envoy Gateway
helm uninstall eg -n envoy-gateway-system
# Uninstall Cert Manager
kubectl delete -f \
https://github.com/cert-manager/cert-manager/releases/download/v1.16.3/cert-manager.yamlKServe Deployed Successfully!
Your KServe environment is now running on a RamNode VPS with K3s Kubernetes, Envoy Gateway traffic management, Cert Manager TLS, and production-ready model serving. RamNode's flexible VPS infrastructure provides the compute resources that ML inference workloads demand — starting at just $4/month.
