AI / ML Guide

    Deploy BentoML

    Build, serve, and deploy production-grade AI model inference APIs. Self-host on RamNode's VPS hosting with full control over your infrastructure and data.

    Python 3.11
    BentoML
    Docker
    Nginx
    Let's Encrypt SSL
    1

    What Is BentoML?

    BentoML is an open-source Python framework for building production-ready AI inference APIs. It lets you turn any machine learning model — from text summarization to image generation — into a scalable REST API with just a few lines of code. BentoML handles dynamic batching, model parallelism, Docker containerization, and environment management out of the box.

    By deploying BentoML on a RamNode VPS, you maintain full control over your infrastructure, data, and costs — no vendor lock-in to managed ML platforms. This makes it ideal for developers, startups, and AI-focused teams who want affordable, high-performance model serving.

    2

    Prerequisites

    Recommended VPS Specifications

    RequirementDetails
    VPS PlanRamNode Cloud VPS — 4 GB RAM minimum (8 GB+ for larger models)
    Operating SystemUbuntu 24.04 LTS
    CPU2+ vCPUs (4+ recommended for inference workloads)
    Storage40 GB+ SSD (model artifacts can be large)
    Python3.9 or higher (3.11 recommended)
    NetworkPublic IPv4 address with SSH access

    💡 RamNode Pricing Advantage: RamNode Cloud VPS plans start at just $4/month with $500 in annual credits — significantly cheaper than managed ML platforms that charge per-inference or per-GPU-hour.

    What You'll Need

    • A provisioned RamNode VPS with Ubuntu 24.04 LTS installed
    • SSH access to your server (root or sudo user)
    • A domain name (optional, for HTTPS/reverse proxy)
    • Basic familiarity with Python and the Linux command line
    3

    Initial Server Setup

    Connect and Update

    SSH and update system
    ssh root@YOUR_SERVER_IP
    apt update && apt upgrade -y
    reboot      # if kernel was updated

    Create a Dedicated User

    Running services as root is a security risk. Create a dedicated user for BentoML:

    Create bentoml user
    adduser bentoml
    usermod -aG sudo bentoml
    su - bentoml

    Configure the Firewall

    Enable UFW
    sudo ufw allow OpenSSH
    sudo ufw allow 3000/tcp   # BentoML default port
    sudo ufw enable
    sudo ufw status

    ⚠️ Security Note: Port 3000 should only be exposed directly during development. In production, place BentoML behind a reverse proxy (Nginx) with HTTPS — covered in the Production Hardening section.

    4

    Python Environment Setup

    Install Python 3.11 and Dependencies

    Ubuntu 24.04 ships with Python 3.12, but BentoML works best with Python 3.11:

    Install Python 3.11
    sudo apt install -y python3.11 python3.11-venv python3.11-dev \
      build-essential curl git

    Create a Virtual Environment

    Set up venv
    python3.11 -m venv ~/bentoml-env
    source ~/bentoml-env/bin/activate
    
    # Verify Python version
    python --version    # Should output Python 3.11.x
    
    # Auto-activate on login
    echo "source ~/bentoml-env/bin/activate" >> ~/.bashrc

    Install BentoML

    Install from PyPI
    pip install --upgrade pip
    pip install bentoml
    
    # Verify the installation
    bentoml --version
    bentoml env
    5

    Build Your First BentoML Service

    Create the Project Directory

    Scaffold project
    mkdir ~/my-bento-service && cd ~/my-bento-service

    Write the Service

    Create service.py with a text summarization service powered by Hugging Face Transformers:

    service.py
    # service.py
    import bentoml
    
    @bentoml.service(
        image=bentoml.images.Image(python_version="3.11")
            .python_packages("torch", "transformers"),
        resources={"cpu": "2"},
        traffic={"timeout": 120},
    )
    class Summarization:
        def __init__(self) -> None:
            import torch
            from transformers import pipeline
    
            device = "cuda" if torch.cuda.is_available() else "cpu"
            self.pipeline = pipeline(
                'summarization',
                model="sshleifer/distilbart-cnn-12-6",
                device=device,
            )

    💡 Model Choice: The distilbart-cnn-12-6 model is lightweight (~1.2 GB) and runs well on CPU — perfect for testing on a standard RamNode VPS. For GPU-accelerated inference, consider a plan with NVIDIA GPU support.

    Install Model Dependencies

    Install ML libraries
    pip install torch transformers

    Serve Locally

    Start the dev server
    bentoml serve service:Summarization
    # Serves at http://localhost:3000
    # Swagger UI at http://localhost:3000/docs

    Test the API

    Test with curl
    curl -X POST http://localhost:3000/summarize \
      -H "Content-Type: application/json" \
      -d '{"texts": ["BentoML is a Python library for building online serving systems optimized for AI apps and model inference. It lets you easily build APIs for any AI/ML model and simplifies Docker container management."]}'
    Test with Python client
    import bentoml
    
    with bentoml.SyncHTTPClient('http://localhost:3000') as client:
        result = client.summarize([
            "BentoML simplifies deploying ML models to production."
        ])
        print(result)
    6

    Containerize with Docker

    BentoML can package your service into a portable Docker image — ideal for reproducible deployments and scaling.

    Install Docker

    Install Docker
    sudo apt install -y docker.io
    sudo systemctl enable --now docker
    sudo usermod -aG docker bentoml
    newgrp docker

    Create the Bento Build File

    bentofile.yaml
    # bentofile.yaml
    service: 'service:Summarization'
    labels:
      owner: my-team
      project: summarization-api
    include:
      - '*.py'
    python:
      packages:
        - torch
        - transformers
    docker:
      python_version: "3.11"
      distro: debian

    Build and Containerize

    Build Bento and Docker image
    # Build the Bento
    bentoml build
    
    # List available Bentos
    bentoml list
    
    # Containerize into a Docker image
    bentoml containerize summarization:latest
    
    # Verify the image
    docker images | grep summarization

    Run the Docker Container

    Run container
    docker run -d \
      --name bentoml-summarization \
      -p 3000:3000 \
      --restart unless-stopped \
      summarization:latest
    7

    Systemd Service (Non-Docker)

    If you prefer running BentoML directly without Docker, create a systemd unit file for automatic startup and process management:

    Create /etc/systemd/system/bentoml.service
    [Unit]
    Description=BentoML Inference Server
    After=network.target
    
    [Service]
    Type=simple
    User=bentoml
    WorkingDirectory=/home/bentoml/my-bento-service
    Environment=PATH=/home/bentoml/bentoml-env/bin:/usr/bin:/bin
    ExecStart=/home/bentoml/bentoml-env/bin/bentoml serve service:Summarization --host 0.0.0.0 --port 3000
    Restart=always
    RestartSec=10
    
    [Install]
    WantedBy=multi-user.target
    Enable and start
    sudo systemctl daemon-reload
    sudo systemctl enable --now bentoml
    sudo systemctl status bentoml
    8

    Production Hardening

    Nginx Reverse Proxy with SSL

    Never expose BentoML directly to the internet in production. Set up Nginx as a reverse proxy with Let's Encrypt SSL:

    Install Nginx and Certbot
    sudo apt install -y nginx certbot python3-certbot-nginx
    Create /etc/nginx/sites-available/bentoml
    server {
        listen 80;
        server_name api.yourdomain.com;
    
        location / {
            proxy_pass http://127.0.0.1:3000;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
    
            # WebSocket support for streaming
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
    
            # Increase timeouts for model inference
            proxy_read_timeout 300s;
            proxy_send_timeout 300s;
        }
    }
    Enable site and obtain SSL
    sudo ln -s /etc/nginx/sites-available/bentoml /etc/nginx/sites-enabled/
    sudo nginx -t && sudo systemctl reload nginx
    
    # Obtain SSL certificate
    sudo certbot --nginx -d api.yourdomain.com
    
    # Lock down the firewall for production
    sudo ufw delete allow 3000/tcp
    sudo ufw allow 'Nginx Full'

    API Authentication

    Add a simple API key middleware for basic access control. Create auth.py:

    auth.py
    # auth.py
    import os
    from starlette.middleware.base import BaseHTTPMiddleware
    from starlette.responses import JSONResponse
    
    API_KEY = os.environ.get('BENTOML_API_KEY', 'change-me')
    
    class APIKeyMiddleware(BaseHTTPMiddleware):
        async def dispatch(self, request, call_next):
            if request.url.path in ('/healthz', '/docs', '/schema'):
                return await call_next(request)
    
            api_key = request.headers.get('Authorization', '').replace('Bearer ', '')
            if api_key != API_KEY:
                return JSONResponse(
                    status_code=401,
                    content={"error": "Invalid API key"}
                )
            return await call_next(request)

    Resource Limits

    Configure BentoML's resource management to prevent OOM errors:

    Service decorator with limits
    @bentoml.service(
        resources={
            "cpu": "2",
            "memory": "2Gi",
        },
        traffic={
            "timeout": 120,
            "max_concurrency": 4,
        },
    )

    For Docker deployments, set container memory limits:

    Docker with memory limits
    docker run -d \
      --name bentoml-summarization \
      --memory=3g --memory-swap=4g \
      -p 3000:3000 \
      --restart unless-stopped \
      summarization:latest
    9

    Health Checks & Log Management

    BentoML exposes a built-in health check endpoint at /healthz. Use it for monitoring and load balancer health checks:

    Health check
    curl http://localhost:3000/healthz
    # Returns: {"status": "ok"}

    View Logs

    Log management
    # Systemd service logs
    journalctl -u bentoml -f --no-pager
    
    # Docker container logs
    docker logs -f bentoml-summarization
    10

    Updating & Redeploying

    When you update your model or service code, follow this workflow:

    1. Update your service.py or model dependencies
    2. Rebuild the Bento: bentoml build
    3. Rebuild the Docker image: bentoml containerize summarization:latest
    4. Stop the old container: docker stop bentoml-summarization && docker rm bentoml-summarization
    5. Run the new container with the same docker run command from Step 6

    For zero-downtime deployments, consider running two containers behind your Nginx reverse proxy and switching traffic between them.

    11

    Troubleshooting

    Out of memory during model load

    Upgrade to a VPS with more RAM (8 GB+) or use a smaller model variant.

    Port 3000 already in use

    Check with sudo lsof -i :3000 — kill the process or change the port.

    bentoml command not found

    Ensure your virtual environment is activated: source ~/bentoml-env/bin/activate

    Slow first request

    Normal — the model loads into memory on the first request. Subsequent requests will be much faster.

    Docker build fails

    Ensure Docker has enough disk space: docker system prune -a to clean up old images.

    SSL certificate renewal fails

    Verify your domain's DNS points to the VPS IP and port 80 is accessible.

    BentoML Deployed Successfully!

    Your BentoML AI inference API is now running in production on a RamNode VPS with Docker containerization, Nginx reverse proxy, SSL encryption, and API authentication.

    Next Steps:

    • Multi-model pipelines: Chain multiple models in a single service for RAG or multi-stage inference
    • GPU inference: Deploy on a GPU-equipped VPS for compute-heavy models (LLMs, Stable Diffusion)
    • Custom Docker images: Use BentoML's image API for optimized containers with specific CUDA versions
    • Horizontal scaling: Run multiple BentoML containers behind a load balancer for high-throughput workloads