Self-Hosted AI Stack Series
    Part 3 of 8

    RAG Pipeline — Qdrant + Document Ingestion

    Make your AI answer questions from your actual documents, codebase, and knowledge base.

    40 minutes
    4GB+ recommended
    Prerequisites

    Completed Parts 1–2 (Ollama + Open WebUI), Docker, basic Python familiarity

    Time to Complete

    35–45 minutes

    Recommended Plan

    4GB ($20/mo) minimum. 8GB ($40/mo) for larger document sets

    Introduction

    LLMs are powerful but they hallucinate and don't know your data. RAG (Retrieval-Augmented Generation) fixes this by feeding your actual documents into the LLM's context at query time. This part builds a production-grade pipeline — not a toy demo.

    💰 Cost comparison: Pinecone Serverless costs $70+/month for vector storage. Qdrant on your VPS: $0 — unlimited vectors, unlimited queries.

    RAG Architecture Overview

    RAG Pipeline Flow
    Documents → Chunking → Embedding → Vector DB → Query → Context Injection → LLM Response
    
    ┌──────────┐    ┌──────────┐    ┌──────────────┐    ┌─────────┐
    │  Upload   │───▸│ Chunking │───▸│  Embedding   │───▸│ Qdrant  │
    │  PDFs,    │    │ Split    │    │ nomic-embed  │    │ Vector  │
    │  Docs,    │    │ into     │    │ via Ollama   │    │   DB    │
    │  Code     │    │ segments │    │              │    │         │
    └──────────┘    └──────────┘    └──────────────┘    └────┬────┘
                                                             │
    ┌──────────┐    ┌──────────┐    ┌──────────────┐         │
    │   LLM    │◂───│ Context  │◂───│   Retrieve   │◂────────┘
    │ Response │    │ Inject   │    │  Top-K docs  │
    └──────────┘    └──────────┘    └──────────────┘

    The embedding model (nomic-embed-text) runs locally via Ollama — no external API calls needed.

    Pull the embedding model
    ollama pull nomic-embed-text

    Deploying Qdrant

    Add Qdrant to your AI stack:

    mkdir -p ~/ai-stack/qdrant && cd ~/ai-stack/qdrant
    docker-compose.yml
    version: "3.8"
    
    services:
      qdrant:
        image: qdrant/qdrant:latest
        container_name: qdrant
        restart: unless-stopped
        ports:
          - "6333:6333"
          - "6334:6334"
        environment:
          - QDRANT__SERVICE__API_KEY=your-qdrant-api-key-change-this
        volumes:
          - qdrant-data:/qdrant/storage
    
    volumes:
      qdrant-data:
    docker compose up -d

    Qdrant dashboard is available at http://your-server-ip:6333/dashboard.

    Why Qdrant over ChromaDB?

    FeatureQdrantChromaDB
    Performance at scaleExcellentDegrades
    FilteringAdvanced payload filteringBasic metadata
    Production readinessBuilt for productionBetter for prototyping
    Memory efficiencyOn-disk + quantizationPrimarily in-memory

    Document Ingestion Pipeline

    Create a Python-based ingestion script using LangChain:

    Install dependencies
    pip install langchain langchain-community qdrant-client pypdf2 pdfplumber
    ingest.py
    #!/usr/bin/env python3
    """Document ingestion pipeline for Qdrant + Ollama RAG."""
    
    import os
    from langchain_community.document_loaders import PyPDFLoader, TextLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.embeddings import OllamaEmbeddings
    from qdrant_client import QdrantClient
    from qdrant_client.models import Distance, VectorParams, PointStruct
    import uuid
    
    # Configuration
    OLLAMA_URL = "http://localhost:11434"
    QDRANT_URL = "http://localhost:6333"
    QDRANT_API_KEY = "your-qdrant-api-key-change-this"
    COLLECTION_NAME = "documents"
    EMBEDDING_MODEL = "nomic-embed-text"
    CHUNK_SIZE = 500
    CHUNK_OVERLAP = 50
    
    # Initialize
    embeddings = OllamaEmbeddings(
        base_url=OLLAMA_URL,
        model=EMBEDDING_MODEL
    )
    qdrant = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
    
    # Create collection if it doesn't exist
    collections = [c.name for c in qdrant.get_collections().collections]
    if COLLECTION_NAME not in collections:
        qdrant.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=768, distance=Distance.COSINE)
        )
    
    # Text splitter
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", ". ", " "]
    )
    
    def ingest_file(filepath: str):
        """Ingest a single file into the vector database."""
        ext = os.path.splitext(filepath)[1].lower()
        
        if ext == ".pdf":
            loader = PyPDFLoader(filepath)
        elif ext in [".txt", ".md", ".py", ".js", ".ts"]:
            loader = TextLoader(filepath)
        else:
            print(f"Skipping unsupported file: {filepath}")
            return
        
        docs = loader.load()
        chunks = splitter.split_documents(docs)
        
        for chunk in chunks:
            vector = embeddings.embed_query(chunk.page_content)
            qdrant.upsert(
                collection_name=COLLECTION_NAME,
                points=[PointStruct(
                    id=str(uuid.uuid4()),
                    vector=vector,
                    payload={
                        "text": chunk.page_content,
                        "source": filepath,
                        "page": chunk.metadata.get("page", 0)
                    }
                )]
            )
        
        print(f"Ingested {len(chunks)} chunks from {filepath}")
    
    if __name__ == "__main__":
        import sys
        for path in sys.argv[1:]:
            ingest_file(path)
    Usage
    # Ingest a single PDF
    python3 ingest.py /path/to/document.pdf
    
    # Ingest multiple files
    python3 ingest.py docs/*.pdf docs/*.md

    Chunking Strategy

    Chunk SizeOverlapBest For
    200–30030FAQ, short-form Q&A
    500 (default)50General documents, best balance
    1000–1500200Technical docs, code files

    Connecting RAG to Open WebUI

    Configure Open WebUI to use your Qdrant-based RAG pipeline:

    1. Navigate to Admin Panel → Settings → Documents
    2. Set the Embedding Model to nomic-embed-text
    3. Configure the vector database connection to Qdrant
    4. Set Top-K to 5 (retrieve 5 most relevant chunks)
    5. Set Similarity Threshold to 0.7 (filter out weak matches)

    Test by uploading a document through Open WebUI and asking questions about its content. Compare the quality of responses with and without RAG.

    Optimizing Retrieval Quality

    Troubleshooting Common Issues

    ProblemCauseFix
    Irrelevant resultsChunks too largeReduce chunk size to 300–500
    Missing contextChunks too smallIncrease chunk size, add overlap
    Hallucination despite RAGLow similarity thresholdIncrease threshold to 0.75+
    Slow retrievalLarge collectionEnable on-disk storage, add indexes

    Batch Ingestion Script

    A production-ready script that watches a directory for new documents:

    watch-and-ingest.sh
    #!/bin/bash
    # Watch a directory and auto-ingest new documents
    WATCH_DIR="/home/user/documents"
    LOG_FILE="/var/log/rag-ingestion.log"
    
    inotifywait -m -r -e create -e moved_to "$WATCH_DIR" |
    while read dir action file; do
        filepath="$dir$file"
        echo "$(date): New file detected: $filepath" >> "$LOG_FILE"
        python3 /home/user/ai-stack/rag/ingest.py "$filepath" >> "$LOG_FILE" 2>&1
    done
    sudo apt install -y inotify-tools
    chmod +x watch-and-ingest.sh

    What's Next?

    Your AI now answers from your actual documents — company wikis, technical docs, contracts, code. Zero data sent to third parties. In Part 4: AnythingLLM, we'll put this power in the hands of non-technical team members with a no-code AI app builder.