{"id":6022,"date":"2026-04-13T17:27:52","date_gmt":"2026-04-13T16:27:52","guid":{"rendered":"https:\/\/upcloud.com\/global\/?post_type=tutorial&#038;p=6022"},"modified":"2026-04-13T17:27:52","modified_gmt":"2026-04-13T16:27:52","slug":"building-self-hosted-rag-system-open-source-tools","status":"publish","type":"tutorial","link":"https:\/\/upcloud.com\/global\/resources\/tutorials\/building-self-hosted-rag-system-open-source-tools\/","title":{"rendered":"Building a Self-Hosted RAG System with Open-Source Tools"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Large language models are trained on a fixed dataset and don\u2019t have access to new or private data by default. This makes them unreliable for up-to-date or context-specific questions. This issue is what <em>Retrieval-Augmented Generation<\/em> solves. Developers combine a language model with vector search so applications can answer questions using their own documents.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Introductions<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Most tutorials show this workflow using hosted APIs for embeddings, vector search, and model inference because it allows a working prototype to appear within minutes. And while that convenience works well early on, teams often begin to reconsider the architecture once real usage starts. Request-based pricing grows with traffic, documents move through several external services, and core application logic ends up tied to specific AI platforms. It also becomes harder to control where data is processed, how it is stored, and how portable the system is across environments.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a result, open-source tooling and self-hosted stacks become more relevant. Teams start looking for ways to keep sensitive data within their own infrastructure, reduce reliance on vendor-specific services, and retain the flexibility to move across providers. That shift leads to a practical question in engineering discussions: can the same system run using open-source tools on infrastructure the team controls?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The answer today is increasingly yes. Modern inference servers expose OpenAI-compatible APIs, embedding models run comfortably on standard machines, and PostgreSQL can perform vector search through the pgvector extension. With object storage holding source documents and a small API service orchestrating retrieval and generation, the entire RAG pipeline can run on ordinary infrastructure rather than managed AI platforms. This reduces vendor lock-in and cost variability, but it also means taking on responsibility for operating and scaling the system, and performance may not match fully managed services without careful tuning or dedicated hardware.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This tutorial walks through building that stack using open-source components on UpCloud. Instead of focusing on RAG theory, the goal is to show how realistic it is to run a minimal but practical system using an open LLM runtime, a local embedding model, PostgreSQL with pgvector, and object storage. By the end, you will have a working self-hosted RAG pipeline that mirrors the architecture many teams now deploy in production.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What a Self-Hosted RAG Stack Looks Like<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before diving into the implementation, it helps to understand the main components that make up a typical RAG system. Even in self-hosted environments, most deployments follow a similar structure with a few clearly defined layers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. LLM Runtime<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The language model runtime is responsible for generating answers. In self-hosted setups, this usually means running an open-source model behind an inference server that exposes an API.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Several runtimes now provide <em>OpenAI-compatible endpoints<\/em>, which allow applications to interact with local models using the same SDKs commonly used for hosted AI services. Popular options include <a href=\"https:\/\/docs.vllm.ai\/en\/latest\/\" target=\"_blank\" rel=\"noopener\">vLLM<\/a>, <a href=\"https:\/\/ollama.com\/\" target=\"_blank\" rel=\"noopener\">Ollama<\/a>, and <a href=\"https:\/\/github.com\/ggml-org\/llama.cpp\" target=\"_blank\" rel=\"noopener\">llama.cpp<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">vLLM is typically used for GPU-backed inference, while CPU-focused setups are better suited to runtimes like llama.cpp or Ollama. Running larger models (such as Mistral-7B) purely on CPU is possible but tends to be significantly slower, so CPU mode is often used for demos or small-scale setups rather than production deployments.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2. Embedding Model<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Embeddings convert text into numerical vectors that capture semantic meaning. These vectors allow the system to find relevant information even when the wording of a query differs from the source document.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Common open models include <a href=\"https:\/\/huggingface.co\/BAAI\/bge-m3\" target=\"_blank\" rel=\"noopener\">BGE<\/a>, <a href=\"https:\/\/huggingface.co\/intfloat\/e5-large\" target=\"_blank\" rel=\"noopener\">E5<\/a>, and various sentence-transformer models. They can run locally using Python libraries such as <a href=\"https:\/\/huggingface.co\/sentence-transformers\" target=\"_blank\" rel=\"noopener\">sentence-transformers<\/a>, making them easy to integrate into ingestion pipelines or query processing workflows.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3. Vector Search<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The vector database stores embeddings and performs similarity searches to find relevant document chunks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are many systems designed for this task, including <a href=\"https:\/\/qdrant.tech\/\" target=\"_blank\" rel=\"noopener\">Qdrant<\/a>, <a href=\"https:\/\/milvus.io\/\" target=\"_blank\" rel=\"noopener\">Milvus<\/a>, and <a href=\"https:\/\/opensearch.org\/\" target=\"_blank\" rel=\"noopener\">OpenSearch<\/a>. In this tutorial, we use <a href=\"https:\/\/github.com\/pgvector\/pgvector\" target=\"_blank\" rel=\"noopener\">pgvector<\/a>, a PostgreSQL extension that adds vector search capabilities directly to a relational database. It is widely supported, simple to deploy, and sufficient for many RAG workloads.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. Document Storage<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The original documents used by the system are usually stored separately from the vector index. <a href=\"https:\/\/upcloud.com\/global\/products\/object-storage\/\">Object storage<\/a> works well for this layer because it handles large files efficiently and keeps the source material independent from derived embeddings.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">During ingestion, documents are retrieved from storage, split into smaller chunks, and converted into embeddings before being stored in the vector database. This design allows documents to be reprocessed later if the chunking strategy or embedding model changes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5. Retrieval and Application Layer<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The final layer is a small application service that orchestrates the RAG workflow.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This service receives user queries, generates query embeddings, searches the vector database, assembles relevant document context, and sends a prompt to the language model. Frameworks such as <a href=\"https:\/\/fastapi.tiangolo.com\/\" target=\"_blank\" rel=\"noopener\">FastAPI<\/a>, <a href=\"https:\/\/flask.palletsprojects.com\/\" target=\"_blank\" rel=\"noopener\">Flask<\/a>, or lightweight <a href=\"https:\/\/nodejs.org\/en\" target=\"_blank\" rel=\"noopener\">Node.js<\/a> APIs are commonly used to implement this layer.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Together, these five components form the core of most RAG systems.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">High-Level Architecture<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now that you know what the core components are, it helps to see how they interact during a typical request. A RAG system works by retrieving relevant information from a document index and using that information to guide the language model\u2019s response.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At a high level, the flow looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/rag-system-infrastructure.png\" alt=\"-\" class=\"wp-image-79001\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">When a user submits a question, the API service first converts the query into an embedding using the same model that was used during document ingestion. That embedding is then used to search the vector database for document chunks with similar semantic meaning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The retrieved chunks are combined with the user\u2019s question to construct a prompt. This prompt is sent to the language model running on the inference server, which generates a response grounded in the retrieved information.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This tutorial is intentionally minimal. It is designed to show the core mechanics of RAG without adding too many moving parts. In production systems, this pipeline is usually extended with better chunking strategies, reranking steps, caching, structured prompt templates, and monitoring around retrieval quality and latency.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most teams also place a decision layer in front of retrieval, using either lightweight heuristics or an LLM to classify the query first. That layer can decide whether to trigger RAG at all, which index or tenant-specific corpus to search, and whether the request should be handled by a normal model response, a retrieval workflow, or another tool altogether. Let\u2019s now understand the infrastructure needed to build this setup.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Infrastructure Used in This Tutorial<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">With the architecture in place, the next step is choosing the infrastructure needed to run it. A self-hosted RAG system does not require an overly complex environment. In many cases, a small number of services are enough to support a fully functional pipeline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The most important component is compute for running the language model inference server. Depending on the model size, this can run either on CPUs or GPUs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A typical CPU-based setup might look like this:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>4\u20138 vCPU<\/li>\n\n\n\n<li>16\u201332 GB RAM<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This configuration is often sufficient for smaller instruction-tuned models such as Mistral or similar open models used in lightweight RAG systems.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For higher throughput or larger models, a GPU instance can significantly improve inference performance:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>1\u00d7 NVIDIA GPU<\/li>\n\n\n\n<li>32 GB RAM<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Next, the vector database stores embeddings and handles similarity search. In this tutorial, we will use PostgreSQL with the pgvector extension, which will allow vectors to be stored and queried directly inside a relational database. A <a href=\"https:\/\/upcloud.com\/global\/postgresql-managed-databases\/\">managed PostgreSQL service<\/a> simplifies setup while still keeping the architecture portable.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Document storage will be handled using <a href=\"https:\/\/upcloud.com\/global\/products\/object-storage\/\">object storage<\/a>, which holds the original files that will later be processed and indexed. Keeping source documents separate from embeddings will allow the ingestion pipeline to be rerun later if chunking strategies or embedding models change.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, a small Python API service will act as the application layer. This service performs document ingestion, generates embeddings, executes vector searches, and sends prompts to the language model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Together, these components will mirror the infrastructure used by many production RAG deployments.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, let\u2019s get started!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 1: Provision the UpCloud Infrastructure<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before setting up the RAG components, provision three resources on UpCloud:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A VM for the application and model server<\/li>\n\n\n\n<li>A managed PostgreSQL database for pgvector<\/li>\n\n\n\n<li>An object storage bucket for source documents<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Create an <strong>Ubuntu VM<\/strong> with enough compute for the API service and model inference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Example configuration:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>4\u20138 vCPU<\/li>\n\n\n\n<li>16\u201332 GB RAM<\/li>\n\n\n\n<li>Ubuntu 22.04 LTS<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Next, create an <strong>UpCloud Managed PostgreSQL<\/strong> instance. This database will store the document chunks and embeddings used during retrieval.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, create an <strong>UpCloud Managed Object Storage bucket<\/strong>. This bucket will hold the documents that the RAG system will index. Name the bucket as <code>rag-documents<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next, you will upload three small text files to this bucket so the system has a simple knowledge base to work with.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example Documents for the Tutorial<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create the following files locally and upload them to your object storage bucket manually from the UpCloud dashboard:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><code>pricing.txt<\/code><\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">Product Pricing Plans\n\nStarter Plan\nPrice: $19 per month\nIncludes:\n- 5 GB storage\n- 1 team member\n- Email support\n\nPro Plan\nPrice: $49 per month\nIncludes:\n- 50 GB storage\n- Up to 10 team members\n- Priority email support\n- API access\n\nEnterprise Plan\nPrice: Custom pricing\nIncludes:\n- Unlimited storage\n- Unlimited team members\n- Dedicated support\n- SLA guarantees\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\"><code>faq.txt<\/code><\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">Frequently Asked Questions\n\nQ: How do I upgrade my plan?\nA: You can upgrade your plan from the billing section of the dashboard.\n\nQ: Do you offer refunds?\nA: Monthly subscriptions can be cancelled anytime but payments are not refunded.\n\nQ: Is there an API available?\nA: Yes. The Pro and Enterprise plans include API access.\n\nQ: Where is customer support available?\nA: Support is available through email for Starter users and priority email for Pro users.<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\"><code>architecture.txt<\/code><\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">System Architecture Overview\n\nThe application runs on a cloud virtual machine.\n\nComponents:\n- FastAPI service that handles API requests\n- PostgreSQL database with pgvector for semantic search\n- Object storage for storing raw documents\n- vLLM server running an open source language model\n\nUser queries are embedded using a local embedding model.\nThe embedding is compared with stored vectors in PostgreSQL to retrieve relevant document chunks.\nThe retrieved chunks are then sent to the language model to generate the final response.<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">These files give the RAG system a small but predictable knowledge base. For example, the following questions should return grounded answers:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u201cWhat does the Pro plan include?\u201d<\/li>\n\n\n\n<li>\u201cDoes the service provide API access?\u201d<\/li>\n\n\n\n<li>\u201cWhat components make up the system architecture?\u201d<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Once you build a working pipeline, you can replace these files with real documentation or internal datasets.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s how your bucket should look after you\u2019ve uploaded all the files:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/rag-documents-object-storage.png\" alt=\"-\" class=\"wp-image-79000\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Step 2: Run an OpenAI-Compatible LLM Locally<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Next, SSH into your VM and install Python and its required dependencies on the UpCloud VM.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">apt install -y python3-pip python3-venv git gcc-12 g++-12 libnuma-dev libtcmalloc-minimal4 python3-dev\npython3 -m venv .venv\nsource .venv\/bin\/activate\npip install --upgrade pip build<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Since we\u2019re using a CPU-only server for this tutorial, we\u2019ll need to build vLLM from source, as the prebuilt vLLM wheels are meant for GPU-based VMs. To do that, clone the repository and install the CPU dependencies.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">git clone https:\/\/github.com\/vllm-project\/vllm.git\ncd vllm<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">pip install -v -r requirements\/cpu-build.txt --extra-index-url https:\/\/download.pytorch.org\/whl\/cpu\npip install -v -r requirements\/cpu.txt --extra-index-url https:\/\/download.pytorch.org\/whl\/cpu\nVLLM_TARGET_DEVICE=cpu python -m build --wheel --no-isolation\npip install dist\/*.whl<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The CPU runtime also expects certain memory and threading libraries to be preloaded. These improve memory allocation performance and ensure OpenMP threading works correctly.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">export VLLM_CPU_KVCACHE_SPACE=4\nexport OMP_NUM_THREADS=4\nexport MKL_NUM_THREADS=4\nTC_PATH=$(find \/ -iname 'libtcmalloc_minimal.so.4' 2>\/dev\/null | head -n 1)\nIOMP_PATH=$(find \/ -iname 'libiomp5.so' 2>\/dev\/null | head -n 1)\nexport LD_PRELOAD=\"$TC_PATH:$IOMP_PATH:$LD_PRELOAD\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now start the inference server.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">vllm serve mistralai\/Mistral-7B-Instruct-v0.2 --max-model-len 4096<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The reduced <code>--max-model-len<\/code> keeps memory usage manageable on smaller machines. Without it, the model attempts to reserve a very large KV cache based on its default 32k context window, which can exhaust RAM even on a 4-CPU \/ 32 GB VM.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once started, vLLM exposes a local API server that follows the OpenAI API format. Your apps can now send <code>chat\/completions<\/code> or <code>completions<\/code> requests to the server just like they would with the OpenAI API.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 3: Set Up Vector Search with PostgreSQL and pgvector<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Next, to set up your vector database, first connect to the UpCloud PostgreSQL instance using the <code>psql<\/code> CLI and your public database URL:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">psql \"postgres:\/\/username:password@host:5432\/defaultdb\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Once connected, enable pgvector:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">CREATE EXTENSION IF NOT EXISTS vector;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And create the table that will store document chunks and embeddings:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">CREATE TABLE documents (\n  id BIGSERIAL PRIMARY KEY,\n  source_name TEXT NOT NULL,\n  chunk_index INTEGER NOT NULL,\n  content TEXT NOT NULL,\n  embedding VECTOR(768) NOT NULL\n);<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here, 768 must match the dimensionality of the embedding model you use. If you switch to a model with a different embedding size, inserts will fail unless you update the schema accordingly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To make similarity search scale beyond very small datasets, add a vector index and analyze the table:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);\nANALYZE documents;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Each row will eventually contain a chunk from <code>pricing.txt<\/code>, <code>faq.txt<\/code>, or <code>architecture.txt<\/code>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 4: Build the Document Ingestion Pipeline<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now that the database and object storage bucket are ready, the next step is to turn the uploaded text files into searchable vectors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The ingestion flow looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">UpCloud Managed Object Storage\n            \u2193\n      Document Loader\n            \u2193\n          Chunking\n            \u2193\n      Embedding Model\n            \u2193\n   PostgreSQL + pgvector<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Install the required Python packages<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you have reconnected to the VM since installing vLLM, activate the virtual environment again first:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">source .venv\/bin\/activate<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then install the packages needed for ingestion:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">pip install boto3 sentence-transformers psycopg[binary] python-dotenv<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Create a <code>.env<\/code> file<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create a file named .env in your project directory:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">nano .env<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Add your database and object storage credentials:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">DATABASE_URL=postgres:\/\/username:password@host:5432\/defaultdb?sslmode=require\nS3_ENDPOINT=https:\/\/your-object-storage-endpoint\nS3_ACCESS_KEY=your-access-key\nS3_SECRET_KEY=your-secret-key\nS3_BUCKET=rag-documents<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Use the actual values from your UpCloud account. Also, make sure to add your VM\u2019s IP to the allowlist of your managed PostgreSQL instance before running the script.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Create the ingestion script<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>ingest.py<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">nano ingest.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Paste the following code:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">import os\nfrom dotenv import load_dotenv\nimport psycopg\nimport boto3\nfrom sentence_transformers import SentenceTransformer\n\nload_dotenv(\".env\")\n\nDATABASE_URL = os.environ[\"DATABASE_URL\"]\nS3_ENDPOINT = os.environ[\"S3_ENDPOINT\"]\nS3_ACCESS_KEY = os.environ[\"S3_ACCESS_KEY\"]\nS3_SECRET_KEY = os.environ[\"S3_SECRET_KEY\"]\nS3_BUCKET = os.environ[\"S3_BUCKET\"]\n\nmodel = SentenceTransformer(\"BAAI\/bge-base-en-v1.5\")\n\ns3 = boto3.client(\n    \"s3\",\n    endpoint_url=S3_ENDPOINT,\n    aws_access_key_id=S3_ACCESS_KEY,\n    aws_secret_access_key=S3_SECRET_KEY,\n)\n\ndef load_document(key):\n    obj = s3.get_object(Bucket=S3_BUCKET, Key=key)\n    return obj[\"Body\"].read().decode(\"utf-8\")\n\ndef chunk_text(text, chunk_size=120, overlap=30):\n    words = text.split()\n    chunks = []\n    step = chunk_size - overlap\n\n    for i in range(0, len(words), step):\n        chunk = \" \".join(words[i:i + chunk_size])\n        if chunk.strip():\n            chunks.append(chunk)\n\n    return chunks\n\ndef ingest_document(key):\n    print(f\"Ingesting {key}...\")\n    text = load_document(key)\n    chunks = chunk_text(text)\n    embeddings = model.encode(chunks, normalize_embeddings=True, batch_size=32)\n\n    with psycopg.connect(DATABASE_URL, connect_timeout=5) as conn:\n        with conn.cursor() as cur:\n            for index, (chunk, embedding) in enumerate(zip(chunks, embeddings)):\n                vector_str = \"[\" + \",\".join(str(x) for x in embedding.tolist()) + \"]\"\n                cur.execute(\n                    \"\"\"\n                    INSERT INTO documents (source_name, chunk_index, content, embedding)\n                    VALUES (%s, %s, %s, %s::vector)\n                    \"\"\",\n                    (key, index, chunk, vector_str),\n                )\n        conn.commit()\n\nif __name__ == \"__main__\":\n    ingest_document(\"pricing.txt\")\n    ingest_document(\"faq.txt\")\n    ingest_document(\"architecture.txt\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This script takes care of loading the document files from the object storage (<code>load_document<\/code>), chunking them (<code>chunk_text<\/code>), and inserting the chunks into the Postgres database (<code>ingest_document<\/code>). The script does this for all three of the files in the bucket.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the <code>ingest_document<\/code> step, <code>normalize_embeddings=True<\/code> is used since we are using cosine similarity, as it can make retrieval more consistent. <code>batch_size=32<\/code> helps optimize the operation for larger datasets, improving throughput and memory usage during embedding generation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Run the ingestion script<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Next, you need to run the script. To do that, run the following on your VM:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">python ingest.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A successful run should look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">Ingesting pricing.txt\u2026\nIngesting faq.txt\u2026\nIngesting architecture.txt\u2026<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">At this point, the files stored in object storage have been converted into embeddings and inserted into PostgreSQL.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 5: Implement the Retrieval Step<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now that the documents are indexed, the next step is querying PostgreSQL for the most relevant chunks when a user asks a question.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The retrieval flow looks like this:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Embed the user\u2019s question<\/li>\n\n\n\n<li>Compare that embedding against the stored vectors<\/li>\n\n\n\n<li>Return the closest chunks<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Create the retrieval module<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>retrieval.py<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">nano retrieval.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Paste the following:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">import os\nfrom dotenv import load_dotenv\nimport psycopg\nfrom sentence_transformers import SentenceTransformer\n\nload_dotenv(\".env\")\n\nDATABASE_URL = os.environ[\"DATABASE_URL\"]\nmodel = SentenceTransformer(\"BAAI\/bge-base-en-v1.5\")\n\ndef retrieve_documents(query, limit=2):\n    query_embedding = model.encode(query)\n    vector_str = \"[\" + \",\".join(str(x) for x in query_embedding.tolist()) + \"]\"\n\n    with psycopg.connect(DATABASE_URL, connect_timeout=5) as conn:\n        with conn.cursor() as cur:\n            cur.execute(\n                \"\"\"\n                SELECT content\n                FROM documents\n                ORDER BY embedding &lt;=> %s::vector\n                LIMIT %s\n                \"\"\",\n                (vector_str, limit),\n            )\n            results = cur.fetchall()\n\n    return [row[0] for row in results]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>::vector<\/code> cast is important here. Without it, PostgreSQL will treat the query embedding as a regular float array, and the similarity operator will fail.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The choice of distance operator also matters. pgvector\u2019s \\<code>&lt;-><\/code> operator uses Euclidean distance (L2), which measures straight-line distance between two vectors. That can work, but many text embedding models such as BGE are more commonly compared using cosine similarity, which focuses on how closely two vectors point in the same direction rather than how far apart they are in raw space.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Since we are already normalizing embeddings during ingestion and query time, cosine distance is usually the more appropriate choice here, so this example uses \\<code>&lt;=><\/code> instead of \\<code>&lt;-><\/code> for more reliable semantic retrieval.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Test retrieval on its own<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>test_retrieval.py<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">nano test_retrieval.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Paste:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">from retrieval import retrieve_documents\n\nquery = \"What does the Pro plan include?\"\nresults = retrieve_documents(query)\n\nfor i, chunk in enumerate(results, 1):\n    print(f\"\\nResult {i}:\\n{chunk}\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Run it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">python test_retrieval.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If everything is working, the results should include the pricing content and possibly the FAQ content, since both mention API access and plan details:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/python-test-retrieval-1024x573.png\" alt=\"-\" class=\"wp-image-78998\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Step 6: Generate Answers with the LLM<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Once retrieval works, the next step is sending the retrieved chunks to the local vLLM server and asking the model to answer using that context.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Before continuing, make sure the vLLM server is running. If it is not, start it again in another terminal:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">source .venv\/bin\/activate\nvllm serve mistralai\/Mistral-7B-Instruct-v0.2<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">By default, vLLM exposes an OpenAI-compatible API on port <code>8000<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Install the OpenAI client library<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If it is not already installed inside the virtual environment:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">pip install openai<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Create the generation module<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>generation.py<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">nano generation.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Paste:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">from openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http:\/\/127.0.0.1:8000\/v1\",\n    api_key=\"dummy\",\n)\n\ndef generate_answer(query, context_chunks):\n    context = \"\\n\\n\".join(context_chunks)\n\n    prompt = f\"\"\"Context:\n{context}\n\nQuestion:\n{query}\n\nAnswer using only the context above. If the answer is not in the context, say you do not know.\n\"\"\"\n\n    response = client.chat.completions.create(\n        model=\"mistralai\/Mistral-7B-Instruct-v0.2\",\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n        temperature=0.2,\n        max_tokens=300,\n    )\n\n    return response.choices[0].message.content<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Test retrieval and generation together<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>test_rag.py<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">nano test_rag.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">With the following contents:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">from retrieval import retrieve_documents\nfrom generation import generate_answer\n\nquery = \"What does the Pro plan include?\"\n\nchunks = retrieve_documents(query)\nanswer = generate_answer(query, chunks)\n\nprint(\"Retrieved chunks:\")\nfor i, chunk in enumerate(chunks, 1):\n    print(f\"\\nChunk {i}:\\n{chunk}\")\n\nprint(\"\\nFinal answer:\")\nprint(answer)\n\nAnd run it:\npython test_rag.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A successful result should print the retrieved chunks followed by a grounded answer, such as:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/rag-test-pythong-script-1024x776.png\" alt=\"-\" class=\"wp-image-78996\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">That confirms the retrieval and generation parts are now working together.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 7: Expose the End-to-End RAG API<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">With retrieval and generation working, the last step is wrapping everything in a small FastAPI application.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Install FastAPI and Uvicorn<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Make sure the virtual environment is active:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">source .venv\/bin\/activate<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then install the API dependencies:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">pip install fastapi uvicorn<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Create the API application<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>app.py<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">nano app.py<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And paste the following code in it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">from fastapi import FastAPI\nfrom pydantic import BaseModel\nfrom retrieval import retrieve_documents\nfrom generation import generate_answer\n\napp = FastAPI()\n\nclass Question(BaseModel):\n    query: str\n\n@app.post(\"\/ask\")\ndef ask(question: Question):\n    chunks = retrieve_documents(question.query)\n    answer = generate_answer(question.query, chunks)\n\n    return {\n        \"query\": question.query,\n        \"retrieved_chunks\": chunks,\n        \"answer\": answer,\n    }<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Start the API server<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, start the app with this command:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">uvicorn app:app --host 0.0.0.0 --port 8080<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The API will now be listening on port <code>8080<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Send a test request<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In another terminal on the VM, run:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">curl -X POST http:\/\/127.0.0.1:8080\/ask \\\n  -H \"Content-Type: application\/json\" \\\n  -d '{\"query\":\"What does the Pro plan include?\"}'<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Or, you can use your VM\u2019s public IP to send requests from your local machine:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/rag-testing-from-local-1024x851.png\" alt=\"-\" class=\"wp-image-78993\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This request triggers the full RAG flow:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The API receives the question<\/li>\n\n\n\n<li>It embeds the query<\/li>\n\n\n\n<li>PostgreSQL returns the nearest chunks using pgvector<\/li>\n\n\n\n<li>Those chunks are assembled into a prompt<\/li>\n\n\n\n<li>The prompt is sent to the local vLLM server<\/li>\n\n\n\n<li>The final answer is returned as JSON<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">At this point, the self-hosted RAG pipeline is fully working on UpCloud using object storage, PostgreSQL with pgvector, a local embedding model, and a vLLM-served open model.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Running This System in Production<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The system built in this tutorial is intentionally minimal so the core ideas remain clear. In production environments, teams usually extend several parts of the pipeline to improve retrieval quality, reliability, and performance as traffic grows.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One of the first improvements typically happens in the document ingestion pipeline. Instead of basic chunking, production systems often introduce:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Overlapping chunks<\/strong> to preserve context across document sections<\/li>\n\n\n\n<li><strong>Parsers for structured formats<\/strong> such as PDFs, Markdown, and HTML<\/li>\n\n\n\n<li><strong>Metadata fields<\/strong> like document titles, section headings, source URLs, or timestamps<\/li>\n\n\n\n<li><strong>Automated ingestion jobs<\/strong> that index new files when they appear in object storage<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">These changes make the retrieval layer more accurate and help trace answers back to the original source documents.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The retrieval strategy itself may also evolve. While vector similarity search works well for many queries, production systems often combine it with other signals:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.analyticsvidhya.com\/blog\/2024\/12\/contextual-rag-systems-with-hybrid-search-and-reranking\/\" target=\"_blank\" rel=\"noopener\"><strong>Hybrid search<\/strong><\/a> that blends vector similarity with keyword search<\/li>\n\n\n\n<li><a href=\"https:\/\/codesignal.com\/learn\/courses\/scaling-up-rag-with-vector-databases\/lessons\/metadata-based-filtering-in-rag-systems\" target=\"_blank\" rel=\"noopener\"><strong>Metadata filters<\/strong><\/a> that restrict results by document type or category<\/li>\n\n\n\n<li><strong>Re-ranking models<\/strong> that refine the top retrieved passages before sending them to the LLM<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">These techniques help improve answer quality, especially when datasets grow larger or more complex.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The LLM inference layer usually scales next. A single inference server is enough for testing, but production systems typically run:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Multiple vLLM servers behind a load balancer<\/li>\n\n\n\n<li>GPU-backed instances for faster generation<\/li>\n\n\n\n<li>Query caching for frequently asked questions<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This helps maintain consistent latency even as query volume increases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, teams usually introduce monitoring and operational tooling to track signals like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>API request latency<\/li>\n\n\n\n<li>Embedding generation time<\/li>\n\n\n\n<li>Vector search query time<\/li>\n\n\n\n<li>LLM inference latency<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">These metrics help identify bottlenecks and guide infrastructure scaling decisions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why Self-Hosting RAG Is More Practical Today<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Running a full RAG system on infrastructure you control used to be much harder than it is today. A few years ago, open-source models were weaker, inference servers were harder to operate, and vector search required specialized systems that were not widely adopted. As a result, most developers relied on managed AI services.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Several changes across the ecosystem have made self-hosted RAG systems far more practical.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Stronger open-source models<\/strong>: Open models have improved rapidly. Models such as Mistral and newer Llama-family variants perform well for tasks like documentation search, support assistants, and internal knowledge retrieval.<\/li>\n\n\n\n<li><strong>Production-ready inference servers<\/strong>: Running models locally has become much easier thanks to modern inference servers. Tools like vLLM provide efficient model serving and expose APIs compatible with the OpenAI API format.<\/li>\n\n\n\n<li><strong>Vector search inside existing databases<\/strong>: Vector search is no longer limited to specialized databases. Extensions like pgvector make it possible for many applications to handle both structured data and vector retrieval in the same setup.<\/li>\n\n\n\n<li><strong>Widely available infrastructure<\/strong>: High-memory VMs and GPU instances are now available from many infrastructure providers, including UpCloud. This allows teams to run inference workloads on standard cloud infrastructure rather than depending entirely on managed AI platforms.<\/li>\n\n\n\n<li><strong>S3-compatible object storage<\/strong>: Object storage systems that follow the S3 API have also simplified document pipelines. Raw documents can be stored in object storage while the vector database stores the processed chunks and embeddings. This separation makes it easy to rebuild indexes, change embedding models, or reprocess documents later.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Taken together, these improvements have lowered the barrier to running AI systems outside proprietary platforms.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Final Thoughts<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In this tutorial, we built a complete self-hosted Retrieval-Augmented Generation pipeline using open-source tools and standard infrastructure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One of the main takeaways from this exercise should be how practical self-hosting has become. Platforms such as <a href=\"https:\/\/upcloud.com\/global\/\">UpCloud<\/a> provide the infrastructure needed to run this architecture, including compute for model inference, managed PostgreSQL for vector search, and object storage for document ingestion. Because each component of the system scales independently, you can start with a small deployment like the one built here and expand it as your application grows.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to explore further, try extending this example by indexing larger document collections, experimenting with different embedding models, or adding hybrid retrieval. Once the core pipeline is in place, these improvements can significantly enhance the quality and performance of your RAG system.<\/p>\n","protected":false},"author":82,"featured_media":0,"comment_status":"open","ping_status":"closed","template":"","community-category":[244,223,232],"class_list":["post-6022","tutorial","type-tutorial","status-publish","hentry"],"acf":[],"_links":{"self":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/tutorial\/6022","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/tutorial"}],"about":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/types\/tutorial"}],"author":[{"embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/users\/82"}],"replies":[{"embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/comments?post=6022"}],"version-history":[{"count":9,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/tutorial\/6022\/revisions"}],"predecessor-version":[{"id":6039,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/tutorial\/6022\/revisions\/6039"}],"wp:attachment":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/media?parent=6022"}],"wp:term":[{"taxonomy":"community-category","embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/community-category?post=6022"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}