TechTrends Now - Tech News for Builders and Operators

Everyone reaching for a vector database when building RAG is solving the wrong problem first. For most domain-specific corpora — technical documentation, company knowledge bases, article archives — BM25 retrieval is competitive with semantic search, costs a fraction of the compute, and is dramatically simpler to operate. This tutorial shows you how to build a full RAG pipeline using Meilisearch as the retrieval backend, stream responses from an LLM API, and evaluate hit rate without a single embedding model.

Why RAG, and why not a vector database

Retrieval-Augmented Generation solves a fundamental problem: LLMs have a knowledge cutoff and a finite context window. You want answers grounded in your documents, not hallucinated from pre-training.

The standard advice is to use a vector database (Pinecone, Weaviate, Chroma). Vector search is powerful for open-domain retrieval where semantic similarity matters. But on a domain-specific corpus with consistent terminology — think a cybersecurity knowledge base or a medical reference — BM25 with typo tolerance typically achieves 85–95% of the recall you'd get from embeddings, with zero GPU cost, sub-10ms latency, and no embedding pipeline to maintain.

Meilisearch gives you BM25 out of the box, plus typo tolerance, faceted filtering, and a simple REST API. It's what I use to power the search across 1,600+ articles at AYI NEDJIMI Consultants.

Setup

pip install meilisearch openai httpx

Run Meilisearch locally:

docker run -d -p 7700:7700 getmeili/meilisearch:latest

Step 1: Index your documents

Your documents need an id, searchable content, and any filter attributes you want to use at query time.

import meilisearch
import hashlib
import json

MEILI_URL = "http://127.0.0.1:7700"
MEILI_KEY = "your_master_key"  # or "" for local dev
INDEX_NAME = "knowledge_base"

client = meilisearch.Client(MEILI_URL, MEILI_KEY)

def get_or_create_index():
    try:
        index = client.get_index(INDEX_NAME)
    except meilisearch.errors.MeilisearchApiError:
        task = client.create_index(INDEX_NAME, {"primaryKey": "id"})
        client.wait_for_task(task.task_uid)
        index = client.get_index(INDEX_NAME)

    # Configure searchable attributes and filters
    index.update_settings({
        "searchableAttributes": ["title", "content", "tags"],
        "filterableAttributes": ["category", "doc_type"],
        "rankingRules": [
            "words", "typo", "proximity", "attribute", "sort", "exactness"
        ],
        "typoTolerance": {
            "enabled": True,
            "minWordSizeForTypos": {"oneTypo": 4, "twoTypos": 8}
        }
    })
    return index

def index_documents(documents: list[dict]):
    """
    Each document: {"id": str, "title": str, "content": str,
                    "tags": list[str], "category": str, "doc_type": str}
    """
    index = get_or_create_index()

    # Add stable IDs if not present
    for doc in documents:
        if "id" not in doc:
            doc["id"] = hashlib.sha256(doc["content"].encode()).hexdigest()[:16]

    task = index.add_documents(documents, primary_key="id")
    client.wait_for_task(task.task_uid)
    print(f"Indexed {len(documents)} documents.")

# Example: load from a JSONL file
def load_and_index(filepath: str):
    docs = []
    with open(filepath) as f:
        for line in f:
            docs.append(json.loads(line.strip()))
    index_documents(docs)

Step 2: Retrieve top-k documents

def retrieve(query: str, top_k: int = 5, filters: str = "") -> list[dict]:
    """
    Returns top_k documents matching the query.
    filters example: "category = 'security' AND doc_type = 'guide'"
    """
    index = client.get_index(INDEX_NAME)

    search_params = {
        "limit": top_k,
        "attributesToRetrieve": ["id", "title", "content", "category"],
        "attributesToHighlight": ["content"],
        "highlightPreTag": "**",
        "highlightPostTag": "**",
    }

    if filters:
        search_params["filter"] = filters

    results = index.search(query, search_params)
    return results["hits"]

Step 3: Construct the prompt

The prompt structure is critical. You want the model to be explicitly grounded — it should cite only what's in the retrieved chunks, not hallucinate.

def build_prompt(query: str, retrieved_docs: list[dict]) -> list[dict]:
    context_blocks = []
    for i, doc in enumerate(retrieved_docs, 1):
        context_blocks.append(
            f"[Source {i}] {doc['title']}\n{doc['content'][:1200]}"
        )

    context = "\n\n---\n\n".join(context_blocks)

    system_prompt = (
        "You are a technical assistant. Answer the user's question using ONLY "
        "the provided sources. If the answer is not in the sources, say so explicitly. "
        "Cite sources by number, e.g. [Source 1]."
    )

    user_message = f"""Sources:
{context}

---

Question: {query}"""

    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message},
    ]

Step 4: Stream the LLM response

Never buffer the full response before sending it to the user. Streaming is essential for UX on long answers.

from openai import OpenAI  # generic llm_client — swap for any compatible SDK

llm_client = OpenAI(
    api_key="your_api_key",
    base_url="https://api.your-llm-provider.com/v1",  # adjust per provider
)

def rag_stream(query: str, category_filter: str = ""):
    """Generator that yields text chunks as they arrive from the LLM."""
    filters = f"category = '{category_filter}'" if category_filter else ""
    docs = retrieve(query, top_k=5, filters=filters)

    if not docs:
        yield "No relevant documents found in the knowledge base."
        return

    messages = build_prompt(query, docs)

    stream = llm_client.chat.completions.create(
        model="gpt-4o-mini",  # or your preferred model
        messages=messages,
        stream=True,
        temperature=0.2,  # lower temp for factual retrieval tasks
        max_tokens=800,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            yield delta.content

Step 5: Wire it together — a minimal CLI

import sys

def main():
    query = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Query: ")
    print(f"\nQuery: {query}\n{'='*60}\n")

    for token in rag_stream(query):
        print(token, end="", flush=True)

    print("\n")

if __name__ == "__main__":
    main()

Usage:

python rag.py "What are the key requirements of NIS 2 for SMEs?"

Step 6: Evaluate hit rate

Before deploying, measure whether your retrieval is actually finding the right documents. You need a small golden dataset: query → expected document ID.

def evaluate_hit_rate(golden_set: list[dict], top_k: int = 5) -> float:
    """
    golden_set: [{"query": "...", "expected_id": "doc_id"}, ...]
    Returns hit rate @ top_k.
    """
    hits = 0
    for item in golden_set:
        results = retrieve(item["query"], top_k=top_k)
        retrieved_ids = {r["id"] for r in results}
        if item["expected_id"] in retrieved_ids:
            hits += 1

    hit_rate = hits / len(golden_set)
    print(f"Hit rate @{top_k}: {hit_rate:.2%} ({hits}/{len(golden_set)})")
    return hit_rate

# Example usage
golden = [
    {"query": "NIS 2 SME requirements", "expected_id": "nis2-guide-001"},
    {"query": "ISO 27001 certification steps", "expected_id": "iso27001-checklist"},
    {"query": "penetration testing methodology", "expected_id": "pentest-guide-002"},
]

evaluate_hit_rate(golden, top_k=5)

On a 1,600-article cybersecurity corpus, this setup achieves roughly 91% hit rate at k=5 — without a single embedding model call.

Production considerations

Chunking strategy: For long documents, chunk at 512–800 tokens with 10% overlap. Store doc_id and chunk_index so you can reconstruct the full document if needed.

Re-ranking: If your hit rate plateaus below 85%, add a lightweight cross-encoder re-ranker as a second stage. cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers works locally and adds ~30ms latency.

Context window budget: At 5 docs × 1,200 chars, you're using roughly 1,500 tokens of context. Adjust top_k and content truncation to stay within your model's window while leaving room for the answer.

Caching: Cache retrieval results for identical queries with a TTL of 5–15 minutes using Redis or even a simple in-memory dict. LLM call results can be cached longer for factual queries.

This pipeline — retrieval with Meilisearch, prompt construction, streaming output — is what I run in production. No embedding pipeline, no vector database operational overhead. For domain-specific retrieval, BM25 is frequently the pragmatic choice. Reach for semantic search when your query vocabulary genuinely diverges from your document vocabulary; otherwise, ship the simpler thing.

How to build a production RAG pipeline in Python (without a vector database)

Why RAG, and why not a vector database

Setup

Step 1: Index your documents

Step 2: Retrieve top-k documents

Step 3: Construct the prompt

Step 4: Stream the LLM response

Step 5: Wire it together — a minimal CLI

Step 6: Evaluate hit rate

Production considerations

Comments (0)

United States

Related News

Trump Calls Off AI Executive Order Over Concern It Could Weaken US Tech Edge

Microservices Didn't Fail. People Did

Meta Settles Lawsuit That Claimed Social Media Addiction Screwed Up Schools

Centralized Authentication for a Multi-Brand Laravel Ecosystem

Gizmo Guard - Safeguard Bot (Powered by Gemma4)