Everyone reaching for a vector database when building RAG is solving the wrong problem first. For most domain-specific corpora β technical documentation, company knowledge bases, article archives β BM25 retrieval is competitive with semantic search, costs a fraction of the compute, and is dramatically simpler to operate. This tutorial shows you how to build a full RAG pipeline using Meilisearch as the retrieval backend, stream responses from an LLM API, and evaluate hit rate without a single embedding model.
Why RAG, and why not a vector database
Retrieval-Augmented Generation solves a fundamental problem: LLMs have a knowledge cutoff and a finite context window. You want answers grounded in your documents, not hallucinated from pre-training.
The standard advice is to use a vector database (Pinecone, Weaviate, Chroma). Vector search is powerful for open-domain retrieval where semantic similarity matters. But on a domain-specific corpus with consistent terminology β think a cybersecurity knowledge base or a medical reference β BM25 with typo tolerance typically achieves 85β95% of the recall you'd get from embeddings, with zero GPU cost, sub-10ms latency, and no embedding pipeline to maintain.
Meilisearch gives you BM25 out of the box, plus typo tolerance, faceted filtering, and a simple REST API. It's what I use to power the search across 1,600+ articles at AYI NEDJIMI Consultants.
Setup
pip install meilisearch openai httpx
Run Meilisearch locally:
docker run -d -p 7700:7700 getmeili/meilisearch:latest
Step 1: Index your documents
Your documents need an id, searchable content, and any filter attributes you want to use at query time.
import meilisearch
import hashlib
import json
MEILI_URL = "http://127.0.0.1:7700"
MEILI_KEY = "your_master_key" # or "" for local dev
INDEX_NAME = "knowledge_base"
client = meilisearch.Client(MEILI_URL, MEILI_KEY)
def get_or_create_index():
try:
index = client.get_index(INDEX_NAME)
except meilisearch.errors.MeilisearchApiError:
task = client.create_index(INDEX_NAME, {"primaryKey": "id"})
client.wait_for_task(task.task_uid)
index = client.get_index(INDEX_NAME)
# Configure searchable attributes and filters
index.update_settings({
"searchableAttributes": ["title", "content", "tags"],
"filterableAttributes": ["category", "doc_type"],
"rankingRules": [
"words", "typo", "proximity", "attribute", "sort", "exactness"
],
"typoTolerance": {
"enabled": True,
"minWordSizeForTypos": {"oneTypo": 4, "twoTypos": 8}
}
})
return index
def index_documents(documents: list[dict]):
"""
Each document: {"id": str, "title": str, "content": str,
"tags": list[str], "category": str, "doc_type": str}
"""
index = get_or_create_index()
# Add stable IDs if not present
for doc in documents:
if "id" not in doc:
doc["id"] = hashlib.sha256(doc["content"].encode()).hexdigest()[:16]
task = index.add_documents(documents, primary_key="id")
client.wait_for_task(task.task_uid)
print(f"Indexed {len(documents)} documents.")
# Example: load from a JSONL file
def load_and_index(filepath: str):
docs = []
with open(filepath) as f:
for line in f:
docs.append(json.loads(line.strip()))
index_documents(docs)
Step 2: Retrieve top-k documents
def retrieve(query: str, top_k: int = 5, filters: str = "") -> list[dict]:
"""
Returns top_k documents matching the query.
filters example: "category = 'security' AND doc_type = 'guide'"
"""
index = client.get_index(INDEX_NAME)
search_params = {
"limit": top_k,
"attributesToRetrieve": ["id", "title", "content", "category"],
"attributesToHighlight": ["content"],
"highlightPreTag": "**",
"highlightPostTag": "**",
}
if filters:
search_params["filter"] = filters
results = index.search(query, search_params)
return results["hits"]
Step 3: Construct the prompt
The prompt structure is critical. You want the model to be explicitly grounded β it should cite only what's in the retrieved chunks, not hallucinate.
def build_prompt(query: str, retrieved_docs: list[dict]) -> list[dict]:
context_blocks = []
for i, doc in enumerate(retrieved_docs, 1):
context_blocks.append(
f"[Source {i}] {doc['title']}\n{doc['content'][:1200]}"
)
context = "\n\n---\n\n".join(context_blocks)
system_prompt = (
"You are a technical assistant. Answer the user's question using ONLY "
"the provided sources. If the answer is not in the sources, say so explicitly. "
"Cite sources by number, e.g. [Source 1]."
)
user_message = f"""Sources:
{context}
---
Question: {query}"""
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
]
Step 4: Stream the LLM response
Never buffer the full response before sending it to the user. Streaming is essential for UX on long answers.
from openai import OpenAI # generic llm_client β swap for any compatible SDK
llm_client = OpenAI(
api_key="your_api_key",
base_url="https://api.your-llm-provider.com/v1", # adjust per provider
)
def rag_stream(query: str, category_filter: str = ""):
"""Generator that yields text chunks as they arrive from the LLM."""
filters = f"category = '{category_filter}'" if category_filter else ""
docs = retrieve(query, top_k=5, filters=filters)
if not docs:
yield "No relevant documents found in the knowledge base."
return
messages = build_prompt(query, docs)
stream = llm_client.chat.completions.create(
model="gpt-4o-mini", # or your preferred model
messages=messages,
stream=True,
temperature=0.2, # lower temp for factual retrieval tasks
max_tokens=800,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
yield delta.content
Step 5: Wire it together β a minimal CLI
import sys
def main():
query = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Query: ")
print(f"\nQuery: {query}\n{'='*60}\n")
for token in rag_stream(query):
print(token, end="", flush=True)
print("\n")
if __name__ == "__main__":
main()
Usage:
python rag.py "What are the key requirements of NIS 2 for SMEs?"
Step 6: Evaluate hit rate
Before deploying, measure whether your retrieval is actually finding the right documents. You need a small golden dataset: query β expected document ID.
def evaluate_hit_rate(golden_set: list[dict], top_k: int = 5) -> float:
"""
golden_set: [{"query": "...", "expected_id": "doc_id"}, ...]
Returns hit rate @ top_k.
"""
hits = 0
for item in golden_set:
results = retrieve(item["query"], top_k=top_k)
retrieved_ids = {r["id"] for r in results}
if item["expected_id"] in retrieved_ids:
hits += 1
hit_rate = hits / len(golden_set)
print(f"Hit rate @{top_k}: {hit_rate:.2%} ({hits}/{len(golden_set)})")
return hit_rate
# Example usage
golden = [
{"query": "NIS 2 SME requirements", "expected_id": "nis2-guide-001"},
{"query": "ISO 27001 certification steps", "expected_id": "iso27001-checklist"},
{"query": "penetration testing methodology", "expected_id": "pentest-guide-002"},
]
evaluate_hit_rate(golden, top_k=5)
On a 1,600-article cybersecurity corpus, this setup achieves roughly 91% hit rate at k=5 β without a single embedding model call.
Production considerations
Chunking strategy: For long documents, chunk at 512β800 tokens with 10% overlap. Store doc_id and chunk_index so you can reconstruct the full document if needed.
Re-ranking: If your hit rate plateaus below 85%, add a lightweight cross-encoder re-ranker as a second stage. cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers works locally and adds ~30ms latency.
Context window budget: At 5 docs Γ 1,200 chars, you're using roughly 1,500 tokens of context. Adjust top_k and content truncation to stay within your model's window while leaving room for the answer.
Caching: Cache retrieval results for identical queries with a TTL of 5β15 minutes using Redis or even a simple in-memory dict. LLM call results can be cached longer for factual queries.
This pipeline β retrieval with Meilisearch, prompt construction, streaming output β is what I run in production. No embedding pipeline, no vector database operational overhead. For domain-specific retrieval, BM25 is frequently the pragmatic choice. Reach for semantic search when your query vocabulary genuinely diverges from your document vocabulary; otherwise, ship the simpler thing.
United States
NORTH AMERICA
Related News
Trump Calls Off AI Executive Order Over Concern It Could Weaken US Tech Edge
4h ago

Microservices Didn't Fail. People Did
4h ago

Meta Settles Lawsuit That Claimed Social Media Addiction Screwed Up Schools
4h ago

Centralized Authentication for a Multi-Brand Laravel Ecosystem
12h ago
Gizmo Guard - Safeguard Bot (Powered by Gemma4)
4h ago