RAG (Retrieval-Augmented Generation) has become the standard for connecting your business data to an LLM. Here's a concrete guide to setting up a reliable RAG architecture in production, using LangChain and pgvector.

Why RAG over fine-tuning?

Fine-tuning an LLM on your data is expensive, time-consuming, and becomes stale as soon as your data evolves. RAG, by contrast, is dynamic: your knowledge base is updated continuously, and the model always queries the most recent data at each request.

That's why for 90% of enterprise use cases, internal documentation, customer support, contract analysis, RAG is the right answer.

Target architecture

Our reference stack for a production-ready RAG:

01

Ingestion & chunking

Load documents (PDF, Word, web, APIs), split into coherent chunks with overlap, clean and normalize text.

02

Embedding & vector storage

Vectorize with text-embedding-3-large (OpenAI) or a local model. Store in Postgres with the pgvector extension for native vector queries.

03

Hybrid retrieval

Combine semantic search (cosine similarity) and BM25 search (full-text) to maximize the precision of retrieved context.

04

Generation & evaluation

Build the prompt with retrieved context, call the LLM, and automatically evaluate responses using RAGAS or LangSmith.

The code that matters

Python — pgvector setup
from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="knowledge_base",
    connection="postgresql+psycopg://user:pass@localhost/ragdb",
    use_jsonb=True,
)
Python, RAG chain
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 6, "fetch_k": 20}
)

prompt = ChatPromptTemplate.from_template("""
Answer the question based solely on the provided context.
If you can't find the answer, say so clearly.

Context: {context}
Question: {input}
""")

chain = create_retrieval_chain(
    retriever,
    create_stuff_documents_chain(llm, prompt)
)

Classic mistakes to avoid

Chunks that are too large or too small

A 2000-token chunk buries relevant information in noise. A 50-token chunk loses the context needed for understanding. Our sweet spot: 400–600 tokens with 80-token overlap.

Skipping the evaluation phase

A RAG pipeline without evaluation metrics is flying blind. We systematically use RAGAS to measure response faithfulness, retrieval relevance, and absence of hallucinations.

Pro tip: Enable retrieval logging from day one. It's your best debugging tool, you'll immediately see if the problem is in retrieval or generation.

Production checklist

Before going to prod, verify: incremental indexing (no full re-indexing on every update), batch embedding for reduced API costs, caching of frequent embeddings, P95 latency monitoring, and automated regression tests on a golden dataset.

RAGLangChainPostgrespgvectorLLMPythonTutorial

With care,

Sylvie Wendkuni NITIEMA
Founder & Data Scientist · DataSAI