RAG (Retrieval-Augmented Generation) has become the standard for connecting your business data to an LLM. Here's a concrete guide to setting up a reliable RAG architecture in production, using LangChain and pgvector.
Why RAG over fine-tuning?
Fine-tuning an LLM on your data is expensive, time-consuming, and becomes stale as soon as your data evolves. RAG, by contrast, is dynamic: your knowledge base is updated continuously, and the model always queries the most recent data at each request.
That's why for 90% of enterprise use cases, internal documentation, customer support, contract analysis, RAG is the right answer.
Target architecture
Our reference stack for a production-ready RAG:
Ingestion & chunking
Load documents (PDF, Word, web, APIs), split into coherent chunks with overlap, clean and normalize text.
Embedding & vector storage
Vectorize with text-embedding-3-large (OpenAI) or a local model. Store in Postgres with the pgvector extension for native vector queries.
Hybrid retrieval
Combine semantic search (cosine similarity) and BM25 search (full-text) to maximize the precision of retrieved context.
Generation & evaluation
Build the prompt with retrieved context, call the LLM, and automatically evaluate responses using RAGAS or LangSmith.
The code that matters
from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = PGVector(
embeddings=embeddings,
collection_name="knowledge_base",
connection="postgresql+psycopg://user:pass@localhost/ragdb",
use_jsonb=True,
)from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 6, "fetch_k": 20}
)
prompt = ChatPromptTemplate.from_template("""
Answer the question based solely on the provided context.
If you can't find the answer, say so clearly.
Context: {context}
Question: {input}
""")
chain = create_retrieval_chain(
retriever,
create_stuff_documents_chain(llm, prompt)
)Classic mistakes to avoid
Chunks that are too large or too small
A 2000-token chunk buries relevant information in noise. A 50-token chunk loses the context needed for understanding. Our sweet spot: 400–600 tokens with 80-token overlap.
Skipping the evaluation phase
A RAG pipeline without evaluation metrics is flying blind. We systematically use RAGAS to measure response faithfulness, retrieval relevance, and absence of hallucinations.
Pro tip: Enable retrieval logging from day one. It's your best debugging tool, you'll immediately see if the problem is in retrieval or generation.
Production checklist
Before going to prod, verify: incremental indexing (no full re-indexing on every update), batch embedding for reduced API costs, caching of frequent embeddings, P95 latency monitoring, and automated regression tests on a golden dataset.
With care,
Finally a RAG tutorial that cuts to the chase. The pgvector + MMR retriever config is exactly what I needed. Adapted the example to my use case in under an hour, fantastic write-up.
Great to hear Alex! MMR is really the way to go to avoid redundancy in the retrieved context. Good luck with your project!
The 400-600 token chunk advice saved me hours of debugging. I had chunks at 1500 tokens and couldn't understand why my answers were so vague. Immediate improvement after adjusting.
Really solid technical article. Any plans for a follow-up on RAG evaluation with RAGAS? I'm struggling to interpret some of the metrics and a practical guide would be very useful.
Noted Ravi! A dedicated article on RAG evaluation with RAGAS is in our content pipeline. Stay tuned!