In our sector, everyone publishes their successes. We will tell you about a failure. Our first RAG deployment in production was stopped after 6 weeks. Here is what happened, why, and how we rebuilt everything correctly the second time.
Too much confidence, not enough rigour
The project: an AI assistant for a technical support team at an industrial company. 4,000 pages of technical documentation. We had done RAG internally before. We were confident. Too confident.
Error 1: data quality first
We indexed all 4,000 pages directly without preprocessing. Result: scanned documents with poor OCR, badly parsed tables, technical diagrams without textual context. The retriever recovered incoherent chunks.
Error 2: default chunking
We used LangChain's default chunking: 1,000 characters with 200 overlap. For technical documentation with 15-step procedures, this was catastrophic. A procedure was often cut in the middle of a critical step.
The rule we learned: chunking must follow the semantic structure of the document, not an arbitrary character count. For technical docs: chunk by section or complete procedure.
Error 3: no systematic evaluation
We evaluated on 20 manual questions during development. In production, technicians asked 300 very different questions per day. Without automatic evaluation, we did not see the degradation coming.
Version 2: what we changed
Full document preprocessing, semantic chunking by procedure section, automatic evaluation with RAGAS on a 200-question golden dataset, production monitoring with Langfuse. Result: 87% satisfaction rate in 4 weeks.
With care,
Excellent article, this matches exactly what we're seeing with our enterprise clients. The section on inference costs is especially valuable. It's a topic most articles gloss over but it's make-or-break at scale.
Thanks James! Inference cost optimization is often deprioritized during prototyping but becomes critical in production. Feel free to book a session if you'd like to go deeper on this.
Sharing this with my whole team. The distinction between an impressive demo and robust production is exactly the debate we're having internally right now. The human checkpoint advice is immediately actionable.
Great article. I'd push back slightly on the 18-day deployment estimate, in our experience with enterprise security and GDPR requirements, 4–6 weeks is more realistic for a first production agent.
Completely fair point David. The 18 days refers to a scoped first agent in a test environment. For full enterprise production with security constraints, your estimate is accurate.