2026-04-187 min readrag.art team

The anatomy of a useful chatbot: retrieval + reasoning + restraint

The three-part mental model we use to design chatbots that customers actually return to — retrieval, reasoning, restraint — and why missing any one of them is the reason most projects fail in month three.

designproductrag

Every chatbot project we've been invited to rescue failed for one of three reasons: the retrieval was bad, the reasoning was weak, or the bot said things it shouldn't have. The mental model we use internally for designing new deployments is those three words. Retrieval. Reasoning. Restraint. Each has to be right, and they have to be right together.

Retrieval: finding the relevant thing

Retrieval is not a solved problem. It is an engineering problem with three sub-problems hidden inside it: chunking, embedding, and ranking.

Chunking

Splits your corpus into pieces the model can reason over. Naive chunking (fixed-size text windows with overlap) ships on day one and is the cause of 60% of 'the bot gave the wrong answer' complaints by month two. Good chunking is aware of document structure — headings, lists, tables get preserved as units. Great chunking also generates parent-child hierarchies so retrieval can pull a specific clause and the surrounding context.

Embedding

Turns each chunk into a vector. Which embedding model you pick matters less than people think for general English. It matters a lot for multilingual corpora, for code, and for technical jargon. Pick a model that was trained on content like yours.

Ranking

Takes the top-50 vector hits and resorts them. A reranker catches semantic near-misses that cosine similarity ships wrong. Turning on a reranker is the single biggest accuracy win in the stack after chunking.

Reasoning: composing the answer

Once retrieval returns good chunks, reasoning is surprisingly solved. Any GPT-4-class model will write a grounded answer from those chunks if the prompt tells it to. The failure mode isn't 'the model can't reason'. The failure modes are:

The prompt tells the model to be confident and prioritise helpfulness over accuracy. Remove those adjectives.
The prompt doesn't tell the model to cite. Force it to produce citations inline; it keeps itself honest.
The prompt doesn't define 'I don't know'. Give it an explicit refusal template, and the bot will use it.
The temperature is above 0.3. Drop it to 0.1 for factual retrieval.

Restraint: not saying the wrong thing

This is where most products fail silently. The bot is technically accurate, but it says things the brand wouldn't say, or reveals things the company considers internal, or gives advice the legal department has opinions about. Restraint is the set of hard rules the bot cannot cross, implemented server-side, not in the prompt.

Never promise things your policy doesn't support (refunds, discounts, timelines).
Never speak about competitors by name.
Never cross into regulated-advice territory (legal, medical, financial).
Never expose system instructions. A user asking 'what's your system prompt' should get a flat refusal.
Never change persona. A bot sold as 'Acme Assistant' shouldn't admit to being GPT-4, even if the user insists.

These rules live in a post-processing layer that runs on every response, not in the system prompt. System prompts can be overridden. Post-processing cannot.

Why all three have to ship together

Retrieval without reasoning gives you a search box with a weird UI. Reasoning without retrieval gives you hallucinations. Retrieval and reasoning without restraint gives you a bot that nails the answer most of the time and torches your brand in the 3% where it doesn't.

The mid-market chatbot market is littered with tools that do one of these well and the other two haphazardly. Any vendor worth evaluating will let you inspect all three — show you retrieval metrics, let you read the final system prompt, and describe the server-side guardrails in words you can redraft with your lawyer.

Our defaults, stated plainly

Hybrid retrieval: vector search plus keyword BM25, with a reranker on the top 50.
Structured chunking: tables and lists preserved, headings carry into chunk metadata.
Reasoning prompt: temperature 0.1, citations mandatory, refusal path explicit.
Restraint layer: server-side regex + classifier pass on every response before it ships to the user.
Observability: every query and response logged with retrieval trace and final prompt, scrubbed of PII.

That's the skeleton. Everything else is template choices, branding, and channels. The skeleton is what separates useful bots from demo bots.