Context Engineering: Why Your AI App Keeps Getting It Wrong
Your AI app hallucinates, ignores your data, or gives confidently wrong answers. The problem is almost certainly not the model — it's your context architecture. Here's what context engineering actually is and a checklist to fix it.
You gave your AI app careful instructions. You wrote the prompts. You connected it to your database. And it still returns wrong answers, hallucinates facts, or ignores the context you gave it.
The problem is almost certainly not the model.
It's your context architecture.
What "Context" Actually Means
When you send a message to an AI model, the model only knows three things:
- What you told it to be (system prompt)
- What data you retrieved and passed to it
- The conversation history so far
That's it. The model has no memory. No instinct. No common sense beyond its training. Every response is only as good as what you put in the context window.
Context engineering is the discipline of designing exactly what goes into that window — and what stays out.
The Myths That Break Most AI Apps
Myth 1: "Better prompts will fix my AI app."
The reality: Prompts are about 10% of why an AI app works. The other 90% is what data you retrieve, when you retrieve it, how you chunk it, what you filter out, and how you structure conversation history.
A brilliant prompt with the wrong documents retrieved = wrong answer. A mediocre prompt with exactly the right context = correct answer.
I've seen this dozens of times. The developer rewrites the prompt 20 times. Marginally better. Then I look at the retrieval layer — it's fetching the three least relevant documents in the entire database.
Fix the retrieval. The prompt becomes almost irrelevant.
Myth 2: "More context = better answers."
The reality: This is one of the most damaging mistakes.
Models exhibit what researchers call "lost in the middle" behaviour — they attend strongly to the start and end of the context window, and largely ignore everything in between.
If you retrieve 20 documents when you need 2, the model frequently uses the wrong one.
The principle: give the model exactly what it needs, nothing more.
Good retrieval is about relevance, not volume.
Myth 3: "My retrieval is working, so why are answers wrong?"
The reality: Retrieval and generation are two separate failure modes.
Common generation failures even with good retrieval:
- The retrieved text is too long and the key fact is buried in the middle
- The question is ambiguous and the model picks the wrong interpretation
- The context contains contradictions the model does not resolve
- The model defaults to its training data instead of the retrieved content
Fix: Add a re-ranking step after retrieval. Score retrieved documents by relevance before passing to the model. Cut anything below threshold.
Myth 4: "AI hallucinations are a model problem — nothing I can do."
The reality: Hallucinations are almost always a context problem.
When a model hallucinates, it fills in gaps with plausible-sounding content. The gap exists because the context did not have the right information at the right time.
Solutions that actually work:
- Grounding: Force the model to only answer from retrieved content — "If the answer is not in the provided documents, say: I don't have that information."
- Citation: Require the model to cite which document it is drawing from. This forces precision and makes hallucinations instantly visible.
- Confidence gates: If retrieval similarity is below 0.65, skip the LLM entirely and return a fallback response.
Myth 5: "Prompt engineering and context engineering are the same thing."
The reality:
- Prompt engineering = how you phrase your instructions.
- Context engineering = the entire information architecture of your AI system.
Context engineering includes:
- Retrieval strategy — which documents to fetch, how many, by what method
- Chunking — how to split documents (chunk size matters enormously for retrieval quality)
- Re-ranking — scoring and filtering retrieved results before they reach the model
- Memory architecture — what to remember across sessions, how to summarise long conversations
- Tool definitions — how you describe available functions to the model
- Conversation structure — how you format dialogue history passed to each API call
- What you leave out — deliberately removing information that would confuse or distract
Prompt engineering is one tool inside this larger architecture.
The Diagnostic Checklist
If your AI app is behaving badly, run through this in order:
Retrieval
- Are you retrieving the right documents? (Log what gets retrieved for failing queries)
- Are your chunks the right size? (200–500 tokens is usually optimal)
- Are you re-ranking retrieved results by relevance score?
- Are you passing too many documents? (Try cutting to top 3)
Generation
- Is the model instructed to only use retrieved content?
- Are you requiring citations in responses?
- Is the conversation history growing too long? (Add summarisation above 4,000 tokens)
- Are there contradictions in your context?
Memory
- Does the model need cross-session memory? (Add persistent store)
- Is conversation history trimmed when it gets long?
- Are you passing duplicate information?
Error handling
- Do you have fallback responses when retrieval confidence is low?
- Are you logging what context went into each failed response?
- Do you have retry logic with exponential backoff?
What Good Context Engineering Looks Like
This pipeline works for most RAG applications:
User query
↓
Query rewriting (rephrase for better retrieval)
↓
Retrieval (top 10 candidates)
↓
Re-ranking (cut to top 3 by relevance score)
↓
Relevance gate (if max score < 0.65 → return fallback)
↓
Context assembly (docs + conversation history + system prompt)
↓
Generation (with citation requirement)
↓
Response validation (does it cite? does it stay grounded?)
↓
User
Most broken AI apps skip steps 3–6 entirely.
Where to Start If Your App Is Broken Right Now
Step 1 — Log your retrieval. For every failing query, log exactly which documents were retrieved and their similarity scores. You will usually see the problem immediately.
Step 2 — Test retrieval in isolation. Disconnect the LLM and just test whether the right documents come back for your test queries. Fix this layer first.
Step 3 — Add a relevance threshold. If the best retrieved document scores below 0.65 cosine similarity, do not call the LLM. Return: "I don't have reliable information on that." This alone eliminates most hallucinations.
Step 4 — Reduce context. Cut retrieved documents from 10 to 3. You will almost certainly see immediate improvement.
Step 5 — Add grounding instructions. Add to your system prompt: "Only answer using the provided context. Never use information from outside the provided documents."
If you've tried the above and it's still broken, the issue is likely architectural. I do AI project rescue — free 48-hour audit, honest diagnosis, fixed in 1–2 weeks.