Book a call

Intelligence Architecture — Published: 23 March 2026 · 10 min read

The Card in the Machine: Why RAG Is the Most Important Idea in Enterprise AI Right Now

Your company holds a library of irreplaceable knowledge — a century of formulas, decisions, and expertise. The AI can’t read all of it at once. But with Retrieval-Augmented Generation, it doesn’t need to. It only needs the one piece that matters.

← Back to Insights

Imagine a room. Not a small room — a vast one. Floor-to-ceiling shelves receding into the distance, every shelf packed tight with binders, ledgers, lab notebooks, printed reports, technical manuals, and supplier correspondence going back a century. This is the knowledge base of Cavendish & Co., a beverage company that has spent 100 years perfecting its drinks.

In that room lives everything: the proprietary citrus extraction ratios, the seasonal blending guides, the handwritten notes from the founding chemist, the supplier contracts, the quality control records, the reformulation history for every SKU. Somewhere north of one gigabyte of institutional memory — the beating heart of the company’s competitive advantage.

Now a junior product developer needs a quick, accurate answer to a seemingly simple question: “What is the approved citrus concentration for the summer-run elderflower tonic?”

Without AI, this is a 45-minute expedition through shared drives, a call to someone who might remember which binder it’s in, and a healthy dose of luck. But surely, in 2026, we can do better? Surely you can just feed the machine your library?

Here is where most people hit a wall they did not know existed.

The analogy

The machine at the end of the corridor.

Picture, at the far end of that library, a machine unlike any other. A heavy steel box, the size of a wardrobe, finished in brushed gunmetal. Its face is dominated by a single green phosphor monitor — the kind you remember from the early 1980s, text glowing with that distinctive soft-edged light. Below the screen, a small brass plaque reads: Ask anything. Receive the truth.

Beneath the plaque is a slot. A narrow, precise opening — exactly the width and height of an index card. Not a door, not a tray, not a drawer. A slot. Like the card reader on an ATM machine, it will accept one thing and one thing only: a small piece of paper, no larger than the palm of your hand, fed in face-up.

The machine is extraordinary. Its reasoning is deep, its language fluent, its ability to synthesise and explain — nothing short of astonishing. A wonder brain in a metal cabinet. But it has one immovable, physical constraint: it can only read what you slide through that slot.

You cannot carry armfuls of binders to it. You cannot wheel in a trolley of folders. The slot is the slot. If you tried to force in a thick document, it would jam. The machine would see nothing.

Fig. 1  The Machine — An Analogy for How LLMs Receive Context
THE LIBRARY ~1 GB of institutional knowledge RAG Retriever index card ~3 KB of signal › QUERY RECEIVED... › CONTEXT LOADED formulation_v4.pdf [§3.2] › PROCESSING... ANSWER READY ↓ ASK ANYTHING. RECEIVE THE TRUTH. CARD READER SLOT THE MACHINE The LLM · context window only retrieves natural language answer → user 🔒 stays here
“The machine is not limited by intelligence. It is limited by the size of the opening. RAG is the system that ensures you always slide in exactly the right card.”

The technical reality

The context window: the slot has a name.

In the world of Large Language Models, that slot has a technical name: the context window. It is the total volume of text a model can read and reason over in a single interaction — its working memory, its field of view, the sum total of what it can know about your question in this moment.

The context window is measured in tokens — roughly three-quarters of a word each — and while it has grown considerably in recent years, it remains finite. The table below shows where the most widely-deployed models currently stand:

Model Provider Context Window Approx. Text Size Window (Relative)
GPT-4oOpenAI128K tokens~96 KB
GPT-4.1OpenAI1M tokens~750 KB
Claude Opus 4Anthropic200K tokens~150 KB
Gemini 2.0 FlashGoogle DeepMind1M tokens~750 KB
Gemini 1.5 ProGoogle DeepMind2M tokens~1.5 MB
Llama 3.1 405BMeta (open)128K tokens~96 KB
Mistral LargeMistral AI128K tokens~96 KB
DeepSeek-V3DeepSeek128K tokens~96 KB

↑ Even the largest commercial context window — ~1.5 MB — is a rounding error against a 1 GB knowledge base. The gap is roughly 700×.

That gap — 700× — is not a software problem to be patched in the next release. It is a fundamental architectural reality. No model will ever have a context window large enough to ingest an entire enterprise knowledge base in a single pass. The question is not how do we make the slot bigger but how do we always slide in exactly the right card?

The solution

The librarian who prepares the card.

This is precisely what Retrieval-Augmented Generation — RAG — does. It does not try to feed the whole library through the slot. Instead, it employs an extraordinarily fast and intelligent librarian who knows the entire collection by heart, can locate the precise passage you need in milliseconds, and transcribes only the relevant information onto a single index card — together with your original question — before sliding it through.

For our product developer, that card might contain exactly three things: the original question about citrus concentration, a single paragraph from the 2019 summer formulation guide specifying the approved ratio, and one cross-reference note from a 2023 quality review confirming it was revalidated. Perhaps 3 kilobytes of text. The other 999,997 kilobytes of the knowledge base remain safely on the shelf — untouched, unseen by the machine, and entirely under Cavendish & Co.’s control.

The machine scans the card. Its wonder brain goes to work. And thirty seconds later, the green monitor displays a precise, accurate, natural-language answer — as if the world’s foremost authority on elderflower tonics had just read the relevant documents and was speaking directly to you.

Architecture

RAG in technical terms: step by step.

The library analogy maps cleanly onto the technical architecture. Here is exactly what happens when the product developer types their question.

① The query arrives. The developer’s question enters the RAG system. Before it reaches the LLM, it is intercepted by the retrieval layer.

② The knowledge base is searched. The system converts the question into a mathematical vector — a numerical fingerprint of its semantic meaning. It searches a vector store where all company documents have been pre-indexed in the same format. This is the librarian working at superhuman speed, comparing the shape of the question against the shape of every passage in the collection, and surfacing the two or three most relevant ones.

③ The card is prepared. The retrieved passages — a few paragraphs at most — are combined with the original question to form the assembled prompt. Total size: 3–5 kilobytes. The entire knowledge base remains in the vector store. Nothing else moves.

④ The card is slid through the slot. The assembled prompt is passed to the LLM. It reads only this card — no more, no less. It processes what it has been given, applies its formidable reasoning capabilities, and returns a precise, grounded answer.

Fig. 2  RAG Architecture — From 1 GB Knowledge Store to Precise LLM Response
① USER ② KNOWLEDGE STORE ③ RETRIEVER ④ LLM SEND User Query + Prompt Vector Store Embeddings Index PDF DOCX Wiki ~1 GB Knowledge Base 🔒 Stays in your environment RAG RETRIEVER Retrieved Chunk Prompt + Context ≈ 3–5 KB assembled The index card LLM Grounded Answer No hallucination search user prompt top-K chunks the card ✓ Precise, grounded natural language answer returned to user 🔒 Your IP never enters the LLM

The IP advantage

The library never leaves.

There is a dimension to this architecture that goes well beyond convenience — and it is the one that should matter most to any company with genuinely valuable proprietary knowledge: the LLM sees only the card, never the library.

When you interact with a commercial AI system without RAG — pasting documents directly into a chat interface, uploading files — the contents of those documents travel to the model provider’s servers. Depending on the service’s data policies, that information may be used to improve future versions of the model. For a company like Cavendish & Co., whose century-old formulas constitute the core of its competitive advantage, this is not a theoretical risk. It is a material one.

In a properly architected RAG system, the library never leaves. The vector store and the original documents sit entirely within your environment — your cloud account, your private infrastructure, your control. The only thing that travels to the LLM is the assembled card: the question plus the retrieved context. A few kilobytes of information. Specific, necessary, and minimal.

The LLM cannot learn from it, cannot retain it, and cannot leak it. It reads the card, answers the question, and forgets everything. Just as the ATM card reader scans your card and hands it back — your account details never becoming part of the machine itself.

“RAG lets you use the world’s most powerful reasoning engines without ever showing them your crown jewels.”

Precision

Grounded, not guessed

Answers come directly from your documents. No hallucination, no invention — only what is written in the relevant retrieved passage.

IP Protection

The library stays with you

Only the assembled card reaches the machine. Your proprietary data never touches the model provider’s infrastructure.

Scale

Unlimited knowledge base

As your data grows from one gigabyte to ten, RAG scales without modification. The context window constraint disappears entirely.

Currency

Always up to date

Update a document today, the retriever finds the new version tomorrow. No model retraining, no downtime, no stale answers.

What this means in practice

The slot is not a limitation. It is a feature.

At OntosLab, RAG is the foundation of how we build AI systems for clients with complex, proprietary knowledge. The pattern is consistent: whether it is a century-old beverage formulator, a financial institution with terabytes of regulatory filings, or an engineering firm with decades of technical drawings — the core architecture is always a well-designed retrieval layer sitting between the knowledge and the model.

We design the retrieval layer around your existing data structures. We ensure the knowledge remains in your infrastructure. And we pass only the relevant, assembled context to the model — giving you intelligent, accurate, and defensible AI responses without surrendering an ounce of your intellectual property.

The context window is not a problem to engineer around. It is a feature, when the retrieval layer is built correctly. The machine can only read what you slide through the slot — and with RAG, you will always slide in exactly the right card.

RAG at a glance
700× Typical gap between enterprise knowledge and LLM context window
3–5 KB Typical assembled context passed to the LLM per query
Zero Proprietary data that needs to leave your environment
Common Stack
Vector DB Embeddings Model Azure OpenAI LangChain / LlamaIndex Private Cloud

Build with us

Ready to put RAG to work
in your organisation?

If you are sitting on a large, valuable knowledge base and your current AI setup cannot reach most of it — or if you need answers grounded in your own data without exposing your IP — this is exactly what we build. Let’s talk.

Book a discovery call Back to home