·9 min read

How to set up a well-structured AI knowledge base: the secret to consistent, on-brand output

A common way to set up an AI knowledge base backfires. The instinct is to put everything into the vault - all the context, all the company knowledge, all your files, all your documents - everything needs to live in there. Then shove the AI straight into that and tell it to use all of it, all the time.

High-volume context reduces output quality. And there is a growing body of research that explains exactly why.

Volume reduces output quality

The way large language models work clashes with large, complex knowledge bases. When you shove everything into one knowledge base and tell the system to always work through it, it will try to read everything and bring it all into every output. It is trying to remember your strategy, your ICP, your writing rules, and every constraint you have ever written down - all at once.

The problem is that everything at the top will be taken into preference, and everything at the bottom might be forgotten, while a random transcript from a sales meeting pulls everything skew.

Stanford and UC Berkeley researchers tested this in "Lost in the Middle." They found a U-shaped attention curve - LLMs consistently favour content at the beginning and end of their context window and neglect what sits in the middle. When key information was placed in the middle of the input, accuracy dropped by over 30%. In some cases, the models performed worse with relevant information in the middle than it did with no documents at all - zero context outperformed buried context.

MIT researchers followed up in 2025 and identified the root cause as a structural feature of how transformer attention works. Larger context windows do not resolve this - it is built into how the architecture processes attention.

So if there is a little rule in your knowledge base that says "don't use em dashes" or "always use sentence case" and it is sitting somewhere in the middle of a 15-page vault, the model might just forget to apply it because it is dealing with so much context. We have seen this happen repeatedly.

Context contamination

Irrelevant content actively degrades output - it steals attention from what matters and lowers the quality of everything the model produces.

If you have a transcript from a sales meeting in there that says something, the system might put emphasis on that and skew everything towards a single conversation. You are trying to write an SEO blog and the model is pulling phrasing and framing from a sales call that happened to be sitting in the same vault.

Research from ICLR 2025 found that RAG system accuracy consistently falls below retrieval recall when irrelevant passages are present, meaning the presence of relevant information does not guarantee correct answers when it is surrounded by noise. Irrelevant documents do not just sit there passively. They actively steal attention from the documents that matter and degrade the quality of every output.

A separate benchmark testing across production RAG systems found that performance can drop by 10% or more simply by increasing the volume of irrelevant pages - same questions, same relevant documents, just more noise around them.

A two-layer architecture

The better way to do it is to think of your knowledge base as two separate things. There is a company knowledge base, and there is a library or archive. Each layer has a distinct purpose, and each works best when used on its own terms.

Layer one: The company knowledge base (always on)

This is the foundation. It stays lean, it stays clean, and it stays to the point.

What belongs here:

Your strategy and positioning - who you are, what you stand for, where you sit in the market. This is the north star that every piece of content should align with, whether it is a tweet or a 2,000-word article.

Your ICP and audience - who you are talking to, what they care about, what language resonates with them. This shapes every word the model writes.

Your product information - what you sell, what the benefit is, and how you talk about it. Keep this factual and tight. Feature lists, core value propositions, key differentiators.

Your brand identity and writing style - the rules that define how your content shows up. Reading level, sentence structure preferences, formatting conventions. Things like "use grade eight reading level" or "short paragraphs, two to four sentences."

Your hard rules - the non-negotiables. Never do this, always do that. These are the constraints that protect brand consistency. Examples: never use em dashes, always use sentence case for headings, never use the word "leverage," always use Oxford commas. Hard rules need to be explicit and positioned prominently - small rules buried in large contexts are the first things to get dropped.

Your tone of voice - how you sound when you speak. The personality, the register, the feel, the vibe of the brand. Are you formal or conversational? Witty or straightforward? This defines the human quality of everything the AI produces.

These rarely shift - unless you launch a new product, rebrand, or pivot your positioning. The knowledge base should be stable and ever-present. Regardless of whatever the output is - whether you are writing an SEO blog, a social media post, or a thought leadership article - this piece of context should always come in.

How to structure it practically

Try to keep the total length under 5,000 words. Every section should have a clear heading. Hard rules should sit near the top or be given their own clearly labeled section - the attention bias research shows the model attends to early content most reliably, so put what matters first.

Use direct, declarative language. Do not write your knowledge base like marketing copy. Write it like instructions. "The brand voice is confident and warm. Sentences are short. Paragraphs never exceed four sentences. We do not use jargon." The cleaner the instructions, the more reliably the model follows them.

Do not just use AI-generated slop - monkey see monkey do. You can't tell it not to use em-dashes with text full of em-dashes. And give it examples of what it should look like and what it should match with the output. Don't leave, "warm but witty" up to interpretation for a large language model - it will fail you every single time.

Test it by running the same prompt through different workflow types. If your social post output and your blog output both sound like your brand, the knowledge base is working. If one feels off, the knowledge base is probably too cluttered or the relevant rules are buried too deep.

Layer two: The library or archive (selective)

This is where your files live. And the key principle is simple: the library should be separate and should not be ever-present.

The library stays separate so you pull in what you need for each specific task.

What belongs here:

Transcripts - meeting transcripts, podcast recordings, webinar recordings, sales calls, customer interviews. These are source material, not standing instructions.

Playbooks and documents - process docs, campaign briefs, research reports, competitive analyses. Useful for specific tasks, irrelevant for others.

Previous content - past blog posts, articles, social content that you might want to reference or repurpose.

Raw notes and research - anything captured that might feed into future content but should not influence every output.

The rule is that nothing in the library gets pulled into a workflow unless you deliberately select it for that specific task.

How to pull library items for surgical, on-brand output

This is where the two layers work together. The company knowledge base shapes how the AI writes, and library items supply what it writes about.

Here is the practical workflow:

Start with the task. Define what you are creating - a thought leadership article, a social post, an SEO blog, a newsletter. The company knowledge base is already loaded. It is always on. The model already knows your voice, your audience, your rules.

Select your library inputs deliberately. Ask yourself: what specific source material does this task need? If you recorded a podcast yesterday and want to turn it into an article, pull in that transcript. Just that transcript. Not every transcript from the last six months.

Work through a real example to see the method in practice. Say you run a podcast series and you ask every guest the same question across 24 episodes. You want to write a roundup article pulling insights from all of them. In that case, you pull all 24 transcripts into the workflow. But you do not want a sales call from last week to influence that piece. The library setup lets you make that choice.

Use library files, notes and documents as specific input. If you have a customer case study interview transcript. You want to turn it into a LinkedIn post. You pull in the transcript, the company knowledge base shapes the voice and format, and the output is surgical - informed by that specific conversation, written in your brand's style, untouched by anything irrelevant.

The fewer library items you pull in per task, the cleaner the output. This is directly supported by the research - every irrelevant document in context degrades the signal. Be selective. Be deliberate. Cherry-pick what you want to use.

Why this works (the research summary)

The two-layer approach works because it directly addresses both failure modes that the research has identified:

Attention bias is managed by keeping the always-on knowledge base small enough that nothing important gets lost in the middle. When your knowledge base is 2,000 words instead of 20,000, the model can attend to all of it reliably. Your hard rules do not get buried. Your tone of voice does not get forgotten.

Context contamination is eliminated by keeping task-irrelevant files out of the context entirely. A sales transcript cannot skew your blog post if it was never in the context window to begin with. The library architecture ensures that the only source material influencing an output is the material you specifically chose for that output.

Keep the knowledge base small and clean. Let your library do the heavy lifting when - and only when - you tell it to.