Beyond ChatGPT: How to Implement RAG for a Custom AI Knowledge Base
Stop giving generic answers. Learn to build an AI chatbot that truly understands your business using Retrieval-Augmented Generation (RAG).

Your AI Knows Everything Except Your Business
The first RAG demo I ever shipped confidently told a customer that a sofa came with a 90-day return window. The actual policy was 30 days. The model didn't lie on purpose. It pulled a chunk from an old blog post, padded it with what sounded plausible, and handed back a clean, well-formatted, completely wrong answer. That's the thing nobody warns you about: a bad RAG pipeline doesn't fail loudly. It fails politely, and it sounds right.
So let me skip the textbook intro. You already know GPT-class models can write code and explain quantum physics. The problem is they have zero idea what your return policy is, what your flagship product costs, or whether the Azure Dream sofa ships to a given zip code. Retrieval-Augmented Generation (RAG) is how you close that gap, and most of the work is in the unglamorous parts: how you chunk, what you retrieve, and how hard you force the model to stick to what it was given.
I've built this for real. The Ask Shopify assistant on this site is a RAG system over Shopify's docs. Everything below comes from making that work, not from a tutorial.
What RAG actually is
RAG connects a language model to an external knowledge source so it answers from your data instead of its training set. Two moving parts:
- The retriever searches your knowledge base (docs, FAQs, support tickets, product specs) and pulls back the snippets most relevant to the question.
- The generator is the LLM. It gets the question plus those snippets and writes an answer grounded in them.
That's it. The interesting failures all live in the retriever, which is the part most tutorials wave away.
When RAG is overkill (and when it isn't)
Here's the opinionated part most posts skip. RAG is not always the answer.
If your knowledge base is small and static, say a few hundred lines of policy text, you don't need a vector database. Stuff the whole thing in the system prompt. Modern context windows are huge. A 200KB FAQ fits with room to spare, and you skip an entire category of retrieval bugs. I've shipped "RAG" features that were really just a well-organized prompt, and they outperformed the fancy pipeline because nothing could be retrieved wrong.
You reach for real RAG when:
- The corpus is too big to fit in context (full product catalogs, thousands of support tickets, large doc sites).
- Content changes often and you don't want to redeploy to update an answer.
- You need citations pointing back to a specific source.
If none of those are true, build the simple thing first. I've watched teams spend a month standing up Pinecone for content that would've fit in a single prompt. Don't be that team. If you're weighing whether AI even belongs in a given workflow, I wrote a more general take in this guide to automating workflows with AI.
How RAG works, end to end
Step 1: Chunking (where most pipelines quietly break)
You gather your sources, PDFs, Markdown, scraped pages, a Notion export, and split them into chunks. This sounds trivial. It is not. Chunking is the single biggest lever on answer quality, and it's where I see the most damage.
Chunk too large and the embedding becomes a blurry average of five different topics, so retrieval gets vague. Chunk too small and you sever the context a sentence needs to make sense. A pricing table split across two chunks will hand the model half the numbers and it'll invent the rest.
A few things I've learned the hard way:
- Split on structure (headings, sections), not a blind character count, whenever the source has structure.
- Use overlap so a thought that straddles a boundary survives in at least one chunk. 10–20% is a reasonable starting point.
- Keep metadata on every chunk: source URL, title, section. You need it for citations and for filtering retrieval later.
Step 2: Embeddings and indexing
Each chunk gets converted to a vector, a list of numbers that encodes its meaning, by an embedding model. Those vectors go into a vector database (Pinecone, Chroma, Weaviate, pgvector if you already run Postgres and don't want another service).
The point of embeddings is that chunks with similar meaning land near each other in vector space, so you can search by meaning instead of exact keywords. One rule: use the same embedding model for your chunks and your queries. Mix two models and your retrieval quietly turns to noise, with no error to tell you why.
Step 3: Retrieval
The user's question gets embedded with that same model, and the database returns the nearest chunks. This is semantic search, it finds what's contextually relevant, not just what shares keywords.
Pure vector search isn't enough on its own. Two upgrades that earned their keep for me:
- Hybrid search. Combine vector similarity with old-fashioned keyword (BM25) matching. Vectors are weak on exact tokens, SKUs, model numbers, error codes. Keyword search nails those. Together they cover each other's blind spots.
- Reranking. Pull a wider net, say the top 20 chunks, then run a reranker (a cross-encoder model) to reorder by actual relevance and keep the top 4 or 5. This was the single biggest quality jump in my Ask Shopify build. Plain nearest-neighbor retrieval surfaces chunks that look related but don't answer the question. A reranker filters those out.
Step 4: Generation
The question and the top chunks get assembled into one augmented prompt and sent to the LLM. The prompt is roughly:
Context: [the retrieved chunks]
>
Question: [the user's question]
>
Answer using only the context above. If the context doesn't contain the answer, say so.
That last sentence is load-bearing. Without an explicit instruction to refuse, the model will happily fill gaps with confident nonsense, exactly how I got that wrong sofa return policy. With it, you get "I don't have that information," which is the correct answer when the retrieval missed. A wrong answer costs you a customer. An honest "I don't know" doesn't.
If you're wiring this into a customer-facing chat widget, the retrieval half is only part of the job, the conversation handling, streaming, and fallbacks matter just as much. I covered that side in how to build an AI chatbot for your website.
What the code looks like
Here's the shape of a pipeline in Python with LangChain. It runs, but treat it as a skeleton, the chunking strategy and the reranker are where your real work goes.
# 1. Load your documents (e.g., a PDF manual)
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("my-product-manual.pdf")
docs = loader.load()
# 2. Split into chunks. Overlap matters more than people think.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
chunks = text_splitter.split_documents(docs)
# 3. Embed and store. Same embedding model here AND at query time.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings_model = OpenAIEmbeddings()
vector_store = Chroma.from_documents(chunks, embeddings_model)
# 4. Retrieve a wide net so a reranker has something to work with.
retriever = vector_store.as_retriever(search_kwargs={"k": 20})
# 5. Build the chain. Note the "only use the context" instruction.
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
"Answer using only the context below. "
"If it isn't there, say you don't know.\n\n"
"Context:\n{context}\n\nQuestion: {question}"
)
llm = ... # your model of choice
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# 6. Ask.
question = "How do I clean the filter on the Model-X vacuum?"
print(rag_chain.invoke(question))The k=20 plus a reranking step (omitted here for brevity) is the part that took my own pipeline from "demos well, fails in production" to something I'd actually put in front of customers.
Where RAG pays off
- Support that doesn't sleep. A customer at 2 AM asking whether the Azure Dream sofa is pet-friendly and ships to 90210 by Friday gets a real answer pulled from your product data and shipping rules, not a canned deflection.
- Faster onboarding. A new support hire asks the internal knowledge base "what's our process for international returns on damaged goods?" and gets the exact steps from your own docs instead of interrupting a senior teammate.
- A shopping assistant that knows the catalog. Feed it your full product set and it can reason: "lightweight waterproof jacket for PNW hiking in October" returns three real options from your store with the tradeoffs spelled out.
If you're connecting these systems to live tools and data sources rather than just static docs, the MCP ecosystem is worth a look, I broke down the ones I actually use in my rundown of essential MCP servers for a Claude workflow.
FAQ
How many chunks should I send to the model? Start with 4 or 5 after reranking. More isn't better, irrelevant chunks dilute the prompt and the model starts hedging or drifting. Retrieve wide (15–20), rerank hard, send few.
Do I need a vector database? Only if your content won't fit in the context window or it changes constantly. Small, stable corpus? Put it straight in the prompt and skip the infrastructure.
Why does my RAG bot still hallucinate? Almost always retrieval, not the model. If the right chunk never gets retrieved, the model fills the gap. Check what's actually being passed in as context before you blame the LLM, and add an explicit "say you don't know" instruction.
Vector search or keyword search? Both. Vectors handle meaning, keyword (BM25) handles exact tokens like SKUs and error codes. Hybrid beats either one alone.
Bottom line
RAG turns a generic model into something that actually knows your business, but the value isn't in the LLM. It's in chunking that preserves meaning, retrieval you can trust, a reranker that cuts the noise, and a prompt that makes the model admit when it doesn't know. Get those right and you have a real asset. Get them wrong and you have a confident liar with great formatting. Build the simple version first, measure what it retrieves, and only add complexity where the numbers tell you to.
Want this built for you instead of DIY?
I'm Karan — a Top Rated Plus Shopify Expert ($300K+ earned, 100% Job Success). If you'd rather hand this to someone who's done it hundreds of times, let's talk.
🛠️Generative AI Tools You Might Like
Tags
📬 Get notified about new tools & tutorials
No spam. Unsubscribe anytime.
Comments (0)
Leave a Comment
No comments yet. Be the first to share your thoughts!


