Advanced RAG Patterns — RAG Engineering: Building AI That Knows Your Data

Why Basic RAG Is Not Enough

The standard retrieve-then-generate pipeline works well for straightforward questions with clear answers in your documents. But it breaks down in several common scenarios:

The user's query is vague or uses different terminology than the documents.
The answer requires synthesizing information scattered across multiple documents.
The retrieved chunks are marginally relevant but not actually helpful.
The question requires reasoning or multi-step logic, not just lookup.

Advanced RAG patterns address these failure modes by adding intelligence to the retrieval and generation stages.

Multi-Query RAG

The problem: a user's single query might not be the best search query for finding all relevant information. Different phrasings of the same question can retrieve different, complementary chunks.

Multi-query RAG uses the LLM to generate multiple reformulations of the original question, runs each as a separate retrieval query, and merges the results.

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

# Wrap your base retriever with multi-query
multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    llm=llm,
)

# The retriever generates multiple queries internally
# For "What are the benefits of remote work?", it might generate:
# 1. "What advantages does working from home offer employees?"
# 2. "How does remote work policy benefit the organization?"
# 3. "What are the perks of telecommuting?"
results = multi_retriever.invoke("What are the benefits of remote work?")

Multi-query is one of the simplest advanced patterns to implement and often provides an immediate improvement in retrieval recall. The tradeoff is increased latency (one extra LLM call) and cost.

HyDE: Hypothetical Document Embeddings

HyDE tackles a fundamental asymmetry in RAG: the query is short (a question) while the documents are long (paragraphs of information). Their embeddings live in different regions of the vector space, which can hurt retrieval quality.

HyDE works by asking the LLM to generate a hypothetical answer to the query, then embedding that hypothetical answer instead of the query itself. Since the hypothetical answer looks more like a document than a question, it matches actual documents more effectively.

from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# Create HyDE embeddings
base_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    base_embeddings=base_embeddings,
    prompt_key="web_search",  # Built-in prompt for generating hypothetical docs
)

# Use HyDE embeddings for retrieval
# When you embed "What is the refund policy?", HyDE first generates
# a hypothetical document like "Our refund policy allows customers
# to return products within 30 days..." then embeds THAT text
results = vectorstore.similarity_search(
    query="What is the refund policy?",
    k=5,
    embedding_function=hyde_embeddings,
)

When HyDE helps: Questions about topics where the LLM has some general knowledge (so the hypothetical document is reasonable). Questions that are phrased very differently from the actual documents.

When HyDE hurts: Highly domain-specific questions where the LLM generates an inaccurate hypothetical document, leading retrieval astray.

Self-RAG: Self-Reflective Retrieval

Self-RAG adds a critical thinking layer to the RAG pipeline. After generating an answer, the system evaluates whether the answer is actually supported by the retrieved documents. If not, it can re-retrieve, reformulate, or flag the answer as uncertain.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o", temperature=0)

def self_rag(query: str, retriever, llm) -> dict:
    """RAG with self-reflection on answer quality."""

    # Step 1: Retrieve
    docs = retriever.invoke(query)
    context = "\n\n".join(d.page_content for d in docs)

    # Step 2: Generate initial answer
    answer_prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer based on the context. Context:\n{context}"),
        ("human", "{query}"),
    ])
    answer = (answer_prompt | llm).invoke({"context": context, "query": query})

    # Step 3: Self-reflect -- is the answer grounded?
    reflection_prompt = ChatPromptTemplate.from_messages([
        ("system", """Evaluate whether the answer is fully supported by the context.
Respond with a JSON object:
- "is_grounded": true/false
- "confidence": 0.0 to 1.0
- "unsupported_claims": list of claims not in context
- "suggestion": what to do if not grounded"""),
        ("human", "Context:\n{context}\n\nQuestion: {query}\n\nAnswer: {answer}"),
    ])
    reflection = (reflection_prompt | llm).invoke({
        "context": context,
        "query": query,
        "answer": answer.content,
    })

    return {
        "answer": answer.content,
        "reflection": reflection.content,
        "sources": docs,
    }

Self-RAG is particularly valuable in high-stakes applications (medical, legal, financial) where an ungrounded answer can have serious consequences.

Corrective RAG (CRAG)

Corrective RAG evaluates the quality of retrieved documents before generating an answer. If the documents are not relevant enough, it falls back to web search or admits it cannot answer.

def corrective_rag(query: str, retriever, llm) -> str:
    """Evaluate retrieval quality before generating."""

    # Step 1: Retrieve
    docs = retriever.invoke(query)

    # Step 2: Grade each document for relevance
    grading_prompt = ChatPromptTemplate.from_messages([
        ("system", """Grade whether this document is relevant to the query.
Respond with only "relevant" or "not_relevant"."""),
        ("human", "Query: {query}\n\nDocument: {document}"),
    ])

    relevant_docs = []
    for doc in docs:
        grade = (grading_prompt | llm).invoke({
            "query": query,
            "document": doc.page_content,
        })
        if "relevant" in grade.content.lower() and "not_relevant" not in grade.content.lower():
            relevant_docs.append(doc)

    # Step 3: Decide based on relevance
    if len(relevant_docs) == 0:
        return "I could not find relevant information to answer this question."
    elif len(relevant_docs) < 2:
        # Optionally: supplement with web search
        context = relevant_docs[0].page_content
        # Could add web search results here
    else:
        context = "\n\n".join(d.page_content for d in relevant_docs)

    # Step 4: Generate with only relevant documents
    answer_prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer based on the context below.\n\nContext:\n{context}"),
        ("human", "{query}"),
    ])
    answer = (answer_prompt | llm).invoke({"context": context, "query": query})
    return answer.content

Agentic RAG

Agentic RAG gives the LLM the ability to decide when, what, and how to retrieve. Instead of always retrieving before answering, an agent can choose to retrieve from different sources, perform multiple retrieval steps, or skip retrieval entirely for questions it can answer directly.

from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.tools import Tool
from langchain_openai import ChatOpenAI

# Define retrieval as a tool the agent can choose to use
retrieval_tool = Tool(
    name="knowledge_base_search",
    description="Search the company knowledge base for policies, "
                "procedures, and documentation. Use this when the question "
                "is about company-specific information.",
    func=lambda query: "\n\n".join(
        d.page_content for d in retriever.invoke(query)
    ),
)

code_search_tool = Tool(
    name="codebase_search",
    description="Search the codebase for code examples, function "
                "documentation, and technical implementation details.",
    func=lambda query: "\n\n".join(
        d.page_content for d in code_retriever.invoke(query)
    ),
)

# The agent decides which tool to use (or neither)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [retrieval_tool, code_search_tool]

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant with access to company "
               "knowledge bases. Use the tools when you need specific "
               "company information. For general knowledge, answer directly."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = executor.invoke({"input": "How does our auth middleware work?"})

Agentic RAG is the most flexible pattern but also the hardest to control. The agent can make poor decisions about when to retrieve, leading to missed information or unnecessary retrievals. Use it when the query space is diverse and different questions genuinely need different retrieval strategies.

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

RAPTOR builds a hierarchical summary tree of your documents. Leaf nodes are the original chunks. Parent nodes are summaries of groups of chunks. Higher levels are summaries of summaries. At query time, the system can retrieve at any level of the tree, getting either detailed chunks or broad summaries depending on what the question needs.

# RAPTOR conceptual implementation
def build_raptor_tree(chunks: list, llm, max_levels: int = 3) -> dict:
    """Build a hierarchical summary tree."""
    tree = {"level_0": chunks}  # Leaf nodes are original chunks

    current_level = chunks
    for level in range(1, max_levels + 1):
        # Group chunks (e.g., by clustering their embeddings)
        groups = cluster_chunks(current_level, n_clusters=len(current_level) // 5)

        # Summarize each group
        summaries = []
        for group in groups:
            combined = "\n\n".join(c.page_content for c in group)
            summary = llm.invoke(
                f"Summarize the following documents concisely:\n\n{combined}"
            )
            summaries.append(summary)

        tree[f"level_{level}"] = summaries
        current_level = summaries

    return tree

RAPTOR is particularly useful for questions that need a broad understanding of a topic rather than a specific detail. For example, "What are the main themes of our product roadmap?" benefits from higher-level summaries, while "What is the deadline for feature X?" needs leaf-level detail.

Choosing the Right Pattern

| Pattern | Best For | Complexity | Extra Cost | |---------|----------|------------|------------| | Multi-Query | Vague or broad queries | Low | 1 extra LLM call | | HyDE | Short queries, terminology mismatch | Low | 1 extra LLM call | | Self-RAG | High-stakes applications | Medium | 1-2 extra LLM calls | | Corrective RAG | Noisy knowledge bases | Medium | N grading calls | | Agentic RAG | Diverse query types, multiple sources | High | Variable | | RAPTOR | Multi-granularity questions | High | Offline summarization |

Start with multi-query or HyDE -- they are simple to implement and provide immediate improvements. Layer in self-RAG or corrective RAG when answer reliability is critical. Move to agentic RAG when your system needs to handle fundamentally different types of questions.

In the next lesson, we will apply RAG specifically to code -- a domain with unique chunking, embedding, and retrieval challenges.