Retrieval Techniques — RAG Engineering: Building AI That Knows Your Data

Beyond Basic Similarity Search

In the previous lessons, you learned to embed documents and search for the most similar chunks using cosine similarity. That basic approach works, but it has limitations. The top-K most similar chunks are not always the most useful chunks. They might be redundant (five chunks saying the same thing), miss keyword-critical matches, or include marginally relevant results that dilute the context.

This lesson covers the retrieval techniques that production RAG systems use to get dramatically better results.

Similarity Search Fundamentals

The baseline retrieval approach: embed the query, compute similarity against all stored vectors, return the top-K results.

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)

# Basic similarity search
results = vectorstore.similarity_search(
    query="What is the refund policy?",
    k=5
)

# With scores
results_with_scores = vectorstore.similarity_search_with_score(
    query="What is the refund policy?",
    k=5
)
for doc, score in results_with_scores:
    print(f"Score: {score:.4f} | {doc.page_content[:80]}")

This works reasonably well but has three common failure modes: redundant results, keyword misses, and no diversity in the retrieved set.

Maximum Marginal Relevance (MMR)

MMR solves the redundancy problem. Instead of returning the five most similar chunks (which might all come from the same section and say essentially the same thing), MMR balances relevance to the query with diversity among the selected chunks.

The algorithm works iteratively: select the most relevant chunk first, then for each subsequent selection, pick the chunk that is most relevant to the query but least similar to the chunks already selected.

# MMR retrieval -- balances relevance with diversity
results = vectorstore.max_marginal_relevance_search(
    query="What is the refund policy?",
    k=5,             # Number of results to return
    fetch_k=20,      # Number of candidates to consider
    lambda_mult=0.7  # 0 = max diversity, 1 = max relevance
)

The lambda_mult parameter controls the tradeoff. At 1.0, MMR behaves like standard similarity search. At 0.0, it maximizes diversity. A value around 0.5-0.7 works well for most RAG applications.

When to use: Always consider MMR as a drop-in replacement for basic similarity search. It rarely hurts and often significantly improves answer quality by giving the LLM diverse perspectives on the topic.

Hybrid Search: Combining Dense and Sparse Retrieval

Dense retrieval (embedding-based) excels at semantic matching. But it can miss exact keyword matches that a user expects. If someone searches for "error code E-4021", a dense search might return chunks about general error handling instead of the specific error code.

Sparse retrieval (BM25, TF-IDF) excels at exact keyword matching but misses semantic connections. It would find "error code E-4021" perfectly but would not match "how to fix the payment processing failure" to a document about E-4021.

Hybrid search combines both: run a dense search and a sparse search in parallel, then merge the results.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Dense retriever (semantic)
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Sparse retriever (keyword)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# Hybrid: combine both with weighted scores
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # 60% semantic, 40% keyword
)

results = hybrid_retriever.invoke("error code E-4021 payment failure")

When to use: Applications where users search for both concepts and specific terms (product names, error codes, IDs). Hybrid search is the standard recommendation for production RAG systems.

Re-Ranking

Re-ranking is a two-stage retrieval process. First, retrieve a larger candidate set (e.g., 20 chunks) using fast similarity search. Then, use a more powerful model to re-score and re-order those candidates based on their actual relevance to the query.

The re-ranking model reads both the query and the candidate text together, which gives it much deeper understanding of relevance than the embedding comparison alone.

Cohere Rerank

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Base retriever -- get a large candidate set
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

# Re-ranker -- score candidates more precisely
reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=5  # Return top 5 after re-ranking
)

# Combined retriever
retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever,
)

results = retriever.invoke("What is the refund policy for digital products?")

Cross-Encoder Re-Ranking (Open Source)

from sentence_transformers import CrossEncoder

# Load cross-encoder model
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Get candidate chunks from your retriever
query = "What is the refund policy?"
candidates = vectorstore.similarity_search(query, k=20)

# Re-rank with cross-encoder
pairs = [(query, doc.page_content) for doc in candidates]
scores = cross_encoder.predict(pairs)

# Sort by cross-encoder score
ranked = sorted(
    zip(candidates, scores),
    key=lambda x: x[1],
    reverse=True
)

# Take top 5
top_results = [doc for doc, score in ranked[:5]]

When to use: When retrieval precision is critical. Re-ranking consistently improves relevance in benchmarks. The main cost is added latency (typically 100-300ms) and the re-ranking model cost. For most production systems, the quality improvement justifies the overhead.

Contextual Compression

Sometimes the retrieved chunk contains the answer but also a lot of irrelevant text. Contextual compression extracts only the parts of each chunk that are relevant to the query.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Compressor that extracts only relevant portions
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10}),
)

results = compression_retriever.invoke("What are the refund deadlines?")
# Each result now contains only the relevant portion of the original chunk

When to use: When your chunks are large and contain mixed information, or when you need to minimize the context sent to the LLM to save on token costs.

Metadata Filtering

Metadata filtering narrows the search space before similarity search runs. Instead of searching all 100,000 chunks, you search only the 5,000 chunks from the "engineering" department or the 200 chunks from documents updated in the last month.

# ChromaDB metadata filtering
results = vectorstore.similarity_search(
    query="deployment process",
    k=5,
    filter={
        "$and": [
            {"department": "engineering"},
            {"year": {"$gte": 2024}},
        ]
    }
)

Common filtering patterns:

By source: Only search knowledge base articles, not internal memos.
By date: Prioritize recent documents over outdated ones.
By department/team: Scope results to the user's area.
By document type: Search only policies, or only technical docs.
By access level: Enforce security by filtering based on user permissions.

Multi-Vector Retrieval

Instead of embedding each chunk once, create multiple embeddings per chunk -- one for the text as-is, one for a summary, one for hypothetical questions that the chunk answers. Search across all vectors and deduplicate.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

def generate_hypothetical_questions(chunk_text: str) -> list[str]:
    """Generate questions that this chunk could answer."""
    response = llm.invoke(
        f"Generate 3 questions that the following text could answer. "
        f"Return only the questions, one per line.\n\n{chunk_text}"
    )
    return response.content.strip().split("\n")

# For each chunk, embed the text AND the hypothetical questions
# Store all embeddings, mapping back to the same chunk

This increases the chances that a user's query will match one of the representations of the relevant chunk.

Combining Techniques: A Production Stack

The best RAG systems layer multiple retrieval techniques:

Hybrid search (dense + BM25) to cast a wide net
Metadata filtering to scope to relevant documents
Re-ranking to precisely order the candidates
MMR to ensure diversity in the final set

# Production retrieval stack (pseudocode)
candidates = hybrid_search(query, k=30, filters=user_filters)
reranked = rerank(query, candidates, top_n=10)
final = mmr(reranked, k=5, lambda_mult=0.7)

Each layer addresses a different failure mode. Together, they produce a retrieval pipeline that is robust, precise, and diverse.

In the next lesson, you will assemble all these components into a complete, working RAG pipeline from end to end.