RAG for Code — RAG Engineering: Building AI That Knows Your Data

Why Code Needs Special Treatment

Code is fundamentally different from natural language text. It has rigid syntax, hierarchical structure (modules contain classes contain methods), cross-file dependencies (imports, inheritance), and meaning that comes from both the text and its structure. A generic RAG pipeline that chunks code by character count will produce terrible results because it will split functions in half, separate class definitions from their methods, and lose the import context that makes code understandable.

Building a RAG system for code requires rethinking every stage of the pipeline: how you chunk, how you embed, what metadata you extract, and how you prompt the LLM.

Code-Specific Embeddings

General-purpose embedding models work on code but underperform compared to models trained specifically on programming languages. Code-specific models understand that def calculate_total(items) and function computeSum(products) are semantically similar despite having no words in common.

Recommended models for code:

OpenAI text-embedding-3-small/large -- Handles code reasonably well since the training data includes code. Good enough for most use cases.
Voyage Code 2 -- Specialized code embedding model from Voyage AI. Outperforms general models on code retrieval benchmarks.
CodeSage -- Open-source code embedding model optimized for retrieval.
StarEncoder -- From BigCode, trained specifically on code. Good for self-hosted deployments.

from langchain_openai import OpenAIEmbeddings

# For most projects, OpenAI embeddings handle code well enough
code_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# For code-critical applications, consider Voyage
# from langchain_voyageai import VoyageAIEmbeddings
# code_embeddings = VoyageAIEmbeddings(model="voyage-code-2")

AST-Based Chunking

The Abstract Syntax Tree (AST) is a structured representation of source code that captures its syntactic structure. Instead of splitting code by character count, you parse the AST and split at natural boundaries: functions, classes, methods.

import ast
from dataclasses import dataclass

@dataclass
class CodeChunk:
    content: str
    chunk_type: str  # "function", "class", "module"
    name: str
    file_path: str
    start_line: int
    end_line: int
    docstring: str | None = None
    imports: list[str] | None = None

def extract_python_chunks(file_path: str) -> list[CodeChunk]:
    """Extract functions and classes from a Python file using AST."""
    with open(file_path, "r", encoding="utf-8") as f:
        source = f.read()

    tree = ast.parse(source)
    chunks = []
    lines = source.split("\n")

    # Extract module-level imports
    imports = []
    for node in ast.walk(tree):
        if isinstance(node, (ast.Import, ast.ImportFrom)):
            imports.append(ast.get_source_segment(source, node))

    for node in ast.iter_child_nodes(tree):
        if isinstance(node, ast.FunctionDef):
            func_source = "\n".join(lines[node.lineno - 1:node.end_lineno])
            docstring = ast.get_docstring(node)
            chunks.append(CodeChunk(
                content=func_source,
                chunk_type="function",
                name=node.name,
                file_path=file_path,
                start_line=node.lineno,
                end_line=node.end_lineno,
                docstring=docstring,
                imports=imports,
            ))

        elif isinstance(node, ast.ClassDef):
            class_source = "\n".join(lines[node.lineno - 1:node.end_lineno])
            docstring = ast.get_docstring(node)
            chunks.append(CodeChunk(
                content=class_source,
                chunk_type="class",
                name=node.name,
                file_path=file_path,
                start_line=node.lineno,
                end_line=node.end_lineno,
                docstring=docstring,
                imports=imports,
            ))

    return chunks

# Usage
chunks = extract_python_chunks("src/auth/middleware.py")
for chunk in chunks:
    print(f"{chunk.chunk_type}: {chunk.name} ({chunk.start_line}-{chunk.end_line})")

Handling Large Functions and Classes

Some functions or classes are too large to embed as a single chunk. For classes, split into the class docstring/signature plus individual methods. For long functions, consider including the function signature and docstring as context with each logical block.

def split_large_class(class_node, source_lines, file_path):
    """Split a large class into method-level chunks."""
    chunks = []

    # Class-level chunk (signature + docstring)
    class_header = f"class {class_node.name}:"
    docstring = ast.get_docstring(class_node)
    if docstring:
        class_header += f'\n    """{docstring}"""'

    for node in class_node.body:
        if isinstance(node, ast.FunctionDef):
            method_source = "\n".join(
                source_lines[node.lineno - 1:node.end_lineno]
            )
            # Prepend class context to each method
            contextual_chunk = f"# Class: {class_node.name}\n{method_source}"
            chunks.append(CodeChunk(
                content=contextual_chunk,
                chunk_type="method",
                name=f"{class_node.name}.{node.name}",
                file_path=file_path,
                start_line=node.lineno,
                end_line=node.end_lineno,
            ))

    return chunks

Multi-Language Support

For JavaScript/TypeScript, use tree-sitter instead of Python's ast module. Tree-sitter supports dozens of programming languages with a consistent API.

# Using LangChain's language-aware splitter
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

# Python-aware splitting
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000,
    chunk_overlap=200,
)

# JavaScript-aware splitting
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS,
    chunk_size=2000,
    chunk_overlap=200,
)

# TypeScript, Go, Java, Rust, etc. are all supported
ts_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.TS,
    chunk_size=2000,
    chunk_overlap=200,
)

Enriching Code Chunks With Metadata

Code chunks benefit enormously from rich metadata. The more context you provide, the better the retrieval and the more useful the LLM's response.

def enrich_code_metadata(chunk: CodeChunk) -> dict:
    """Create rich metadata for a code chunk."""
    metadata = {
        "source": chunk.file_path,
        "chunk_type": chunk.chunk_type,
        "name": chunk.name,
        "start_line": chunk.start_line,
        "end_line": chunk.end_line,
        "language": "python",
    }

    # Add docstring as separate searchable field
    if chunk.docstring:
        metadata["docstring"] = chunk.docstring

    # Add import context
    if chunk.imports:
        metadata["imports"] = ", ".join(chunk.imports[:10])

    # Infer module path from file path
    # src/auth/middleware.py -> auth.middleware
    module_path = chunk.file_path.replace("/", ".").replace(".py", "")
    if module_path.startswith("src."):
        module_path = module_path[4:]
    metadata["module"] = module_path

    return metadata

Building a Codebase Chatbot

Here is a complete example that ingests a Python project and creates a Q&A interface:

import os
from pathlib import Path
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# --- Step 1: Load code files ---
loader = DirectoryLoader(
    "./src",
    glob="**/*.py",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
)
code_docs = loader.load()

# Add language metadata
for doc in code_docs:
    doc.metadata["language"] = "python"
    doc.metadata["content_type"] = "code"

# --- Step 2: Language-aware chunking ---
splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(code_docs)
print(f"Created {len(chunks)} code chunks")

# --- Step 3: Embed and store ---
vectorstore = Chroma.from_documents(
    chunks,
    OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./code_chroma_db",
)

# --- Step 4: Code-specific prompt ---
CODE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are a senior developer assistant that answers questions
about a codebase. You have access to the following code snippets.

Rules:
1. Answer based on the code provided in the context.
2. When referencing code, cite the file path and line numbers.
3. If the code does not contain enough information, say so.
4. Explain not just what the code does, but why it likely
   does it that way.
5. Suggest improvements only when asked.

Code Context:
{context}"""),
    ("human", "{question}"),
])

# --- Step 5: Build chain ---
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 8, "fetch_k": 25},
)

def format_code_docs(docs):
    formatted = []
    for doc in docs:
        source = doc.metadata.get("source", "unknown")
        formatted.append(f"# File: {source}\n{doc.page_content}")
    return "\n\n---\n\n".join(formatted)

chain = (
    {"context": retriever | format_code_docs, "question": RunnablePassthrough()}
    | CODE_PROMPT
    | ChatOpenAI(model="gpt-4o", temperature=0)
    | StrOutputParser()
)

# --- Use it ---
answer = chain.invoke("How does the authentication middleware validate tokens?")
print(answer)

Integrating With Git Repositories

For RAG systems that need to stay current with a codebase, integrate with git to detect changes and re-index only modified files:

import subprocess

def get_changed_files(since_commit: str = "HEAD~1") -> list[str]:
    """Get files changed since a specific commit."""
    result = subprocess.run(
        ["git", "diff", "--name-only", since_commit],
        capture_output=True, text=True
    )
    return [f for f in result.stdout.strip().split("\n") if f.endswith(".py")]

def incremental_index(vectorstore, changed_files: list[str]):
    """Re-index only changed files."""
    for file_path in changed_files:
        # Delete old chunks for this file
        vectorstore.delete(where={"source": file_path})

        # Re-load and re-chunk the file
        loader = TextLoader(file_path, encoding="utf-8")
        docs = loader.load()
        chunks = splitter.split_documents(docs)

        # Add new chunks
        vectorstore.add_documents(chunks)
        print(f"Re-indexed: {file_path} ({len(chunks)} chunks)")

Documentation Generation From Code

RAG for code can also work in reverse: given code, generate documentation.

DOC_GEN_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """Generate clear, concise documentation for the given code.
Include:
- A one-line summary
- Parameters and return types
- Usage examples
- Any important notes about behavior or edge cases

Code:
{code}"""),
    ("human", "Generate documentation for this code."),
])

Tips for Code RAG

Include the file path in every chunk. Code without file path context is hard to navigate.
Preserve import statements. Either include imports in each chunk or store them as metadata. They tell the LLM what libraries and modules the code uses.
Use larger chunks for code. Code is denser than prose. A 2000-character code chunk is often a single function, while a 2000-character prose chunk might be several paragraphs. Err on the side of larger chunks.
Index documentation alongside code. README files, docstrings, comments, and inline documentation should be indexed as separate chunks with references to the code they describe.
Test with real developer questions. "How does X work?", "Where is Y implemented?", "What calls function Z?" -- these are the queries your system needs to handle well.

In the next lesson, you will learn how to measure whether your RAG system -- for code or any other content -- is actually producing good results.