Document Processing
The Ingestion Challenge
Before you can embed and search documents, you need to extract clean text from them. This sounds simple but it is one of the most underestimated parts of building a RAG system. Real-world documents come in messy formats: multi-column PDFs with headers and footers, HTML pages cluttered with navigation and ads, scanned images embedded in Word documents, code files with complex nesting.
The quality of your extracted text directly determines the quality of your embeddings and, ultimately, the quality of your answers. Garbage in, garbage out applies nowhere more forcefully than in RAG.
PDF Processing
PDFs are the most common document format in enterprise RAG systems, and they are also the hardest to process well. A PDF is fundamentally a visual format -- it describes where to draw characters on a page, not the logical structure of the text.
PyPDF (Simple and Fast)
PyPDF is a pure-Python library that extracts text from most standard PDFs. It works well for text-native PDFs (documents created digitally) but struggles with scanned documents and complex layouts.
from langchain_community.document_loaders import PyPDFLoader
# Load a PDF and split by pages
loader = PyPDFLoader("company_handbook.pdf")
pages = loader.load()
for page in pages:
print(f"Page {page.metadata['page']}: {len(page.page_content)} chars")
print(page.page_content[:200])
print("---")
Unstructured (Robust and Layout-Aware)
The unstructured library handles complex documents by understanding layout elements like titles, headers, tables, and lists. It can process PDFs, DOCX, HTML, images (via OCR), and more.
from langchain_community.document_loaders import UnstructuredPDFLoader
# High-resolution mode detects layout elements
loader = UnstructuredPDFLoader(
"annual_report.pdf",
mode="elements", # Split into elements (titles, text, tables)
strategy="hi_res" # Use layout detection model
)
elements = loader.load()
for elem in elements[:5]:
print(f"Type: {elem.metadata.get('category', 'unknown')}")
print(f"Text: {elem.page_content[:100]}")
print("---")
Handling Scanned PDFs
Scanned PDFs contain images, not text. You need OCR (Optical Character Recognition) to extract text. The unstructured library integrates with Tesseract OCR, and more advanced options include Azure Document Intelligence and AWS Textract.
# Using unstructured with OCR for scanned documents
from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader(
"scanned_contract.pdf",
strategy="ocr_only" # Force OCR processing
)
docs = loader.load()
Tip: For production systems processing many PDFs, consider Azure Document Intelligence or Amazon Textract. They handle tables, forms, and handwriting significantly better than open-source OCR.
HTML Processing
Web pages are a common source for RAG systems -- documentation sites, knowledge bases, and blog posts all live on the web. The challenge is separating the actual content from navigation, sidebars, footers, and ads.
BeautifulSoup Approach
from langchain_community.document_loaders import WebBaseLoader
import bs4
# Load a web page and extract only article content
loader = WebBaseLoader(
web_paths=["https://docs.example.com/guide"],
bs_kwargs=dict(
parse_only=bs4.SoupStrainer(
class_=("post-content", "article-body", "main-content")
)
),
)
docs = loader.load()
# Clean up the extracted text
for doc in docs:
# Remove excessive whitespace
doc.page_content = " ".join(doc.page_content.split())
print(doc.page_content[:300])
Processing Multiple Pages
For documentation sites with many pages, use a sitemap or recursive crawler:
from langchain_community.document_loaders import SitemapLoader
# Load all pages from a sitemap
loader = SitemapLoader(
"https://docs.example.com/sitemap.xml",
filter_urls=["https://docs.example.com/guide/"], # Only guide pages
)
docs = loader.load()
print(f"Loaded {len(docs)} pages")
Markdown Processing
Markdown files are the easiest format to process because they are already plain text with lightweight structure. They are common in documentation repos, wikis, and knowledge bases.
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from pathlib import Path
# Load all Markdown files from a directory
loader = DirectoryLoader(
"./docs",
glob="**/*.md",
loader_cls=TextLoader,
loader_kwargs={"encoding": "utf-8"}
)
docs = loader.load()
# Each document includes source path in metadata
for doc in docs:
print(f"Source: {doc.metadata['source']}")
print(f"Length: {len(doc.page_content)} chars")
Code File Processing
Code files need special handling because their structure carries meaning. Indentation, function boundaries, and import statements all matter for understanding the code.
from langchain_community.document_loaders import DirectoryLoader, TextLoader
# Load Python files from a project
loader = DirectoryLoader(
"./src",
glob="**/*.py",
loader_cls=TextLoader,
loader_kwargs={"encoding": "utf-8"}
)
code_docs = loader.load()
# Add language metadata
for doc in code_docs:
doc.metadata["language"] = "python"
doc.metadata["file_type"] = "code"
We will cover code processing in depth in Lesson 9.
Metadata Extraction
Good metadata makes retrieval dramatically better. It enables filtering (only search marketing documents), provides context to the LLM (this information is from the 2024 annual report), and improves source citation.
Essential Metadata Fields
Every document in your RAG system should carry at minimum:
- source -- where the document came from (file path, URL, database ID)
- title -- the document or section title
- date -- when the document was created or last updated
- content_type -- what kind of content it is (policy, guide, code, FAQ)
Extracting Metadata Automatically
from datetime import datetime
from pathlib import Path
def enrich_metadata(doc, base_path: str = ""):
"""Add useful metadata to a loaded document."""
source = doc.metadata.get("source", "")
path = Path(source)
# Extract from file path
doc.metadata["filename"] = path.name
doc.metadata["extension"] = path.suffix
doc.metadata["directory"] = str(path.parent)
# Extract title from first heading
lines = doc.page_content.split("\n")
for line in lines:
if line.startswith("# "):
doc.metadata["title"] = line.lstrip("# ").strip()
break
# Add processing timestamp
doc.metadata["indexed_at"] = datetime.now().isoformat()
# Estimate content type
if path.suffix in [".py", ".js", ".ts", ".go"]:
doc.metadata["content_type"] = "code"
elif path.suffix in [".md", ".mdx"]:
doc.metadata["content_type"] = "documentation"
elif path.suffix == ".pdf":
doc.metadata["content_type"] = "document"
return doc
Building a Multi-Source Loader
In production, you rarely load documents from a single source. A typical RAG system ingests from multiple formats and locations. Here is a pattern for building a unified loader:
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
DirectoryLoader,
WebBaseLoader,
)
def load_all_documents(config: dict) -> list:
"""Load documents from multiple sources."""
all_docs = []
# Load PDFs
if "pdf_dir" in config:
loader = DirectoryLoader(
config["pdf_dir"],
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
pdf_docs = loader.load()
for doc in pdf_docs:
doc.metadata["source_type"] = "pdf"
all_docs.extend(pdf_docs)
# Load Markdown docs
if "docs_dir" in config:
loader = DirectoryLoader(
config["docs_dir"],
glob="**/*.md",
loader_cls=TextLoader,
loader_kwargs={"encoding": "utf-8"}
)
md_docs = loader.load()
for doc in md_docs:
doc.metadata["source_type"] = "markdown"
all_docs.extend(md_docs)
# Load web pages
if "urls" in config:
loader = WebBaseLoader(web_paths=config["urls"])
web_docs = loader.load()
for doc in web_docs:
doc.metadata["source_type"] = "web"
all_docs.extend(web_docs)
# Enrich all metadata
all_docs = [enrich_metadata(doc) for doc in all_docs]
print(f"Loaded {len(all_docs)} documents from {len(config)} sources")
return all_docs
# Usage
docs = load_all_documents({
"pdf_dir": "./data/pdfs",
"docs_dir": "./data/docs",
"urls": [
"https://docs.example.com/faq",
"https://docs.example.com/getting-started",
]
})
Handling Tables and Structured Data
Tables in PDFs and HTML are one of the trickiest challenges. When text extraction flattens a table, the row-column relationships are lost, and the LLM cannot interpret the data correctly.
Strategies for tables:
- Preserve structure. Use
unstructuredwithhi_resstrategy to detect and extract tables as HTML or Markdown format. - Describe tables. Convert tables to natural language descriptions: "In Q3 2024, revenue was $12M, up 15% from Q2."
- Separate table chunks. Store tables as their own chunks with metadata indicating they are tabular data.
- Use vision models. For complex tables, screenshot the table and use a vision LLM to describe its contents.
Tips for Production Document Processing
- Deduplicate. Check for duplicate documents before indexing. The same content from different sources will pollute your search results.
- Clean aggressively. Remove headers, footers, page numbers, and boilerplate text. They add noise without adding information.
- Preserve headings. Section headings provide crucial context. Include them in each chunk or store them as metadata.
- Track versions. When documents update, re-index them and remove stale entries. A RAG system answering from outdated documents is worse than not answering at all.
- Log everything. Record how many documents were loaded, how many failed, and why. Parsing failures are inevitable and you need visibility into them.
Document processing is not glamorous, but it is where many RAG systems succeed or fail. In the next lesson, you will learn how to split these processed documents into chunks optimized for retrieval.