Embeddings Deep Dive — RAG Engineering: Building AI That Knows Your Data

What Are Embeddings?

An embedding is a numerical representation of text as a vector -- a list of floating-point numbers that captures the semantic meaning of the input. Two pieces of text that mean similar things will produce vectors that are close together in the embedding space. Two pieces of text that mean different things will produce vectors that are far apart.

This is the foundation of everything in RAG. When you search for relevant documents, you are not matching keywords. You are comparing the meaning of a query against the meaning of stored text, measured by the distance between their vectors.

Consider these three sentences:

"The cat sat on the mat."
"A feline rested on the rug."
"The quarterly revenue exceeded projections."

Sentences 1 and 2 have almost no words in common, but their embeddings will be very close because they mean nearly the same thing. Sentence 3 shares some structural patterns ("The ... exceeded ...") but its embedding will be far from both because the meaning is entirely different.

This is what makes semantic search so powerful. Keyword search would fail to connect sentences 1 and 2. Embedding-based search connects them effortlessly.

How Text Becomes a Vector

The process of generating an embedding involves passing text through a neural network (the embedding model) that has been trained on massive amounts of text data. During training, the model learns to position semantically similar text close together in a high-dimensional space.

The output is a fixed-length array of floating-point numbers. For example, OpenAI's text-embedding-3-small produces a 1536-dimensional vector. That means every piece of text -- whether it is a single word or a full paragraph -- gets mapped to a list of 1536 numbers.

# A simplified view of what an embedding looks like
embedding = [0.0023, -0.0142, 0.0381, ..., -0.0091]  # 1536 floats

The individual numbers in the vector are not interpretable by humans. You cannot look at dimension 47 and say "this represents the concept of animals." The meaning is distributed across all dimensions collectively. But mathematically, these vectors encode rich semantic information that can be compared, clustered, and searched.

Popular Embedding Models

Choosing the right embedding model is one of the most important decisions in your RAG system. Here are the major options:

Commercial APIs

OpenAI text-embedding-3-small -- 1536 dimensions, excellent quality-to-cost ratio. This is the default choice for most production RAG systems. It handles English and multilingual text well, costs $0.02 per million tokens, and has a context window of 8191 tokens.

OpenAI text-embedding-3-large -- 3072 dimensions, higher accuracy for complex retrieval tasks. Costs $0.13 per million tokens. Use this when retrieval precision is critical and cost is secondary.

Cohere embed-v3 -- Available in multiple sizes (1024 dimensions for the light version, up to 1024 for the full version). Strong multilingual support with 100+ languages. Offers specialized input types for search queries vs. documents.

Open-Source Models

BGE (BAAI General Embedding) -- Family of models from the Beijing Academy of AI. BGE-large-en-v1.5 is a strong choice for English. BGE-m3 handles multilingual use cases well. Free to run locally.

E5 (EmbEddings from bidirEctional Encoder rEpresentations) -- Microsoft's embedding family. E5-large-v2 and E5-mistral-7b-instruct are top performers. The instruction-tuned variants allow you to prepend task-specific prefixes.

GTE (General Text Embeddings) -- Alibaba's offering. GTE-large performs competitively with commercial models on many benchmarks. Good for self-hosted deployments.

all-MiniLM-L6-v2 -- A sentence-transformers classic. Only 384 dimensions and fast to run. Not the highest quality, but extremely practical for prototyping and resource-constrained environments.

How to Choose

For most teams starting out: use text-embedding-3-small from OpenAI. It is cheap, fast, and good enough for the vast majority of use cases. Switch to open-source models when you need to avoid API dependencies, reduce costs at scale, or handle specialized domains where a fine-tuned model outperforms general-purpose ones.

Dimensionality Tradeoffs

Higher dimensions generally capture more nuance but come with costs:

Storage: A 3072-dimensional vector takes twice the space of a 1536-dimensional one.
Search speed: Higher dimensions mean slower similarity computations, especially at scale.
Index size: Vector database indexes grow with dimensionality.

In practice, 1536 dimensions is the sweet spot for most applications. Going below 384 dimensions starts to degrade retrieval quality. Going above 3072 rarely provides meaningful improvements for typical RAG use cases.

OpenAI's embedding-3 models support a dimensions parameter that lets you truncate the output to a lower dimensionality. This is useful for experimentation:

from openai import OpenAI

client = OpenAI()

# Full 1536 dimensions
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="What is retrieval-augmented generation?"
)
full_embedding = response.data[0].embedding
print(f"Full dimensions: {len(full_embedding)}")  # 1536

# Truncated to 512 dimensions
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="What is retrieval-augmented generation?",
    dimensions=512
)
truncated_embedding = response.data[0].embedding
print(f"Truncated dimensions: {len(truncated_embedding)}")  # 512

Similarity Metrics

Once you have two vectors, you need a way to measure how similar they are. The two most common metrics are:

Cosine Similarity

Measures the angle between two vectors, ignoring magnitude. Returns a value between -1 and 1, where 1 means identical direction (same meaning) and 0 means orthogonal (unrelated).

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Cosine similarity is the default choice for most embedding models because embeddings are typically normalized during training, making magnitude irrelevant.

Dot Product

A simpler computation that does not normalize for magnitude. If your embeddings are already normalized (unit vectors), dot product and cosine similarity produce identical results.

def dot_product(a, b):
    return np.dot(a, b)

In practice, most vector databases let you choose the metric at index creation time. Use cosine similarity unless you have a specific reason to use dot product or Euclidean distance.

Generating Embeddings: Practical Code

With OpenAI

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env variable

def get_embeddings(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    """Generate embeddings for a list of texts."""
    response = client.embeddings.create(
        model=model,
        input=texts
    )
    return [item.embedding for item in response.data]

# Embed a single query
query_embedding = get_embeddings(["How does photosynthesis work?"])[0]

# Embed multiple documents in batch
documents = [
    "Photosynthesis converts sunlight into chemical energy in plants.",
    "The stock market closed higher on Tuesday.",
    "Chloroplasts are the organelles where photosynthesis occurs.",
]
doc_embeddings = get_embeddings(documents)

# Compare query to each document
for i, doc_emb in enumerate(doc_embeddings):
    similarity = np.dot(query_embedding, doc_emb)
    print(f"Document {i}: similarity = {similarity:.4f}")

With Sentence Transformers (Open Source)

from sentence_transformers import SentenceTransformer
import numpy as np

# Load model locally (downloads on first run)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Generate embeddings
query = "How does photosynthesis work?"
documents = [
    "Photosynthesis converts sunlight into chemical energy in plants.",
    "The stock market closed higher on Tuesday.",
    "Chloroplasts are the organelles where photosynthesis occurs.",
]

query_embedding = model.encode(query, normalize_embeddings=True)
doc_embeddings = model.encode(documents, normalize_embeddings=True)

# Compute similarities
similarities = np.dot(doc_embeddings, query_embedding)
for i, sim in enumerate(similarities):
    print(f"Document {i}: similarity = {sim:.4f}")

Tips for Production Embeddings

Batch your requests. Sending 100 texts in one API call is far cheaper and faster than 100 individual calls.
Cache embeddings. Never re-embed text that has not changed. Store the embedding alongside the source text in your database.
Normalize consistently. If you normalize embeddings at storage time, make sure you also normalize at query time.
Match your model. Always use the same embedding model for documents and queries. Mixing models produces meaningless similarity scores.
Benchmark on your data. MTEB leaderboard rankings are useful, but the best model for your use case depends on your specific domain, language, and query patterns. Always test with real data.

Embeddings are the bridge between human language and mathematical similarity. In the next lesson, you will learn where to store these vectors and how to search them at scale.