Part II — CS Building Blocks · Chapter 16
Vector Embeddings & Semantic Retrieval
Vector Embeddings & Semantic Retrieval
An embedding maps text to a point in ℝᵈ such that semantically similar text lands nearby. Retrieval becomes a nearest-neighbor query.
ℹPrerequisites▼
Before reading this chapter, you should be comfortable with: Chapter 8 (Schemas). Embeddings extend the schema layer; the embedding vector is stored alongside chunk metadata defined in schemas.
The problem is vocabulary. A lawyer who wants every text message about the custody exchange will search for "custody exchange." The caseworker who wrote those messages may have written "pickup time," "transfer," "the handoff," or "when you drop her off." A keyword search for "custody exchange" misses all four variants. The attorney has fifteen minutes and will not think of all five terms.
This is not a failure of diligence. It is the irreducible gap between the vocabulary of the searcher and the vocabulary of the author. Bridging it requires a retrieval primitive that operates on meaning rather than lexical overlap. That primitive is the vector embedding.
At a glance
- A bi-encoder model maps any text to a point in ℝ¹⁰²⁴ such that semantically similar texts land close together under cosine similarity.
- Meridian-Cannon uses bge-large-en-v1.5 (1024-d) on the Postgres/pgvector substrate. Vectors are stored in a separate
embeddingstable (not inchunks). The recommended ANN index is diskann via pgvectorscale (StreamingDiskANN);HNSWwithvector_cosine_opsis the fallback for environments without pgvectorscale. Do not change the embedding model without a measurement and adocs/divergences.mdentry. - The four common failure modes — forgetting query/passage prefixes, treating cosine similarity as a calibrated probability, mixing normalized and unnormalized vectors, and naïve fixed-size chunking — each produce silent errors that degrade retrieval without raising exceptions. ## Learning objectives After this chapter, you can: - Explain bi-encoder architecture and the cosine-similarity retrieval primitive. - Implement correct embedding with the asymmetric prefix discipline. - Configure and query pgvector with diskann (recommended) or HNSW (fallback) cosine-distance indexing across the two-tablechunks/embeddingsschema. - Identify when Matryoshka representation learning saves storage without meaningful recall loss. - Apply section-aware chunking that preserves the entity-level signal an evidence corpus requires. ## From keyword to meaning Keyword retrieval (BM25, Postgrestsvector) operates on term overlap. A query matches a document to the degree that the query's terms appear in the document, weighted by inverse document frequency. This is fast, well-understood, and correct for exact recall. It fails when the searcher and the author do not share vocabulary. Dense retrieval asks a different question: "is this document close to this query in meaning-space?" That requires a learned geometry — a space where "the parent did not attend" and "the father was absent" land near each other. A transformer trained on large corpora — supervised with contrastive learning, self-supervised on masked tokens, or fine-tuned on retrieval tasks — produces an encoder functionf(text) → ℝᵈ. Points near each other are semantically similar. Points far apart are not. > ▼ Why It Matters — The vocabulary gap in practice. > > In the running case, a TPR (termination of parental rights) proceeding requires documentation of every custody exchange attempt over a 12-month period. The caseworker's notes, the parent's text messages, and the DHS records each use different terms for the same event. Keyword search requires the attorney to enumerate all variants — "pickup," "handoff," "transfer," "exchange," "visitation drop-off." Dense retrieval finds all of them from a single query. The fifteen-minute time window is the difference between finding the evidence and missing it. ## Bi-encoder and cross-encoder: the two-pass architecture Two model classes serve retrieval. They operate at different stages. Bi-encoder (Chapter 9, this chapter): encodes the query and each document independently. At query time, embed the query; find the nearest document embeddings by inner product or cosine similarity. This is fast: documents are pre-embedded at ingest time, and the query embedding is a single forward pass. Bi-encoders trade some precision for speed. Cross-encoder (Chapter 10): takes a (query, document) pair as a single input and produces a relevance score. This is expensive — it requires one forward pass per candidate document — but more accurate because the model sees both texts simultaneously and can attend across them. Cross-encoders rerank a candidate pool, not a full index. The standard pipeline: bi-encoder retrieves top-4k candidates at millisecond speed; cross-encoder reranks them to top-10 at higher latency. Chapter 10 covers the reranker. This chapter covers the bi-encoder. > ◆ Going Deeper — Why separate encoders? > > A cross-encoder would theoretically produce the most accurate retrieval — it considers the full interaction between query and document. But indexing requires that every document embedding be computed before any query is seen. A cross-encoder cannot be pre-computed because it requires the query. The bi-encoder approximation — encode query and documents separately, use dot product to approximate full interaction — trades 2–5% recall for 100–10,000× speed depending on corpus size. ## The bge-large-en-v1.5 model Meridian-Cannon uses BAAI's bge-large-en-v1.5, a 335M parameter bi-encoder trained with RetroMAE pre-training and InfoNCE contrastive fine-tuning on MS-MARCO and a suite of BEIR evaluation tasks. It produces 1024-dimensional embeddings. At the time of integration, it ranked in the top tier of the MTEB leaderboard for retrieval on English text. Do not change this model without: 1. Running the full BEIR evaluation on the Meridian-Cannon chunk corpus. 2. Verifying that the new model's embeddings are not pre-normalized differently (mixed normalized/unnormalized vectors in the same pgvector index cause silent recall degradation). 3. Recording the change indocs/divergences.mdwith the measurement. For new deployments in 2026, BGE-M3 (dense + sparse + late-interaction simultaneously) or Nomic-embed-text-v2-MoE are the alternatives to evaluate. The comparison criteria are: MTEB retrieval score, context window (bge-large-en-v1.5 is 512 tokens; BGE-M3 is 8192), and inference latency on the target hardware. ## The asymmetric-prefix discipline BGE models use an asymmetric prefix: queries are prefixed with"Represent this sentence for searching relevant passages: "(thequery:prefix), while documents are embedded without a prefix (passage-only). Omitting the query prefix produces a 3–8 point nDCG@10 drop on retrieval tasks. The failure is silent — the index still queries, the numbers still return, but the ranking is wrong.
# Correct asymmetric embedding
def embed_query(text: str) -> list[float]:
prefix = "Represent this sentence for searching relevant passages: "
return model.encode(prefix + text, normalize_embeddings=True).tolist()
def embed_passage(text: str) -> list[float]:
return model.encode(text, normalize_embeddings=True).tolist()
The normalize_embeddings=True flag ensures unit-length vectors. Combined with inner-product indexing in pgvector, this is equivalent to cosine similarity — and inner product is faster to compute. > ✻ Try This — The vocabulary gap, made measurable. > > Using sentence-transformers and bge-large-en-v1.5, embed these three strings with the query prefix: > - "The parent was denied access to the child." > - "Contact between the father and the child was not facilitated." > - "The court scheduled a status conference for March 2024." > > Compute pairwise cosine similarities. The first two should score above 0.80. The third should score below 0.40 against both. If your scores are reversed, check your prefix usage. ## Normalization, distance metrics, and pgvector configuration Three distance metrics are available in pgvector: L2 (<->), inner product (<#>), and cosine (<=>). For normalized unit-length vectors, inner product and cosine produce identical rankings. Inner product is faster because it skips the normalization step at query time. The correct index configuration for Meridian-Cannon — the vector column lives in the embeddings table, not in chunks. The recommended index is diskann (StreamingDiskANN via pgvectorscale), which scales beyond RAM for large collections (> 1M vectors) by using disk-backed approximate nearest-neighbor search. The HNSW index is the fallback when pgvectorscale is not available.
-- Recommended: diskann index via pgvectorscale (requires the extension).
-- StreamingDiskANN scales beyond RAM for large collections.
-- See schema/30_documents.sql for the authoritative definition.
CREATE INDEX embeddings_diskann_idx
ON embeddings
USING diskann (vector vector_cosine_ops);
-- Fallback: HNSW index (pgvector only, no pgvectorscale needed).
-- Use this in environments where pgvectorscale is not installed.
-- CREATE INDEX embeddings_hnsw_idx
-- ON embeddings
-- USING hnsw (vector vector_cosine_ops)
-- WITH (m = 16, ef_construction = 64);
-- Query: find the 20 chunks nearest to a query embedding for a given matter.
-- matter_id is on documents, not on chunks — the join is required.
SELECT c.id, c.text, 1 - (e.vector <=> $1::vector) AS score
FROM embeddings e
JOIN chunks c ON c.id = e.chunk_id
JOIN documents d ON d.id = c.document_id
WHERE d.matter_id = $2
AND e.model_name = 'bge-large-en-v1.5'
ORDER BY e.vector <=> $1::vector
LIMIT 20;
The <=> operator is cosine distance (lower is more similar, ranging 0–2 for normalized vectors); 1 - distance converts it to a similarity score between 0 and 1. Because both the query embedding and the stored vectors are produced with normalize_embeddings=True, cosine distance is the correct metric. The WHERE clause enforcing d.matter_id is load-bearing: it ensures one matter's embeddings never appear in another matter's retrieval results. > ◆ Going Deeper — Cosine similarity is not a calibrated probability. > > A cosine similarity of 0.85 does not mean "85% probability of relevance." It means "these vectors are separated by approximately 31.8 degrees in ℝ¹⁰²⁴." The relationship between cosine similarity and human relevance judgments is monotone but not linear. A threshold of 0.75 that works well for English legal text may perform differently on other domains or after a model change. Calibrate thresholds empirically on held-out data; never assume them from the similarity score alone. > ◆ Going Deeper — diskann vs HNSW: when does it matter? > > HNSW (Hierarchical Navigable Small World) holds the entire graph index in RAM. At 1,024 dimensions, each vector uses 4 KB. A million-chunk corpus requires roughly 4 GB of RAM for the vectors alone, plus the HNSW graph structure. For many litigation-scale corpora this is manageable. For corpora exceeding several million chunks — or on hardware where RAM is constrained — HNSW's in-memory requirement becomes a bottleneck. > > StreamingDiskANN (via pgvectorscale) uses a disk-backed index that pages data from SSD during search. Recall is comparable to HNSW at similar parameter settings, but latency is higher for cold queries (SSD vs RAM). The trade-off is correct for large collections: scale beyond available RAM at the cost of some per-query latency. For corpora under 500k chunks, HNSW and diskann perform similarly and HNSW is available without the pgvectorscale extension. Upgrade to diskann when the corpus grows beyond RAM capacity. ## Matryoshka embeddings: storage without sacrifice Matryoshka Representation Learning (MRL) trains a model such that the first 256 dimensions of a 1024-d embedding are themselves a useful 256-d embedding. This allows a coarse-to-fine retrieval strategy: 1. Stage 1: retrieve top-4k candidates using 256-d truncated embeddings (stored separately, 4× smaller). 2. Stage 2: rerank the 4k candidates using the full 1024-d embeddings. The recall loss from truncation in Stage 1 is typically less than 2% nDCG@10 on BEIR benchmarks, while the storage savings are 4×. For a 10-million-chunk corpus, this is the difference between 40 GB and 10 GB of vector storage. bge-large-en-v1.5 does not use MRL by default. The BGE-M3 and Nomic-embed-text-v2-MoE models do. If storage is a constraint, evaluate these models before implementing a custom MRL fine-tuning pipeline. ## Section-aware chunking The embedding pipeline operates on chunks, not on raw documents. How you chunk determines what the embedding represents. Naïve fixed-size chunking splits at a fixed token count (e.g., 512 tokens) regardless of document structure. A sentence split across two chunks produces one chunk that ends mid-thought and one that begins mid-thought. Neither represents a coherent unit of meaning. For evidence corpora, this is particularly damaging: a sentence like "The parent arrived at 3:00 PM; the caseworker was not present" contains two independent claims. Splitting it discards the relationship. Section-aware chunking splits at structural boundaries: - Each paragraph is a candidate chunk boundary. - Each message in an email thread is its own chunk. - Each list item is atomic. - Each attachment is a separate chunk with a parent_chunk_id foreign key linking it to the email. - PDF sections split at headings (detected by font-size change or heading-level metadata), not at page boundaries. For short documents (text messages, individual emails), the entire document is one chunk. For long documents (PDFs, court filings), the chunk is the smallest coherent structural unit. > § For the Record — FRE 1002 (Best Evidence Rule). > > "An original writing, recording, or photograph is required in order to prove its content unless these rules or a federal statute provides otherwise." > > A chunk that mid-sentence across a paragraph boundary is not a faithful representation of the original's content. Section-aware chunking is the engineering discipline that keeps chunks admissible as records of the original document's meaning. ## Working example: the chunks and embeddings schema The schema/30_documents.sql migration defines the actual structure the embedding pipeline writes to. The vector column does not live in chunks. Meridian-Cannon uses two separate tables: chunks holds text and structural metadata; embeddings holds the vectors. This separation lets multiple models coexist over time — old embeddings are never overwritten when the model changes, and a new model gets its own rows.
-- From schema/30_documents.sql (abbreviated)
CREATE TABLE chunks (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
document_id uuid NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
parent_chunk_id uuid REFERENCES chunks(id),
chunker text NOT NULL, -- 'pdf_layout' | 'email_mime' | 'message_window' | etc.
chunker_version text NOT NULL,
modality text NOT NULL,
ordinal int NOT NULL, -- order within the document
section_path text,
page_range int4range,
char_offsets int4range,
text text NOT NULL,
text_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', text)) STORED,
-- ... speaker_label, message_id, recording_id, metadata, pii_tier, created_at
);
CREATE TABLE embeddings (
chunk_id uuid NOT NULL REFERENCES chunks(id) ON DELETE CASCADE,
model_name text NOT NULL,
model_version text NOT NULL,
dim int NOT NULL,
vector vector(1024) NOT NULL, -- bge-large-en-v1.5 / bge-m3
computed_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (chunk_id, model_name, model_version)
);
-- Recommended ANN index: diskann (pgvectorscale). Scales beyond RAM.
CREATE INDEX embeddings_diskann_idx
ON embeddings USING diskann (vector vector_cosine_ops);
-- Fallback ANN index: hnsw (pgvector only, no pgvectorscale required).
-- CREATE INDEX embeddings_hnsw_idx
-- ON embeddings USING hnsw (vector vector_cosine_ops)
-- WITH (m = 16, ef_construction = 64);
Key architectural points: chunks has no matter_id column and no vector column. Matter-level isolation is enforced via documents.matter_id — chunks reach their matter through the document_id foreign key. The HNSW index is on embeddings.vector using vector_cosine_ops, not on a hypothetical chunks.embedding column.
The actual two-table join:
-- Real two-table query: chunks joined to embeddings, matter filtered via documents.
SELECT c.id, c.text, 1 - (e.vector <=> $1::vector) AS score
FROM embeddings e
JOIN chunks c ON c.id = e.chunk_id
JOIN documents d ON d.id = c.document_id
WHERE d.matter_id = $2
AND e.model_name = 'bge-large-en-v1.5'
ORDER BY e.vector <=> $1::vector
LIMIT 20;
The <=> operator is cosine distance (lower is more similar); 1 - distance converts it to a similarity score between 0 and 1. The WHERE d.matter_id = $2 clause is load-bearing: it enforces matter-level isolation, ensuring one matter's embeddings never appear in another matter's retrieval results. The join path is embeddings → chunks → documents → matter_id. A query that returns chunks missing embeddings — useful for pipeline monitoring. Note that matter_id is not on chunks directly; the join through documents is required:
SELECT c.id, c.document_id, c.ordinal
FROM chunks c
JOIN documents d ON d.id = c.document_id
LEFT JOIN embeddings e ON e.chunk_id = c.id
AND e.model_name = 'bge-large-en-v1.5'
WHERE d.matter_id = $1
AND e.chunk_id IS NULL
ORDER BY c.document_id, c.ordinal;
The LEFT JOIN … WHERE e.chunk_id IS NULL pattern finds chunks that have no embedding row for the specified model. The computed_at timestamp on each embeddings row records when the vector was produced, creating a model-version audit trail. > ☉ In the Wild — The Waymo v. Uber codename problem (2017). > > When Google's Waymo sued Uber for trade-secret theft, forensic investigators needed to find every internal document that mentioned the stolen LiDAR technology. The technology had internal project codenames at both companies. A keyword search for "LiDAR design specification" missed documents that used the codenames. The discovery process required a specialized technical vocabulary — codenames, part numbers, engineering shorthand — that keyword search could not generalize across. > > Dense retrieval over an embedding index can find documents about a concept even when they use different vocabulary. A query for "LiDAR sensor design" would retrieve documents using project codenames if those documents are semantically similar — if the surrounding context makes the topic clear. Not a guarantee, but a structural advantage that keyword search cannot provide. ## Lab 9 — Prefix omission and the recall penalty The lab is in labs/ch09_embeddings/. Build a small retrieval benchmark using 200 query-answer pairs sampled from the BEIR NF-Corpus or SCIFACT dataset. Implement two retrievers: one with the correct query prefix, one without. Measure nDCG@10 for each. The expected gap is 3–8 points; if you see no gap, verify you are using a model that requires the prefix (bge-large-en-v1.5 does; many sentence-transformers models do not).
embeddings table's separate-row-per-model design preserves old vectors and requires a documented docs/divergences.md entry before any model swap. schema/30_documents.sql. Look up what ef_construction controls. At what value does recall plateau for a 100k-chunk corpus? ### Core 3. Implement section-aware chunking for a five-email MIME thread. Each email is one chunk; each attachment is a separate chunk with parent_chunk_id set. Verify that all chunks share a document_id and that the parent-attachment relationship is intact. 4. Write a Postgres function that returns the embedding coverage rate (percentage of chunks with a non-null embedding) for a given matter_id. Alert if coverage drops below 95%. 5. Using the corrected two-table join from this chapter's working example, write a SQL query that returns the top-10 chunks most similar to the embedding of the string 'parent did not attend the scheduled visit' for a given matter_id. Use the embeddings.vector <#> $1 operator. Explain why the query requires a JOIN to documents (not just to chunks) to enforce matter_id isolation. ### Stretch 6. bge-large-en-v1.5 has a 512-token context window. Design a chunking strategy for a 40-page court filing (a PDF with headings and numbered paragraphs) such that no chunk exceeds 400 tokens and every chunk preserves at least one complete paragraph. Estimate the number of chunks the filing would produce. Identify which PDF structural elements should be chunk boundaries. 7. Implement the Matryoshka truncation described in the chapter: take a 1024-d bge-large-en-v1.5 embedding, truncate it to 256 dimensions, and re-normalize the resulting vector. Embed ten query-passage pairs using both 1024-d and 256-d vectors. Measure the nDCG@10 gap between the two dimensionalities. Report whether the gap exceeds 2 percentage points. 8. Read the BGE-M3 paper. Implement BGE-M3's three-signal retrieval (dense, sparse, ColBERT-style late interaction) for a 5k chunk subset. Compare to bge-large-en-v1.5 + RRF on nDCG@10. Document the difference in docs/divergences.md. ## Build-your-own prompt For your capstone matter: which embedding model? The decision criteria are: license (MIT, Apache 2.0, or proprietary?), language coverage (does your corpus include non-English text?), context window (512 tokens vs 8192?), MTEB retrieval performance in your domain, and inference latency on your hardware. Justify in two paragraphs. If you choose a model other than bge-large-en-v1.5, create a docs/divergences.md entry. ## Further reading - MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard — the authoritative retrieval benchmark. - BGE-M3 paper: https://huggingface.co/BAAI/bge-m3. - Matryoshka Representation Learning (Kusupati et al., NeurIPS 2022): https://huggingface.co/blog/matryoshka. - pgvector HNSW documentation: https://github.com/pgvector/pgvector#hnsw. - pgvectorscale (StreamingDiskANN): https://github.com/timescale/pgvectorscale. The diskann extension used in Meridian-Cannon's recommended index configuration. - Nomic-embed-text-v2-MoE: https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe. - The dossier research/03_rag_eval_and_verifiable_retrieval.md.
Next: Chapter 10 — Hybrid Retrieval. The BM25 and dense signals, fused.