NORAEarly Access

Part II — CS Building Blocks · Chapter 17

Hybrid Retrieval (BM25 + Dense + RRF + Rerank)

Hybrid Retrieval (BM25 + Dense + RRF + Rerank)

Lexical retrieval excels at exact-term recall. Dense retrieval excels at paraphrase. Reciprocal Rank Fusion is what lets you have both without choosing.

Prerequisites

Before reading this chapter, you should be comfortable with: Chapters 8–9 (Schemas, Embeddings). Hybrid retrieval combines BM25 and vector search — you need to understand embeddings before combining them.

Dense retrieval closes the vocabulary gap. It introduces a different problem: it misses exact matches that keyword search handles trivially.

The attorney searching for "exhibit 47-B" will find it immediately with BM25. The dense retrieval system may return conceptually related documents — other exhibits, other numbered references — but the exact string "47-B" may not be the semantically salient feature of the documents that contain it. Dense retrieval can underperform on entities, codes, case numbers, and precise dates: the things that matter most in legal evidence.

The answer is not to choose. It is to run both and fuse the results.

At a glance

  • BM25 retrieval supports two backends: tsvector (Postgres built-in, always available) and ParadeDB / pg_search (Tantivy-powered, optional, activated by MERIDIAN_USE_PARADEDB=1). A dispatcher function selects the path at runtime; RRF is identical regardless of which backend is active. - Reciprocal Rank Fusion (Cormack et al. 2009, k=60) fuses two ranked lists into one without requiring score calibration. It remains the 2026 default for hybrid retrieval in every major commercial search system. - A cross-encoder reranker applied to the top-4k RRF candidates closes the remaining gap to expensive late-interaction methods (ColBERT family) at a fraction of the index cost. - For evidence corpora — entity-heavy, citation-heavy, date-heavy, procedural-vocabulary-heavy — sparse-dense hybrid consistently outperforms pure dense on out-of-distribution vocabulary slices. ## Learning objectives After this chapter, you can: - Implement BM25 retrieval using Postgres full-text search (tsvector/ts_rank) and explain when to upgrade to ParadeDB (pg_search). - Configure the BM25 dispatcher to select between tsvector and ParadeDB at runtime via environment variable. - Implement dense retrieval using pgvector (Chapter 9). - Fuse the two ranked lists using RRF and explain why k=60 is the standard default. - Add a cross-encoder rerank pass and measure the precision improvement at K ≤ 10. - Identify when late-interaction (ColBERT family) is worth the index-size cost. ## Why exact terms still matter Dense retrieval systems are trained on natural-language similarity tasks. Their training signal is "does sentence A paraphrase sentence B?" The embedding geometry reflects this: paraphrases cluster together. But legal evidence is full of tokens that are not paraphrase targets: - Case numbers ("2024JC000099") - Exhibit identifiers ("Exhibit 47-B") - Statutory citations ("Wis. Stat. § 48.415") - Medication names ("sertraline 50mg") - Phone numbers and timestamps These tokens appear in a small number of documents. Their IDF weight in BM25 is very high — a query containing "2024JC000099" will rank documents containing that string near the top of a BM25 result, regardless of surrounding context. A dense retriever will encode the case number as part of a 1024-dimensional vector; depending on how many similar case numbers appear in the training data, the case number's contribution to the embedding geometry may be negligible. This failure mode is concrete. If the attorney queries "the court's order in case 2024JC000099," a dense retriever may return documents about court orders generally, rather than documents specifically about that case. BM25 returns the case documents first. > ▼ Why It Matters — The entity retrieval failure. > > In the running TPR case, the evidence dossier contains roughly 2,000 documents. 400 of them are communications between the parent and the caseworker. 50 reference a specific safety plan by identifier ("SP-2024-03-15"). A query for "the March safety plan" retrieves conceptually related documents. A query for "SP-2024-03-15" retrieves the specific documents instantly via BM25. Hybrid retrieval handles both queries well. Pure dense handles only the first. ## BM25 in Postgres: two paths Meridian-Cannon supports two BM25 backends. Path 1 (tsvector) uses Postgres built-in full-text search and is always available. Path 2 (ParadeDB / pg_search) uses Tantivy, a Rust BM25 engine running inside Postgres, and is optional — activated by the environment variable MERIDIAN_USE_PARADEDB=1. ### Path 1 — tsvector (always available) Postgres implements BM25-style scoring via tsvector and ts_rank. The tsvector is a preprocessed document representation (stemmed, stop-word-filtered); ts_rank computes a relevance score against a tsquery. The chunks table has a text_tsv column computed at ingest time as a generated column (text_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', text)) STORED), with a GIN index already in place (CREATE INDEX chunks_text_tsv_idx ON chunks USING GIN (text_tsv)). No DDL is needed to add it.
SELECT c.id, ts_rank_cd(c.text_tsv, query, 32) AS bm25_score
FROM chunks c
JOIN documents d ON d.id = c.document_id,
     to_tsquery('english', $1) query
WHERE d.matter_id = $2
  AND c.text_tsv @@ query
ORDER BY bm25_score DESC
LIMIT 100;

The ts_rank_cd function computes a BM25-like score with cover-density normalization (the _cd suffix). The normalization flag 32 (bit 5) divides the rank by itself plus 1, bounding the score to [0, 1]. The @@ query predicate is the keyword match; only documents containing the query terms are returned. ### Path 2 — ParadeDB / pg_search (optional) ParadeDB replaces Postgres's ts_rank with Tantivy's native BM25 scoring — the same engine that powers Elasticsearch and Meilisearch — running as a Postgres extension. It is activated by MERIDIAN_USE_PARADEDB=1 and requires the pg_search extension to be installed.

SELECT c.id, paradedb.score() AS bm25_score
FROM chunks c
JOIN documents d ON d.id = c.document_id
WHERE chunks @@@ paradedb.match(field => 'content', value => $1)
  AND d.matter_id = $2
ORDER BY paradedb.score() DESC
LIMIT 100;

The @@@ operator is ParadeDB-specific. It is not standard SQL and will fail with ERROR: operator does not exist if the pg_search extension is not installed. Do not use it without confirming the extension is present. The schema migration that activates the ParadeDB index is in schema/B1_paradedb_fts.sql — it is guarded by an extension existence check and is a no-op if pg_search is not installed. To install ParadeDB: either use the official ParadeDB Postgres Docker image, or add the pg_search extension to your docker-compose.yml using the ParadeDB extension package for your Postgres version.

Dispatcher pattern

The application layer does not branch on the BM25 backend in every query. A single dispatcher function selects the path at runtime:

import os

def _bm25_search(query: str, conn) -> list[dict]:
    if os.environ.get("MERIDIAN_USE_PARADEDB") == "1":
        return _paradedb_search(query, conn)
    return _tsvector_search(query, conn)

RRF fusion (below) is identical regardless of which BM25 path is active — it operates on ranked lists of chunk IDs, not on the scores themselves.

Going Deeper — BM25 term saturation.

The classic Okapi BM25 formula scores term frequency with a saturation function: (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * |d| / avgdl)). The k1 parameter (typically 1.2–2.0) controls saturation: a document with 100 occurrences of a query term is not scored 100× higher than one with 1 occurrence. The saturation reflects the law of diminishing returns — each additional occurrence of a term provides less evidence of relevance. > > Postgres ts_rank_cd approximates this with cover-density scoring rather than the full Okapi formula. ParadeDB's pg_search implements the full Okapi BM25 scoring via Tantivy. For most litigation-scale corpora the difference is not measurable; use tsvector by default and upgrade to ParadeDB only if you measure a gap on your specific corpus.

Reciprocal Rank Fusion

BM25 and dense retrieval each produce a ranked list. BM25 scores are term-frequency-based; dense scores are cosine similarities. They are not on the same scale and cannot be directly summed.

RRF works without shared scale. The formula:

RRF(d) = Σ 1 / (k + rank_i(d))

where rank_i(d) is the rank of document d in ranked list i, and k=60 is a regularization constant. For each document, sum the reciprocal rank across both lists. A document ranked first in both lists scores 1/(60+1) + 1/(60+1) ≈ 0.033. A document ranked 100th in one list and not in the other scores 1/(60+100) ≈ 0.006. The scores are always commensurate — no scale calibration required.

The intuition: a document ranked highly in either list is a good candidate. A document ranked highly in both lists is a very good candidate. The RRF score reflects this without requiring score normalization.

Going Deeper — Why k=60?

Cormack, Clarke, and Buettcher derived k=60 empirically in the 2009 TREC Legal Track experiments. At k=60, RRF was robust to the choice of retrieval systems being fused: it performed well across many combinations of BM25, dense, TF-IDF, and other systems. Lower k values give more weight to top-ranked documents; higher k values flatten the score differences between ranks. At k=0, RRF becomes summed reciprocal rank — positions matter (rank 1 scores 1.0, rank 2 scores 0.5). At k=∞, scores converge toward equal weighting within each list. The k=60 value was chosen empirically to balance these extremes.

Subsequent work (Rackauckas 2024, An Analysis of Fusion Functions for Hybrid Retrieval) found k=60 still near-optimal for hybrid BM25+dense fusion on BEIR benchmarks in 2024. Change k only if you have a measurement showing a different value outperforms on your specific corpus.

The RRF query in Postgres

The hybrid retrieval query runs BM25 and dense search in parallel using CTEs, then fuses with RRF:

WITH bm25 AS (
  SELECT c.id,
         ROW_NUMBER() OVER (ORDER BY ts_rank_cd(c.text_tsv, query, 32) DESC) AS rank
  FROM chunks c
  JOIN documents d ON d.id = c.document_id,
       to_tsquery('english', $1) query
  WHERE d.matter_id = $2
    AND c.text_tsv @@ query
  LIMIT 1000
),
dense AS (
  SELECT c.id,
         ROW_NUMBER() OVER (ORDER BY e.vector <#> $3) AS rank
  FROM chunks c
  JOIN documents d ON d.id = c.document_id
  JOIN embeddings e ON e.chunk_id = c.id AND e.model_name = 'bge-large-en-v1.5'
  WHERE d.matter_id = $2
  ORDER BY e.vector <#> $3
  LIMIT 1000
),
rrf AS (
  SELECT
    COALESCE(b.id, d.id) AS chunk_id,
    COALESCE(1.0 / (60 + b.rank), 0) +
    COALESCE(1.0 / (60 + d.rank), 0) AS rrf_score
  FROM bm25 b
  FULL OUTER JOIN dense d ON b.id = d.id
)
SELECT chunk_id, rrf_score
FROM rrf
ORDER BY rrf_score DESC
LIMIT 100;

$1 is the text query for BM25, $2 is the matter_id, $3 is the query embedding vector (pre-computed by the application layer). matter_id filtering goes through documents because chunks has no direct matter_id column; vector search goes through the embeddings table because chunks stores no vector column — the embedding is embeddings.vector. The FULL OUTER JOIN is what enables fusion: documents in BM25 but not dense, dense but not BM25, and both lists all appear in the RRF result. A document appearing in only one list gets a partial RRF score; a document in both gets a combined score.

Try This — Find the gap between BM25 and dense.

Take a specific document identifier from your test corpus — a case number, an exhibit reference, or a statutory citation. Run a BM25 query for it. Note the top-3 results. Run a dense query using the same string (embedded). Note the top-3 results.

Are the top-3 results the same? If not, which query found the relevant document? Now run the RRF fusion. Is the relevant document in the RRF top-3?

Try the reverse: take a conceptual query ("the parent's attempts to contact the child") where BM25 will likely underperform. Compare BM25, dense, and RRF top-3 again.

Cross-encoder reranking

RRF retrieves 100 candidates efficiently. Precision at rank 10 is good but not as good as a cross-encoder reading (query, document) pairs jointly.

A cross-encoder reranker takes each (query, chunk_text) pair and produces a scalar relevance score. It reads both texts simultaneously, detecting fine-grained signals — negation, context, entailment — that a bi-encoder cannot. The cost: one forward pass per candidate, rather than one pass total.

For a 100-candidate pool, this is 100 forward passes. At 5ms per forward pass on a GPU (typical for a reranker like mxbai-rerank-large-v2 with 568M parameters), that is 500ms — acceptable for most legal document retrieval workflows.

The reranker sits on top of RRF:

def rerank(query: str, candidates: list[dict]) -> list[dict]:
    pairs = [(query, c["chunk_text"]) for c in candidates]
    scores = reranker.predict(pairs)          # batch forward pass
    return sorted(
        [{"chunk_id": c["chunk_id"], "score": float(s)}
         for c, s in zip(candidates, scores)],
        key=lambda x: x["score"],
        reverse=True
    )

The reranker does not replace RRF; it reranks the RRF output. The bi-encoder/RRF pipeline handles scale; the cross-encoder handles precision at the top of the list.

In the Wild — The TREC Legal 2008 corpus.

The Text Retrieval Conference (TREC) Legal Track 2008 evaluated retrieval systems on the Enron email corpus — 1.7 million emails produced in the FERC investigation. Participant systems ran queries designed to find documents relevant to specific litigation topics.

The finding most relevant to evidence systems: keyword-only retrieval systems achieved recall in the 40–60% range on average across topics. Hybrid systems (combining keyword with semantic expansion or query reformulation) achieved 60–80%. No system achieved recall above 85% without some form of semantic component.

Legal discovery standards in the US require "reasonable diligence" in retrieval — not perfect recall, but a defensible effort. The TREC Legal findings mean that keyword-only retrieval is not a defensible effort for large corpora. A legal team that produces only keyword results on a million-document corpus will, on average, miss 40–60% of relevant documents.

§ For the Record — TREC Legal 2008 Summary Finding.

"On average across all topics, participants employing keyword-only retrieval achieved estimated recall of 0.44 (median) on the Enron corpus. Participants employing hybrid retrieval approaches (combining keyword with semantic expansion, concept-based retrieval, or query reformulation) achieved estimated recall of 0.67 (median). The gap was statistically significant across topics (p < 0.01, Wilcoxon signed-rank test)." — TREC Legal Track 2008 Overview.

These are not BEIR lab numbers; they are numbers from a real legal discovery corpus (the Enron emails) under real legal review conditions.

Late interaction: when to go further

The RRF + cross-encoder pipeline covers the majority of evidence retrieval use cases. For some corpora — long documents, highly structured (court filings, multi-section PDFs), requiring fine-grained cross-sentence reasoning — late-interaction models (ColBERTv2, GTE-ModernColBERT) offer further precision.

Late interaction computes a token-level interaction between query and document: each query token attends to each document token, and the maximum similarity (MaxSim) is accumulated. This preserves more information than a single document embedding but requires storing one embedding per token, not one per document. The index is 50–100× larger.

For Meridian-Cannon's current deployment, RRF + cross-encoder is the correct endpoint. Late interaction is appropriate if:

  • The corpus has very long documents (> 2,000 tokens per chunk) where single-vector compression loses critical detail.
  • Precision at rank 1 is more important than recall at rank 10.
  • Index storage budget accommodates the 50–100× size increase.

For most litigation evidence corpora — text messages, emails, PDF pages, audio transcriptions — chunk lengths are short and the bi-encoder representation is sufficient with reranking. The decision to adopt late interaction requires a measurement, not an assumption.

SearchAttestation: sealing the retrieval result

A retrieval query is itself an evidentiary act. If the attorney runs a query and produces a result set that becomes the basis for a motion, the opposing party has a legitimate interest in knowing: what query was run, what method was used, and what the full result set was.

The Canon SearchAttestation (Phase E, not yet implemented) seals:

  • The query text and embedding.
  • The retrieval method (bm25_only, dense_only, rrf_hybrid, rrf_plus_rerank).
  • The retrieval parameters (k, threshold, reranker model).
  • The result set (top-k chunk IDs and scores).
  • The issued_at timestamp. Until the SearchAttestation is implemented, every query result is logged in the audit_log with action = 'search' and the query parameters in payload. The audit log entry is not as strong as a SearchAttestation — it lacks the cryptographic seal — but it provides a provenance record for the retrieval operation.

Going Deeper — Why attest the retrieval, not just the documents?

A document that was retrieved and then not produced is as legally relevant as one that was produced. If opposing counsel asks "what documents did your system retrieve when you searched for X?" the SearchAttestation answers that question precisely. Without it, the answer is "whatever our system returned, which we did not preserve." In the context of a completeness challenge — did you search thoroughly? — the SearchAttestation is the difference between a reproducible record and an assertion.

The completeness challenge is the one that matters most in evidence retrieval. It is not "did you produce the documents correctly?" It is "did you look in the right places?"

Working example: the full pipeline

A complete retrieval call in the application layer:

def retrieve(
    query: str,
    matter_id: str,
    k: int = 10
) -> list[dict]:
    # 1. Embed the query with the correct prefix.
    q_vec = embed_query(query)

    # 2. Run RRF fusion in Postgres.
    candidates = db.fetch_rrf(
        query_text=query,
        query_embedding=q_vec,
        matter_id=matter_id,
        limit=100,
    )

    # 3. Rerank with the cross-encoder.
    reranked = rerank(query, candidates)

    # 4. Log the retrieval in audit_log.
    db.audit("search", "chunk_set", None, {
        "query": query,
        "method": "rrf_plus_rerank",
        "result_ids": [r["chunk_id"] for r in reranked[:k]],
    })

    return reranked[:k]

The four steps map directly to the chapter's structure: embed, fuse, rerank, log. Each step is separable and testable. The audit entry at step 4 ensures the retrieval is recorded even before the SearchAttestation is implemented.

Lab 10 — Build the SearchAttestation emitter

The lab is in labs/ch10_hybrid_retrieval/.

Using the RRF query from this chapter and the reranker from a local instance of mxbai-rerank-large-v2 (or bge-reranker-v2-m3), implement the full retrieval pipeline and emit a draft SearchAttestation JSON for a query against the lab corpus. The draft SearchAttestation does not need to be sealed (the sealing infrastructure is in Chapter 25); it should have the correct structure and record all required fields.

Walk the draft SearchAttestation through the structural validator from Chapter 8 (meridian/canon/schema.py). Correct any schema violations before submitting.

💡Key Takeaways
- Neither BM25 nor vector search alone is sufficient for legal evidence corpora: BM25 excels at exact terms (case numbers, exhibit IDs, statutory citations) while dense retrieval excels at paraphrase — evidence corpora require both signals to avoid missing either category. - Reciprocal Rank Fusion with k=60 combines two ranked lists without score calibration by assigning each document a score of 1/(60 + rank) summed across lists, so a document ranked first in both scores far higher than one ranked first in only one. - Two BM25 backends are available — tsvector (always available, Postgres built-in) and ParadeDB/pg_search (Tantivy-backed, full Okapi BM25, activated by MERIDIAN_USE_PARADEDB=1) — with a runtime dispatcher that makes RRF fusion identical regardless of which backend is active. - MERIDIAN_USE_PARADEDB=1 enables the Tantivy-backed pg_search extension, which implements full Okapi BM25 scoring rather than Postgres's approximate ts_rank_cd, and is appropriate when you have measured a scoring gap on your specific corpus. - Hybrid retrieval improves recall for legal queries because entity-heavy, citation-heavy, date-heavy text creates a vocabulary distribution where BM25 and dense retrieval surface complementary document sets — TREC Legal 2008 showed keyword-only systems miss 40–60% of relevant documents on the Enron corpus.
## Exercises ### Warm-up 1. Implement RRF in 10 lines of Python, operating on two dict[str, int] (document ID to rank). Verify your implementation on three examples: (a) both lists agree on the top document, (b) lists disagree completely, (c) one list has documents not in the other. 2. Look up the default ef_search parameter for pgvector's HNSW index. How does it affect the trade-off between retrieval speed and recall? At what setting is recall approximately 99%? ### Core 3. Implement the full RRF query from this chapter in your Postgres instance. Run it on a 1,000-chunk corpus with a keyword query and a dense query. Compare the RRF result set to the BM25-only and dense-only result sets: what documents appear in RRF but not in either individual list? 4. Add the reranker step. Compare the precision@5 of RRF-only vs. RRF+rerank on 20 manually labeled query-document pairs from the running case corpus. Document the improvement. 5. The RRF query uses FULL OUTER JOIN. What happens if neither BM25 nor dense returns a given chunk — can it appear in the result? What happens to a chunk returned by BM25 only: what is its RRF score formula? ### Stretch 6. Run the TREC Legal 2008 task on a 10k-document subset of the Enron corpus (available from the TREC website). Measure recall@100 for BM25-only, dense-only, and RRF. Compare to the published baseline numbers. 7. Implement a ColBERTv2 retriever using the ragatouille library on the same 10k-document subset. Compare precision@10 to RRF+reranker. Document the index size difference and the latency difference. Record the results in docs/divergences.md. ## Build-your-own prompt For your capstone matter (Chapter 27): which retrieval signals matter most for your corpus? A corpus of text messages and short communications is entity-heavy but not vocabulary-diverse — BM25 and dense may agree most of the time, and RRF may not add much. A corpus of long PDF court filings is vocabulary-diverse and structure-heavy — BM25 may outperform dense on citations and case numbers, while dense outperforms on conceptual queries. Profile your corpus before committing to the full RRF+rerank stack. ## Further reading - ParadeDB / pg_search: https://docs.paradedb.com/documentation/getting-started/install. The Tantivy-backed BM25 extension for Postgres. See schema/B1_paradedb_fts.sql for the Meridian-Cannon migration. - Cormack, G., Clarke, C., and Buettcher, S., Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods, SIGIR 2009. - Rackauckas, C., An Analysis of Fusion Functions for Hybrid Retrieval, ACM TOIS 2023. - TREC Legal Track 2008 Overview: https://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf. - mxbai-rerank-large-v2: https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v2. - bge-reranker-v2-m3: https://huggingface.co/BAAI/bge-reranker-v2-m3. - Late Interaction overview (Weaviate): https://weaviate.io/blog/late-interaction-overview. - The dossier research/03_rag_eval_and_verifiable_retrieval.md.


Next: Chapter 11 — Information Extraction with Local LLMs. The L4 pipeline that turns retrieved chunks into structured claims.