PostgreSQL as Evidence Substrate

The question is not whether to use a database. The question is whether the database is a component or the substrate — a layer everything else sits on, or one more service among equals. Meridian-Cannon treats Postgres as substrate. The distinction matters more than any single feature.

ℹPrerequisites▼

Before reading this chapter, you should be comfortable with: Chapter 13 (Seven-Layer Pipeline). The Postgres schema maps directly to pipeline layers; understanding the layers makes the schema obvious.

Every design decision in this chapter follows from one commitment: evidence integrity cannot be delegated to the application layer. If the only thing preventing a partial write from corrupting the hash chain is application code that remembers to call commit() correctly, the hash chain is fragile. If the only thing preventing an unauthorized actor from reading a privileged document is middleware that checks a JWT before executing a query, the access control is bypassable. The guarantees must live in the layer that everything else touches, not in any single piece of code that might be bypassed, misconfigured, or upgraded in ways that introduce regressions. PostgreSQL provides the exact combination of primitives an evidence system needs. This chapter explains each one, grounds it in the Meridian-Cannon schema, and names where it falls short. ## At a glance - ACID semantics are load-bearing for chain-of-custody: a pipeline step that writes to three tables must commit all three or none, or the audit trail becomes incompletable and exploitable in court. - pgvector and Row-Level Security coexist in the same database so that RBAC is enforced at query time on every vector search — colocation is not a performance trade-off, it is an access-control guarantee. - PostGIS is provisioned in 00_setup.sql but has no dedicated spatial evidence table yet; the extension is ready for location-tagged evidence without a platform change. ## Learning objectives By the end of this chapter you should be able to: 1. Explain how each ACID property — Atomicity, Consistency, Isolation, Durability — applies to a specific pipeline write in Meridian-Cannon, with a concrete example for each. 2. Configure the embeddings HNSW index parameters (m, ef_construction) and explain the recall-versus-speed trade-off they control. 3. Trace how an RLS policy evaluation uses pii_tier_rank() to determine whether a given actor can read a given chunk, stepping through the SQL comparison. 4. Describe what pii_tier_rank() returns for an unrecognized tier string and explain the security implication for RLS policies that rely on it. ## Why Postgres, Not a Document Database The first alternative that gets proposed is a document database — MongoDB, Couchbase, or Elasticsearch. The argument for it is that evidence data is heterogeneous: emails have different fields than financial records, which have different fields than audio transcripts. A document database handles schema variation without migrations. The argument is correct about the heterogeneity. It is wrong about the implication. The heterogeneous fields go in JSONB columns. The chunks.metadata column, the parties.metadata column, the acquisitions.request_meta and response_meta columns — all JSONB. A new source type that has twenty additional structured fields adds them to JSONB without a migration. The schema is as flexible as a document database for per-type variation. What Postgres has that document databases do not: ACID guarantees at the level of multi-table transactions. Consider what happens when a text message is extracted at L2. The correct outcome is: one row in messages, one row in chunks, one row in audit_log, all committed together. If the application crashes after writing messages but before writing audit_log, the audit trail is incomplete. The document database has no mechanism to prevent this: it can provide per-document atomicity, but not cross-collection atomicity. Postgres does: wrap the three writes in one transaction, and the database guarantees that either all three commit or none do. For an evidence system where the audit trail is the integrity guarantee, cross-table transactional atomicity is not optional. > ▼ Why It Matters — What a partial write looks like in court. > > In the TPR proceeding, the parent's attorney subpoenas the evidence system's records. The production includes the messages table and the audit_log. The opposing party's expert compares the two and finds that thirty-seven messages have no corresponding audit log entry — the application crashed during a batch import and wrote the messages but not their audit rows. > > Those thirty-seven messages are now suspect. The opposing expert argues: if the audit trail is incomplete, how do we know the messages are authentic? The attorney cannot answer, because the audit trail is the authenticity mechanism. The messages are not necessarily inauthentic — but the integrity guarantee has been broken, and in a proceeding where authenticity is contested, a broken integrity guarantee is an exploitable gap. > > PostgreSQL's transaction semantics prevent this gap from arising. Either all writes in a pipeline step commit, or none do. The application that crashes mid-batch leaves a clean state: restarting the batch re-processes only the source records with no downstream artifacts, not a mix of partially processed records and complete ones. ## ACID Guarantees in the Pipeline Context ACID stands for Atomicity, Consistency, Isolation, Durability. All four matter for Meridian-Cannon, but they matter in different ways. Atomicity is the multi-table commitment guarantee described above. One pipeline step = one transaction = all-or-nothing. Consistency is what prevents the hash chain from entering a state that violates the invariants defined in the schema. The audit_log.hash column is NOT NULL. The documents.sha256 column has a CHECK (length(sha256) = 64) constraint. The chunks.pii_tier column has a CHECK against a defined set of values. These constraints fire at the database layer, not in application code. Application code that produces an invalid hash length or an undefined PII tier gets a constraint violation error, not a silent write. Isolation prevents two workers running concurrently from producing inconsistent results by reading each other's uncommitted writes. The claim_next function uses SELECT FOR UPDATE SKIP LOCKED: one worker claims one job, and other workers skip locked rows rather than waiting. The result is that concurrent workers operate on distinct jobs with no interference. Durability means that a committed write survives a crash. Once Postgres returns COMMIT, the data is on disk. A worker that crashes after receiving a COMMIT confirmation can restart knowing that the row exists. A worker that does not receive a COMMIT confirmation should treat the write as uncertain and retry. > ◆ Going Deeper — Why SELECT FOR UPDATE SKIP LOCKED, not LISTEN/NOTIFY. > > Postgres has a built-in pub/sub mechanism: LISTEN and NOTIFY. A producer writes a job and calls NOTIFY channel. A consumer waits on LISTEN channel. When notified, it claims the next job. > > The problem with LISTEN/NOTIFY for a job queue is delivery semantics. NOTIFY is a best-effort signal: if no consumer is listening when the notification fires, the notification is lost. A consumer that crashed and reconnected will not receive the notification for jobs that were enqueued during the outage. > > SELECT FOR UPDATE SKIP LOCKED has different semantics: it is a polling mechanism, not a push mechanism. A worker polls on a short interval, claims any available job of its kind, and processes it. There is no notification to miss. A worker that reconnects after a crash polls, finds the jobs that were enqueued during its absence, and processes them. No job is lost. > > The trade-off is latency: a notification-based system processes a job within milliseconds of enqueue; a polling-based system processes it within one poll interval (configurable; 1–5 seconds is typical). For an evidence ingestion pipeline, that latency is immaterial. Use SKIP LOCKED for reliability; use LISTEN/NOTIFY only if you have a real-time constraint that polling cannot meet. ## pgvector: Embeddings in the Same Transaction The embeddings table stores 1024-dimensional vectors alongside the chunk they describe, in the same Postgres database, in the same transaction as the chunk insert.

-- 30_documents.sql
CREATE TABLE embeddings (
  chunk_id            uuid NOT NULL REFERENCES chunks(id) ON DELETE CASCADE,
  model_name          text NOT NULL,
  model_version       text NOT NULL,
  dim                 int NOT NULL,
  vector              vector(1024) NOT NULL,
  computed_at         timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (chunk_id, model_name, model_version)
);

CREATE INDEX embeddings_hnsw_idx
  ON embeddings USING hnsw (vector vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

This design makes a specific trade-off. A dedicated vector database — Pinecone, Weaviate, Qdrant — would offer better throughput at scale, a managed index, and potentially lower query latency at hundreds of millions of vectors. Meridian-Cannon does not have hundreds of millions of vectors. A single litigation matter generates on the order of tens to hundreds of thousands of chunks. At that scale, pgvector's HNSW index returns approximate nearest neighbors with recall above 0.95 and query times in the single-digit milliseconds. That is adequate for interactive retrieval.

The decisive argument for colocating vectors in Postgres is not performance. It is the RBAC model.

The embeddings table inherits the row-level security policies defined in schema/99_rls.sql. A query for "visitation request" that uses the embedding index returns only the chunks the current actor is authorized to read. The mechanism is precise: the RLS policy is enforced as a row-level filter on the candidates returned by the HNSW index — not before the index is consulted. PostgreSQL's query planner uses the HNSW index to identify vector-similar candidates, then applies RLS conditions to remove unauthorized rows from the result. The guarantee is that no unauthorized row appears in query output. The practical consequence: if many of the nearest neighbors are in a restricted tier, the query may return fewer than the requested top-k — the index does not substitute filtered rows with the next nearest authorized candidate. A dedicated vector database would require a separate synchronization mechanism: every time a chunk's PII tier changes (or a new actor is added, or an ACL grant is revoked), the vector database would need to be updated to reflect the change. With colocated vectors, the RBAC model is a single source of truth, enforced at query time by the database engine. The second argument is consistency. A chunk and its embedding are written in the same transaction. They cannot be out of sync: if the embedding write fails, the transaction rolls back, and no embedding is recorded. There is no window during which a chunk exists without an embedding or an embedding exists without a chunk. A dedicated vector database introduces such a window — the chunk is written to Postgres, then the embedding is written to the vector store. A crash between the two writes produces an inconsistent state. > ◆ Going Deeper — When to evaluate a dedicated vector database. > > The recommendation above holds at the scale of a single matter: tens to hundreds of thousands of chunks, one to a handful of concurrent users. If Meridian-Cannon is ever deployed at the scale of a document review platform — millions of documents, thousands of concurrent queries, HNSW index build times in hours — the calculus changes. > > The measurement to make: run the pgvector benchmark suite against your actual document count and query load. If HNSW recall drops below 0.90 or query latency exceeds 100ms at your p95, evaluate a dedicated store. If you make this change, document it in docs/divergences.md with the benchmark results that motivated it. The colocated embedding model described here is the default; departing from it is an architectural decision, not a configuration choice. ## PostGIS: Location Evidence as First-Class Data Communications evidence often contains location data: a photo has GPS coordinates in its EXIF metadata; a mobile device log records cell tower associations; a navigation history records routes and waypoints. In a custody or TPR proceeding, location evidence is frequently material: was the parent at the agreed exchange location? Were communications sent from a particular address? PostGIS is the extension that makes location data queryable at the same semantic level as text. A GPS coordinate stored as a PostGIS GEOGRAPHY(POINT) column can be queried as "within 500 meters of this address," sorted by distance from a reference point, or joined against a polygon layer representing addresses or court-ordered exclusion zones. The entities table includes location entities extracted by NER — addresses, intersections, city names. These are text. PostGIS enables the next step: resolving text location references to actual coordinates and storing them in a geometry column on a derived table. A communication that says "I'm at Park and Main" resolves, via geocoding, to a coordinate; that coordinate is stored in the spatial column and is queryable with the same RBAC and audit logging as any other Postgres row. At current implementation, Meridian-Cannon includes PostGIS in schema/00_setup.sql as an extension but does not have a dedicated spatial evidence table. The provision is intentional: when the first corpus of location-tagged evidence arrives, the extension is ready. Adding the spatial table requires one migration, not a platform change. ## The Schema's Provenance Chain The Meridian-Cannon schema is divided into numbered files applied in sequence. The provenance chain — the path from a source document to a sealed attestation — spans three files. schema/10_core.sql defines the fundamental primitives: matters, parties, actors, audit_log. These are the "who." matters is the top-level partition: every row in every table carries a matter_id. In a multi-matter deployment (a circuit court case plus a parallel administrative proceeding), the matter_id isolates each matter's records while sharing infrastructure. schema/20_provenance.sql defines sources, acquisitions, and productions. A source is a long-lived entity — "Isabel's iMessage history" — that persists across multiple fetch events. An acquisition is one fetch event: one specific export at one specific time, producing one specific file with one specific SHA-256. The distinction is what makes idempotency possible: a source can be fetched multiple times; each fetch is a new acquisition row; each acquisition either finds a matching hash in documents (and skips) or creates a new document row. schema/30_documents.sql defines documents, chunks, embeddings, and entities. These are the "what" — the extracted content and its enrichments. documents identifies by SHA-256: one row per unique byte sequence, regardless of how many acquisitions produced it. The same PDF filed in court, attached to three emails, and received as a fax creates one documents row and three document_acquisitions rows. The relationships: sources → acquisitions → documents → chunks → embeddings. This is the provenance chain. Starting from any embedding, you can walk upstream to its chunk, its document, the acquisitions that produced it, and the source they came from. Starting from any source, you can walk downstream to every chunk and embedding derived from it. > ☉ In the Wild — Waymo v. Uber (2017): Evidence lives in the substrate, not the documents. > > In 2017, Waymo (then part of Google) sued Uber for trade secret misappropriation. The central allegation: Anthony Levandowski had downloaded 14,000 files from Google's servers before resigning to found his own self-driving car company, which Uber subsequently acquired. > > The proof was not in the files themselves. It was in the substrate. Google's forensics team used server access logs, download timestamps, file system metadata, and hash comparisons to reconstruct exactly what had been downloaded, from which machine, at what time, under which user credential. The files on Levandowski's devices matched the hashes of files that had been logged as downloaded from Google's internal systems on specific dates. > > The audit trail was not built for litigation. It was built because Google's internal systems logged everything as a matter of regular practice — the same conditions that qualify a record as a business record under FRE 803(6). The litigation value was a consequence of the engineering discipline. > > Meridian-Cannon's audit_log is built with the same logic: every read, write, export, and transformation is logged with the actor, the timestamp, the resource, and the payload. The audit log's value in litigation is a consequence of the system's design, not a special mode turned on after a dispute arises. ## The audit_log Hash Chain The audit_log is append-only by policy and hash-chained by trigger.

-- 10_core.sql
CREATE TABLE audit_log (
  id              bigserial PRIMARY KEY,
  occurred_at     timestamptz NOT NULL DEFAULT clock_timestamp(),
  actor_id        uuid REFERENCES actors(id),
  matter_id       uuid REFERENCES matters(id),
  action          text NOT NULL,
  resource_type   text,
  resource_id     uuid,
  payload         jsonb NOT NULL DEFAULT '{}',
  ip_address      inet,
  user_agent      text,
  prev_hash       text,
  hash            text NOT NULL
);

The trigger fires on every INSERT:

CREATE OR REPLACE FUNCTION audit_log_hash_trigger() RETURNS trigger AS $$
DECLARE
  prev text;
BEGIN
  SELECT hash INTO prev FROM audit_log ORDER BY id DESC LIMIT 1;
  NEW.prev_hash := prev;
  NEW.hash := sha256_hex(
    coalesce(prev, '') || '|' ||
    NEW.occurred_at::text || '|' ||
    coalesce(NEW.actor_id::text, '') || '|' ||
    coalesce(NEW.matter_id::text, '') || '|' ||
    NEW.action || '|' ||
    coalesce(NEW.resource_type, '') || '|' ||
    coalesce(NEW.resource_id::text, '') || '|' ||
    coalesce(NEW.payload::text, '{}')
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Each row's hash includes the prior row's hash (prev_hash) as a chained input. An adversary who wants to modify row 500 without detection must also recompute the hash for row 500, and then for row 501, and then for row 502, through to the most recent row — which requires knowing sha256_hex, which is defined in Postgres, and which means executing SQL against the live database. The practical effect: any modification to any historical row breaks every hash computed after it, and a forward-walk verification will detect the break at the first modified row.

This is not a blockchain — no distributed consensus, no mining, no economic incentive. It is a simpler and older structure: a hash-chained log, used in certificate transparency, in financial audit logs, and in notarized logbooks. The property it provides is tamper-evidence: modification is detectable, not impossible.

Verification is a forward walk:

-- Verify audit_log chain integrity from the beginning.
WITH chain AS (
  SELECT id, prev_hash, hash,
    sha256_hex(
      coalesce(prev_hash, '') || '|' ||
      occurred_at::text || '|' ||
      coalesce(actor_id::text, '') || '|' ||
      coalesce(matter_id::text, '') || '|' ||
      action || '|' ||
      coalesce(resource_type, '') || '|' ||
      coalesce(resource_id::text, '') || '|' ||
      coalesce(payload::text, '{}')
    ) AS recomputed
  FROM audit_log
)
SELECT id, hash, recomputed,
       hash = recomputed AS chain_valid
FROM chain
WHERE hash <> recomputed
ORDER BY id;

If this query returns zero rows, the chain is intact from the first record to the last. If it returns any rows, those rows are either modified or computed from a modified predecessor. The earliest row in the result set is the first tampered record.

The schema/99_rls.sql file revokes UPDATE and DELETE on audit_log from all roles including the application role. A trigger or application code that attempts to update or delete an audit row receives a permission denied error. The only permitted operation on audit_log is INSERT. This is enforced at the database layer, not at the application layer — there is no application code that can accidentally (or deliberately) bypass it. > ✻ Try This — Find the gap in your audit trail. > > Write a SQL query that returns all rows in chunks whose parent document_id has no corresponding audit_log entry with action = 'chunk_created'. This query reveals chunks that exist without an audit record of their creation — a data quality gap that indicates either a pipeline bug or a direct database write that bypassed the worker.

The query pattern:
SELECT c.id, c.document_id, c.created_at
FROM chunks c
WHERE NOT EXISTS (
  SELECT 1 FROM audit_log al
  WHERE al.resource_type = 'chunk'
    AND al.resource_id = c.id
    AND al.action = 'chunk_created'
);
If this returns rows in a healthy system, investigate: were these chunks created by a migration? By a test fixture? By direct SQL? Each answer tells you something different about the pipeline's completeness.

Row-Level Security: RBAC at the Database Layer

The schema/99_rls.sql file defines row-level security policies. RLS is a Postgres feature that attaches a WHERE clause to every query against a table, without requiring the application to include that clause explicitly. The application sets app.current_actor_id at the start of each database session:

# workers/litdb.py
with c.cursor() as cur:
    cur.execute(
        "SELECT set_config('app.current_actor_id', %s, true)",
        (SYSTEM_ACTOR_ID,),
    )

The RLS policy reads this value via current_actor_id() and applies per-row filtering. A query against documents from a paralegal actor returns only documents whose evidentiary_pii_tier is within the paralegal's pii_ceiling. A query from an opposing_counsel actor returns only documents that have been formally produced. The critical property: an application bug that forgets to add a WHERE clause to restrict results by actor role does not produce a privilege leak. The database's RLS policy applies regardless of what the application sent. An application that executes SELECT * FROM documents as an opposing_counsel actor gets only the produced documents — not all documents.

Defense in depth adds redundant application checks. Defense in layers moves enforcement to a layer application code cannot bypass. RLS is defense in layers.

-- 10_core.sql — PII tier rank function used by RLS
CREATE OR REPLACE FUNCTION pii_tier_rank(tier text) RETURNS int AS $$
  SELECT CASE tier
    WHEN 'public'       THEN 0
    WHEN 'low'          THEN 1
    WHEN 'internal'     THEN 2
    WHEN 'sensitive'    THEN 3
    WHEN 'privileged'   THEN 4
    WHEN 'work_product' THEN 5
    ELSE NULL
  END
$$ LANGUAGE sql IMMUTABLE;

The pii_tier_rank function converts string tier labels to integers for comparison. An actor with pii_ceiling = 'sensitive' (rank 3) can read documents with evidentiary_pii_tier of public (0), low (1), internal (2), or sensitive (3), but not privileged (4) or work_product (5). The RLS policy encodes this comparison without application code. The RBAC model defines seven actor roles: owner, counsel, paralegal, expert, family, opposing_counsel, court_clerk, system. Each role has a defined pii_ceiling and a set of permitted actions. opposing_counsel actors can read produced documents only — documents in the productions table associated with their matter — and nothing else. family actors can read documents tagged for family sharing. system actors (the ingestion workers) can write to all tables, constrained by the NOT NULL and CHECK constraints that prevent inconsistent writes. > § For the Record — FRE 803(6), applied to an RLS-governed audit log. > > FRE 803(6)(B) requires that the record be "kept in the course of a regularly conducted activity." A record that can be modified or deleted after the fact is not "kept" in any meaningful sense — it is editable. The combination of audit_log INSERT-only permissions (enforced by RLS) and hash-chain verification satisfies the "kept" requirement: the record cannot be altered without detection, and the trigger fires on every pipeline event as a regular practice of the system's operation. ## JSONB: Flexible Metadata Without Schema Explosions The tension in evidence data modeling is between structure and flexibility. A fully normalized schema — one table per field, one foreign key per relationship — is queryable but brittle: every new source type requires a migration. A fully denormalized schema — one JSONB blob per record — is flexible but opaque: SQL queries against JSONB are slower than queries against indexed columns, and schema validation is optional. Meridian-Cannon resolves the tension by separating stable, universal fields (present on every row of a type) from per-type fields (present only for specific source kinds). Stable fields go in typed columns with constraints and indexes. Per-type fields go in JSONB. The chunks.metadata column is the primary example. Every chunk has id, document_id, text, ordinal, chunker, modality, and pii_tier as typed columns. A financial transaction chunk additionally has amount, account_number, transaction_date, and counterparty — stored in metadata JSONB. A court order chunk additionally has case_number, judge, order_date, and disposition — also in metadata JSONB. Adding a new source type with ten additional fields requires zero migrations. The worker that processes the new source type writes its fields to metadata. The retrieval query for those fields uses JSONB operators: metadata->>'amount' for extraction, metadata @> '{"transaction_type": "wire"}' for containment. The GIN index on metadata makes these queries efficient. The parties.metadata and acquisitions.request_meta and acquisitions.response_meta columns follow the same pattern. Acquisition metadata for an OAuth-based Gmail fetch includes the token expiration, the scopes granted, and the page token for incremental sync. Acquisition metadata for a physical device export includes the device model, the iOS version, and the forensic tool used. Both go in JSONB; neither requires a schema migration. ## Migration Discipline: Sequential, Reversible, Named The schema files are named with two-digit prefixes: 00_setup.sql, 10_core.sql, 20_provenance.sql, and so on. The numbering is the application order. Each file must be applied before any file with a higher number, because foreign key references between files flow in ascending order: 10_core.sql defines matters; 20_provenance.sql references matters.id; 30_documents.sql references acquisitions.id from 20_provenance.sql. Each migration has a companion down migration: A0_attestations.sql and A0_attestations.down.sql. The down migration is not a safety net — it is a discipline enforcer. Writing a reversible migration forces you to think about what the migration creates and whether it can be undone cleanly. A migration that drops a column with data cannot be reversed; the correct approach is to deprecate the column, stop writing to it, confirm no reads depend on it, and then drop it in a subsequent migration. Do not remove or restructure existing schema files. Add to them, or add new numbered files. The provenance chain depends on the schema being stable: an audit row that references a table that no longer exists in the schema is evidence of a schema rollback that disrupted the chain. ## The Substrate in the Courtroom The legal question the substrate must answer is: can the records produced from this database be authenticated under FRE 901(a)? Authentication requires "evidence sufficient to support a finding that the item is what the proponent claims it is." For a database record, this means: is this row what it appears to be, from the time it appears to be from, produced by the process it appears to have been produced by? The hash chain in audit_log provides a partial answer: the row was written at the logged timestamp, and no modification has occurred since. The documents.sha256 column provides a second partial answer: the document bytes match the stored hash. The acquisitions.raw_sha256 column provides a third: the source file at acquisition time matches the stored hash. The full answer requires a custodian who can testify to the system's regular operation — and the schema is designed so that testimony can be supported by unambiguous records rather than reconstructed from memory. The seven-step Canon verifier (Chapter 25) authenticates sealed attestations. For raw database records — a chunk, an audit log entry, an entity mention — the authentication path runs through the audit trail and the hash columns. Canon signing is the L6 output; the database substrate supports L0–L5. > ✻ Try This — Find chunks with no embedding. > > A chunk in the chunks table with no corresponding row in embeddings is a data quality gap: L3 completed (the chunk exists) but L4 did not complete (the embedding was not computed). The query:

SELECT c.id, c.document_id, c.chunker, c.created_at
FROM chunks c
LEFT JOIN embeddings e ON e.chunk_id = c.id
WHERE e.chunk_id IS NULL
ORDER BY c.created_at DESC
LIMIT 50;
In a healthy pipeline, this query returns zero rows (or returns rows only for chunks created in the last few minutes, while the L4 worker is still processing). If it returns many rows from weeks ago, the L4 worker failed silently on those chunks. The fix is to re-enqueue L4 jobs for the affected chunk IDs.

ParadeDB: BM25 Full-Text Search (v0.2.0)

The chunks.text_tsv generated column (described in Chapter 13) provides standard PostgreSQL full-text search via tsvector and the @@ operator. As an optional upgrade, schema/B1_paradedb_fts.sql adds a ParadeDB BM25 index over chunks.content. ParadeDB embeds the Tantivy search engine (written in Rust) directly inside PostgreSQL as the pg_search extension, replacing the default GIN-based full-text index with a native BM25 ranking model. BM25 scores terms by document frequency and inverse document frequency, returning better-ranked results than ts_rank on longer documents. The schema file is guarded by an extension existence check and silently no-ops if pg_search is not installed. When it is installed, the ParadeDB backend is activated by setting MERIDIAN_USE_PARADEDB=1. Queries use the @@@ operator in place of @@:

-- ParadeDB BM25 query (MERIDIAN_USE_PARADEDB=1)
SELECT id, text FROM chunks WHERE text @@@ 'visitation AND rescheduled';

-- Standard tsvector fallback (always available)
SELECT id, text FROM chunks WHERE text_tsv @@ plainto_tsquery('english', 'visitation rescheduled');

The tsvector / tsquery path remains the fallback and is always present. ParadeDB is an optional performance and ranking upgrade, not a dependency. ## pgvectorscale: StreamingDiskANN Index (v0.2.0) The default ANN index for the embeddings table is HNSW (described above). For large collections — when the HNSW index exceeds available RAM — schema/B2_pgvectorscale.sql creates a StreamingDiskANN index as an alternative. StreamingDiskANN, provided by the pgvectorscale extension from Timescale, stores the index on disk rather than in RAM. It uses DiskANN's graph-based structure, which sustains high recall at sub-100ms latency even when the full index cannot fit in memory. The schema file is guarded by an extension existence check and no-ops if pgvectorscale is not installed. When installed, the StreamingDiskANN index replaces ivfflat as the recommended ANN index for collections beyond a few million vectors. The HNSW index remains the default for collections that fit comfortably in RAM (see Chapter 14's guidance on the recall-versus-speed trade-off). ivfflat remains available as a further fallback for older deployments. ## Rekor Transparency Log (v0.2.0) When MERIDIAN_REKOR_ENABLED=1, the pipeline submits each sealed attestation envelope to a Rekor transparency log after writing the seals row. The returned entry UUID and log index are stored in the rekor_entries table, defined in schema/E0_rekor.sql:

CREATE TABLE rekor_entries (
    id              bigserial PRIMARY KEY,
    seal_id         uuid REFERENCES seals(id),
    entry_uuid      text NOT NULL,
    log_index       bigint,
    rekor_url       text NOT NULL,
    integrated_time bigint,    -- Unix timestamp from Rekor
    created_at      timestamptz DEFAULT now()
);

A Rekor entry is an immutable, publicly auditable record that a specific artifact existed in a specific signed form at a specific time. It extends the Ed25519 custodian-signed seal with a third-party timestamp and inclusion proof: anyone who holds the entry_uuid can verify against the Rekor log that the artifact was submitted — without contacting the original issuer. This matters when the custodian's own key or server is unavailable, or when long-term tamper evidence is required beyond the local audit log's hash chain. The rekor_entries table is additive: removing a Rekor entry from the table does not affect the Rekor log itself, which is append-only. The table provides a local index of submission records; the log is the authoritative tamper-evident record. ## Postgres vs. SQLite: When Each Is Appropriate The stub for this chapter mentioned SQLite as the archival/portability sibling. The full picture: Postgres is the active litigation substrate. It handles concurrent workers (SKIP LOCKED), RLS, pgvector ANN search, PostGIS, and the hash-chained audit log. It requires a running server. In production, Meridian-Cannon runs against a Supabase-hosted Postgres instance; locally, against a Docker container or Homebrew install. SQLite is the archival format. After a matter closes, the evidence corpus — documents, chunks, embeddings (via sqlite-vec), and the audit log — can be exported to a single SQLite file. That file is self-contained: no server, no network, no dependencies beyond the SQLite library. It can be submitted to a court, handed to opposing counsel on a USB drive, or archived for ten years without infrastructure maintenance. The Canon attestations produced by Meridian-Cannon are verified without the database: they are JSON files signed with Ed25519. The attestations do not require either Postgres or SQLite to verify. The database is the production substrate; the attestations are the archive units. Do not introduce SQLite as the primary backend for active ingestion. The concurrency model, the RLS model, and the HNSW index are not available in SQLite. The schema is designed for Postgres semantics; SQLite's weaker type system and non-existent constraint enforcement would silently allow writes that Postgres rejects.

💡Key Takeaways

- A single Postgres instance serves vector (pgvector HNSW), BM25 full-text (tsvector or ParadeDB), and relational queries simultaneously — colocating them enforces RBAC at query time on every vector search, a guarantee a dedicated vector store cannot provide. - The three optional schema migrations extend the base without modifying it: B1_paradedb_fts.sql adds BM25 via pg_search, B2_pgvectorscale.sql adds StreamingDiskANN via pgvectorscale, and E0_rekor.sql adds the rekor_entries table; each is a silent no-op if its extension is absent. - The rekor_entries table is append-only by design: removing a Rekor entry locally does not affect the public Rekor log, which is the authoritative tamper-evident record. - When MERIDIAN_USE_PARADEDB=1, the query dispatcher routes to the @@@ operator against the Tantivy index; when unset, it uses the standard tsvector GIN path — no other code changes required. - pgvectorscale's StreamingDiskANN index outperforms ivfflat above roughly 500k vectors because it stores the index on disk rather than in RAM, sustaining high recall at sub-100ms latency even when the full index cannot fit in memory.

## Exercises ### Warm-up 1. Name four Postgres features used in Meridian-Cannon that are not available in MongoDB. For each, explain what would replace it in a document-database deployment and what property would be lost. 2. The audit_log trigger selects ORDER BY id DESC LIMIT 1 to find the previous hash. Under what concurrency condition could this produce an incorrect prev_hash, and what does the comment in 10_core.sql say about it? ### Core 3. Write the SQL verification query for the audit log hash chain (forward-walk). The query should return only rows where hash <> recomputed, ordered by id ascending. 4. A paralegal actor has pii_ceiling = 'internal'. What SQL policy expression would restrict their reads of the chunks table to rows where pii_tier is within their ceiling? Use the pii_tier_rank function. 5. Write a SQL query that returns all source IDs (sources.id) for which there are acquisitions but no corresponding documents rows — i.e., sources that were fetched but whose documents were never extracted. What layer of the pipeline does this implicate? ### Stretch 6. Implement the audit log verification function in Python: connect to the database, read all audit_log rows ordered by id, recompute each hash using the same inputs as the trigger, and report the first row where the stored hash does not match the recomputed hash. 7. A court filing requires you to produce a "complete audit trail for document ID X." Design the SQL query (or sequence of queries) that returns: the source, all acquisitions, the document row, all chunk IDs, all audit log entries referencing those resource IDs, and all attestations that reference those chunks. This is the provenance chain query. 8. The embeddings HNSW index uses m=16, ef_construction=64. Research what these parameters control and propose new values for a corpus of 2 million chunks. Justify your proposal with reference to pgvector's documentation on recall-vs-speed trade-offs. ## Build-Your-Own Prompt For your capstone matter: identify the PII tiers present in your corpus. Which source types produce privileged or work_product content? Which actors need access to each tier? Draw the RBAC matrix: actors as rows, PII tiers as columns, permitted actions (read, annotate, produce, redact) as cell values. Then verify that the actors table schema (role, pii_ceiling) and the acl_grants table can represent every cell in your matrix. If there are cells that require a grant pattern not expressible in the current schema, document the gap. Your capstone will be evaluated on whether the RBAC model is complete for your corpus. ## Further Reading - schema/10_core.sql — matters, parties, actors, audit_log with the hash-chain trigger. - schema/30_documents.sql — documents, chunks, embeddings, entities. - schema/99_rls.sql — RLS policies for all tables. - schema/B1_paradedb_fts.sql — ParadeDB BM25 index on chunks.content. - schema/B2_pgvectorscale.sql — StreamingDiskANN index on embeddings.vector. - schema/E0_rekor.sql — rekor_entries table for transparency log integration. - pgvector documentation, HNSW index parameters: https://github.com/pgvector/pgvector. - ParadeDB documentation: https://docs.paradedb.com/ — the pg_search BM25 extension. - pgvectorscale / DiskANN documentation: https://github.com/timescale/pgvectorscale. - Rekor project (Sigstore): https://github.com/sigstore/rekor — the transparency log protocol. - Kleppmann, Designing Data-Intensive Applications (2017), Ch 7 (Transactions) — the canonical treatment of ACID in a production context. - Waymo LLC v. Uber Technologies, Inc., 3:17-cv-00939 (N.D. Cal.) — the docket and Waymo's forensic expert report, publicly available. - FRE 803(6) and the Advisory Committee Notes (1975, 2000, 2014 amendments). - PostGIS documentation: https://postgis.net/documentation/. - NIST SP 800-92 — Guide to Computer Security Log Management, for audit log design principles. --- Next: Chapter 15 — The Attestation Schema. How the Canon data model is mapped onto the Postgres substrate, and what the A0_attestations.sql migration adds to the chain.