Customization Patterns

The system that bends without breaking is the one worth building. The system that breaks when you touch it is the one you already own.

Every real deployment of Meridian-Cannon hits a wall. A medical-records team needs HIPAA-specific redaction tables. A financial-forensics team needs an amount normalizer that handles "$1,200.00," "twelve hundred dollars," and "1200" as the same fact. A newsroom running five simultaneous investigations needs strict separation at the database layer so that one reporter's query cannot surface another team's sources.

None of these require changing the core. All of them require understanding where to make the change. That boundary — what is core, what is extension — is what this chapter establishes.

At a glance

Five structural customization patterns — domain schema extension, custom source workers, custom extraction types, custom challenge types, and multi-matter deployment — each with a three-rule correctness test: additive only, reversible via .down.sql, and core tables carry no FK references to extension tables. - Six pluggable backend patterns (v0.2.0) — BM25 backend swap, vector index swap, PDF backend selection, LM adapter selection, PII masking backend, and transparency log configuration — each controlled by an environment variable or a constructor argument; no core code changes required. - Epistemic Neutrality Masking discipline applies to all domain extensions: a custom extractor that normalizes for retrieval must preserve the verbatim surface form in verbatim_source and must populate the gaps array honestly rather than claiming precision it does not have. - The R6 coverage object must include domain-specific declined entries: every custom challenge type that exists in the system must appear in coverage.declined for every attestation where it was not run, with a machine-readable reason — silence is not a valid reason under Canon §9.4. ## Learning objectives By the end of this chapter you should be able to: 1. Implement a domain schema extension (Pattern 1) without breaking existing workers: new tables reference core tables via FK, core tables are not modified, and the extension has a clean .down.sql inverse. 2. Write a custom source worker (Pattern 2) following the four-invariant skeleton: hash before processing, idempotency check before insert, job-queue write for all processing, and audit-log row for every action including skipped duplicates. 3. Apply Epistemic Neutrality Masking to a domain-specific extraction result (Pattern 3): populate content with the normalized form, verbatim_source with the original, and gaps with honest qualifiers about approximation or missing context. 4. Define a custom challenge type (Pattern 4) that satisfies R6: implement the challenge logic as a subclass, list it in coverage.declined for non-applicable document types, and provide a machine-readable decline reason for each exclusion. 5. Select and configure the appropriate pluggable backend (Patterns 6–11) for your deployment's BM25 engine, vector index, PDF renderer, LM adapter, PII masker, and transparency log without modifying core code. ## The extension protocol, stated plainly The Canon substrate consists of the schema files in schema/10_core.sql through schema/99_rls.sql, the Pydantic models in meridian/canon/schema.py, the seven-step verifier, and the job-queue conventions in workers/litdb.py. These are not changed. They are the trust surface. When opposing counsel receives an attestation, the verifiability of that attestation depends on the substrate being exactly what the spec describes. Extensions that modify the core substrate break the conformance guarantee.

Extensions add to the substrate. They do not rewrite it. Every pattern in this chapter is additive.

▼ Why It Matters — Who pays when the core breaks.

In the 2026 TPR proceeding, the family's attorney received fifteen attestations over three months of litigation. Each one was issued by the same system. If the underlying schema were modified between Month 1 and Month 3 — a table renamed, a column dropped, a foreign-key constraint changed — the Month 1 attestations would still verify. But the system's audit log would have a gap where the migration happened, and opposing counsel could argue the chain of custody was broken at that gap.

Core stability is not a preference. It is the evidence-integrity argument.

Pattern 1 — Domain schema extension

The first and most common customization need is new tables. A medical-records system needs to track HIPAA-covered entities, minimum-necessary assessments, and the specific redaction applied to each protected health information field. None of those belong in the reference schema, and none of them need to be in the reference schema.

The pattern is simple: new domain tables reference existing core tables via foreign key, but core tables do not reference new tables. Extension is additive and asymmetric.

A medical-records extension might add:

-- schema/M1_hipaa_phi.sql
CREATE TABLE IF NOT EXISTS hipaa_phi_inventory (
    id          BIGSERIAL PRIMARY KEY,
    source_id   BIGINT NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
    matter_id   UUID   NOT NULL REFERENCES matters(id) ON DELETE CASCADE,
    phi_category TEXT  NOT NULL,  -- e.g. 'name', 'dob', 'diagnosis', 'ssn'
    field_path  TEXT  NOT NULL,  -- JSON path within the source record
    identified_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    identified_by TEXT NOT NULL  -- actor_id or system component
);

CREATE TABLE IF NOT EXISTS phi_redaction (
    id               BIGSERIAL PRIMARY KEY,
    phi_inventory_id BIGINT NOT NULL REFERENCES hipaa_phi_inventory(id),
    redaction_method TEXT  NOT NULL,  -- 'safe_harbor' | 'expert_determination'
    redacted_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    redacted_by      BIGINT REFERENCES actors(id)
);

The sources and matters and actors tables are from the core. The new tables depend on them. The core tables have no knowledge of hipaa_phi_inventory or phi_redaction. Drop the extension and the core still runs. Add the extension to a fresh install and the core is unchanged. > ◆ Going Deeper — The migration naming convention for extensions. > > Reference migrations are named 00_ through 99_. Domain extensions use > a prefix that sorts after 99_: A0_, M1_, F1_, etc. This > guarantees that psql -f schema/*.sql in filename order runs core before > extensions. > > The A0_attestations.sql migration in the reference implementation > follows this convention — it adds the attestations table to the core, > but is itself an extension to the baseline schema, sorted after the 99_ > RLS policies it depends on. > > Every domain extension should have a corresponding .down.sql file. The > down migration drops only the tables the extension created. It does not > touch core. Reversibility is the test of whether the extension really is > additive. Three rules that together enforce Pattern 1: 1. New tables carry REFERENCES to core tables. Core tables carry no REFERENCES to new tables. 2. New tables are created in migrations that sort after 99_. They are never merged into existing migration files. 3. Every extension has a .down.sql that is a clean inverse. If you cannot write the down migration, the extension is entangled with the core. > ☉ In the Wild — OpenEMR and thirty years of additive extension without a protocol. > > OpenEMR is the most widely deployed open-source electronic health record > system. It began in 1992 as a billing tool, extended incrementally to > cover clinical records, laboratory results, prescriptions, and scheduling, > and today has more than 30,000 columns across hundreds of database tables. > Each extension was additive in intent. > > The problem is that OpenEMR never formalized what it meant to be an > extension. Some extensions added foreign keys back into other extensions. > Some added columns to existing tables. Some renamed tables that other > extensions depended on. Thirty years of additive work without an extension > protocol produced a system where no single developer understands the full > dependency graph, where the "down" for any migration is effectively > undefined, and where security auditors have found HIPAA-relevant data in > tables whose names suggest they contain scheduling data. > > Pattern 1 is the architectural lesson from that history: define the > extension protocol before you need it, not after thirty years of > extensions have accumulated. ## Pattern 2 — Custom source workers The reference workers in workers/jobs/ handle Gmail, iMessage, audio recordings, PDFs, images, and a handful of other source types. Your corpus probably has at least one source type that isn't in that list. A new worker must satisfy four invariants: 1. Hash before processing (L0). The source file's SHA-256 is computed before any transformation, extraction, or chunking. The hash is stored. If processing fails, the hash record survives. 2. Idempotency before insert. Check whether a record with this source_hash already exists before inserting. Every worker in the reference follows this pattern (Chapter 15). Workers that skip it produce duplicates. 3. Write to the job queue. Workers don't process inline; they enqueue work for the pipeline to pick up. This keeps the worker fast, makes retries trivial, and lets the queue be the backpressure mechanism. 4. Write an audit_log row. Every ingestion action — even a skipped duplicate — writes an audit row. The audit log is the chain-of-custody record. A worker that does not write audit rows is invisible to the audit.

The minimal skeleton for a new worker fits in under 30 lines:

# workers/jobs/my_source_type.py
from workers.litdb import get_db, hash_file, check_idempotent, enqueue_job
from workers.litdb import write_audit

def ingest(file_path: str, matter_id: str, actor_id: int) -> dict:
    db = get_db()

    # L0 invariant: hash before any processing
    source_hash = hash_file(file_path)

    # Idempotency invariant: skip if already ingested
    if existing := check_idempotent(db, source_hash):
        write_audit(db, "ingest_skipped", matter_id, actor_id,
                    {"source_hash": source_hash, "existing_id": existing["id"]})
        return {"status": "duplicate", "source_id": existing["id"]}

    # Insert the source record
    source_id = db.execute("""
        INSERT INTO sources (matter_id, source_hash, source_type, file_path, ingested_by)
        VALUES (%s, %s, 'my_source_type', %s, %s) RETURNING id
    """, (matter_id, source_hash, file_path, actor_id)).fetchone()[0]

    # Enqueue processing
    job_id = enqueue_job(db, "process_my_source_type",
                         {"source_id": source_id, "file_path": file_path})

    # Audit invariant
    write_audit(db, "ingest_queued", matter_id, actor_id,
                {"source_id": source_id, "job_id": job_id})

    return {"status": "queued", "source_id": source_id, "job_id": job_id}

That is the complete ingestion worker. The processing — parsing the source, extracting content, chunking — belongs in a separate job handler that the queue calls when it picks up the process_my_source_type job. Notice what is not in the worker: no format-specific parsing, no NLP, no embeddings. Those are the processing phase. The worker's only job is to assert "I saw this thing, it has this hash, process it later." > ▼ Why It Matters — The worker is a witness. > > In a TPR proceeding, the opposing party's argument is often that evidence > was altered after it was collected. The ingestion worker's job is to make > that argument fail at the database layer. The hash-before-processing > invariant means that even if the processing worker has a bug — even if it > mangles the content — the source hash in the sources table reflects what > was received, not what the processing worker made of it. That hash is > the worker's testimony, and it does not change after ingestion. ## Pattern 3 — Custom extraction types The L4 extractors in the reference system identify entities: people, places, dates, organizations, phone numbers. For many domain deployments, the reference extractors are not enough. A financial-forensics deployment needs to handle amounts. "$1,200.00," "$1200," "twelve hundred dollars," "1.2k," and "USD 1,200" are five surface forms of the same fact. A naive extractor produces five different entities from the same document. A good amount normalizer produces one, with the verbatim surface form preserved. The Epistemic Neutrality Masking discipline from Chapter 18 applies here: the extractor normalizes for retrieval but preserves verbatim for attestation. The Claim records the normalized form; the Witness entry records the verbatim form in content_inline.

A custom L4 extractor for financial amounts:

# meridian/extractors/financial_amount.py
from decimal import Decimal, InvalidOperation
import re
from meridian.canon.schema import Claim, InferenceType

WORD_AMOUNTS = {
    "hundred": 100, "thousand": 1000, "million": 1_000_000,
    "billion": 1_000_000_000, "k": 1000, "m": 1_000_000
}

def normalize_amount(verbatim: str) -> Decimal | None:
    """Normalize surface form to Decimal. Returns None if not parseable."""
    text = verbatim.lower().strip().replace(",", "").replace("$", "").replace("usd", "")
    for word, mult in WORD_AMOUNTS.items():
        if word in text:
            base_str = text.replace(word, "").strip()
            try:
                return Decimal(base_str) * mult if base_str else Decimal(mult)
            except InvalidOperation:
                return None
    try:
        return Decimal(text)
    except InvalidOperation:
        return None

def extract_amounts(chunk_text: str, chunk_id: str) -> list[Claim]:
    pattern = r'\$[\d,]+\.?\d*|\d[\d,]*\.?\d*\s*(?:dollars?|thousand|million|billion|k\b)'
    claims = []
    for match in re.finditer(pattern, chunk_text, re.IGNORECASE):
        verbatim = match.group(0)
        normalized = normalize_amount(verbatim)
        if normalized is not None:
            claims.append(Claim(
                claim_id=f"amt_{chunk_id}_{match.start()}",
                inference_type=InferenceType.DEDUCTION,
                content=f"financial_amount: {normalized}",
                verbatim_source=verbatim,
                supports=[chunk_id],
                gaps=["currency not verified", "context not assessed"]
            ))
    return claims

The critical line is verbatim_source=verbatim. The normalized canonical form financial_amount: 1200 goes into the Claim's content. The original "twelve hundred dollars" goes into verbatim_source. The Witness entry for the chunk carries the raw text. The chain is: raw text → verbatim capture → normalized claim, and every link in that chain is auditable. > ◆ Going Deeper — The ENM discipline for custom extractors. > > Epistemic Neutrality Masking (Chapter 18) applies to custom extractors > with one additional constraint: the normalizer must not introduce a > false precision. An amount recorded as "approximately $1,200" should > produce a Claim with financial_amount: 1200 and a gaps entry > that notes "approximation qualifier present — verbatim may not represent > exact amount." The extractor's job is to make retrieval work; the gaps > array is where the extractor is honest about what retrieval costs. > > A custom extractor that produces normalized Claims with empty gaps > arrays is an extractor that is claiming to understand more than it does. > R4 and R5 require this honesty at the schema level. An extractor that > systematically violates R4/R5 will fail the conformance suite's > per-requirement tests. ## Pattern 4 — Custom challenge types The refutation harness in Chapter 20 defines five challenge types: TEMPORAL_CONSISTENCY, ENTITY_CONSISTENCY, COUNTER_SOURCE, LOGICAL_VALIDITY, and COMPLETENESS. These cover the challenge space for personal-records litigation. They do not cover everything. A medical-records deployment needs a DOSAGE_ERROR challenge: given a prescription record, is the recorded dosage consistent with standard dosing guidelines for the named medication? That is not a temporal check, not a counter-source check, not a logical validity check. It is a domain-specific factual check that requires medical knowledge.

Adding a custom challenge type requires two things. First, implement the challenge logic as a Refutation subclass:

# meridian/refutation/dosage_error.py
from meridian.canon.schema import Challenge, ChallengeType

class DosageErrorChallenge:
    challenge_type = "DOSAGE_ERROR"

    def run(self, claim, context) -> Challenge:
        medication = context.get("medication_name")
        dosage = context.get("recorded_dosage_mg")
        standard_range = lookup_standard_range(medication)  # domain-specific
        verdict = "pass" if standard_range and standard_range[0] <= dosage <= standard_range[1] else "fail"
        return Challenge(
            challenge_type=self.challenge_type,
            targets=[claim.claim_id],
            verdict=verdict,
            rationale=f"Dosage {dosage}mg; standard range {standard_range} for {medication}"
        )

Second — and this is what most implementations miss — the custom challenge type must be listed in coverage.declined for every attestation where it was not run. R6 requires that the coverage object honestly represent what was and was not tested. An attestation over a non-prescription record that silently omits DOSAGE_ERROR from its coverage is misleading. The correct behavior is to include:

{
  "declined": [
    {
      "challenge_type": "DOSAGE_ERROR",
      "reason": "source_type:general_document — DOSAGE_ERROR applies only to prescription records"
    }
  ]
}

That entry tells the recipient: "We know this challenge type exists. We did not run it. Here is why." The recipient can assess whether the reason is adequate. The coverage object is where the system is honest about its own limits.

§ For the Record — Canon v0.2.0 §9.4 on coverage requirements.

"The coverage object MUST contain a declined array. Each entry in > declined MUST identify the challenge type and provide a > machine-readable reason. An implementation that omits challenge types > from declined when those types are not applicable MUST document > the exclusion criteria in its configuration. Silence is not a valid > reason." > ✻ Try This — Design a custom challenge for your domain. > > For your target domain — choose one: medical, financial, journalistic, > academic — identify one challenge type that the five-challenge harness > does not cover. Write out: (a) the challenge type name, all caps and > underscore-separated; (b) what it checks, in one sentence; (c) three > document types for which it would be declined, with machine-readable > reasons. Don't implement the logic yet — just specify the interface. > The specification is the harder part. ## Pattern 5 — Multi-matter deployment The reference system is scoped to one matter per deployment. Every content table has a matter_id foreign key, but the reference RBAC policies assume one matter is active. Production deployments for law firms, NGOs, and journalists typically need multiple matters in one database, with strict isolation: an actor with access to matter A must not be able to query matter B's chunks, even if they share the same Postgres session. This isolation is enforced at two layers. The first layer is the schema. The matters table is defined in schema/10_core.sql. Every content table — sources, acquisitions, documents, chunks, entities, communications, recordings — carries a matter_id UUID NOT NULL REFERENCES matters(id). This is already true in the reference implementation. Nothing to add here. The second layer is Row-Level Security. The schema/99_rls.sql file defines RLS policies for the reference single-matter deployment. A multi-matter deployment extends those policies:

-- schema/M2_multi_matter_rls.sql

-- Each session sets a role that the actor_matter_access table maps to matters.
CREATE TABLE IF NOT EXISTS actor_matter_access (
    actor_id   BIGINT NOT NULL REFERENCES actors(id),
    matter_id  UUID   NOT NULL REFERENCES matters(id),
    role       TEXT   NOT NULL,  -- 'owner' | 'counsel' | 'paralegal' | 'expert'
    granted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    granted_by BIGINT REFERENCES actors(id),
    PRIMARY KEY (actor_id, matter_id)
);

-- RLS policy: a session can only see rows in matters it has access to.
CREATE POLICY matter_isolation ON sources
    USING (
        matter_id IN (
            SELECT matter_id FROM actor_matter_access
            WHERE actor_id = current_setting('app.current_actor_id')::BIGINT
        )
    );

With this policy active, SELECT * FROM sources in an actor's session returns only sources for matters that actor has access to. The WHERE clause in the application query does not matter. Postgres enforces the boundary. > ◆ Going Deeper — Current-setting pattern for RLS. > > The current_setting('app.current_actor_id') pattern is how Postgres

RLS policies access session-level context. The application sets this before executing queries:
db.execute("SET app.current_actor_id = %s", (actor_id,))
This is a session-local setting, not a transaction setting. A connection pool that recycles connections must reset it on every checkout. Supabase handles this via JWT claims; the reference implementation uses explicit SET calls. Either approach works; the critical invariant is that no > query executes against a session that has not set app.current_actor_id. > > An actor with access to matter A who attempts to query matter B's chunks > receives an empty result set, not an error. This is intentional: leaking > the existence of rows in an inaccessible matter is itself an information > disclosure. Empty result is the correct behavior. > ▼ Why It Matters — Matter isolation is not application logic. > > Application-layer access control — "check the user's role before running > the query" — can be bypassed by bugs, by developers with database access, > by compromised sessions, and by SQL injection. Row-Level Security is > enforced by the database engine, not the application. It cannot be bypassed > by changing application code. For a system handling privileged litigation > records, that distinction is material: opposing counsel's subpoena for the > database does not give them access to another matter's records. ## Pluggable backend patterns (v0.2.0) The five structural patterns above govern what you add to the data model. The six backend patterns below govern how specific subsystems behave. Each is controlled by an environment variable or a constructor argument. None require changes to the Canon substrate or to attestation byte-shape. ### Pattern 6 — BM25 backend swap The default full-text engine is PostgreSQL's built-in tsvector. For deployments that need better relevance ranking (BM25F field weighting, Tantivy scoring), ParadeDB's pg_search extension is a drop-in swap:

# In your deployment, set MERIDIAN_USE_PARADEDB=1 to use ParadeDB
# or leave unset to use tsvector (always available)
# The dispatcher in meridian.query.search handles this automatically

When MERIDIAN_USE_PARADEDB=1, queries use the @@@ operator against a Tantivy index. When unset or 0, queries use the standard tsvector GIN index. The two code paths are isolated in meridian.query.search; no other code changes. tsvector is appropriate for corpora under ≈10 million chunks. ParadeDB extends headroom significantly and improves recall for long-tail legal terms. The trade-off is that pg_search is a separate Postgres extension requiring installation and the @@@ operator is non-standard SQL. ### Pattern 7 — Vector index swap The default approximate nearest-neighbor index is ivfflat (bundled with pgvector, always available). For corpora that exceed available RAM, the StreamingDiskANN index from pgvectorscale provides disk-backed ANN with better recall at high N: - Default: ivfflat — no additional setup required. - Optional: diskann via pgvectorscale — run schema/B2_pgvectorscale.sql after installing the extension. No application code change is required. pgvector uses whichever index is present. The migration file creates the StreamingDiskANN index in place of the ivfflat index on the chunks.embedding column. pgvectorscale requires a separate extension install and an additional schema migration. The ivfflat default is sufficient for most capstone-scale corpora. ### Pattern 8 — PDF backend selection Three PDF rendering backends are available for render_brief_pdf():

from meridian.export.pdf import render_brief_pdf

# ReportLab: always available, no system deps
pdf_bytes = render_brief_pdf(brief, backend="reportlab")

# WeasyPrint: pip install weasyprint; requires libpango + libcairo
pdf_bytes = render_brief_pdf(brief, backend="weasyprint")

# Typst: nearest-LaTeX quality; install separately (brew install typst)
pdf_bytes = render_brief_pdf(brief, backend="typst")

Backend	Quality	Availability	Notes
`reportlab`	Good	Always available	Default; no system deps

pip install meridian-canon[presidio]

from meridian.witness.masking import make_presidio_masker

masker = make_presidio_masker()  # returns a callable
# Swap in as the EntityMasker in your chunker configuration

make_presidio_masker() returns a callable with the same interface as the regex masker. The swap is transparent to the rest of the pipeline. The Presidio backend uses Microsoft's presidio-analyzer and requires an NLP model on first run (spaCy en_core_web_lg by default). The Presidio backend is appropriate when free-text fields contain informal references to people and places that the regex masker does not cover. The regex masker remains the recommended default for structured records (email headers, court dockets, lab reports) where entity signals are in known fields. ### Pattern 11 — Transparency log configuration Rekor transparency log integration is disabled by default. Three configurations are supported: Disabled (default): MERIDIAN_REKOR_ENABLED=0 (or unset). publish_attestation() returns {"status": "disabled"} without making any network call. Public Rekor: Set MERIDIAN_REKOR_ENABLED=1. Uses the default rekor_url (Sigstore public Rekor instance). Every sealed attestation is submitted to the public append-only log; anyone can confirm the attestation existed before a given date. Private Rekor: Deploy your own Rekor instance and pass rekor_url to the publishing call. Provides the same verifiability guarantees without exposing attestation existence to the public log. Privacy-preserving variant: When attestation content is sensitive, publish only the envelope hash rather than the full payload. Patch publish_attestation() to submit SHA-256(dsse_envelope_bytes) instead of the envelope itself. > ▼ Why It Matters — Transparency log vs. custodian trust. > > Without a transparency log, a verifier checking an attestation's timestamp > must trust the custodian's clock. With Rekor enabled, the timestamp is > publicly anchored in an append-only Merkle tree that anyone can audit. For > matters where opposing parties are likely to dispute when evidence was > collected, the Rekor entry converts a "trust us" timestamp into a > independently auditable one. ## Composing the patterns The five structural patterns compose. A medical-records deployment running multiple matters adds a domain schema extension (Pattern 1), a custom source worker for HL7/FHIR records (Pattern 2), an amount normalizer and a medication- dosage extractor (Pattern 3), a DOSAGE_ERROR challenge type (Pattern 4), and multi-matter RLS policies (Pattern 5). None of these changes touch the core. Each can be reverted independently. The six backend patterns (6–11) compose independently of the structural patterns. Selecting typst for PDF output (Pattern 8) and LiteLLMAdapter for pipeline inference (Pattern 9) does not affect the schema extension or the RLS policies. Backend selection is deployment configuration, not architectural change. The test of a correct composition is a migration that applies cleanly to a fresh database seeded only with the reference migrations, and a .down.sql chain that returns the database to its pre-extension state. If either test fails, the extension is entangled and needs to be redesigned. > ✻ Try This — The extension design exercise. > > Choose a domain: medical, financial, journalistic, or academic. Design the > five-pattern extension set for that domain: > > (a) Schema extension (Pattern 1): Name one table. Define its columns. > Identify which core tables it references via FK. Write the CREATE TABLE > statement and the corresponding DROP TABLE down migration. > > (b) Custom source worker (Pattern 2): Name one source type not in the > reference. Identify the four invariants your worker must satisfy. > Sketch the skeleton in ≤15 lines. > > (c) Custom extractor (Pattern 3): Name one domain-specific entity type. > Describe the normalization it performs. Identify what goes in the > Claim's content and what goes in verbatim_source. > > (d) Custom challenge (Pattern 4): Name one domain-specific challenge type. > Write the machine-readable reason you would use in coverage.declined > for a non-applicable document. > > (e) Multi-matter (Pattern 5): True or false — your domain requires > multi-matter isolation. If false, explain why single-matter is adequate. > If true, identify which Postgres role each external party (opposing > counsel, auditor, regulator) would hold. > > The exercise takes about 30 minutes. The design memo from your capstone is > the right place to record it.

💡Key Takeaways

- The six pluggable backend patterns are: BM25 backend swap (MERIDIAN_USE_PARADEDB), vector index swap (B2_pgvectorscale migration), PDF backend selection (backend= parameter to render_brief_pdf()), LM adapter selection (constructor argument to run_harness()), PII masking backend (make_presidio_masker() factory swap), and transparency log configuration (MERIDIAN_REKOR_ENABLED + rekor_url). - The patterns compose independently: selecting ParadeDB for BM25 does not affect which PDF backend is used, and enabling Rekor does not affect which LM adapter runs refutation — each backend is isolated behind a single env variable or constructor argument. - Use LiteLLMAdapter at the pipeline orchestration layer (Dagster assets, batch refutation runs) where provider flexibility and cost routing matter; use OllamaAdapter or OpenAIAdapter for per-attestation work where you want minimal dependencies and a single known provider. - Enabling public Rekor (MERIDIAN_REKOR_ENABLED=1 with the default rekor_url) makes every sealed attestation publicly searchable by entry_uuid; a private Rekor instance or the payload_only_hash flag preserves verifiability without public disclosure of attestation existence. - make_presidio_masker() returns a callable with the same interface as the built-in regex masker, making it a true drop-in EntityMasker replacement — the rest of the ENM pipeline does not change when swapping from regex-only to NER-backed masking.

## Exercises ### Warm-up 1. Read schema/99_rls.sql. Identify which tables have RLS policies and which do not. For a table that lacks an RLS policy, explain whether this is intentional or an oversight. 2. Trace the check_idempotent call in any reference worker in workers/jobs/. What column is it keyed on? What happens if two workers attempt to ingest the same file simultaneously? ### Core 3. Write the CREATE TABLE statement for a journalism_source_contact table that tracks the journalist's sources (human contacts, not files) for a given matter. Include appropriate foreign keys to core tables. Write the corresponding down migration. 4. Implement a custom source worker for .vcf (vCard) contact files that satisfies all four Pattern 2 invariants. The worker does not need to parse vCard format — it only needs to hash, check idempotency, insert, enqueue, and audit. 5. Write a custom source worker for a plain-text log file format (lines of TIMESTAMP | ACTOR | EVENT). The worker must: compute SHA-256 of the raw file before any parsing, check for an existing acquisition record with the same hash, parse each line into a structured record, and write one audit log entry per imported line. Verify that running the worker twice on the same file produces no duplicate records. ### Stretch 5. Design a STATISTICAL_OUTLIER challenge type for a financial-forensics deployment. Define: the challenge logic (what does it check?), the document types for which it applies, and three machine-readable decline reasons for non-applicable documents. 6. A law firm runs 40 active matters simultaneously in one Meridian-Cannon database. A paralegal's session accidentally omits the SET app.current_actor_id call. What happens to their query results? Is this a security failure, a correctness failure, or both? How would you detect it in the audit log? ## Build-your-own prompt For your capstone matter: apply Pattern 1 to your schema. Identify the two or three domain-specific tables your corpus requires. Write the CREATE TABLE statements and the down migrations. Verify that no core migration file was modified. The result is a schema extension that is ready for Week 2 of the capstone plan. ## Further reading - OpenEMR architecture overview, https://www.open-emr.org/wiki/ (observe the extension accumulation over time; the wiki dates migrations back to 2002). - Postgres Row-Level Security documentation, §5.8, https://www.postgresql.org/docs/current/ddl-rowsecurity.html. - Canon v0.2.0 §9.4 (Coverage requirements and declined entries). - Canon v0.2.0 §7 (Extension points for Canon implementations). - schema/99_rls.sql and schema/A0_attestations.sql in this repository — the reference extension and the reference RLS policies. - HIPAA minimum-necessary standard, 45 CFR §164.502(b) — for medical-records extension designers. - R6 enforcement in meridian/canon/schema.py — the Pydantic model that makes the coverage requirement machine-enforceable at construction time.

Next: Chapter 29 — Conformance Testing. What it means, in machine-readable terms, for an implementation to be correct.