NORAEarly Access

Part V — Build Your Own · Chapter 34

Capstone: Designing a Domain-Specific Meridian

Capstone: Designing a Domain-Specific Meridian

The skill the rest of the book has been in service of: looking at a corpus you do not yet know how to handle, and designing a Canon-conformant system over it.

At a glance

  • The capstone is a 10-week project: weeks 1–4 mirror the book's Phases A–B (skeleton + tests provided); weeks 5–7 require design from interface contracts only; weeks 8–10 are open-ended.
  • Deliverables: a working corpus ingester, a sealed EnrichmentAttestation (the enrichment kind in the attestations table schema — see the four canonical kinds in Chapter 20 plus this fifth kind for enriched content) and SearchAttestation, a 4–8 page design memo, a 30-minute oral defense. - The capstone is the assessment. Everything before it is preparation. ## Learning objectives By the end of this chapter you should be able to: 1. Design a domain-specific schema extension — new tables, FK references to core tables, and a paired .down.sql — that passes the additive-extension test: the core runs without it, and it can be reverted independently. 2. Write a conformance test for a custom source worker that verifies all four invariants: hash-before-processing, idempotency, job-queue write, and audit-log write. 3. Produce a key management policy document covering key generation, storage location, rotation schedule, compromise response, and the rotation attestation that records each rotation event. 4. Emit and verify a complete attestation chain for a custom domain: one ObservationAttestation per ingested item, one EnrichmentAttestation with R4/R5-conformant Claims, one SearchAttestation, and a cross-language verifier that agrees byte-for-byte with the Python reference on all of them. The sealing step must use emit_dsse() (v0.2.0 preferred) or emit() (legacy; retained for backward compatibility). ## Choosing a corpus Pick a corpus that is not the reference personal-data corpus, that you have legitimate access to, and that has a real recipient who would care about the attestations. Good capstone corpora: - An estate's records — emails, financial statements, photographs — to be inventoried and produced for probate. Recipient: counsel for the executor, beneficiaries, the probate court. - A journalistic FOIA corpus — government records released under public-records law. Recipient: editors, fact-checkers, downstream reporters. - An NGO's monitoring archive — text, images, audio collected by a human rights organization. Recipient: prosecutors, investigators, public report readers. - An academic archive — interview transcripts, fieldnotes, observational data of a researcher who consents to attested processing. Recipient: peer reviewers, replication studies, IRBs. - A regulatory production — documents a regulated entity has produced or must produce. Recipient: regulators, auditors, opposing counsel. - A small-business records archive — your own or a consenting employer's. Recipient: tax authority, auditor, future buyer. Bad capstone corpora: - A corpus you do not have the legal right to ingest. (Ask your institution's IRB or a lawyer if uncertain.) - A corpus so large you cannot finish in 10 weeks. Aim for $10^3$–$10^4$ documents for the capstone; you can scale later. - A corpus with no real recipient. The discipline of the book — verifiability without issuer cooperation — is meaningless if no one will ever check. ## The 10-week plan | Week | Phase | Deliverable | Self-grade | |---|---|---|---| | 1 | Research & scope | One-page memo: corpus, recipient, threats, declined-challenges with reasons. | ✓ if you can defend the scope verbally. | | 2 | Schema | Adapt schema/ to your corpus. Document deviations from the reference. | ✓ if migrations are reversible and pass pytest -m db. | | 3 | Phase A reprise | Stand up a working meridian/canon/ against your corpus's matter. Seal one minimal attestation using emit_dsse(). | ✓ if the standalone walker validates the DSSEEnvelope. | | 4 | Phase B reprise | One adapter, one source. Hash on receipt. Emit one ObservationAttestation per item. | ✓ if 100 items round-trip cleanly. | | 5 | Per-type extractor | One Findings extractor for your dominant document type. | ✓ if every emitted Claim has correct inference_type and a non-empty gaps array (R4, R5). | | 6 | Refutation harness | Wire up at least three of the five challenges; document declines for the rest with machine-readable reasons. | ✓ if every Refutation block satisfies R6. | | 7 | Indexing + Search | BM25 + dense + RRF; emit a SearchAttestation. | ✓ if a SearchAttestation walks back to its supports cleanly. | | 8 | Audit + Auditor | Hash-chain audit log; produce an Admissibility Auditor report attestation. | ✓ if the Auditor's checks all run. | | 9 | Standalone verifier | Implement a verifier in a second language (Rust or Go). | ✓ if it agrees byte-for-byte with the Python reference on 100 attestations. | | 10 | Memo + defense | 4–8 page design memo + 30-minute oral defense. | ✓ if your reviewer can replicate one of your attestations end-to-end during the defense. | ## The design memo The memo answers, in 4–8 pages: 1. Corpus. What it is, who custodies it, who the recipient is, what the threats are. 2. Schema deviations from reference. What you added or removed, and why. 3. Per-type extractors. Which document types you handled; which you declined and why. 4. Refutation harness. Which challenges you applied; which you declined; the machine-readable reasons. 5. Domain-specific risks. What the recipient should know that doesn't fit in the attestation itself. 6. Open questions. What you would do next, given more time. The memo is what the postmortem from research/06_textbook_craft.md calls the artifact that converts a build into learning. It is not optional. ## The oral defense 30 minutes. The reviewer: 1. Reads your memo in advance. 2. Selects one attestation from your repository at random. 3. Walks it back to its originating Observations using your verifier. 4. Asks you to identify the three weakest claims in your system, and what you would do to strengthen them. 5. Asks you to identify three things your declined-coverage entries do not cover, and what would have to change for them to cover those things. The defense is not a presentation of your work. It is a stress test of your understanding. Prepare to explain trade-offs you made, especially the ones you wish you hadn't. ## ▼ Why It Matters — Who actually reads what you build > ▼ Why It Matters — Who actually reads what you build. > > Every capstone student builds a system for a recipient. The recipient > is not the instructor. The recipient is the person on the other side of > a dispute, an investigation, a disclosure, or a transaction — the party > who will receive your attestations in discovery, or the auditor who will > check them, or the judge who will rule on whether they are admissible. > > A system without a real recipient is an exercise. A system with one is > evidence work. The discipline the book has tried to instill — verifiability > without issuer cooperation — means nothing until it is tested by someone > who has a reason to distrust you. The oral defense simulates this, but > only imperfectly. The most valuable capstones are the ones where a real > recipient has agreed in advance to receive and attempt to verify the output. > The most educational moment is when the verifier fails on their machine. ## ◆ Going Deeper — Scaling beyond 10,000 documents > ◆ Going Deeper — Scaling beyond 10,000 documents. > > The capstone target is $10^3$–$10^4$ documents. Real investigations are larger. > The ICIJ Panama Papers corpus was 11.5 million documents; the Enron corpus > used in legal discovery ran to 619,446 emails. What breaks at scale? > > Hash-chain audit log. The trigger-based chain in schema/10_core.sql > serializes inserts. At high insert rates, the advisory lock becomes a > bottleneck. At scale, the chain is typically computed asynchronously > by a dedicated worker and verified periodically rather than per-insert. > > HNSW index rebuild time. At $10^6$ chunks, HNSW index construction > takes hours. Partition by matter or time window; build sub-indexes and > merge at query time using RRF over sub-results. > > Attestation JSON size. At $10^6$ Observations, the supports graph > becomes too large to store inline. The pattern is Merkle-proof attestation: > each Observation stores its Merkle leaf; the SearchAttestation stores the > root and a proof path, not all leaves. > > These are Phase D–H concerns, outside the capstone scope. The point is: > what you build for $10^4$ documents is the specification of what the > scaled version must conform to. Conformance, not optimization, is the > capstone's mandate. ## What your capstone is not required to do - Cover every source the reference Meridian-Cannon covers. One source thoroughly is better than ten sources superficially. - Use the same models. If a smaller or different model fits your corpus, use it; document the choice. - Implement Phase G (BriefAttestation) or Phase H (full standalone verifier) to production quality. A working subset is fine. - Beat the reference implementation on benchmarks. Your goal is conformance, not optimization. ## What your capstone must do - Produce attestations that pass R1–R9, sealed with emit_dsse(). - Walk back cleanly through the supports graph to Observations. - Verify byte-identical across two language implementations (DSSE PAE computation must agree). - Surface declined-coverage entries that honestly describe what your system does not test, and why. - Survive the oral defense without recourse to "I'll fix that later." ## Reviewer's rubric (for self-grading) A passing capstone: - Every emitted attestation is R1–R9 conformant. - The supports graph is acyclic, finite, and walkable. - The cross-language verifier agrees with the Python reference on every test attestation. - The Refutation block of every EnrichmentAttestation has a non-trivial declined list with machine-readable reasons. - The design memo identifies at least three load-bearing trade-offs the student understood and made deliberately. - The oral defense surfaces no "I didn't think of that" moments on questions the rubric anticipates. A strong capstone goes further: - Implements a challenge type not in the reference (e.g., a domain-specific consistency check). - Proposes a Canon spec amendment with justification. - Identifies a real-world recipient (counsel, auditor, regulator) who has agreed to receive and verify the attestations. - Publishes the standalone verifier as an open-source package. ## What success looks like A completed capstone produces four artifacts: - A working system, scoped to your corpus, that emits Canon-conformant artifacts. - A second-language verifier that agrees byte-for-byte with the Python reference. - A memo that names the trade-offs you made. - A defense where no question on the rubric produces "I didn't think of that." ## ☉ In the Wild — The Panama Papers pipeline > ☉ In the Wild — The Panama Papers pipeline. > > In April 2016, the International Consortium of Investigative Journalists published findings from 11.5 million documents leaked from Panamanian law firm Mossack Fonseca. The ICIJ did not build one big search box. They built a domain-specific evidence system. > > The corpus included emails, PDFs, spreadsheets, and images across 40 years and dozens of jurisdictions. The ICIJ's platform — eventually open-sourced as Aleph — had to handle format detection, per-type extraction, entity recognition, and cross-document entity resolution at a scale that required partitioning the corpus by sensitivity tier (some documents were embargoed for coordinated publication; others were immediately public). > > Several choices the ICIJ made map directly to capstone design decisions: they hashed every source document on receipt and stored originals separately from derived objects; they used a tiered access model so that reporters in one country could not see documents assigned to another country's investigation until publication day; they maintained a chain of custody from leak receipt through publication to answer questions about how each document entered the archive. > > The ICIJ did not use a Canon attestation format. The lessons of the Panama Papers are not that Aleph was wrong — it was built for speed and journalist usability, and it shipped. The lesson is that the choices the ICIJ made under deadline pressure (hash-on-receipt, tier isolation, custody records) are the same choices Canon makes mandatory. A Canon-conformant Aleph would have produced an independently verifiable record; the actual Aleph produced a trustworthy one — but "trust us" and "verify without trusting us" are different postures. Both are legitimate choices; the capstone is training you to make the second one. > > Source: Obermaier and Obermayer, Panama Papers (2017); Aleph platform, https://aleph.occrp.org. ## Build-your-own (this is the prompt) Begin Week 1 now. Write the one-page memo: - Corpus: ___ - Custodian (use a non-personal identifier, e.g. "acme-corp-2026"): ___ - Recipient: ___ - Three threats to verifiability: ___ - Three challenges you anticipate declining, with machine-readable reasons: ___ Save this memo. The remaining nine weeks are an elaboration of it.
    💡Key Takeaways
    - The capstone has a three-week deliverable structure: by Week 3 you have a working corpus ingester emitting sealed ObservationAttestations; by Week 7 a complete enrichment and refutation pipeline; by Week 9 a second-language verifier that agrees byte-for-byte with the Python reference. - emit_dsse() is required (not emit()) because the capstone must demonstrate DSSE PAE computation — the cross-language verifier check cannot verify canonical byte equality if the attestation was sealed without a DSSE envelope. - The cross-language verifier check proves that PAE is deterministic across implementations: the same payload_type and payload bytes produce the same PAE input, so the same Ed25519 signature verifies in both Python and Go (or Rust) without any implementation-specific adjustment. - The minimum viable attestation that passes conformance carries: a valid canon_version, an attestation_id matching the ULID pattern, at least one WitnessEntry with content_hash and either content_ref or content_inline, at least one Claim with inference_type and non-empty supports, at least one Challenge, and a coverage object with declined entries for every unchallenged type. - A custodian memo documents: the key generation event, the custodian name chosen, the public PEM URL, the key rotation schedule, and — for each rotation — a signed rotation attestation linking old and new fingerprints so any verifier can trace continuity across the key lifecycle.
    ## Further reading - The Meridian-Canon spec v0.2.0, end to end. By the time you reach the capstone, every section should read as commentary on something you have implemented. - The dossiers in research/. Each one informs at least one capstone decision.
  • Other students' capstones, where available — peer review is the most underused learning mechanism in CS education.

Next: Chapter 28 — Customization Patterns.