Part III — System Architecture · Chapter 20
The Seven-Layer Pipeline
The Seven-Layer Pipeline
Architecture is the choice of where to put the seams. Meridian-Cannon's seams are the seven layer boundaries — each is an audit point, a checkpoint for idempotency, and a place where the question "what happened here?" has a definitive answer.
ℹPrerequisites▼
Before reading this chapter, you should be comfortable with: All of Part II (Chapters 5–12). The pipeline assembles all CS primitives into a coherent system — you need to know each component before seeing how they compose.
Every evidence system has a pipeline: raw source material comes in, processed artifacts come out. The question is not whether you have layers. You always do. The question is whether your layers are explicit — with defined inputs, defined outputs, and a way to prove that what came out matches what went in — or whether they are implicit, nested inside functions that do five things at once and leave no record of any of them.
Meridian-Cannon uses seven explicit layers. This chapter traces a single text message from Isabel's phone export to a sealed Canon attestation, identifying exactly which layer produced which record and why the boundaries fall where they do.
At a glance
- The seven layer boundaries are not aesthetic — each boundary marks a qualitatively different failure mode with a different detection and repair strategy.
- Every layer has a defined input/output contract: a layer consumes a job and row type from the prior layer and produces a job and row type for the next; violating the contract is detectable at the boundary.
- The pipeline is idempotent and hash-chained: the same source bytes always produce the same acquisition hash, and any retry stops at the idempotency check rather than creating duplicate records.
%%| label: fig-pipeline
%%| fig-cap: "The seven-layer Canon pipeline: L0 acquisition through L6 sealed attestation"
flowchart TD
L0["**L0 — Acquisition**\nRaw source bytes\nAudio · Email · PDF · SMS"]
L1["**L1 — Format Detection**\nMIME type · language · encoding\nArtifact: typed source record"]
L2["**L2 — Raw Extraction**\nText + metadata extraction\nArtifact: source rows in DB"]
L3["**L3 — Chunking**\nSection-aware partitioning\nArtifact: chunk rows + embeddings"]
L4["**L4 — Enrichment**\nNER · entity resolution · TARG\nArtifact: enriched claim set"]
L5["**L5 — Adversarial Validation**\nTri-Model Consensus refutation\nArtifact: Refutation block"]
L6["**L6 — Attestation Emission**\nDSSE envelope · Seal · Rekor\nArtifact: signed Attestation"]
L0 --> L1 --> L2 --> L3 --> L4 --> L5 --> L6
style L0 fill:#1A2E55,color:#fff
style L1 fill:#1A2E55,color:#fff
style L2 fill:#1A2E55,color:#fff
style L3 fill:#1A2E55,color:#fff
style L4 fill:#C99E3E,color:#14181F
style L5 fill:#C99E3E,color:#14181F
style L6 fill:#10b981,color:#fff
| Layer | Input | Primary artifact | Failure symptom |
|---|---|---|---|
| L0 Acquisition | Raw bytes from source | source DB row, SHA-256 hash | Missing or corrupt source |
# workers/litdb.py >>> ch13 start
def put_content_addressed(
raw_bytes: bytes, source_kind: str, ext: str = "bin", bucket: str = "raw-evidence"
) -> tuple[str, str, int]:
"""Hash, store under raw/<source_kind>/YYYY-MM/<sha>.<ext>, return (uri, sha, size)."""
sha = hashlib.sha256(raw_bytes).hexdigest()
yyyymm = time.strftime("%Y-%m")
key = f"raw/{source_kind}/{yyyymm}/{sha}.{ext.lstrip('.')}"
# ... storage dispatch ...
return uri, sha, len(raw_bytes)
# workers/litdb.py <<< ch13 end
The function returns the SHA-256 hex, the storage URI, and the byte count. All three are written to the acquisitions table. When the L1 worker picks up this job, it receives the acquisition ID; it can look up the hash and verify the file has not changed since L0 wrote it. If the hash does not match, the job fails — not silently, with a dead-letter entry in the queue. Idempotency at L0 is enforced by the acquisitions table's raw_sha256 column combined with ON CONFLICT logic in upsert_document. If the same file is submitted twice — the same export from the same phone on the same day — the SHA-256 matches, and no new document row is created. The acquisition row does get created (recording that the source arrived again), but the downstream workers will not reprocess. > ◆ Going Deeper — Why hash before processing, not after. > > A processing step that modifies content before hashing — BOM stripping, line-ending normalization, charset transcoding, timestamp normalization — makes the recorded hash unreproducible from the original file. You cannot take the file as received, hash it, and get back the stored value. The chain is broken before it started. > > This is not a theoretical concern. Many eDiscovery platforms normalize files on ingest (converting Windows line endings to Unix, stripping the UTF-8 BOM, converting CRLF to LF). When a recipient tries to verify that the exhibit matches the original export, the hashes will not match. The platform cannot explain the discrepancy without reconstructing the normalization step — which requires documentation that was never kept. > > The Meridian-Cannon rule is strict: hash the raw bytes first, store the raw bytes at the hashed path, then process. Processing is idempotent because the input (the raw bytes) is stable. The hash at acquisitions.raw_sha256 is always the hash of the file as it was received, not as it was transformed. One acquisition row is written per fetch event. record_acquisition in litdb.py writes the row and then calls audit() — two records in two tables, both created in the same database transaction. Either both commit or neither does. ## L1 — Format Detection and Routing L1 takes an acquisition ID, reads the raw bytes from storage, and decides where they go next. "Decides" means three things in sequence: check the declared MIME type from the source metadata; run python-magic against the raw bytes to get the detected MIME type; if they conflict, prefer the detected type with a warning in the job payload. The documents table stores both: declared_mime_type and detected_mime_type. Discrepancies are not failures — they are information. An email server that says application/octet-stream when the bytes are a PDF is a discrepancy worth recording. The detected_format_family column collapses MIME subtypes into processing families: pdf, office_doc, image, audio, rfc822_eml, json_export, and so on. This is what the L1 worker uses to dispatch: it enqueues a kind-specific L2 job based on the format family, not the raw MIME type. The actual routing table lives in workers/jobs/dispatch_document.py (the _ROUTING dict, keyed to document-level format families). The conceptual pattern looks like this:
# Conceptual dispatch pattern — see workers/jobs/dispatch_document.py for
# the actual _ROUTING dict and handle() implementation.
FORMAT_TO_L2_KIND = {
"rfc822": "email_parse",
"pdf": "pdf_chunk",
"audio": "audio_ingest_from_doc",
"image": "image_chunk",
"text": "text_chunk",
"archive": "archive_extract",
}
def dispatch_l2(document_id: str, format_family: str, job_id: int) -> None:
kind = FORMAT_TO_L2_KIND.get(format_family, "extract_unknown")
enqueue(kind, {"document_id": document_id}, parent_job_id=job_id)
The parent_job_id parameter is important. The job queue in litdb.py tracks parent-child relationships between jobs. An L0 job that spawns an L1 job that spawns an L2 job creates a chain: failure at any point is traceable to its parent. A complete pipeline run for one source file is one tree of jobs in the jobs table. Format families that have no registered L2 worker get extract_unknown. The system does not drop them: it queues them for manual review. In a litigation context, "unprocessed" is not the same as "unimportant." An unknown format may be exactly the format that matters most. > ☉ In the Wild — The Enron email corpus and the cost of ad-hoc pipelines. > > When FERC subpoenaed Enron's email after the company's 2001 collapse, the processing pipeline was improvised. There was no layer separation. Files were processed with different tools by different contractors at different times. The same email might be extracted, indexed, and produced multiple times with different results — different character encodings, different attachment handling, different date parsing. Privilege review was conducted on a corpus that no one could certify was stable. > > The chaos contributed to a discovery process that stretched over six years. The FERC investigation alone generated 1.5 million documents. Enron's collapse became a case study in what happens when you treat evidence processing as a scripting problem rather than an architecture problem. > > Meridian-Cannon's L0–L1 layer boundary is the architectural answer to Enron's first failure: record every source file's identity before processing begins, and route it deterministically to exactly one worker based on format. The same input always goes to the same place. The audit trail is not a post-hoc reconstruction; it is the processing record itself. ## L2 — Raw Extraction L2 workers are the source-specific specialists. They extract human-readable content from the raw bytes and write records — one per logical unit of the source — without making any claims about what the content means. L2 answers "what does the file contain." L4 answers "what does the content mean." A text message says "I need to reschedule for next week." L2 records: timestamp, sender, recipient, body text. L4 identifies "next week" as a temporal expression and the message as a visitation-rescheduling communication. Separate layers mean L2 workers are simple, fast, and testable in isolation. For email, L2 is a MIME parser. The MIME tree is walked recursively: each text/plain or text/html part becomes a record; each attachment becomes a document row (with its own SHA-256 hash) linked via attachment_relations to the parent email's document row. The attachment_relations.parent_attachment_id column handles nesting — an email with a ZIP attachment that contains a PDF creates three document rows and two attachment relation rows. For iMessage exports, L2 is a chat log parser. Apple's iMessage export format (whether SQLite, XML, or .ips crash format depending on the extraction tool used) contains individual messages with timestamps, sender handles, and thread identifiers. Each message becomes one row in the appropriate communications table, linked back to the acquisition. For audio, L2 is the transcription worker: it calls the local Whisper model, segments the audio, and writes transcript chunks with start_ms and end_ms timestamps. The transcript is stored in chunks with recording_id as the FK. The raw audio bytes remain at their L0 storage path, unchanged. The invariant for every L2 worker: never modify the original bytes; never write derived content without the source acquisition FK; never skip the audit() call for each record created. ## L3 — Section-Aware Chunking L3 is where the text gets divided into retrieval units. The rule is: never use naive fixed-size chunking. A 512-token window that splits a paragraph in the middle, or a court order at the signature line, or a text message thread mid-conversation, produces chunks that are not atomic units of meaning. A retrieval system that cannot retrieve a complete unit of meaning is a retrieval system that misses things. The chunks table reflects the principle. The chunker column records which chunker produced the row: pdf_layout, email_mime, message_window, heading_split, paragraph_split. Each chunker knows the natural unit of its modality. For PDF, that unit is a layout block (detected by the PDF parser's coordinate system). For email, it is a MIME part. For a message thread, it is a single message — because an iMessage is already atomic. You do not split it further.
-- 30_documents.sql — chunks table (excerpt)
CREATE TABLE chunks (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
document_id uuid NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
parent_chunk_id uuid REFERENCES chunks(id),
chunker text NOT NULL,
chunker_version text NOT NULL,
modality text NOT NULL,
ordinal int NOT NULL,
section_path text,
page_range int4range,
char_offsets int4range,
text text NOT NULL,
text_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', text)) STORED,
...
);
The parent_chunk_id column supports hierarchical chunking: a chapter can be one chunk whose children are its sections. A retrieval query can retrieve the section and navigate up to the chapter when the section alone is insufficient. The section_path column stores the human-readable path: "Introduction > Background > Prior Proceedings". The text_tsv column is a generated tsvector. Full-text search over all chunks in the database requires no separate indexing step: inserting a chunk triggers index maintenance automatically. Idempotency at L3 is enforced by checking whether chunks already exist for a document ID before running the chunker. If any chunks exist, the worker skips the document. This is intentional: L3 produces a stable partition of the document. Re-chunking with the same chunker version should produce the same partition, but the guarantee is enforced by skipping rather than by comparison. > ▼ Why It Matters — Atomicity is not about aesthetics. > > In the TPR case, the attorney has fifteen minutes to find ten specific text messages before a hearing. The retrieval system returns chunks. If the chunks are arbitrary 512-token windows, the attorney gets fragments: "...visitation on March 15 would be fine with me, but I need" — and the rest of the message is in the next chunk, which may or may not be in the results. If the chunks are individual messages, the attorney gets complete messages. The difference between a retrieved fragment and a retrieved message is the difference between "I found something" and "I found what I was looking for." ## L4 — Enrichment L4 operates on chunks. Its outputs are three things: entities extracted from the text, embeddings for semantic search, and structured fields extracted from the chunk's content. Entity extraction uses a local NER model. The entity_mentions table records each mention: the chunk ID, the entity ID, the character span within the chunk, the surface text, the confidence score, and the model version. The entity ID points to a row in entities, which is the canonical record for a real-world referent — a person, organization, date, monetary amount, or location. Multiple mentions across multiple chunks may resolve to the same entity: "Isabel," "Isabel Fontaine," and "I.F." all resolve to one entities row. Embeddings are written to the embeddings table. The primary model is BAAI/bge-large-en-v1.5, producing 1024-dimensional vectors. The vector lives in the embeddings table alongside the chunk ID, model name, model version, and computed timestamp. If the embedding model changes in the future, the old embeddings remain — the new model gets its own rows. Retrieval can use either model's embeddings, or both via reciprocal rank fusion. The embeddings table uses an HNSW index (vector_cosine_ops, m=16, ef_construction=64). HNSW (Hierarchical Navigable Small World) is an approximate nearest-neighbor graph structure that retrieves the top-k most similar vectors to a query in sub-linear time. The m=16, ef_construction=64 parameters control the graph's connectivity and construction-time recall; see Chapter 9 for the full configuration. At the scale of one litigation matter — tens of thousands of chunks, not hundreds of millions — HNSW returns approximate nearest neighbors with recall above 0.95 at sub-10ms query times. That is sufficient for interactive retrieval. L4 also performs structured extraction for source types that have defined schemas: financial records produce transaction amounts and account numbers; court documents produce case numbers, parties, and hearing dates; communications produce sender, recipient, and message direction. These go into the metadata JSONB column on the chunk row, not into separate tables. JSONB is appropriate here because the schema is per-modality: a financial chunk and a court-order chunk have different structured fields, and the JSONB column accommodates both without requiring a new migration for each source type. > ◆ Going Deeper — The embedding model choice. > > bge-large-en-v1.5 was chosen over OpenAI's text-embedding-3-large (3072 dimensions) and Nomic's nomic-embed-text-v1.5 (768 dimensions) on three grounds: (1) it runs locally with no API key, preserving attorney-client privilege; (2) its MTEB score on legal retrieval benchmarks is competitive with commercial models at 1024 dimensions; (3) 1024 dimensions is a stable target for pgvector's HNSW index without the memory overhead of 3072-dim vectors. > > If this decision is ever revisited, the change must be documented in docs/divergences.md with the benchmark that motivated it. The existing 1024-dim embeddings remain in the embeddings table. Do not retroactively alter the model name on existing rows. ## L5 — Adversarial Validation L5 is where claims are challenged before they are sealed. Every Canon attestation must include at least one Challenge in its Refutation block (R6 requirement). The Refutation block is not a formality. It is what separates an attestation from a report. A report says "here is what I found." An attestation says "here is what I found, here is the strongest argument against it, here is why I still believe it given that argument, and here is the inventory of challenges I considered but declined to run." L5 implements this mechanically: the Tri-Model Consensus protocol (Chapter 12) presents each claim to three independent model instances with system prompts configured for different adversarial postures. The three responses are compared. Where they agree, the claim is stable. Where they disagree, the disagreement is itself recorded as a Challenge in the Refutation block. For the running case: the claim "the parent requested visitation on March 15" is extracted from a text message at L4. At L5, three model instances are asked: is this characterization accurate given the exact text? Do any of the following alternative interpretations fit? Is there ambiguity in the phrase "next week" that could place the request on a different date? The responses are compared. If two models agree and one does not, the disagreement is noted in the Challenges block with the dissenting interpretation. The claim survives if two or more models produce consistent characterizations. The coverage.declined inventory in the attestation records challenge types that were considered but not run — because they were inapplicable to this source modality, because they were structurally blocked by the content type, or because running them would require information not available to the system. The Canon spec (R6) requires this inventory to be present. A recipient who reads a coverage.declined entry saying "semantic_ambiguity: not applicable to timestamp extraction" knows that the system deliberately did not challenge timestamp extraction and can decide whether that is adequate. ## L6 — Attestation Emission L6 is the final layer. It takes the outputs of L0–L5 and produces a sealed Canon attestation. The emission follows the protocol in meridian/canon/emit.py: build the attestation object from the chunk, entities, embeddings, and L5 validation results; canonicalize it under RFC 8785; compute the SHA-256 chain hash; sign with the issuer's Ed25519 private key; write the sealed attestation to the attestations table in schema/A0_attestations.sql; write one audit row; enqueue the attestation ID for downstream consumers (CASEFORUM, the admissibility auditor). The chain hash is what links the sealed attestation to the pipeline run. It is computed over the entire attestation object (with the seal field excluded). Any change to any field — the claim text, the entity mentions, the witness content, the Refutation block — changes the chain hash, invalidates the signature, and causes the seven-step verifier to fail at step 3. An adversary who wants to modify the exhibit without detection must forge the Ed25519 signature, which requires the issuer's private key. Every L6 emission also updates the documents.parser_status column from in_progress to parsed. This is the completion signal for the pipeline run: a document with parser_status = 'parsed' has a complete trail from L0 acquisition to L6 sealed attestation. ## Walking the Running Case: Isabel's Text Message One message, all seven layers. The source: Isabel's iMessage export, a SQLite database exported from her iPhone during device forensics review. The export covers March 1–April 30, 2026. L0. The worker reads the SQLite file as raw bytes. put_content_addressed computes the SHA-256: a3f7b2.... The file is stored at raw/sms_thread_export/2026-03/a3f7b2....db. record_acquisition writes the acquisition row with method='device_forensics_export', legal_basis='owner_consent', raw_sha256='a3f7b2...'. The audit() call writes action='acquisition_created' to audit_log. The L0 job is marked succeeded. A new job of kind detect_format is enqueued with parent_job_id pointing to the L0 job. L1. The L1 worker claims the detect_format job. It reads the raw bytes, calls python-magic, gets application/x-sqlite3. Format family: sms_thread_export. detected_mime_type is written back to the document row. A new job of kind extract_imessage is enqueued. Audit row: action='format_detected', payload={'format_family': 'sms_thread_export'}. L2. The extract_imessage worker claims the job. It opens the SQLite file using the iPhone backup schema (message, chat, handle tables). It iterates every message in the thread between Isabel and the parent. For the March 15 message — "I need to move our visit to next week, any day is fine" — it writes: timestamp 2026-03-15 09:23:41 UTC, sender handle +17155551234, recipient handle +17155555678, body text, thread ID. One row in messages. One audit row: action='message_extracted'. The extraction is idempotent: before writing, the worker queries for existing messages rows with source_hash = sha256(message_body + timestamp + sender). If a row exists, the message is skipped. L3. The chunker worker claims the chunk_messages job. For iMessage, each message is its own chunk — the chunker is message_window, and the window is exactly one message. A single chunks row is written: text = "I need to move our visit to next week, any day is fine", ordinal = 14 (within the thread), message_id = <the message UUID>. The text_tsv column is populated automatically by the generated expression. Audit row: action='chunk_created'. L4. The enrichment worker claims the enrich_chunk job. NER identifies: Isabel (person, sender), parent (person, recipient), next week (temporal expression, relative to March 15), visit (event). Entity rows are upserted; mention rows are written with span offsets. The embedding worker runs bge-large-en-v1.5 over the chunk text and writes a 1024-dim vector to embeddings. Structured extraction notes this is a visitation_rescheduling_request in the metadata JSONB field. Three audit rows. L5. The validation worker claims the validate_chunk job. Three model instances receive the chunk text and the claim "the sender requested to reschedule a visit to next week." Two models agree: the claim is accurate. One model notes: "the phrase 'any day is fine' is an expression of flexibility, not a specific request for a date." The dissent is recorded as a Challenge in the Refutation block with challenge_type='semantic_interpretation'. The claim survives (2/3 consensus). Coverage records declined: ['date_arithmetic', 'speaker_identification'] — the system did not attempt to resolve "next week" to specific calendar dates, and did not run speaker de-identification checks. Audit row: action='validation_completed'. L6. The emission worker claims the emit_attestation job. It builds the Canon attestation: Witness block (the message chunk, its content hash), Findings block (the claim with its supports reference), Refutation block (the semantic_interpretation Challenge and its declined inventory). It canonicalizes, hashes, signs, writes the sealed attestation to attestations, writes the final audit row: action='attestation_emitted'. documents.parser_status flips to parsed. The complete pipeline for one text message: seven layers, approximately twelve audit rows, one sealed attestation that any recipient can verify without contacting the issuer. > ✻ Try This — Verify idempotency with a five-email thread. > > Take a MIME .eml file containing a thread of five emails (any five-email thread from your own mail client will do). Run it through the first three layers manually: > > 1. Compute the SHA-256 of the file: sha256sum yourfile.eml on Linux/macOS or certutil -hashfile yourfile.eml SHA256 on Windows. Record the hex. > 2. Open the file in a MIME parser (Python's email.message_from_bytes is one line). List the parts: how many text/plain or text/html parts does it have? How many attachments? > 3. For each attachment, extract the bytes and compute its SHA-256. You now have a hash for each discrete artifact in the MIME tree. > > Now run it again. The hashes should be identical — SHA-256 of the same bytes is always the same value. > > The exercise demonstrates: the idempotency check at L0 (comparing raw_sha256 against the stored hash) is the same operation you just did manually. If the hash matches a row already in acquisitions, the pipeline correctly skips it. If you modify even one byte of the file before re-running, the hash changes — and the system creates a new acquisition, correctly treating the modified file as a different source. ## The Job Queue as the Pipeline's Spine The job queue in litdb.py is the mechanism that connects the layers. Each layer is not a function call chained to the next; it is a job that claims work, does it, and enqueues the next job before marking itself complete.
# workers/litdb.py — job queue core
def claim_next(kind: str, *, lease_seconds: int = 300) -> Optional[Job]:
with conn() as c, c.cursor() as cur:
cur.execute("SELECT * FROM jobs_claim_next(%s, %s, %s)", (kind, WORKER_NAME, lease_seconds))
row = cur.fetchone()
if not row or row["id"] is None:
return None
return Job(id=row["id"], kind=row["kind"], payload=row["payload"], attempts=row["attempts"])
jobs_claim_next is a stored procedure that uses SELECT FOR UPDATE SKIP LOCKED — claiming exactly one job of the requested kind without blocking other workers claiming different jobs. Two enrichment workers running in parallel will never claim the same job. The job is leased for lease_seconds (default 300). If the worker dies mid-job, the lease expires, and the job becomes claimable again. The attempts counter increments each time a job is claimed. After max_attempts, the job moves to dead_letter — not dropped, just parked for human review. In a litigation context, a dead-lettered job is evidence of a processing failure, and the pipeline's audit trail is the record of what was tried. The parent_job_id chain makes the complete pipeline run for one source file one tree, queryable with a recursive CTE. When something goes wrong at L4, follow the parent chain from the failed job back to the original L0 acquisition. No timestamp search needed. > § For the Record — FRE 803(6), Records of regularly conducted activity. > > "A record of an act, event, condition, opinion, or diagnosis if: (A) the record was made at or near the time by — or from information transmitted by — someone with knowledge; (B) the record was kept in the course of a regularly conducted activity of a business, organization, occupation, or calling, whether or not for profit; (C) making the record was a regular practice of that activity; (D) all these conditions are shown by the testimony of the custodian or another qualified witness, or by a certification that complies with Rule 902(11)." > > The job queue records, the audit log, and the acquisitions table are created at or near the time of processing (A), kept in the course of the regular pipeline operation (B), created by the trigger functions and worker code as a regular practice of every pipeline run (C). The witness who testifies to conditions A–C is the system's engineer or custodian. The seven-layer pipeline is designed so that the custodian's testimony is supported by records that are self-consistent and complete, not reconstructed from memory. ## Failure Modes, by Layer Each layer has a characteristic failure. Knowing which failure belongs to which layer is how you debug a pipeline problem without reading every line of every worker. L0 failures are storage failures: the raw bytes could not be written, or the hash computation failed, or the acquisition row could not be inserted. The source is not in the system. Nothing downstream exists. L1 failures are format misclassification: the file was routed to the wrong L2 worker. Symptom: the extraction produces garbage or fails silently. Repair: re-route the acquisition to the correct worker after updating the format family. L2 failures are extraction failures: the content could not be parsed (corrupt file, unsupported encoding, nested MIME depth exceeded). The document row exists with parser_status = 'failed'. No chunks exist. The file requires manual intervention or a specialized parser. L3 failures are chunking failures: the chunker crashed or produced zero chunks. The document row exists, chunks do not. The most common cause is a chunker that assumes a text encoding that does not match the actual encoding. L4 failures are enrichment failures: NER crashed, the embedding model returned an error, or the structured extraction produced malformed JSON. The chunks exist; the enrichment records do not. The pipeline can be resumed by re-running just the L4 job for the affected chunks. L5 failures are validation failures: one of the three model instances did not respond, or the consensus logic crashed, or the Refutation block could not be constructed. The chunks and enrichment records exist; the attestation is not sealed. The L5 job can be re-run without affecting any prior layer. L6 failures are emission failures: the signing key was not available, the attestation table was locked, or the Canon schema validation rejected the object. The attestation is not in the database. The L6 job can be re-run; the signing operation is deterministic given the same input, but the timestamp in the attestation will differ.
observation_attestation → refute_claims → generate_brief → seal_attestation) makes pipeline lineage explicit and supports retry and backfill without replaying upstream work. - Feature flags MERIDIAN_USE_PARADEDB and MERIDIAN_REKOR_ENABLED select alternate backends at the layer level without changing any other pipeline code. - Pipeline layers are strictly one-way: re-ingesting a source requires a new acquisition with a hash check, not a re-run of downstream layers over an existing acquisition row. declared_mime_type = 'application/pdf' and detected_mime_type = 'image/tiff'. Which type does L1 use to route the file? What does it record in the database? ### Core 3. Write a SQL query that lists all jobs in the jobs table that are children (direct or indirect) of a given L0 acquisition job, using a recursive CTE. The schema stores parent_job_id on each job row. 4. A text message export is processed twice — once on March 1 and once on March 10 (the same file was re-submitted). How many rows are created in acquisitions? How many in documents? Explain the difference. 5. A chunk has parser_status = 'parsed' on its parent document, but has no row in embeddings. What layer failed? What SQL query confirms this? ### Stretch 6. Implement a minimal L1 format-detection worker in Python: given a path to a file, compute its SHA-256, detect its MIME type using python-magic, and return the format family string. Handle at least three format families. 7. The Refutation block at L5 records declined: ['date_arithmetic']. Write a paragraph explaining what this means to an opposing-counsel expert, without technical jargon.
8. Design a recovery procedure for an L4 failure that affected all chunks from a specific source kind (e.g., all audio transcripts from one month). What SQL queries would you run to identify the affected chunks, and what worker command would you issue to re-enqueue the L4 jobs?
Build-Your-Own Prompt
For your capstone corpus: identify which source types you have and which L2 workers they require. If your corpus contains a format with no L2 worker — a source type that none of the existing workers handles — design the extraction interface: what does the worker receive (an acquisition ID), what does it write (one or more document rows, one or more messages or records), and what does it enqueue (one L3 job per document). Your capstone pipeline will be evaluated, in Chapter 27, by whether a complete seven-layer trace can be produced for any artifact in your corpus.
Software-Defined Assets: Dagster Integration (v0.2.0)
The seven layers can be implemented as a hand-coded job queue — the approach described throughout this chapter — or as a set of software-defined assets using Dagster. The Dagster integration is optional and installed via:
pip install meridian-canon[pipeline]
This extra also installs litellm, which is the LLM backend used at the pipeline layer to fan out calls across providers. litellm is intentionally scoped to the pipeline layer; it should not be used in per-attestation core library code. The four Dagster assets in meridian/pipeline/ map onto the seven Canon layers: | Asset | Layers covered | |---|---| | observation_attestation | L0 ingestion → L2 enrichment | | refute_claims | L2 → L4 adversarial refutation | | generate_brief | L4 → L5 findings brief | | seal_attestation | L5 → L6 sealed envelope | Each asset receives the upstream asset's output as its input. Dagster materializes the asset graph, tracks lineage in its UI, and supports retry and backfill on any failed asset without replaying upstream work. The job-queue pattern described in this chapter remains the substrate; Dagster adds an orchestration layer on top of it. The following environment variables are relevant to the pipeline layer: | Variable | Purpose | Default | |---|---|---| | MERIDIAN_DB_URL | Postgres connection string | postgresql://localhost:5433/meridian | | MERIDIAN_CUSTODIAN | Signing custodian name used in Seal | — | | MERIDIAN_PUBLIC_KEY_URL | URL where the public PEM is hosted for verification | — | | MERIDIAN_USE_PARADEDB | Switch BM25 backend to ParadeDB (see Chapter 14) | 0 | | MERIDIAN_REKOR_ENABLED | Publish attestation seal to Rekor transparency log (see Chapter 14) | 0 | When MERIDIAN_REKOR_ENABLED=1 is set, the seal_attestation asset submits the signed envelope to a Rekor instance after writing the seals row and records the returned UUID and log index in the rekor_entries table. ## Further Reading - workers/litdb.py in this repository — the job queue, storage, and audit helpers. - schema/20_provenance.sql — the acquisitions table structure. - meridian/pipeline/ — Dagster asset definitions for the software-defined pipeline. - Kleppmann, Designing Data-Intensive Applications (2017), Ch 11 (Stream Processing) — the intellectual predecessor to the job-queue pattern. - Dagster documentation: https://docs.dagster.io/ — software-defined assets, lineage, and retry semantics. - The FERC Enron investigation docket — https://www.ferc.gov/industries/electric/indus-act/wec/enron/ — for the discovery timeline.
- FRE 803(6) and the committee notes on the business-records exception.
- NIST FIPS 180-4 — SHA-256 specification.
Next: Chapter 14 — PostgreSQL as Evidence Substrate. Why the database itself is a layer, and what that means for transaction guarantees, row-level security, and audit integrity.