NORAEarly Access

Part III — System Architecture · Chapter 22

Idempotent Ingestion & Custody Logs

Idempotent Ingestion & Custody Logs

The same bytes, ingested twice, must produce exactly one record. This is not a performance optimization. It is a chain-of-custody requirement.

Prerequisites

Before reading this chapter, you should be comfortable with: Chapters 5, 13–14 (Hashing, Pipeline, Postgres). Idempotency relies on source_hash from Chapter 5 and the schema structure from Chapter 14.

Open your database right now and count the rows in sources. Now run your Gmail worker against an export you already imported. Count again. If the number went up, your system has a chain-of-custody defect. You may not discover it until opposing counsel asks why the same message has two acquisition IDs — and two different timestamps for when it entered your system. Idempotency in an evidence system is not the same thing as idempotency in a billing system. In billing, a duplicate charge costs money; you fix it and refund. In evidence, a duplicate acquisition muddies provenance. It raises the question: which copy is the one that was processed? Which one's embeddings made it into the retrieval index? Which one's chain hash was used to build the attestation? If you cannot answer those questions cleanly, your chain of custody has a gap. This chapter is about closing that gap before it opens. The mechanism is simple: hash the file before doing anything else with it, look up the hash, and decide in that moment whether this is a new acquisition or a retry of an existing one. The discipline around that single decision — what to log, what to skip, what never to duplicate — is what the chapter covers. ## At a glance - source_hash must be computed from the raw bytes before any storage operation, parser call, or database write — hashing a derived or stored copy breaks the chain of custody at its root. - Acquisition-level idempotency (same file, same source) and document-level idempotency (same bytes, any source) are distinct checks serving distinct legal purposes: the first prevents junk records on retry, the second preserves multi-path acquisition provenance. - Cursor-based resumption enables safe retry for large sources by bookmarking progress in the job payload, while the hash check handles any records already processed within a restarted batch. ## Learning objectives By the end of this chapter you should be able to: 1. Implement check_or_insert_acquisition(): compute the SHA-256 before any write, query for an existing row, log the duplicate, and return the correct acquisition ID whether new or existing. 2. Explain the hash-before-store invariant and identify at least two specific processing steps (e.g., BOM stripping, line-ending normalization) that would silently break the chain if performed before hashing. 3. Distinguish the two idempotency levels — acquisition-level raw_sha256 check versus document-level ON CONFLICT (sha256) DO NOTHING — and give a concrete scenario where both fire in the same pipeline run. 4. Trace a cursor-based resumption through a worker crash: what state survives, what the restarted worker does first, and how the hash check handles any records already written in the interrupted batch. ## What idempotency means here Formal definition first. An operation is idempotent if applying it multiple times produces the same result as applying it once. In Meridian-Cannon, the operation is ingestion of a source file. The result is a set of rows in sources, acquisitions, documents, chunks, and audit_log. The invariant: ingesting the same source file twice produces the same source row, the same document row, and the same chunk rows. It does not produce two source rows, two sets of chunks, or two attestations over what are nominally the same bytes but are actually different records with different IDs. The mechanism that enforces this invariant is raw_sha256 in the acquisitions table. Every acquisition row carries a 64-character hex digest of the raw bytes, computed before any parsing, before any storage, before any database write. Two acquisitions with the same raw_sha256 are the same content, period. The system checks for this before proceeding. Two idempotency checks operate at different levels of the pipeline: - The acquisition check: "Have I fetched this exact file before?" If yes, log it and stop. - The document check: "Have I seen this byte sequence before, regardless of where I fetched it from?" If yes, return the existing document ID and continue linking. These checks are not interchangeable. A single email can arrive via three paths — Gmail API, Apple Privacy Export, and IMAP — and produce three acquisitions, one for each path, all pointing to the same document. The three acquisitions are distinct: each represents a distinct fetch event with a distinct legal basis, a distinct fetched_by_actor_id, and distinct request_meta records. The document is singular because it is the same bytes. This is the schema reflecting that separation. The document_acquisitions bridge table, populated by link_document_acquisition() in litdb.py, allows one document to accumulate multiple acquisition links without duplicating its content. ## The hash as the first link in the chain A source without a hash is a source without provenance — not hyperbole, but a statement of what "provenance" requires. Chain of custody begins at the moment raw bytes enter the system. If you hash the file after parsing it, you have hashed a derived artifact. If you hash it after saving it to storage, you have hashed a stored copy, not the original. The hash has to be the first thing computed — before the bytes touch the filesystem, before the bytes enter any parser. litdb.put_content_addressed() enforces this:

def put_content_addressed(
    raw_bytes: bytes, source_kind: str, ext: str = "bin", bucket: str = "raw-evidence"
) -> tuple[str, str, int]:
    sha = hashlib.sha256(raw_bytes).hexdigest()
    yyyymm = time.strftime("%Y-%m")
    key = f"raw/{source_kind}/{yyyymm}/{sha}.{ext.lstrip('.')}"
    # ...store under content-addressed path, return (uri, sha, size)

The SHA is computed in the first line, from raw_bytes in memory. The storage key is derived from the SHA. The SHA goes into acquisitions.raw_sha256. The storage URI goes into acquisitions.raw_storage_uri. These three things — hash, key, URI — are computed atomically from the same bytes, before anything is written to the database. The result: every acquisition row is self-authenticating. Given the raw_storage_uri, you can fetch the bytes; given the bytes, you can recompute the SHA; if the recomputed SHA matches raw_sha256, you have verified that the stored content is the same content that was ingested. This is the verification chain that FRE 901(b)(9) is looking for.

Implementing the idempotency check

Here is the pattern a worker uses to implement the check:

# >>> ch15 start
import hashlib

def hash_file(path: str) -> str:
    """Return sha256:<hex> from a file path."""
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for block in iter(lambda: f.read(65536), b""):
            h.update(block)
    return f"sha256:{h.hexdigest()}"

def check_or_insert_acquisition(
    path: str, source_id: str, method: str, legal_basis: str, db
) -> tuple[str, bool]:
    """Return (acquisition_id, created_now).
    If the raw_sha256 already exists for this source_id, skip and return existing."""
    raw = open(path, "rb").read()
    sha = hashlib.sha256(raw).hexdigest()
    with db.cursor() as cur:
        cur.execute(
            "SELECT id FROM acquisitions WHERE source_id=%s AND raw_sha256=%s LIMIT 1",
            (source_id, sha),
        )
        row = cur.fetchone()
        if row:
            db.execute(
                "SELECT audit(%s, 'acquisition', %s::uuid, %s::jsonb)",
                ("duplicate_detected", row["id"], json.dumps({"sha256": sha})),
            )
            return str(row["id"]), False
    uri, sha, size = litdb.put_content_addressed(raw, source_kind=method)
    acq_id = litdb.record_acquisition(
        source_id=source_id, method=method, legal_basis=legal_basis,
        raw_uri=uri, raw_sha256=sha, raw_byte_size=size,
    )
    return acq_id, True
# <<< ch15 end

The function returns a boolean alongside the acquisition ID. True means this is a new acquisition; downstream code should proceed to parse, chunk, and embed. False means this is a duplicate; downstream code should stop, having already logged the attempt. That log entry for the duplicate is not optional. The audit trail shows not just what was successfully ingested but every attempt — including attempts that were correctly identified as duplicates. If opposing counsel asks "when did the export first enter the system?", the answer is in the audit log. If they ask "was this export imported multiple times?", that is also in the audit log. Both answers come from the same append-only table. ## The custody log: what it records and why The audit_log in schema/10_core.sql is hash-chained. Each row's hash field is computed by the audit_log_hash_trigger() trigger:

NEW.hash := sha256_hex(
    coalesce(prev, '') || '|' ||
    NEW.occurred_at::text || '|' ||
    coalesce(NEW.actor_id::text, '') || '|' ||
    -- ... action, resource_type, resource_id, payload
);

The trigger reads the current tail hash before inserting, making each row a function of every row before it. You cannot alter an early row without invalidating every row that follows. This is the property that makes the audit log a custody log in the forensic sense: it records not just what happened but that the record of what happened has not been modified.

What events go into the custody log for ingestion? Every acquisition triggers an audit() call with action acquisition_created. Every duplicate detection triggers an audit() call with action duplicate_detected. The payload jsonb carries the SHA, the method, the source kind — everything needed to reconstruct what happened from the audit record alone. The sequence for a first-time ingest: 1. duplicate_detected check: no existing row → proceed. 2. put_content_addressed: bytes stored, URI computed. 3. record_acquisition: row inserted into acquisitions. 4. audit("acquisition_created", ...): custody log entry written. 5. Parse, chunk, embed → additional audit events. The sequence for a retry ingest (same bytes): 1. duplicate_detected check: existing row found. 2. audit("duplicate_detected", ...): custody log entry written. 3. Return existing acquisition ID. Stop. The second sequence is the idempotency guarantee in action. The caller gets the same acquisition ID it would have gotten on the first ingest. Any downstream code that is doing its own idempotency checks (the document upsert's ON CONFLICT DO NOTHING) also produces the same result. The system reaches the same final state regardless of how many times the ingest runs. ## Why retries are inevitable Production evidence systems encounter retries constantly. Network connections drop mid-import. API rate limits terminate a Gmail fetch after 200 of 800 emails. A worker crashes on a malformed MIME attachment at email 347, and the job is requeued. The iMessage export is large; the laptop sleeps; the worker starts over the next morning. Without idempotency, every retry produces duplicate records. With idempotency, every retry is safe — the system refuses to produce a duplicate. The job queue in litdb.py supports this directly. claim_next() claims a job under a lease. If the worker crashes before calling complete(), the lease expires and the job is reclaimed. The worker that picks it up will encounter the same source bytes. If those bytes were partially processed before the crash — some chunks inserted, not all — the idempotency check at the acquisition level stops the worker from creating a second acquisition, and the document-level ON CONFLICT DO NOTHING handles the already-inserted chunks. The result: the worker completes the remaining work without touching what was already done. This behavior is not accidental. It is the reason the hash check comes first, before any other database write. --- > ☉ In the Wild — Forensic disk imaging. > > When a law enforcement investigator images a suspect's hard drive, the > imaging is never done once. Standard forensic practice requires running the > same drive through the imaging tool twice: once to produce the image, and > once to verify it. Write-blocking hardware prevents the original from > being modified during either pass. The imaging tool produces a hash of the > original; the verification pass hashes the image and compares. > > What happens when the verification hash doesn't match? The image is > discarded and the drive is re-imaged. There is no "close enough" in forensic > disk imaging; the hashes must match exactly or the image is not evidence. > > Now imagine the analysis pipeline that processes the verified image. Without > idempotency, running the same image twice — during verification, during > analysis, during quality assurance — would produce duplicate records for > every file on the drive. A drive with 100,000 files, processed twice, would > produce 200,000 records, many of them marked as "appearing twice" — which > a careless analyst might interpret as evidence that a file was stored in > two places, which is not the same thing as the analyst running the tool > twice. > > The discipline of check-before-insert is standard forensic practice at the > pipeline layer. Meridian-Cannon applies it by design. The alternative is > not "more records" — it is corrupted chain of custody. --- > ✻ Try This. > > Write a Python function hash_file(path) that returns the SHA-256 digest > of a file as a string prefixed with sha256:, in the form > sha256:a1b2c3... (64 hex chars after the prefix). Process the file in > 64 KB blocks so it does not load large files into memory. > > Then write check_or_insert_source(path, conn) that: > > 1. Calls hash_file(path) to get the digest. > 2. Queries SELECT id, raw_sha256 FROM acquisitions WHERE raw_sha256 = %s > LIMIT 1 with the hex portion of the digest (no prefix). > 3. If a row exists, prints DUPLICATE: {existing_id} and returns that ID. > 4. If no row exists, inserts a minimal acquisition row and returns the new ID. > > Run your function twice against the same file. The second run should print > DUPLICATE: and return the same ID the first run returned. No new row > should appear in acquisitions. Check with > SELECT COUNT(*) FROM acquisitions WHERE raw_sha256 = '<hex>'. --- > ◆ Going Deeper — Content-addressed storage vs. source_hash idempotency. > > Content-addressed storage (CAS) — as used in Git, IPFS, and many backup > systems — deduplicates by content hash system-wide. If the same bytes are > stored twice, they appear once. The address is the hash. > > raw_sha256 idempotency in Meridian-Cannon is deliberately different. The > same content from two different acquisitions — for example, the same email > exported from Gmail and also extracted from an Apple Privacy Export — produces > two acquisition rows with the same raw_sha256 but different id, > source_id, method, and legal_basis values. Those two acquisitions are > different evidence objects. They reflect different chains of custody. The > fact that they contain identical bytes is itself evidentially significant: > it shows that the same communication was captured via two independent paths, > which strengthens rather than weakens the authenticity argument. > > The document-level deduplication (ON CONFLICT (sha256) DO NOTHING in > upsert_document()) is where content-addressing actually operates: > two acquisitions of identical bytes produce two acquisition rows but one > document row. The bridge table document_acquisitions links both > acquisitions to the single document. This is the correct model: two > custody paths, one content object. > > Do not collapse acquisition-level idempotency into document-level > deduplication. They serve different purposes. Acquisition idempotency > prevents retries from creating junk records. Document deduplication prevents > the same content from being chunked and embedded twice. The distinction > matters when you ask "how many different ways did we receive this document?" > — a question with a real legal answer. --- ## The running case Isabel's attorney has assembled her iMessage export across three hard drives. The first drive is from the phone backup taken the day after the initial DHS visit. The second is from a newer phone backup, done six months later, that also contains the same early messages. The third is an Apple Privacy Export that Isabel requested directly from Apple — a different format, different container, but the same underlying message content. Without idempotency, importing all three drives would produce three sets of chunk records for the same messages. Three different chunk IDs for the message that reads, in part, "the caseworker told me" — a message that may become a central exhibit. Three different attestations, each covering a different chunk ID, all purporting to attest to the same sentence. When the attorney tries to produce this message to the court, she has to explain why there are three records. With idempotency, the sequence is different. The first drive produces one acquisition and one set of chunks. The second drive's import hits the raw_sha256 check: the same messages were already acquired from drive one. The duplicate is logged; the processing stops. The Apple Privacy Export produces a new acquisition (different source kind, different legal basis, different method value) but the upsert_document() call returns the existing document IDs because the byte-level content is the same. One document, three acquisitions, one set of chunks, one attestation.

The audit log shows all three import attempts, with timestamps. This is the timeline of evidence collection: first acquisition on date A, second attempt (duplicate) on date B, third acquisition via a different path on date C. The timeline is itself evidence of due diligence in collection.


Why It Matters.

In the running case, Isabel's attorney imports the iMessage export three times across different sessions and different drives. Each import attempt is logged. The idempotency check shows that the first import succeeded; imports two and three are logged as duplicates. The audit trail shows not just what was ingested but when each attempt occurred.

This matters for two reasons. First, it establishes the timeline of evidence collection — relevant when opposing counsel argues that evidence was fabricated or added after a key date. The audit log has timestamps for every attempt, including the failed duplicates. Second, it protects against the "multiple copies" argument: if the system had produced three records for the same message, opposing counsel could argue that the records do not represent a single continuous communication but three separately created records — implying that the chain of custody was not intact. The idempotency guarantee removes that argument.


§ For the Record — FRE 901(b)(9).

"Evidence describing a process or system and showing that it produces an accurate result."

Rule 901(b)(9) is the authentication hook for computer-generated evidence. The rule does not require that the system be perfect; it requires that the process be describable and that the result be accurate. The idempotency log satisfies both prongs: the process is the hash check, described in this chapter; the accuracy is demonstrated by showing that every import of the same content produces the same source ID, the same document ID, and the same chunk IDs. Determinism is accuracy, for the purposes of 901(b)(9).


Document Partitioning with Unstructured.io (v0.2.0)

The chunking workers described in Chapter 13 apply section-aware chunking — avoiding naive fixed-size splits. In v0.2.0, the recommended library for document partitioning is Unstructured.io, installed via:

pip install meridian-canon[unstructured]

Unstructured.io partitions a document into typed elements — title, narrative_text, table, list, page headers, and footers — before any chunking logic runs. Each element carries an element_type field that L3 workers use to determine the appropriate chunk boundary: a table is never split mid-row; a section heading starts a new chunk; a footer is excluded from content chunks entirely. This replaces any remaining fixed-size fallback logic. Supported source types: PDF, email (MIME/RFC 822), DOCX, HTML, plain text, and others. The element_type is preserved in ChunkRecord.metadata alongside the page number and character offsets, making it available to L4 enrichment and to the attestation's Witness block. The adapter is in meridian/witness/unstructured_adapter.py:

from meridian.witness.unstructured_adapter import partition_and_chunk

chunks = partition_and_chunk(
    source_bytes,
    source_type="application/pdf",
    parent_sha256="sha256:a3b4c5...",
    custodian="acme-corp-2026",
)

If the unstructured package is not installed, the adapter falls back to a single-section chunker that treats the entire document as one chunk. The fallback is automatic and logged; no configuration change is needed. The idempotency guarantee is unaffected: source_hash is computed from the original source bytes before partitioning. Re-ingesting the same document after switching from the fallback chunker to Unstructured.io will produce a hash match at L0 (same bytes, same acquisition) and the document-level ON CONFLICT will return the existing document row. To re-partition an already-ingested document with the new chunker, the document row must be explicitly reset — a deliberate operator action, not a side effect of re-ingestion. ## Failure modes and how to handle them The stub for this chapter listed the failure modes from spec §7.6. Here they are mapped to idempotency concerns: Rate limits and OAuth expiry. The worker is mid-import when the Gmail API returns a 429 or the OAuth token expires. The job is marked queued with a backoff by litdb.fail(). When it retries, the raw_sha256 check correctly identifies the already-imported messages as duplicates. The worker picks up where it left off — or, more precisely, it re-scans from the beginning and the hash check makes the already-done work a no-op. Hash mismatch. If raw_sha256 in the acquisition row does not match a recomputed hash of the bytes at raw_storage_uri, the stored content has been modified after ingest. This is a custody break. The worker should log it as custody_break and halt rather than silently proceeding. Schema violation. A source file that cannot be parsed (malformed MIME, unexpected encoding) fails at the parser level. The acquisition row is still created — the raw bytes were validly received — but parsed_at is null and the job is marked failed. On retry, the acquisition check finds the existing row and skips re-ingestion. The parser retry happens against the already-stored raw bytes, not a re-fetch. Corruption. The file on disk does not match its acquisition hash. This is the same as hash mismatch above — halt, log, do not re-ingest over the corrupted copy. In every failure mode, the invariant holds: the acquisition either exists with its original hash or it does not exist. There is no half-existence. ## Cursor-based resumption For sources that produce many records — a Gmail account with 10,000 emails, a CCAP docket with 300 entries — the worker cannot process everything in one job. It needs a cursor: a bookmark that records where it stopped, so the next run starts from there rather than from the beginning. The pattern in Meridian-Cannon: the job's payload jsonb carries a cursor field. After each successful batch, the worker calls enqueue() to post a new job with the updated cursor. If the batch job crashes mid-run, the cursor in the original job is still at the start of the batch. When the job is reclaimed, the idempotency check handles the records that were already processed within that batch. This is belt-and-suspenders. The cursor prevents re-scanning already-processed records from the beginning. The hash check handles the case where the cursor puts the worker into a partially-processed batch. Both mechanisms are needed; neither is redundant.

💡Key Takeaways
- Idempotency in evidence ingestion means the same source bytes produce exactly the same database state on every run — same acquisition row, same document row, same chunk IDs — with no duplicates regardless of how many times the source is submitted. - source_hash must be computed from raw bytes in memory before any partitioning, storage write, or parser call; hashing a derived or normalized copy breaks the chain-of-custody root and makes the hash non-reproducible from the original file. - Unstructured.io's element_type field (e.g., Title, NarrativeText, Table) is stored in ChunkRecord.metadata so that downstream retrieval ranking can weight chunk types differently — a structural claim that survives the admissibility challenge that naive byte-splitting cannot. - Email follows a separate MIME-tree path (each body part and attachment becomes its own chunk with a parent_chunk_id FK) while documents follow the section-aware partitioner path; both paths compute source_hash from the same original bytes before any parsing begins. - Hash-before-process prevents an adversary from re-ingesting a modified file and having it treated as a new source: the raw_sha256 check fires at acquisition time and stops duplicate processing before any downstream artifact is created.
## Exercises ### Warm-up 1. Open schema/10_core.sql and read the audit_log_hash_trigger() function. Trace what prev_hash contains for the very first row in the audit log (hint: look at coalesce(prev, '')). What does this mean for the chain's root? 2. In litdb.py, upsert_document() uses ON CONFLICT (sha256) DO NOTHING and then, on conflict, queries for the existing document. Why not use ON CONFLICT DO UPDATE SET ... instead? What would be wrong with updating the existing document row on re-encounter? ### Core 3. Write a shell command that counts the number of distinct raw_sha256 values in acquisitions and compares it to the total row count. If the numbers differ, what does that tell you? Is the difference expected or unexpected? 4. Implement the check_or_insert_acquisition() function from the ✻ Try This sidebar with full error handling: handle the case where put_content_addressed() fails (storage unavailable) without having written a database row. What transaction scope do you need? 5. The duplicate_detected audit event currently carries the SHA and nothing else. Add a second worker run in your test environment, force a duplicate, and inspect the audit log row. What additional fields would make the duplicate audit row more useful for later forensic analysis? ### Stretch 6. Implement cursor-based resumption for a worker of your choice. The cursor should survive a worker crash at any point during a batch. Verify by manually killing the worker process mid-run and restarting it; confirm that the final state is identical to a clean run. 7. Consider the case where the same email appears in two different Gmail accounts (e.g., a forwarded message). Under what circumstances does the idempotency check correctly produce two acquisition rows? Under what circumstances would it produce one? Is the one-acquisition case ever wrong? ## Build-your-own prompt For your capstone matter: audit your existing ingestion by querying SELECT raw_sha256, COUNT(*) FROM acquisitions GROUP BY raw_sha256 HAVING COUNT(*) > 1. If any SHA appears more than once, those are legitimate multiple-path acquisitions (the same bytes from different sources). Document each one: what were the two acquisition methods? Were they expected? Does the audit_log show the expected sequence of events for each? Your custody log is only as complete as you make it. ## Further reading - schema/20_provenance.sql — the acquisitions and sources tables. - schema/10_core.sqlaudit_log structure and hash-chain trigger. - workers/litdb.pyput_content_addressed(), record_acquisition(), upsert_document(), link_document_acquisition(). - meridian/witness/unstructured_adapter.pypartition_and_chunk() and the fallback single-section chunker. - Unstructured.io documentation: https://docs.unstructured.io/ — element types, supported formats, and partition strategies.

  • NIST SP 800-86, Guide to Integrating Forensic Techniques into Incident Response, §3.1.3 (evidence acquisition integrity).
  • FRE 901(b)(9) and Advisory Committee Notes (2000 amendment), explaining the "process or system" standard.
  • Casey, Eoghan, Digital Evidence and Computer Crime, 3rd ed., Ch. 4 (Digital Evidence Acquisition), esp. the section on write-blocking and hash verification.
  • Brinkmann, Frank, "Hash Databases in Digital Forensics," Digital Investigation 5 (2008): S49–S56.
  • FRE 803(6) (business records exception) — the hearsay angle on audit logs.

Next: Chapter 16 — Procedural-Legal Primitives. The tables that make Meridian-Cannon a legal substrate, not just a document database.