NORAEarly Access

Part IV — Engineering Practice · Chapter 29

Local-First Chunking & Privilege

Local-First Chunking & Privilege

Hash before cloud. The custodian's machine is the trust root; everything that leaves it is metadata.

Prerequisites

Before reading this chapter, you should be comfortable with: Chapters 13, 15 (Pipeline, Ingestion). Local-first chunking is L3 of the pipeline; idempotent ingestion provides the hash-before-process pattern it relies on.

Here is the threat. Your client has a Gmail export containing 6,000 emails. You want to embed them for semantic search. The obvious move: send the emails to an embedding API. The problem: if any of those emails are attorney-client privileged, you have transmitted privileged communications to a third-party server outside the privilege holder's control. Whether that transmission waives the privilege depends on jurisdiction, care in transmission, and how quickly you can establish that disclosure was inadvertent. None of those factors are comfortable to argue about in a brief.

Local-First Chunking is the engineering response. Hash the source file on the custodian's machine. Chunk it on the custodian's machine. Flag each chunk's privilege tier on the custodian's machine. Then send only the cloud-safe representation — chunk metadata, content hash, text — to the API. The raw bytes never leave the controlled environment.

This chapter explains why this discipline matters, how it is implemented in meridian/witness/local_chunker.py, what section-aware chunking means and why naive fixed-size chunking is legally suspect, and how privilege metadata is bound at the chunk level before any outbound operation. ## At a glance - The hash must be computed before any cloud API call: this is the L0 invariant, and it is what makes the source hash a timestamped custody anchor rather than a post-processing annotation. - Section-aware chunking preserves semantic units — email messages, section headings, list items — so that each chunk can individually satisfy the best-evidence standard; naive fixed-size chunking routinely produces mid-sentence splits that fail that standard. - Privilege must be checked before chunking, not after: the pii_tier field is embedded in the ChunkRecord at creation time so that the privilege assessment is part of the chunk's identity, not a later annotation that could be bypassed. - As of v0.2.0, Unstructured.io is the recommended document partitioner for production ingestion. It understands section headers, tables, lists, page boundaries, and headers/footers across PDF, email (MIME), DOCX, HTML, and Markdown, and stores an element_type per chunk. The primitive chunk_local() remains available as a fallback and for custom source types. ## Learning objectives By the end of this chapter you should be able to: 1. Implement chunk_local() for a new source type, correctly passing parent_sha256 (computed before any transformation) and setting pii_tier before the first cloud eligibility check. 2. Explain the custody guarantee versus the privacy guarantee distinction, and give one example of a system that provides the former without the latter. 3. Apply section-aware chunking rules to an email thread: identify which sections become separate chunks, how attachments are linked via parent_chunk_id, and how privilege is inherited downward from message to attachment. 4. Identify which document types require special chunking treatment (email threads, contracts, court filings, short messages) and state the section-aware rule that governs each. ## 22.1 What local-first means Local-first is a custody guarantee. It is not a privacy guarantee. The distinction matters and the chapter will return to it several times, because confusing the two causes architects to apply local-first where it is unnecessary and to omit it where it is required. Privacy is about who can access the data. Privacy is served by encryption, access control, audit logs, and need-to-know enforcement. Local-first is not primarily a privacy mechanism: a chunk that stays on the custodian's machine is still readable by anyone with physical or remote access to that machine. Custody is about who can prove the chain from original to attested. Local- first establishes custody. When the system hashes a source file before any transformation, the hash is a timestamped witness to the file's state at that moment. If the original file is later challenged — a common tactic in adversarial proceedings — the hash anchors the challenge. "This was the file as it existed before we processed it." That anchor is only valid if the hash was computed before processing, locally, without any intermediate step that could have altered the bytes. The Canon spec calls this L0: the first layer in the processing pipeline, and the only layer where raw bytes are in scope. Everything above L0 operates on derived representations. > ▼ Why It Matters — Custody before disclosure. > > In the 2026 TPR proceeding, the parent's attorney receives a production from > the opposing agency containing 300 PDF documents. Some are medical records. > Some contain communications between agency workers. A few may contain > attorney work product inadvertently included in the production. > > Before any of these documents are processed — before OCR, before embedding, > before keyword extraction — the local-first chunker hashes every file. The > hash is the receipt. If opposing counsel later alleges that a document was > altered or selectively processed, the hash is the answer. If the attorney > later realizes a document is covered by a clawback demand, the hash > establishes that the document was in the system at a specific time and in a > specific state, which is relevant to the inadvertent-disclosure analysis > under FRE 502. ## 22.2 The privilege waiver risk Inadvertent disclosure of privileged material to a third party has been a recognized privilege-waiver risk since at least Swidler & Berlin v. United States (1998). The question of whether cloud API calls constitute third-party disclosure has not been definitively resolved, but the risk profile is real: - Cloud embedding APIs (OpenAI, Cohere, Voyage, etc.) are operated by entities outside the attorney-client relationship. - Their terms of service typically reserve the right to use submitted content for model training, subject to opt-outs that vary by tier. - A law firm that sends privileged emails to an embedding API is making a disclosure to a third party without the client's informed consent. Whether this waives the privilege depends on whether the disclosure was "reasonably calculated to preserve confidentiality" — the standard articulated in FRE 502(b). Local-first chunking is an engineering implementation of that standard: raw privileged content never reaches the third party. What reaches the API is the chunk's text representation (not the original bytes), after privilege has been assessed, and only for chunks that clear the privilege filter. This is not a complete legal answer — privilege waiver is ultimately a doctrinal determination. It is an engineering constraint that satisfies the "reasonably calculated" test: the architecture, not a written policy, is what prevents raw bytes from leaving the controlled environment. That distinction matters at the 502(b) hearing. ## 22.3 The ChunkRecord dataclass The implementation lives in meridian/witness/local_chunker.py. The central type is ChunkRecord:

@dataclass(frozen=True)
class ChunkRecord:
    chunk_id: str
    parent_sha256: str
    chunk_index: int
    chunk_offset: int
    chunk_size: int
    chunk_sha256: str
    custodian: str
    pii_tier: str
    chunked_at: str
    raw: bytes = field(repr=False, compare=False)

    def to_cloud_safe(self) -> dict[str, object]:
        d = asdict(self)
        d.pop("raw")
        return d

This is the v0.1.x form. In v0.2.0, two fields are added: element_type: str (the document element category assigned by the partitioner) and metadata: dict (a free-form dict that at minimum carries {"element_type": ...}). The extended form is produced by partition_and_chunk() in meridian.witness.unstructured_adapter (§22.5). The primitive chunk_local() still produces the v0.1.x form; callers that need element_type should use partition_and_chunk() instead. frozen=True means the record is immutable once created. A chunk that can be mutated after creation is not a reliable custody artifact. The raw field carries the actual bytes; it is excluded from repr (so logs do not contain raw content) and from compare (so two chunks with identical metadata but different raw bytes would still be "equal" under Python's comparison — which is intentional: the identity is the chunk_id and the hash, not the bytes). to_cloud_safe() is the API boundary. It serializes the record to a dict and removes the raw key. Every call to a cloud API must go through this method. An ingestion worker that passes a ChunkRecord directly to a cloud SDK is bypassing the local-first guarantee; it must call .to_cloud_safe() instead. The pii_tier field carries the privilege classification. It is set at chunk creation, before any outbound operation. The tier values are: 'public', 'low', 'internal', 'privileged', 'work_product'. The is_cloud_eligible() function enforces the default policy:

def is_cloud_eligible(pii_tier: str) -> bool:
    return pii_tier in {"public", "low", "internal"}

privileged and work_product are foreclosed from cloud transmission by default. An explicit custodian authorization is required to override this for any batch. The override path must be logged. > ◆ Going Deeper — Why chunk_id is derived from (source_hash, chunk_index). > > A chunk's ID is f"chunk-{parent_sha256[:8]}-{idx:06d}". The first eight > characters of the source file's SHA-256 hex digest, followed by a > zero-padded six-digit index. This scheme has a specific property: if the > same source file is re-chunked (after a bug fix to the chunker, after a > version upgrade), the same positions in the file produce the same chunk > IDs. Re-processing is idempotent at the chunk level. > > The alternative — deriving chunk_id from the chunk's content — would mean > that fixing a whitespace normalization bug in the chunker produces different > chunk IDs for the same document, breaking any FK relationships that other > tables have built against the old chunk IDs. > > The tradeoff: if two different source files happen to share the same first > eight hex characters of their SHA-256 (a collision at the abbreviated-hash > level, not the full hash — probability negligible but nonzero), their chunk > IDs could collide. A production system should use the full hash, or a ULID > seeded from the hash, rather than the abbreviated form. The abbreviated form > in the current implementation is a usability concession for development; a > comment in the code marks this. ## 22.4 The chunk_local() primitive

def chunk_local(
    data: bytes,
    *,
    parent_sha256: str,
    custodian: str,
    pii_tier: str = "internal",
    chunk_size: int = 8192,
) -> Iterator[ChunkRecord]:
    if not data:
        return
    n = len(data)
    idx = 0
    offset = 0
    while offset < n:
        end = min(offset + chunk_size, n)
        raw = data[offset:end]
        sha = _hash(raw)
        yield ChunkRecord(
            chunk_id=f"chunk-{parent_sha256[:8]}-{idx:06d}",
            parent_sha256=parent_sha256,
            chunk_index=idx,
            chunk_offset=offset,
            chunk_size=len(raw),
            chunk_sha256=sha,
            custodian=custodian,
            pii_tier=pii_tier,
            chunked_at=_now(),
            raw=raw,
        )
        idx += 1
        offset = end

This is the primitive. It divides data into fixed-size windows of chunk_size bytes (default 8,192 — two OS memory pages). Each window becomes a ChunkRecord with its own SHA-256 hash. The function is a generator: it yields chunks as it produces them rather than materializing all of them in memory at once. For a 100MB PDF, this matters. Notice what chunk_local() does not do: it does not parse the content. It does not identify sentence boundaries, paragraph breaks, or section headers. It divides bytes mechanically. The docstring is explicit about this: > For text-bearing formats with section structure (PDFs, emails), the > higher-level worker should use a section-aware chunker upstream and then > pass the section bytes here. This function is the primitive. The section-aware chunker is the layer above chunk_local(). It parses the document, identifies semantic boundaries, and presents each section as a discrete data: bytes argument to chunk_local(). chunk_local() hashes and wraps the section bytes; the section-aware layer is responsible for the boundaries. ## 22.5 Unstructured.io: recommended production partitioner The primitive chunk_local() divides bytes mechanically. For production ingestion of heterogeneous document types — PDFs, emails, DOCX, HTML, Markdown — v0.2.0 adds Unstructured.io as the recommended partitioner.

Install the optional extra:

pip install meridian-canon[unstructured]

Unstructured.io parses documents into typed elements before chunking. Each element carries an element_type drawn from a controlled vocabulary: Title, NarrativeText, Table, ListItem, Header, Footer, and others. This type is stored in ChunkRecord.metadata and is available to downstream retrieval ranking — a Title element scores differently from a NarrativeText element when assessing relevance. The adapter entry point is in meridian.witness.unstructured_adapter:

from meridian.witness.unstructured_adapter import partition_and_chunk

chunks = partition_and_chunk(
    source=document_bytes,
    source_type="application/pdf",   # or "text/html", "message/rfc822", etc.
    parent_sha256="sha256:<hex>",     # hash of the original source bytes
    custodian="acme-corp-2026",
    strategy="fast",                  # or "hi_res" for OCR-based
)

Returns a list of ChunkRecord objects. The ChunkRecord type is extended in v0.2.0 to carry a metadata dict:

@dataclass(frozen=True)
class ChunkRecord:
    chunk_id: str
    parent_sha256: str
    element_type: str          # v0.2.0: "Title", "NarrativeText", "Table", etc.
    chunk_index: int
    chunk_offset: int
    chunk_size: int
    chunk_sha256: str
    custodian: str
    pii_tier: str
    chunked_at: str
    metadata: dict             # v0.2.0: {"element_type": ..., ...}
    raw: bytes = field(repr=False, compare=False)

Graceful fallback. If the unstructured package is not installed, partition_and_chunk() falls back to a single-section chunker that treats the entire document as one chunk with element_type="NarrativeText". This preserves idempotency guarantees: the source_hash is computed from the original source bytes regardless of which strategy runs. Idempotency. source_hash is computed from the original source bytes before partitioning. Re-ingesting the same document is a no-op regardless of which chunking strategy was used. Email partitioning. Email (MIME) is handled separately due to MIME complexity: headers, body, and attachments are partitioned independently. Attachments are tagged with the parent message's chunk_id as parent_chunk_id. The partition-and-chunk adapter routes message/rfc822 content through this MIME-aware path automatically. > ◆ Going Deeper — Why element_type matters for admissibility. > > Section-aware chunking via Unstructured.io makes a structural claim about > each chunk's provenance: "this chunk came from a section header," or "this > chunk is a table row." When an opposing expert challenges the chunking > methodology, the element_type field is the evidence that the system > respected document structure rather than applying mechanical byte-splitting. > A chunk whose element_type is NarrativeText is straightforwardly a > complete paragraph; its integrity as a semantic unit is a matter of > inspection, not assertion. This is the same best-evidence argument > section-aware chunking makes in general (§22.6), now backed by a > deterministic, auditable partitioner. ## 22.6 Section-aware chunking: why it matters legally Fixed-size chunking at 8,192 bytes does not care about sentence boundaries, paragraph structure, or section headers. A 200-word paragraph that starts at byte 8,150 will be split: the first 42 bytes in chunk N, the remaining 158 bytes in chunk N+1. The sentence — potentially an important sentence — exists in neither chunk intact. This is not merely an information retrieval problem. It is a best-evidence problem. > § For the Record — FRE 1002 (Best Evidence Rule). > > "An original writing, recording, or photograph is required in order to prove > its content unless these rules or a federal statute provides otherwise." > > A chunk that splits a sentence mid-word is not a faithful representation of > the original document's content. It cannot be used to prove what the > original says. A naive chunker that routinely produces such splits is > producing chunks that individually fail the best-evidence standard. An > opposing expert who establishes this in court has a basis to challenge every > exhibit derived from those chunks. Section-aware chunking respects the document's own structure. A court filing splits at section headers: "Background", "Legal Standard", "Argument", "Conclusion". An email splits at message boundaries in a thread. A contract splits at article headings. Each resulting chunk is a complete, self-contained unit of meaning that the original document itself treats as a unit. The rules for Meridian-Cannon's section-aware chunking tier: - Short documents (text messages, individual emails under 2,000 bytes): the entire document is one chunk. No split. - Long documents (PDFs, court filings, contracts): split at section headers detected by the format parser. Never at byte count. - Email threads: each message is its own chunk. Attachments are separate chunks, each tagged with a parent_chunk_id FK to the message chunk. - Lists: each list item is atomic. No item is split across chunks. - Sentence boundary rule: if a section-aware split would fall within a sentence (detected by the parser's sentence tokenizer), move the boundary to the next sentence ending before the target byte position. The last rule is the hardest to implement. The easiest robust implementation uses a sentence boundary detector (spaCy, NLTK punkt, or a simple regex over . patterns) and finds the last sentence ending before the section boundaries produced by the structural parser. > ✻ Try This — Naive vs section-aware comparison. > > Find a five-paragraph document: any article, a court filing excerpt, a > letter. Apply naive fixed-size chunking at 500 characters. Count how many > chunks contain a mid-sentence break (a chunk that does not begin with a > capital letter following a period, or that does not end with a sentence- > terminating punctuation mark). Apply section-aware chunking at paragraph > breaks. Count the same thing. For most five-paragraph documents, naive > chunking produces two to four mid-sentence breaks; paragraph-aware chunking > produces zero. Which chunk set would survive a best-evidence challenge? ## 22.7 Email threads and attachment chains An email thread presents a specific chunking challenge. The thread is a single export unit — one .mbox file, or one .emlx file in Apple Mail's format — but it contains N distinct messages, each with potentially M attachments.

The correct model is a tree, not a sequence:

thread_chunk (the full thread as context)
├── message_chunk[0]  (earliest message)
│   ├── attachment_chunk[0,0]  (first attachment to message 0)
│   └── attachment_chunk[0,1]
├── message_chunk[1]
│   └── attachment_chunk[1,0]
└── message_chunk[2]

Each node in this tree is a ChunkRecord. The parent_chunk_id field links attachment chunks to their parent message chunk, and message chunks to the thread chunk if a thread-level chunk is emitted. The chunk_index at each level is the position within the parent — not globally across the document. The MIME structure of emails makes this natural: a MIME multipart message already separates body parts. The chunker walks the MIME tree and maps each part to a chunk. Attachments with binary content (images, Excel files) get chunks whose pii_tier is set based on the attachment's content type and the parent email's privilege flag. The critical rule: privilege is inherited downward. If a message chunk is flagged pii_tier='privileged', all its attachment chunks inherit the same tier. An attachment to a privileged email is privileged until positively determined otherwise. The determination must be explicit; the default is the conservative one. ## 22.8 Privilege binding before cloud eligibility check The sequence of operations: 1. Source file is received. Hash the raw bytes immediately. 2. Parse the document structure (MIME for email, PDF parser for PDF, etc.). 3. For each section or message: determine the privilege tier from available metadata (sender, recipient, subject line, existing privilege assertions). 4. Call chunk_local() for each section's bytes, passing pii_tier as determined in step 3. 5. For each ChunkRecord yielded: call is_cloud_eligible(chunk.pii_tier). 6. Cloud-eligible chunks: call chunk.to_cloud_safe() and enqueue for API. 7. Non-cloud-eligible chunks: store locally only; record the chunk_id and chunk_sha256 for the audit trail. Step 3 happens before step 4. Privilege assessment precedes chunking. The privilege flag is embedded in the ChunkRecord itself — it is part of the chunk's identity, not a later annotation. A chunk that later has its pii_tier upgraded (because a document review revealed it was privileged) gets a new chunk record in the audit log; the original record is preserved. This is privilege-by-construction. The alternative — chunk first, assess privilege on the cloud-safe output, withhold before transmission — is weaker: it creates a window in which privileged content can be transmitted if the privilege-assessment step fails or is bypassed. > ▼ Why It Matters — Privilege before processing. > > In the running case: Isabel's iMessage export (from the iCloud backup > received in the discovery exchange) contains a thread between Isabel and > her attorney from the six weeks before the hearing. The thread discusses > case strategy. It is clearly attorney-client privileged. > > The export file arrives as a single .db SQLite file (the Messages.sqlite > format). The chunker opens the file, identifies each conversation, and for > each message checks the recipient's identity against the parties table. > When a recipient matches the attorney's phone number (stored in > party_handles with handle_kind='phone'), the conversation is flagged > pii_tier='privileged' before any message bytes are chunked. None of the > chunks from that conversation will pass is_cloud_eligible(). The > embedding API never sees the content. The privilege is established before > any potential disclosure event. ## 22.9 The local-first architecture in the broader system The local-first chunker is a subsystem of the Phase B witness wrapper (meridian/witness/wrapper.py). The wrapper calls chunk_local() as part of the attest_acquisition() flow: 1. attest_acquisition() receives an acquisition_id. 2. It fetches the source bytes from local storage. 3. It hashes the source bytes (SHA-256) — this is the source_hash in the sources table, already computed at ingest time. 4. It calls the appropriate section-aware chunker for the source type. 5. Each section is passed to chunk_local(). 6. The resulting ChunkRecord list is serialized (via to_cloud_safe() for cloud-eligible chunks) and stored. 7. An ObservationAttestation is emitted for the acquisition. The ObservationAttestation commits to the list of chunk IDs and their hashes. A verifier who runs the seven-step protocol on the attestation can confirm that the chunk IDs listed in the attestation's witness block are the same chunks that were produced at ingest time, with the same hashes. If a chunk was later altered — even one byte — its SHA-256 would change, and the attestation's witness block would no longer match. This is the custody chain from raw bytes to attestation: source_hash → chunk IDs → ObservationAttestation. Each link is a SHA-256 commitment. The chain cannot be forged without forging an Ed25519 signature under the issuer's key. > ☉ In the Wild — Apple's iCloud CSAM scanning proposal (2021). > > In August 2021 Apple announced a proposal to scan images on-device before > they were uploaded to iCloud Photos. The mechanism was local: a perceptual > hash of each image would be computed on the device, matched against a > database of known CSAM hashes, and only a match report (not the image) would > be transmitted to Apple's servers. Apple described this as "privacy- > preserving" because the raw image never left the device. > > The proposal provoked intense criticism. The core objection: a perceptual > hash is not a cryptographic hash. A perceptual hash reveals content > categories (whether an image resembles known CSAM) and can be driven by > an adversary who controls the hash database. The architecture was local-first > in the sense that raw bytes stayed on the device — but it was not privacy- > preserving in the sense that it revealed nothing about the device's content. > > This case illustrates the distinction this chapter opens with: local-first > is a custody guarantee, not a privacy guarantee. Apple's proposal maintained > custody (images stayed on-device) but disclosed information about content > categories. The two properties are orthogonal. A system that confuses them > will claim privacy protection it cannot provide, or will fail to provide > custody protection because it assumes custody and privacy are the same thing. > > Meridian-Cannon's local-first chunker hashes with SHA-256 (a cryptographic > hash, not a perceptual hash). It reveals nothing about content to the cloud > API — only the hash, the chunk size, and the chunk index. The text content > of cloud-eligible chunks is transmitted in plaintext (for embedding), but > only after privilege has been assessed and only for non-privileged chunks. > The distinction is explicit, not assumed. ## 22.10 Encryption for sensitive cloud transfer The default policy — is_cloud_eligible() — forecloses cloud transfer for privileged and work_product tiers. There are cases where a custodian needs to use a cloud API on sensitive content: a managed legal AI service that operates under a BAA (Business Associate Agreement) and agrees to specific handling obligations, for instance. The architecture supports this with an explicit override path. A custodian who authorizes cloud transmission of sensitive chunks must: 1. Generate a per-batch encryption key. 2. Encrypt each sensitive chunk's content with that key before calling to_cloud_safe(). 3. Log the authorization event: timestamp, custodian identity, batch ID, reason. 4. Transmit the encrypted content to the cloud service (which must support client-managed keys to be able to process it). 5. Receive the cloud result and decrypt locally. This path is not implemented in local_chunker.py — the current module is the primitive. The encryption layer sits above it, in a higher-level worker. The architecture constraint is that the encryption key must never leave the custodian's environment and must be generated per-batch (not reused across batches), so that a compromise of one batch key does not expose other batches. The audit log entry for the authorization event must be hash-chained into the audit log, which makes it tamper-evident. An auditor can verify that every sensitive cloud transfer was explicitly authorized, and by whom. ## 22.11 What the chunker guarantees and what it does not The local-first chunker guarantees: - Raw bytes do not leave the custodian's machine through the chunker's interface. - Every chunk's chunk_sha256 is computed before the chunk is stored or transmitted. - Privilege tier is set at chunk creation, before any cloud eligibility check. - to_cloud_safe() strips raw bytes before serialization. The local-first chunker does not guarantee: - That the custodian's machine is secure. A compromised machine voids the local-first guarantee. - That the source file was authentic before ingestion. The hash commits to the file's state at ingest time, not to its provenance. Provenance is established separately by the acquisition record and the chain of custody documentation. - That the privilege assessment is correct. The chunker applies the privilege tier passed by the caller. If the caller sets the wrong tier, the chunker applies the wrong tier. Correct privilege classification is the ingestion worker's responsibility. This is the correct division of responsibility. A primitive that does one thing correctly — hash and chunk, locally — is more trustworthy than one that tries to solve the entire problem. The legal sufficiency of the privilege assessment is a human judgment, reviewed by an attorney. The chunker's role is to ensure that the engineering does not override that judgment by inadvertently exposing content before the judgment is made.

💡Key Takeaways
- "Local-first" means no network calls occur during chunking: the source file is hashed, partitioned, and privilege-assessed entirely on the custodian's machine before any cloud API receives any bytes. - Unstructured.io's element_type (e.g., Title, NarrativeText, Table, ListItem) is stored in ChunkRecord.metadata because retrieval ranking needs to weight document structure — a Title chunk scores differently from a NarrativeText chunk when assessing relevance. - When the unstructured package is not installed, partition_and_chunk() falls back to a single-section chunker that emits one chunk with element_type="NarrativeText" — the fallback is logged automatically and the source_hash invariant is unaffected. - parent_sha256 is computed from the original source bytes before any partitioning and stored in every ChunkRecord; re-processing the same source with a different chunker still links all resulting chunks back to the same original file via this field. - strategy="hi_res" enables OCR-based partitioning for scanned PDFs — it invokes a vision model to extract text from image-only pages, producing much higher extraction quality at significantly higher latency than the default "fast" strategy.
## Exercises ### Warm-up 1. Read meridian/witness/local_chunker.py end-to-end. Identify the three functions and the one dataclass. For each, write one sentence stating what it is responsible for. 2. What is the difference between chunk_sha256 and parent_sha256 in a ChunkRecord? Why does chunk_local() require parent_sha256 as a parameter rather than computing it from data? ### Core 3. Call chunk_local() on a 25,000-byte byte string with chunk_size=8192. How many ChunkRecord objects does the generator yield? What is the chunk_size of the last chunk? Verify by running the code. 4. Write a function process_email_thread(thread_bytes, attorney_phone, custodian) that: (a) parses the thread to identify messages; (b) checks each message's recipient list against attorney_phone; (c) sets pii_tier='privileged' for any message with the attorney as a recipient; (d) calls chunk_local() for each message's bytes; (e) returns only the to_cloud_safe() dicts for cloud-eligible chunks. 5. Implement is_cloud_eligible() as a three-tier hierarchy: public and low are always eligible; internal is eligible unless the strict_mode flag is set; privileged and work_product are never eligible. Add the strict_mode parameter to the function signature. ### Stretch 6. The chunk_id scheme uses the first eight characters of parent_sha256. Explain why this abbreviation is a risk. Design an alternative chunk_id scheme that is collision-resistant for a corpus of 10 million source files with an average of 50 chunks each. 7. Implement a section-aware chunker for plain text documents. It should split at paragraph boundaries (blank lines), enforce a minimum chunk size of 200 characters (combining small paragraphs), and ensure no mid-sentence splits. Write tests that verify: (a) no mid-sentence breaks; (b) no chunk smaller than 200 characters except the last; (c) the concatenation of all chunk texts, joined with paragraph separators, equals the original document. ## Build-your-own prompt For your capstone matter, identify the highest-privilege source type in your evidence set (text messages? attorney emails? medical records?). Write a chunker for that source type that: (1) parses the source format's native structure; (2) sets pii_tier correctly based on recipient or document metadata; (3) produces chunks that would survive a best-evidence challenge (no mid-sentence breaks, each chunk a complete semantic unit); (4) passes every chunk through is_cloud_eligible() before any outbound call; (5) logs any chunk that fails the eligibility check with the reason. ## Further reading - meridian/witness/local_chunker.py in this repository.

  • FRE 1002 (Best Evidence Rule); FRE 502(b) (Inadvertent Disclosure and Privilege Waiver).
  • Apple, "CSAM Detection: Technical Summary" (August 2021). The iCloud scanning proposal that prompted the on-device-hash debate.
  • Rogaway, "The Moral Character of Cryptographic Work" (2015). On the distinction between what a cryptographic primitive guarantees and what it does not.
  • Kleppmann, Designing Data-Intensive Applications, Ch. 3 (Storage and Retrieval). The chunking and indexing tradeoffs in columnar vs row storage.
  • Swidler & Berlin v. United States, 524 U.S. 399 (1998). The Supreme Court's holding that the attorney-client privilege survives death, and the reasoning about why privilege must be absolute to be trusted.
  • Paper §7.1 (Local-First Chunking in the Canon spec).

Next: Chapter 23 — Key Management.