NORAEarly Access

Part II — CS Building Blocks · Chapter 12

Cryptographic Hashing

Cryptographic Hashing

"A cryptographic hash is a small piece of evidence about a large piece of data — small enough to fit in a tweet, but as good as a full byte-for-byte copy for the purpose of detecting alteration."

When a parent in a termination-of-parental-rights proceeding hands their attorney a folder of iMessages, voicemails, and DHS correspondence, the first question the attorney should ask is not "what do these say?" but "how do we know they haven't been modified?" The answer is a cryptographic hash: a mathematical fingerprint computed over the exact bytes of each file at the moment of receipt. Every subsequent check — whether performed by your own system, opposing counsel's expert, or a court-appointed forensic examiner — recomputes the fingerprint and compares. Disagreement means the record has changed. Agreement means the bytes are exactly what they were when you received them.

That is the entirety of what hashing does. This chapter makes the guarantee precise, shows where and how Canon enforces it, and explains what hashing cannot do — because the gaps matter as much as the guarantee.

At a glance

  • A cryptographic hash function takes an arbitrary byte string and returns a fixed-size digest such that (a) any change to the input changes the output, and (b) finding two inputs with the same output is computationally infeasible.
  • Canon uses SHA-256 (FIPS 180-4) for both content hashing (R2) and chain hashing (R7). This chapter teaches what SHA-256 guarantees and what it does not.
  • Hashing alone proves nothing; signed hashes prove something. We pair hashing with signatures in Chapter 6, but the integrity guarantee comes from this chapter.

Learning objectives

By the end of this chapter, you should be able to:

  1. Define SHA-256 preimage resistance, second-preimage resistance, and collision resistance — and state which Canon requirement each protects.
  2. Demonstrate the avalanche effect: show that changing a single bit of input changes roughly half the bits of the output.
  3. Explain why the local-first principle requires hashing before any cloud transfer or processing step, and what is lost if you invert the order.
  4. Read the audit_log hash-chain trigger in schema/10_core.sql and trace the computation that links each row to the row before it.
  5. Identify at least three common engineering mistakes with hash functions and state what goes wrong in each case.
  6. Authenticate a hash-bearing digital exhibit under FRE 901(b)(9) by articulating what the hash proves and what a verifier must do to confirm it.

Working code first

This is the entire content-hashing implementation in the repository:

# meridian/canon/hashing.py
# >>> ch5 start
import hashlib

def sha256_hex(data: bytes) -> str:
    """Return lowercase hex digest of data (no prefix)."""
    return hashlib.sha256(data).hexdigest()

def content_hash(data: bytes) -> str:
    """Compute the canonical content hash for a Witness entry (R2).
    Returns 'sha256:<64-hex>' as required by the WitnessEntry schema.
    """
    return f"sha256:{sha256_hex(data)}"
# <<< ch5 end

Two functions, eight lines. By the end of this chapter you will know exactly what guarantee these eight lines provide and why the format sha256:<64-hex> is what every Witness entry (R2) carries.

What a hash function is

A hash function $H$ takes a byte string of arbitrary length and returns a digest of fixed length. In SHA-256's case the digest is 256 bits, conventionally written as 64 hexadecimal characters.

$$H : {0,1}^* \to {0,1}^$$

The hash is deterministic: $H(x)$ always returns the same digest for the same $x$. The hash is one-way: given a digest, you cannot — without exhaustive search — recover the input that produced it. The hash is collision-resistant: you cannot, with any practical effort, find two different inputs $x$ and $x'$ such that $H(x) = H(x')$.

These three properties — determinism, one-wayness, collision resistance — make a hash useful for evidentiary integrity.

Try This — The avalanche effect in 90 seconds.

Open a Python REPL and run:

import hashlib

a = hashlib.sha256(b"Hearing scheduled for Tuesday").hexdigest()
b = hashlib.sha256(b"Hearing scheduled for tuesday").hexdigest()  # lowercase 't'

diff = sum(x != y for x, y in zip(
    bin(int(a, 16))[2:].zfill(256),
    bin(int(b, 16))[2:].zfill(256)
))
print(f"Bits that differ: {diff} of 256")

One character changed — a capital T became lowercase — and roughly half the output bits flipped. This is the avalanche effect: every bit of the output depends on every bit of the input, so no small change produces a small perturbation in the digest. For evidentiary purposes, it means there is no such thing as a "close enough" hash match. The digests either agree exactly or they tell you the bytes are different.

Do the same experiment with a one-byte change to a PDF — swap a single character in a metadata field — and verify that the digest changes completely.

Three properties, three uses

The literature distinguishes three security notions for hash functions. Which one matters depends on the use. Confusing them is a common engineering mistake.

PropertyDefinitionWhere Canon uses it
Preimage resistanceGiven a digest $h$, it is hard to find any $x$ with $H(x) = h$.Witness content_hash — preventing forged content for a given hash.
Collision resistanceIt is hard to find any pair $(x, x')$ with $x \neq x'$ and $H(x) = H(x')$.Chain hash and signature — preventing the issuer from constructing two attestations with the same hash.

For SHA-256, all three properties hold under current cryptanalytic assumptions. The strongest known attacks are far below the security threshold; SHA-256 is expected to be safe for the foreseeable future against classical computers. Quantum computers reduce the security level by the square root (Grover's algorithm), but 128 bits of security is still extremely large. For collision resistance specifically, the Brassard–Høyer–Tapp (BHT) algorithm provides a quantum collision-finding attack in O(2^(n/3)) time, reducing SHA-256's collision resistance to approximately 85 bits under a quantum adversary — not 128. NIST's post-quantum hash guidance (SP 800-107r1) addresses this gap; for long-archive evidence, SHA-256 is currently considered adequate but this assessment may change.

What you should not assume:

  • That a hash makes a value secret. It doesn't. A hash is a commitment, not encryption. If the input space is small (a phone number, an SSN, a date), an attacker can hash every possible input and look up your hash in the table.
  • That hashing detects unauthorized reading. It only detects modification.
  • That two systems will agree on the hash of "the same" data without explicit canonicalization. They will not — Chapter 7 is dedicated to the consequences.

Going Deeper — How SHA-256 actually compresses.

SHA-256 processes input in 512-bit blocks using the Merkle-Damgård construction. Each block is fed through a compression function built on the Davies-Meyer construction over a custom block cipher. The cipher uses a 64-step schedule of 32-bit words derived from the message block, mixed via bitwise operations (AND, XOR, rotate) with no arithmetic carry chains beyond 32-bit modular addition. FIPS 180-4 §6.2 specifies the algorithm completely.

The eight 32-bit words that emerge from the final compression step are concatenated to produce the 256-bit digest. The eight initial values (h0–h7) are the fractional parts of the square roots of the first eight primes; the 64 round constants are the fractional parts of the cube roots of the first 64 primes. This is not numerological decoration — it ensures there is no trapdoor embedded in the constants (nothing-up-my-sleeve numbers, in the art).

Security margin: as of 2025, the best published attack on SHA-256 is a 31-round collision with complexity $2^$. The full function has 64 rounds. The gap between best attack and full function is comfortable; NIST's assessment (SP 800-107r1) is that SHA-256 provides 128 bits of collision resistance and 256 bits of preimage resistance.

For Canon's purposes, you do not need to implement this from scratch. You need to know (1) what bytes go in, (2) what digest comes out, and (3) that the function is a standard library call, not something to implement yourself.

SHA-256 in three sentences

SHA-256 is the 256-bit member of the SHA-2 family, standardized by NIST in FIPS 180-4. It processes input in 512-bit blocks, padding the final block as specified. Its internal compression function is the Davies-Meyer construction over a custom block cipher; the security margin against the best published attacks is comfortable.

You do not need to implement SHA-256 from scratch to use it correctly. The standard library of every major language ships a tested implementation. You do need to know what it computes over: a SHA-256 hash is a function of bytes, not strings. If you have a string, you must encode it (UTF-8 is the universally correct choice) before hashing. Hashing the in-memory representation of a Python string produces an answer that depends on the Python interpreter's internal layout and is meaningless across systems.

>>> import hashlib
>>> hashlib.sha256("hello").hexdigest()
TypeError: Unicode-objects must be encoded before hashing

>>> hashlib.sha256("hello".encode("utf-8")).hexdigest()
'2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824'

Forgetting to encode before hashing is the first entry in research/01_cryptography_pedagogy.md's student-mistake catalogue — the symptom is mysterious cross-platform mismatches. > ☉ In the Wild — SHAttered: when collision resistance breaks down. > > In February 2017, a team at Google (Marc Stevens et al.) announced the first practical SHA-1 collision: two different PDF files — one blank, one with visible content — that produced the identical SHA-1 digest 38762cf7f55934b34d179ae6a4c80cadccbb7f0a. Computing the collision required approximately 9.2 × 10¹⁸ SHA-1 evaluations and cost an estimated $110,000 in cloud compute at the time (the attack is described in full at shattered.io). > > The immediate casualty was any evidence system or signature scheme that used SHA-1 for document integrity. A forensic lab that stored SHA-1 hashes of case files could, in principle, have those hashes "claimed" by a substituted document. Most modern operating systems and browsers had already deprecated SHA-1 for TLS certificates following the 2008 MD5 CA collision (Sotirov et al., Chaos Communication Congress), but many evidence management systems had not followed. > > The lesson for Canon is not abstract: SHA-1 and MD5 are formally broken for collision resistance. Using them in a new evidence system in 2024 is malpractice. SHA-256 is not broken. If you are inheriting a legacy system that uses MD5 or SHA-1, the migration path is to re-hash every artifact and maintain a cross-reference table mapping old hashes to new ones — and to document the migration as an audit event, not a silent replacement. ## Content hashing in Canon (R2) R2 — the content-integrity requirement — says every WitnessEntry must include a SHA-256 content hash computed over the raw observed bytes. Re-hashing the retrieved content must reproduce the declared hash exactly. The discipline this imposes: 1. Hash on receipt, before any processing. The hash is over the bytes you received, not over a parsed or normalized form. If you transcode audio, OCR a PDF, or strip whitespace, you will hash the transformed version and the original will be irretrievable. 2. Preserve the original bytes, hashed. The content_ref URL must resolve to the exact bytes you hashed, byte-for-byte, forever. This means write-once storage, immutable file references, content-addressed identifiers — never a mutable database row. 3. Re-verify on each access. The recipient of an attestation reads content_ref, retrieves the bytes, and recomputes SHA-256. If the recomputation matches content_hash, integrity holds. If not, the system has either been tampered with or the storage has lost the original — both unacceptable. Conceptual pattern (see meridian/witness/wrapper.py for the full implementation):

# Conceptual pattern — meridian/witness/wrapper.py builds ObservationAttestations
# from acquisitions rows. The key discipline is: hash the raw bytes first, then
# pass the hash and storage URI into the attestation.
def attest_acquisition(...):
    raw_bytes = fetch_from_source(...)
    h = content_hash(raw_bytes)               # hash on receipt
    storage_url = vault.write_immutable(raw_bytes, key=h)
    return WitnessEntry(
        observation_id=mint_observation_id(),
        source=upstream_source,
        received_at=now_utc_microseconds(),
        custody_chain=[CustodyEvent(custodian=actor_id(), received_at=...)],
        content_hash=h,
        content_ref=storage_url,
    )

The hash is the primary key of the immutable vault. The content_ref URL is derived from the hash. There is no race condition between hashing and storage because the storage is content-addressed by the hash itself. ## The local-first principle: hash before the cloud sees anything The running case makes the stakes concrete. In 2024JC000099, the evidence corpus includes iMessages, voicemails, and DHS case files — all acquired from personal devices and accounts controlled by the parent. When that data moves from the device to any cloud service for processing (transcription, OCR, embedding), there is a window of risk: the cloud service may normalize, compress, or re-encode the bytes in transit. If hashing happens after cloud transfer, you do not know what the hash covers. The local-first principle (paper §7.1) forecloses this: hash on the custodian's machine, before any network transmission. The chunk_local() primitive in meridian/witness/local_chunker.py enforces this:

# meridian/witness/local_chunker.py (excerpt)
def chunk_local(data: bytes, *, parent_sha256: str, custodian: str,
                pii_tier: str = "internal", chunk_size: int = 8192,
                ) -> Iterator[ChunkRecord]:
    """Split data into fixed-size chunks; hash each before yielding."""
    offset = 0
    idx = 0
    while offset < len(data):
        raw = data[offset : offset + chunk_size]
        sha = hashlib.sha256(raw).hexdigest()
        yield ChunkRecord(chunk_sha256=sha, raw=raw, ...)
        idx += 1; offset += chunk_size

The ChunkRecord.to_cloud_safe() method strips raw before anything goes to a remote service. What travels to the cloud is the hash and the metadata — never the raw bytes, unless the PII tier and explicit custodian authorization permit it. The documents table in schema/30_documents.sql reflects this. Every document is identified by its SHA-256 hash:

-- schema/30_documents.sql (excerpt)
CREATE TABLE documents (
  id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  sha256      text NOT NULL UNIQUE,   -- 64 hex chars; the document's identity
  byte_size   bigint NOT NULL,
  storage_uri text NOT NULL,          -- where the bytes live (write-once)
  ...
  CHECK (length(sha256) = 64)
);

A document that arrives by three different routes — email attachment, DHS production, Google Takeout — produces three acquisitions rows but one documents row. The hash is the deduplication key. If the bytes are the same, the document is the same, regardless of how it arrived. > ▼ Why It Matters — The judge's verifier and yours must agree. > > In 2024JC000099, the parent's attorney may attempt to introduce an audio recording of a home visit. The recording is authenticated in part by showing that the SHA-256 hash of the file produced by the attorney's system matches the SHA-256 hash computed by an independent expert on the same file. > > A judge ruling on a motion to exclude that evidence will not personally rerun SHA-256. The judge will hear testimony from opposing counsel's expert and the attorney's expert. If the two experts run the verifier on the same file and reach the same hash, integrity is established. If they disagree, the exhibit is in jeopardy — not because of cryptography, but because of how the file was handled between acquisition and trial. > > This is why the chain-of-custody log matters as much as the hash itself. The hash proves the bytes have not changed since they were last hashed. The chain-of-custody log proves who has handled the bytes and when. Together, they answer the authentication question that FRE 901(b)(9) puts to any process-generated record: is this the output of an accurate process, correctly applied? Hashing supplies the accuracy; the audit log supplies the correct application. ## Hash chains: writing audit logs that can't be retroactively rewritten A hash chain is a sequence of records where each record incorporates the hash of the prior record. The first record stands alone; each subsequent record's hash depends on every record before it. Modifying a record in the middle of the chain invalidates every record after it. In the repository, the audit-log table is hash-chained. The actual schema in schema/10_core.sql:

-- schema/10_core.sql (excerpt)
CREATE TABLE audit_log (
  id            bigserial PRIMARY KEY,
  occurred_at   timestamptz NOT NULL DEFAULT clock_timestamp(),
  actor_id      uuid REFERENCES actors(id),
  matter_id     uuid REFERENCES matters(id),
  action        text NOT NULL,
  resource_type text,
  resource_id   uuid,
  payload       jsonb NOT NULL DEFAULT '{}',
  prev_hash     text,
  hash          text NOT NULL
);

The trigger that computes each row's hash:

-- schema/10_core.sql (excerpt)
CREATE OR REPLACE FUNCTION audit_log_hash_trigger() RETURNS trigger AS $$
DECLARE prev text;
BEGIN
  SELECT hash INTO prev FROM audit_log ORDER BY id DESC LIMIT 1;
  NEW.prev_hash := prev;
  NEW.hash := sha256_hex(
    coalesce(prev, '') || '|' ||
    NEW.occurred_at::text || '|' ||
    coalesce(NEW.actor_id::text, '') || '|' ||
    coalesce(NEW.matter_id::text, '') || '|' ||
    NEW.action || '|' ||
    coalesce(NEW.resource_type, '') || '|' ||
    coalesce(NEW.resource_id::text, '') || '|' ||
    coalesce(NEW.payload::text, '{}')
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

The hash of each row is the SHA-256 of the prior hash concatenated with every field of the new row. The hash of one row becomes the prev_hash of the next. Why this matters: an attacker who edits a row in the middle of the audit log must also recompute every subsequent hash. If a witness — a third-party verifier, a backup snapshot, an external timestamp service — holds the hash of any later row, the attacker is caught. The chain leaves no place to hide.

This is the design pattern behind Certificate Transparency (RFC 6962), Sigstore Rekor, Git's commit chain, and Bitcoin's block chain. The mathematics is the same in each case: hash chains commit retroactively to history.

§ For the Record — FRE 901(b)(9) and NIST SP 800-107.

Federal Rule of Evidence 901(b)(9) authenticates evidence by describing the process that produced it:

"Evidence about a process or system, including a computer-generated one, used to produce a result, demonstrating that the process or system produces an accurate result."

The test is not "is this a computer output?" but "does the process produce accurate results?" For a hash-based evidence system, the authentication predicate is: (1) SHA-256 is the process; (2) FIPS 180-4 specifies it completely; (3) every major language's standard library implements it correctly; and (4) any verifier, independently, can recompute the hash and compare. The process is accurate and the accuracy is demonstrable.

NIST Special Publication 800-107r1, Recommendation for Applications Using Approved Hash Algorithms, is the federal guidance document governing the use of SHA-2 in federal information systems. It specifies that SHA-256 provides 128 bits of collision resistance and 256 bits of second-preimage resistance, and is approved for all federal digital signature applications. A forensic expert offering hash-based authentication should be prepared to cite both FRE 901(b)(9) and SP 800-107r1 as the dual foundation — one legal, one technical.

State courts that have not adopted FRE verbatim often apply an equivalent standard. In Wisconsin, Wis. Stat. § 910.01(3) tracks the federal rule closely on authentication; the hash-chain predicate is the same.

Merkle trees: making the chain efficient at scale

A hash chain takes $O(n)$ time to verify any single record's place in the log: you must hash every record from the start. For a log with millions of entries, this is expensive.

A Merkle tree (also called a hash tree, after Ralph Merkle's 1979 dissertation) replaces the linear chain with a binary tree. Each leaf is the hash of a record. Each internal node is the hash of its two children. The root commits to every leaf, but a verifier needs only $O(\log n)$ hashes to prove that a specific record is in the tree.

        root_hash = H(H(H(r0,r1), H(r2,r3)), H(H(r4,r5), H(r6,r7)))
                 /                       \
        H(H(r0,r1), H(r2,r3))    H(H(r4,r5), H(r6,r7))
          /          \               /           \
      H(r0,r1)    H(r2,r3)       H(r4,r5)    H(r6,r7)
      /    \      /     \         /    \      /     \
     r0    r1    r2     r3       r4    r5    r6     r7

To prove that record $r_2$ is in this tree, the verifier needs only $r_3$, $H(r_0, r_1)$, and $H(H(r_4,r_5), H(r_6,r_7))$ — three hashes for a tree of eight leaves. They can recompute the root from these and compare against the published root.

For Canon's purposes, Merkle trees become important when a public log of attestations is needed (CASEFORUM, Chapter 30) or when a third-party witness co-signs your audit log. The Russ Cox article in the further reading is the canonical pedagogical introduction; read it.

What the chain hash binds (R7)

The chain hash (R7) is the SHA-256 of the RFC 8785 canonical serialization of the entire attestation. In Canon v0.2.0 it is carried as a convenience field in the DSSE envelope — DSSEEnvelope.chain_hash — and is not what the Ed25519 signature directly covers. What is signed is PAE(CANON_PAYLOAD_TYPE, canonical_bytes) (Chapter 6). The chain_hash field lets a quick-check tool verify field integrity via a single SHA-256 call without implementing PAE. Why the canonicalization step matters regardless: - Canonicalizing first ensures every implementation that emits a Canon attestation produces the same canonical_bytes — and therefore the same chain_hash and the same PAE input — for the same logical content. Without canonicalization, two attestations that are logically identical but textually different (different field ordering, different whitespace, different number formatting) would have different hashes, different PAE inputs, and incompatible signatures.

Chapter 7 covers the canonicalization step in full and shows the full canonicalize → wrap → sign pipeline. The construction to hold in mind:

# meridian/canon/hashing.py
def chain_hash(attestation: dict[str, Any]) -> str:
    """Compute the chain_hash convenience field for the DSSEEnvelope (R7).
    Defined as SHA-256 of the RFC 8785 canonical serialization of the
    Attestation. The envelope payload is base64url(canonical_bytes);
    chain_hash is sha256(canonical_bytes).
    """
    canonical = canonicalize_for_seal(attestation)  # defined in Chapter 7
    return f"sha256:{sha256_hex(canonical)}"

(canonicalize_for_seal() applies RFC 8785 JSON Canonicalization Scheme to the attestation — Chapter 7 covers the details.) The chain hash binds the issuer to the full attestation content. The DSSE signature (via PAE) binds the issuer to both the payload type and the canonical bytes. The chain of commitments is the architecture of falsifiability. ## Length-extension and why HMAC exists A historical note worth retaining. The SHA-2 family is built on the Merkle-Damgård construction, which has a known weakness: length extension. Given $H(M)$ and the length of $M$, an attacker can compute $H(M | \text | M')$ for any chosen $M'$, without knowing $M$ itself. This does not threaten Canon's content hashes (no one is signing $H(M)$ in a way that lets an attacker append). But it threatens a naïve construction $H(\text | M)$ as a message authentication code, which is why HMAC (RFC 2104) was invented: it sandwiches the message between two applications of $H$, breaking length extension. You will not implement HMAC in Canon — the spec uses Ed25519 signatures, not MACs — but you should know the failure mode. Cryptopals Set 4 exercises this attack directly; it is the best way to internalize why hash construction details matter. ## Common engineering mistakes From the cryptography pedagogy dossier: 1. Hashing the wire format instead of the canonical form. "But it round-trips on my machine." See Chapter 7. 2. Forgetting to encode strings before hashing. Always data.encode("utf-8"). 3. Confusing hash functions and MACs. SHA-256 alone is not authenticated; pair it with a signature or use HMAC. 4. Using truncated hashes for security. Truncating SHA-256 to 32 bits is not "good enough for a small system"; the birthday bound makes collisions trivial. 5. Hashing personally identifying information as "anonymization." A SHA-256 of an email address is not anonymous; the input space is enumerable. 6. Storing only the hash and discarding the original. R2 requires the original be retrievable; the hash alone is not the evidence. 7. Hashing after processing rather than on receipt. If your audio transcription worker hashes the transcript and not the .m4a, you have proven the transcript is intact — not the recording. You need both. ## Lab 5 The lab is in labs/ch05_hashing/. Run pytest labs/ch05_hashing/test_lab.py -v from the repo root to check your work. 1. Implement content_hash(data: bytes) -> str from scratch (you may use hashlib). Verify it agrees with the repo's implementation in meridian/canon/hashing.py by testing against at least five byte strings of your own choosing and confirming the output matches hashlib.sha256(data).hexdigest() prefixed with sha256:. 2. Implement a hash chain over a sequence of audit entries. Each entry has fields (occurred_at, actor, action, payload); the chain is over (prev_hash || separator || field1 || separator || ...) matching the trigger in schema/10_core.sql. 3. Modify a single entry in the middle of the chain. Verify your implementation detects the modification at every later entry. 4. Implement Merkle-tree inclusion-proof verification for a small tree (8–16 leaves). Verify that you can prove a leaf's membership using only $\log n$ hashes.

💡Key Takeaways
- SHA-256 guarantees preimage resistance, second-preimage resistance, and collision resistance — but it does not hide the input (it is a commitment, not encryption), does not detect unauthorized reading, and will not agree across systems unless the byte input is canonicalized identically. - The avalanche effect means any single-bit change to the input flips roughly half the output bits, so there is no such thing as a "close enough" hash match — two hashes either agree exactly or the bytes are different. - The local-first principle requires hashing the raw source bytes on the custodian's machine before any cloud transfer or processing step; hashing after transcription, OCR, or re-encoding only proves the transformed version is intact, not the original. - chain_hash in the DSSE envelope is the SHA-256 of the RFC 8785 canonical serialization of the full attestation (seal field excluded), binding the issuer to every field of every block simultaneously. - Hashing for integrity (content hash in the Witness block) and hashing for identity (chain hash in the Seal block) serve different purposes — conflating them produces artifacts where a recipient can verify the content but cannot verify that the seal covers the same content.
## Exercises ### Warm-up 1. By hand (or with a debugger), compute the SHA-256 of the string "meridian" (UTF-8 encoded). Compare against hashlib.sha256(b"meridian").hexdigest(). Then change one byte — make it "Meridian" (capital M) — and count how many of the 256 output bits changed. 2. The chain hash in the Chapter 1 example is sha256:09a1...d2ef. Why does this format include the sha256: prefix rather than just the hex? What ambiguity does the prefix resolve, and what algorithm-substitution attack does it preempt? ### Core 3. Read the audit_log_hash_trigger() function in schema/10_core.sql. Write a Python function that replicates the trigger's computation for a list of audit rows. Apply it to at least five hand-crafted audit rows (construct them directly in your test script) and verify that modifying any row causes every subsequent chain hash to change. 4. Read Russ Cox, Transparent Logs for Skeptical Clients (https://research.swtch.com/tlog), through the section on tile-based logs. Reproduce his 13-record worked example: compute every internal hash and the root hash. Verify that an inclusion proof for record 7 requires exactly 4 hashes, not 13. 5. Implement Cryptopals Set 4 Problem 28 (length-extension on SHA-1) — https://cryptopals.com/sets/4. The point is not the historical algorithm but the construction insight: why does $H(\text | M)$ leak information when you know $H$? ### Stretch 6. Read RFC 6962 (Certificate Transparency) §2 (the data structures). Implement a verifier in 50 lines of Python that consumes a Sigstore Rekor entry and verifies its inclusion in the public log. (Hands-on intro: https://edu.chainguard.dev/open-source/sigstore/rekor/an-introduction-to-rekor/) 7. Suppose Canon migrated from SHA-256 to SHA-3-256. What would have to change in the spec? In the implementation? In the audit-log trigger? What attack does this migration not defend against — and which post-quantum concern does it partially address? ## Build-your-own prompt For your capstone corpus: identify three sources where the acquisition step is non-trivial — i.e., the bytes you receive may already have been transformed by an upstream system before you see them. For each, sketch how you would establish a chain of custody that meaningfully starts at your point of receipt rather than at the upstream source, and what evidence you would produce to show the local-first principle was honored. This exercise connects forward to the Admissibility Auditor module in Chapter 27. ## Further reading - Russ Cox, Transparent Logs for Skeptical Clients, https://research.swtch.com/tlog. Required reading. - FIPS 180-4 (SHA-256 specification), https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf. - NIST SP 800-107r1, Recommendation for Applications Using Approved Hash Algorithms, https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-107r1.pdf. - RFC 6962 (Certificate Transparency), https://www.rfc-editor.org/rfc/rfc6962.html. - Stevens et al., The First Collision for Full SHA-1 (SHAttered), 2017, https://shattered.io/static/shattered.pdf. - Cryptopals Set 4, https://cryptopals.com/sets/4. Problems 28 and 29 in particular. - Soatok, Canonicalization Attacks Against MACs and Signatures (2021), https://soatok.blog/2021/07/30/canonicalization-attacks-against-macs-and-signatures/. Required reading for Chapter 7 but worth previewing now. - The dossier research/01_cryptography_pedagogy.md in this repository.


Next: Chapter 6 — Digital Signatures (Ed25519). Where hashing becomes accountability.