Canonicalization (RFC 8785 JCS)

Two systems can agree on every field of a JSON document, every value, every type — and still disagree on the bytes. Canonicalization is what makes that disagreement detectable rather than catastrophic.

ℹPrerequisites▼

Before reading this chapter, you should be comfortable with: Chapters 5–6 (Hashing, Signatures). Canonicalization feeds directly into the signing pipeline; you cannot understand PAE without understanding both hashing and signatures.

A bug that should not exist

A consortium of three organizations adopts Canon and stands up attestation pipelines: one in Python, one in Go, one in JavaScript. All three claim to implement RFC 8785. All three pass their internal test suites.

Then someone notices: a single attestation, hand-built with the same Witness, Findings, and Refutation blocks, produces different chain hashes in each language. The Python signature does not verify against the Go canonical bytes. The Go signature does not verify against the JavaScript canonical bytes. The implementations are conformant to themselves, and to nobody else.

The bug is real and recurring. The cyberphone reference suite for RFC 8785 ships with roughly 286,000 oracle vectors specifically to detect it. The reason it keeps happening is that canonicalization looks like a thirty-line problem and is actually a thousand-line problem, and the four hundred and seventy lines that look the same in every implementation hide the seven hundred lines that don't.

This chapter is the longest in Part II for that reason.

At a glance

RFC 8785 (JSON Canonicalization Scheme, JCS) defines a single byte stream that any conforming implementation must produce for a given JSON value, regardless of source language or runtime.
Canon uses RFC 8785 because the chain hash (R7) must be computable identically by any implementation; without canonicalization, the same logical attestation would have different signatures across languages.
Four classes of bug recur in implementations: UTF-16 vs UTF-8 sort order, ECMAScript number formatting at exponent boundaries, lone surrogate handling, negative-zero/NaN handling. Lab 7 walks you through each.

Learning objectives

Explain why canonicalization is necessary for any signed-JSON system.
Implement RFC 8785 from the spec, passing the cyberphone reference vectors.
Identify and reproduce the four bug classes; build a parser-mismatch attack against a "secure" attestation system.
Choose between RFC 8785, JWS (RFC 7515), COSE (RFC 9052), and JAdES for a given deployment.

The canonicalization problem, named

✻ Try This — Two equivalent JSONs that hash differently.

Open a Python REPL. Compute the SHA-256 of two JSON strings:
import hashlib, json
a = '{"b":2,"a":1}'
b = '{"a":1,"b":2}'
print(hashlib.sha256(a.encode()).hexdigest())
print(hashlib.sha256(b.encode()).hexdigest())
Compute json.loads(a) == json.loads(b). The Python dicts are > equal. The strings are not. The hashes are not. The bytes are what > the signature signs. If the signer used a and the verifier > reconstructed b from a parsed copy, the signature does not validate. Two systems round-tripping the same JSON produce textually different bytes for hundreds of reasons: | Difference | Examples | |---|---| | Object key ordering | {"b":2,"a":1} vs {"a":1,"b":2} | | Insignificant whitespace | {"a":1} vs { "a": 1 } | | Number formatting | 1.0 vs 1 vs 1e0 vs 10e-1 | | String escaping | "café" vs "café" | | Trailing-zero dropping | 0.10 vs 0.1 | | Encoding | UTF-8 vs UTF-16 vs Latin-1 | | Line endings | LF vs CRLF | A canonicalization scheme picks one representation for each JSON value and forbids all others. Two conforming implementations produce byte-identical output for the same logical input. > ▼ Why It Matters. A signed attestation that does not canonicalize > is signed against a moving target. Re-parse the JSON anywhere in > the pipeline — display it, log it, ship it through a serializer that > reorders keys — and the signature no longer verifies, even though > nothing about the content changed. From the recipient's > perspective, this is indistinguishable from tampering. From the > issuer's perspective, this is a one-day production outage. ## RFC 8785 in detail RFC 8785 was published by the IETF in June 2020. It builds on ECMA-404 (JSON syntax) and ECMA-262 (JavaScript number formatting). The specification is short — about thirty pages — and walks the reader through five rules. The full text is at https://www.rfc-editor.org/rfc/rfc8785. ### Rule 1 — Object members are sorted by key Object members are sorted by their key strings. The sort is lexicographic over UTF-16 code units, not Unicode code points, not UTF-8 bytes. This is the rule most often broken. A naïve implementation that sorts by UTF-8 bytes (or by Python's default string ordering, which is by code point) will produce different output from a JavaScript implementation that sorts by UTF-16 code units the moment any key contains a character above U+FFFF (emoji, mathematical symbols, CJK Extension B characters). > ◆ Going Deeper — Why UTF-16 code units? > > JCS was designed to be implementable inside a JavaScript runtime > using only the standard string operators. JavaScript strings are > UTF-16 internally; sorting strings with < in JavaScript is sorting > by UTF-16 code units. The spec authors took JavaScript's intrinsic > behavior as the canonical reference, on the theory that JavaScript > implementations would be the largest deployment surface and forcing > them to do something unnatural would lead to fragmentation. > > The choice has consequences. For most strings, UTF-16 code units > sort identically to Unicode code points. They diverge above U+FFFF, > where Unicode code points jump to single-codepoint values > (U+10000–U+10FFFF) but UTF-16 represents them as surrogate pairs > using high surrogates (U+D800–U+DBFF) followed by low surrogates > (U+DC00–U+DFFF). The surrogate-pair encoding sorts earlier than > later BMP characters in UTF-16 but later in Unicode. > A key beginning with U+10348 (an Old Italic letter) sorts before a > key beginning with U+E000 (a Private Use Area character) under > UTF-16 but after it under Unicode. Two implementations that disagree > on this disagree silently — the bug never surfaces until you ingest > a corpus that exercises the high-codepoint range. > > Cyberphone's test vectors deliberately include such cases. ### Rule 2 — Strings are JSON-escaped per RFC 8259 JCS adopts RFC 8259 string escaping — backslash sequences for ", \, and the C0 control characters; \uXXXX for everything else that the RFC would otherwise allow. JCS forbids unnecessary escaping: a character that does not require \u escaping must not be escaped. The non-trivial case: lone surrogates. A JavaScript string can contain unpaired surrogate code units (e.g., a single U+D800 with no following U+DC00–DFFF). RFC 8785 mandates that such strings cause termination — the canonicalizer fails. It does not silently substitute U+FFFD; it does not output \uD800; it errors out. > ◆ Going Deeper — Lone surrogates as a security boundary. > > A canonicalizer that silently replaces lone surrogates with U+FFFD > opens the door to a parser-mismatch attack. The signer canonicalizes > input X, which contains a lone surrogate. The signer's library > replaces it with U+FFFD and produces canonical bytes. The signature > goes out. The verifier receives the canonical bytes and parses them; > its library may pass the bytes through unchanged (the U+FFFD is > already there) or — if it canonicalizes a re-parsed copy — produce > different bytes from the input that was actually fed to the signer. > The signature still verifies (the bytes are the same as what was > signed), but the semantic content the verifier sees may differ from > the semantic content the signer intended. > > RFC 8785's hard termination on lone surrogates is exactly to defeat > this. If the input contains one, the canonicalizer refuses to > produce output, and the issuer is forced to clean its input before > signing. ### Rule 3 — Numbers are formatted per ECMAScript This is the rule that produces the most surprising disagreements. RFC 8785 defers to ECMA-262 §7.1.12.1 (Number.toString), which specifies how a JavaScript runtime turns a 64-bit double into a string. The spec is precise: - Use the shortest decimal representation that, when parsed back, yields the same double. - Use exponential notation only when the absolute value is >= 1e21 or < 1e-6, with no + sign and lowercase e. - Fractional numbers between -1 and 1 retain the leading zero: 0.5, not .5. ECMAScript's Number.toString() and the rfc8785 library both produce the leading zero. Many widely-used JSON libraries do not implement ECMA-262 number formatting. They implement Grisu2 (Google's float-to-string algorithm), Ryū (the modern successor to Grisu2), or simply printf("%g"). All three of these almost agree with ECMA-262 — and diverge at exponent boundaries, in the last digit of subnormal numbers, and on the formatting of integers that cannot be represented exactly as doubles (above $2^$).

✻ Try This — A number-formatting disagreement.

What does each of the following produce?
import json
import rfc8785

for x in [1e23, 1.0, 0.1 + 0.2, 1234567890123456789.0]:
    print(json.dumps(x), "vs", rfc8785.dumps(x).decode())
The two columns disagree at least once. The disagreement is the chapter's principal warning.

Rule 4 — Booleans, null, arrays, and structural punctuation

Booleans are true and false (lowercase). null is null (lowercase). Arrays preserve their input order — they are not sorted. Structural punctuation is a single byte each: {, }, [, ], ,, :. There is no whitespace anywhere. A canonical document is therefore a single line of bytes with no indentation, no spaces around colons or commas, no trailing newline. This makes canonical JSON unpleasant to read and is the price of canonicalization. ### Rule 5 — UTF-8 output The output is UTF-8. This is the only place encoding enters the specification — strings are sorted as UTF-16 code units, but the output bytes are UTF-8 encodings of those strings. A correct implementation must hold both representations in mind. ## The four bug classes From dossier research/01_cryptography_pedagogy.md, the four classes of bug recur in every population of RFC 8785 implementations: | Class | Surfaces when... | Detection | |---|---|---| | UTF-16 sort order | A key contains a character ≥ U+10000 | Cyberphone vectors above U+FFFF | | ECMA-262 number formatting | Numbers near 1e21, 1e-6, or large integers | Cyberphone number-format vectors (~100k of them) | | Lone surrogate handling | A string contains an unpaired surrogate (U+D800–U+DBFF or U+DC00–U+DFFF) | Cyberphone surrogate-handling vectors | | Negative zero / NaN / Infinity | Float arithmetic produces -0.0, NaN, Infinity | Cyberphone numeric-edge vectors | Each class is silent. A buggy implementation produces some output; the question is whether it produces the right output. Without an oracle, you cannot know. > ☉ In the Wild — XML-DSig and the canonicalization-attack literature. > > The history of XML Digital Signatures (XML-DSig, 2002) is the > cautionary tale every JCS implementer should know. XML-DSig allowed > multiple canonicalization methods, declared outside the signed > envelope. An attacker could change the canonicalization-method > declaration without invalidating the signature; different > canonicalizers produced different bytes for the same XML; the > verifier accepted the change. The 2007 BlackHat presentation > Taxonomy of Attacks against XML Digital Signatures by Brad Hill > catalogs the resulting damage. > > RFC 8785 fixes this by mandating a single canonicalization (no > negotiation) and by requiring implementations to be byte-identical > to the JavaScript reference. Canon goes one step further: > the seal block declares canonicalization: "rfc8785" *inside the

signed payload*, so an attacker who substitutes a different canonicalizer must also forge a signature — which they cannot without the private key.

What the repository actually does

# meridian/canon/canonicalize.py
import json
import rfc8785

def canonicalize(obj: Any) -> bytes:
    return rfc8785.dumps(obj)

def canonicalize_for_seal(attestation: dict[str, Any]) -> bytes:
    if "seal" not in attestation:
        return canonicalize(attestation)
    seal_excluded = {k: v for k, v in attestation.items() if k != "seal"}
    return canonicalize(seal_excluded)

def roundtrip_check(obj: Any) -> bool:
    first = canonicalize(obj)
    parsed = json.loads(first.decode("utf-8"))
    second = canonicalize(parsed)
    return first == second

Three functions, twenty-odd lines. The work is delegated to the rfc8785 package (Trail of Bits) which implements the spec; the repository's contribution is the seal-exclusion discipline (the seal field is never in the input to the chain hash, even when present in the dict) and the round-trip check (a sanity test that canonicalize(parse(canonicalize(x))) == canonicalize(x)).

The canonicalize → wrap → sign pipeline (v0.2.0)

JCS canonicalization does not operate in isolation. Its output feeds directly into the DSSE envelope described in Chapter 6. The full pipeline is:

%%| label: fig-jcs-dsse
%%| fig-cap: "From attestation JSON to signed DSSE envelope: three uses of canonical_bytes"
flowchart LR
    ATT["Attestation\nobject"]
    JCS["JCS RFC 8785\nCanonical JSON"]
    CB["canonical_bytes\n(deterministic UTF-8)"]

    B64["base64url_encode(canonical_bytes)\n→ envelope.payload"]
    SHA["SHA-256(canonical_bytes)\n→ envelope.chain_hash"]
    PAE["PAE(payload_type, canonical_bytes)\n→ Ed25519.sign() input"]
    SIG["Ed25519 signature\n→ envelope.signatures[0].sig"]

    ATT -->|"json.dumps(sort_keys)"| JCS
    JCS -->|"rfc8785.dumps()"| CB
    CB --> B64
    CB --> SHA
    CB --> PAE
    PAE --> SIG

    style CB fill:#C99E3E,color:#14181F,font-weight:bold
    style SIG fill:#10b981,color:#fff

                  ┌─────────────────────────────────────┐
  Attestation     │  1. JCS(attestation)                │
  (Python dict)   │     → canonical_bytes               │
                  │     (RFC 8785, seal field excluded)  │
                  └─────────────┬───────────────────────┘
                                │ canonical_bytes
                  ┌─────────────┼───────────────────────┐
                  │             │                        │
                  ▼             ▼                        ▼
         base64url_encode   SHA-256(canonical_bytes)  PAE(payload_type,
         (canonical_bytes)  → chain_hash              canonical_bytes)
              │              (convenience field)       → pae_bytes
              │                                             │
              ▼                                             ▼
       DSSEEnvelope.payload                    Ed25519.sign(pae_bytes)
                                                → DSSESignature.sig

In code:

import base64
from meridian.canon.canonicalize import canonicalize_for_seal
from meridian.canon.hashing import sha256_hex
from meridian.canon.signing import sign_dsse, CANON_PAYLOAD_TYPE
from meridian.canon.schema import DSSEEnvelope, DSSESignature

def wrap_and_sign(attestation: dict, private_key, keyid: str, public_key_url: str) -> DSSEEnvelope:
    # Step 1: JCS canonicalization
    canonical_bytes = canonicalize_for_seal(attestation)

    # Step 2a: base64url-encode for the envelope payload field
    payload_b64 = base64.urlsafe_b64encode(canonical_bytes).decode("ascii").rstrip("=")

    # Step 2b: SHA-256 for the chain_hash convenience field
    chain_hash = f"sha256:{sha256_hex(canonical_bytes)}"

    # Step 3: PAE → Ed25519 signature
    sig_b64 = sign_dsse(private_key, canonical_bytes)

    return DSSEEnvelope(
        payload_type=CANON_PAYLOAD_TYPE,
        payload=payload_b64,
        chain_hash=chain_hash,
        signatures=[DSSESignature(keyid=keyid, sig=sig_b64, public_key_url=public_key_url)],
    )

The critical insight: canonical_bytes is produced once and used three ways — encoded into payload, hashed into chain_hash, and fed into sign_dsse (which constructs PAE internally). These three uses are inseparable; any inconsistency between them is detected at verification because: - A verifier decodes payload → canonical_bytes, recomputes SHA-256, and checks it against chain_hash (step 2 of the three-step protocol in Chapter 6). - A verifier reconstructs PAE(payload_type, canonical_bytes) and calls Ed25519.verify (step 3). If payload and chain_hash were computed from different bytes, step 2 would fail. If payload and the signature were computed from different bytes, step 3 would fail. > ◆ Going Deeper — Why round-trip is necessary but insufficient. > > A round-trip check verifies the canonicalizer is idempotent — that > canonicalizing already-canonical bytes produces the same canonical > bytes. This catches catastrophic implementation bugs where the > canonicalizer's output is itself non-canonical. It does not catch > consensus bugs where the implementation is internally consistent > but disagrees with the spec. > > The cyberphone vectors are the consensus check. Run your > implementation against them. Run every other RFC 8785 > implementation you depend on against them. The vectors are the > reference; if your implementation disagrees, your implementation > is wrong. ## Competing schemes — and when to pick one Canon picks RFC 8785 + raw Ed25519. The alternatives are not bad; they are differently-tradeoff-ed. ### JWS (RFC 7515) JOSE-family. Wraps the payload in a base64-encoded envelope and signs the envelope. Sidesteps canonicalization entirely — the payload bytes are whatever the issuer says they are, and the recipient signs the exact bytes that arrive. Trade-off: the envelope adds bytes; the payload is no longer human-readable JSON; algorithm-substitution attacks are a known JOSE-family failure mode. ### COSE (RFC 9052/9053) CBOR Object Signing and Encryption. Used by C2PA, FIDO2, and the EU Digital Identity Wallet. Same idea as JWS but with CBOR (binary) instead of JSON. Trade-off: smaller payloads (good for embedded/IoT/QR codes); not human-readable without tooling; deterministic CBOR encoding has its own canonicalization concerns. ### JAdES (ETSI TS 119 182-1) JWS extended with ETSI Advanced Electronic Signature qualifiers (long-term-validity timestamps, signing-certificate references, archival metadata). The EU eIDAS-qualified signing format. Trade-off: complex; requires PKI infrastructure (CRLs, OCSP, qualified trust service providers); the right answer if you need cross-border EU court admissibility. ### Why Canon picks RFC 8785 Three reasons. 1. Minimal. The signed payload is the JSON itself, canonicalized. No envelope. The signature is over the chain hash, which is the hash of the canonical bytes. There is nothing to forget to include. 2. Human-readable. A judge, a journalist, or an opposing-counsel reviewer can open a Canon attestation in a text editor and read it. This is non-trivial; in evidentiary contexts, the artifact's intelligibility is itself a property. 3. Implementation-neutral. Any language with a JSON library and a SHA-256 implementation can verify a Canon attestation with about a hundred lines of code. JWS, COSE, and JAdES all require library support; Canon does not. The cost: the four bug classes. Lab 7 is how you internalize them. > § For the Record — Algorithm-substitution defense (Canon §10.1). > > The Seal block carries canonicalization: "rfc8785" and > signature_algorithm: "ed25519" inside the signed payload. A > recipient who fetches a Canon attestation cannot be tricked into > verifying it under a different canonicalization scheme, because the > scheme name is part of what was signed. JWS-family schemes have > historically had this defense added later; Canon has it from v0.1.0. ## Lab 7 — Six problems The lab is in labs/ch07_canonicalization/. The full problem statements are in labs/ch07_canonicalization/README.md; a brief summary follows. The lab is structured as Cryptopals is structured: each problem teaches a primitive, and each later problem depends on the primitive learned in the prior. You implement the canonicalizer; you discover its bugs by running it against an oracle; you patch each bug; you build a parser-mismatch attack against a "secure" system. | # | Title | Teaches | |---|---|---| | 7.1 | Naïve canonicalization | The baseline sort_keys=True approach and why it fails | | 7.2 | UTF-16 vs UTF-8 sort order | Discovering the high-codepoint sort bug | | 7.3 | ECMA-262 number formatting | Discovering the Grisu2 vs ECMA-262 disagreement | | 7.4 | Lone surrogates | Discovering the silent-replacement attack surface | | 7.5 | Negative zero, NaN, Infinity | Discovering the IEEE-754 edges | | 7.6 | Parser-mismatch exploit | Building a working signature-verification bypass | Problem 7.6 is, as of 2026, the only public exercise that walks an engineer through constructing a working attack on a signed-JSON system using a canonicalization disagreement. The lab provides two "implementations" in pre-built form; your job is to find the input that makes them disagree. Finishing 7.1–7.5 gives you a working RFC 8785 implementation; 7.6 trains the instinct: do not trust a canonicalizer until you have personally tried to break one. In the running case (TPR proceeding 2024JC000099), Isabel's iMessage export is canonicalized before the chain hash is computed. If any parser in the pipeline — extraction worker, audit logger, or recipient verifier — produces different byte output from the same JSON, the chain hash will not match and the attestation will fail step 3 of the seven-step protocol. This is the canonical failure mode canonicalization is designed to prevent.

💡Key Takeaways

- JSON is not deterministic by default — two logically identical objects can produce different byte sequences due to key ordering, whitespace, number formatting, and encoding, which means a signature over the wire format is a signature over a moving target. - RFC 8785 (JCS) guarantees a single byte stream for any given JSON value by mandating lexicographic-by-UTF-16-code-unit key ordering, RFC 8259 string escaping with hard termination on lone surrogates, ECMA-262 number formatting, and UTF-8 output. - canonical_bytes is produced once and consumed three ways — base64url-encoded into DSSEEnvelope.payload, SHA-256-hashed into chain_hash, and PAE-wrapped before Ed25519 signing — so any inconsistency between those three uses fails verification. - You must canonicalize BEFORE signing because the signing step covers the canonical bytes, not the in-memory dict; re-serializing after signing with a different serializer (or receiving the JSON through a prettifier) changes the bytes and invalidates the signature. - If you re-serialize a canonical document and it does not produce the identical byte sequence, your canonicalization implementation is non-conformant — the round-trip check (canonicalize(parse(canonicalize(x))) == canonicalize(x)) is a necessary but insufficient guard, and only the cyberphone oracle vectors fully validate conformance.

## Exercises ### Warm-up 1. Read RFC 8785 in full. (It is thirty pages; an evening's work.) List three rules you suspect a quick implementation would get wrong on a first attempt. 2. Run python -c 'import rfc8785; print(rfc8785.dumps({"b":2,"a":1}))'. Verify the keys come out in the right order. Now do the same with {"ċ":1,"é":2}. The output should still be sorted; verify the order against UTF-16 code-unit comparison. ### Core 3. Implement Lab 7.1 and 7.2. Submit your patched implementation against the cyberphone test vectors that exercise the high-codepoint range. (The harness in the lab directory tells you which vectors those are.) 4. Read Soatok, Canonicalization Attacks Against MACs and Signatures (https://soatok.blog/2021/07/30/canonicalization-attacks-against-macs-and-signatures/). Identify two attacks Soatok describes that RFC 8785 does prevent and one it does not. 5. Create a Python dict {'b': 2, 'a': 1, 'c': {'z': 26, 'y': 25}}. Run rfc8785.dumps() on it. Compare the output to json.dumps(sort_keys=True). Identify at least one character in the output where the two diverge, and explain which RFC 8785 rule produces that difference. ### Stretch 5. Read RFC 9052 (COSE) and identify the structural differences between deterministic CBOR encoding and JCS. For an embedded device that must produce signed evidence — say, a body-worn sensor — which would you choose? 6. Implement Lab 7.6 (the parser-mismatch exploit). Document the input you found and the two implementations' divergent outputs. Submit a patch to one of them. ## Build-your-own prompt For your capstone: enumerate every place your system touches the on-the-wire JSON of an attestation. For each, identify whether canonicalization is required (signing path, verification path) or forbidden (display, logging, transport). Write the discipline down as an ADR (Architecture Decision Record) in your capstone repo. ## Further reading - RFC 8785 (JCS), https://www.rfc-editor.org/rfc/rfc8785. Read in full. The spec is short. - DSSE spec (secure-systems-lab), https://github.com/secure-systems-lab/dsse/blob/master/spec.md. The envelope format that wraps JCS output in Canon v0.2.0. - cyberphone reference implementation + test vectors, https://github.com/cyberphone/json-canonicalization. The 286k oracle vectors are here. Use them. - Python rfc8785 (Trail of Bits), https://pypi.org/project/rfc8785/. The package the repository wraps. - Soatok, Canonicalization Attacks Against MACs and Signatures (2021), https://soatok.blog/2021/07/30/canonicalization-attacks-against-macs-and-signatures/. Required reading. - Brain on Fire, Preventing parser mismatch vulnerabilities (2022), https://www.brainonfire.net/blog/2022/04/29/preventing-parser-mismatch/. Lab 7.6 is built on this. - Hill et al., Taxonomy of Attacks against XML Digital Signatures, BlackHat 2007. The history of canonicalization-attack literature. - RFC 9052 (COSE Structures), https://datatracker.ietf.org/doc/rfc9052/. - ETSI TS 119 182-1 v1.2.1 (JAdES), July 2024. For EU eIDAS-qualified signing. - The dossier research/01_cryptography_pedagogy.md for the rest of the citation list.

Next: Chapter 8 — Schemas as Contracts. Where the wire format meets the type system.