Part II — CS Building Blocks · Chapter 19
Adversarial Validation & Tri-Model Consensus
Adversarial Validation & Tri-Model Consensus
An LLM that has not been disagreed with has not been tested. An LLM that disagrees only with versions of itself has been tested gently.
ℹPrerequisites▼
Before reading this chapter, you should be comfortable with: Chapters 5–11 (all of Part II). Adversarial validation challenges claims produced by extraction; every primitive in Part II feeds into the refutation harness.
Chapter 11 produced structured claims. The SMS extractor found a custody exchange, assigned times, labeled the inference type, flagged three gaps. The output is clean JSON that validates against the Pydantic schema. The Canon walker accepts it.
Now ask the question that matters: is it true?
This is not the same question as is it well-formed? A perfectly structured claim can be fabricated. It can accurately describe the message text while omitting material context that would reverse its meaning. It can get the event right and the timestamp wrong. The schema does not protect you from any of these failure modes. Pydantic validates structure; it cannot validate truth.
The problem is specific and documented: a model asked to validate its own output has a structural sycophancy problem. It remembers what it produced and is more likely to defend that output than to challenge it. Asking the extractor "is this claim accurate?" is like asking the author of a brief to proofread it for factual errors.
You need a different model, asking a different question.
At a glance
- Tri-Model Consensus (TMC) puts each provisional claim before three architecturally distinct adversary models. The models vote independently; 2-of-3 agreement is consensus. 1-of-3 is a contested dissent, logged with detail.
- The Canon's R6 requirement mandates at minimum one Challenge entry and a
coverage.declinedlist in every attestation's Refutation block — even when all challenges pass. The declined list declares which challenge types were not run and why. - inspect-ai (pip install meridian-canon[inspect]) maps the UK AISI inspection framework to Canon's five challenge types. Userun_adversarial_inspect()frommeridian.refute.inspect_tasks. Can run alongside native TMC, not instead of it. - Langfuse (pip install meridian-canon[langfuse]) provides session-linked observability. Setlangfuse_session_id="<attestation_id>"to tie every LM call to the attestation it serves. SetLANGFUSE_PUBLIC_KEY,LANGFUSE_SECRET_KEY, andLANGFUSE_HOSTin production. - Counter-evidence retrieval — retrieving potentially contradicting documents from the corpus before the challenger sees the claim — is the weakest link in factuality pipelines and must be handled with honesty about its limits. ## Learning objectives After this chapter you can: - Implement the five Canon challenge types as structured adversarial prompts. - Build a TMC harness that runs three independent models and aggregates verdicts. - Produce a complete Canon Refutation block, including thecoverage.declinedinventory. - Identify and mitigate four categories of LLM-as-judge bias: position bias, length bias, self-preference, authority bias. - Retrieve counter-evidence from a corpus and present it to a challenger model in a way that strengthens the challenge rather than confirming the claim. --- ## The structural sycophancy problem Perez et al. (2022, "Sycophancy to Subterfuge") measured GPT-4's capitulation rate when a human confidently asserted a wrong answer after GPT-4 had produced a correct one. Baseline capitulation rate: approximately 20%. When GPT-4 was instructed to play devil's advocate and challenge the human's position: approximately 4%. The adversarial instruction did not just remind the model to be careful — it structurally changed its behavior. This finding has direct implications for validation design. If you ask the same model that produced a claim to assess whether the claim is well-supported, you are relying on a model that has a known tendency to defend positions it has already staked out. The capitulation experiment measures the reverse — human pressure on a model — but the underlying mechanism is the same: models weight recent context heavily, and recent context in a validation task includes the claim being validated. The solution the field has converged on is adversarial posture plus architectural diversity. A model told to assume the claim is wrong and look for evidence against it behaves differently from one told to assess accuracy. A model from a different training lineage has a different prior over what plausible claims look like. Neither guarantee is strong in isolation. Together, they produce measurably better performance than self-assessment. > ☉ In the Wild — Perez et al. and the adversarial instruction. > > The 2022 paper "Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models" (Perez et al., Anthropic) was primarily about reward hacking. But its measurement of capitulation rate under human pressure became one of the most-cited findings in the LLM-as-judge literature. The 20%-vs-4% split between standard and adversarial instruction modes is not a cherry-picked result — it replicates across model families and question types. The paper's recommendation: evaluators should be explicitly instructed to challenge, not assess. "Find the error in this claim" produces a different distribution of outputs than "evaluate this claim." > > The TMC harness in Meridian-Cannon implements this recommendation literally. Each challenger model receives a system prompt that opens: You are an adversarial fact-checker. Your job is not to confirm this claim but to identify its weakest point and attempt to falsify it. ## The five challenge types Canon R6 defines five challenge types. Every attestation's Refutation block must declare which types were run and which were declined. FABRICATION — The claim asserts something not present in the cited observations. The SMS says "I'll be there at 4 not 3." A fabrication challenge looks for claims that assert things the SMS does not say: a specific date, a specific child's name, a specific location. If the claim says "the parent confirmed pickup of [child] at [address]" and the SMS contains no such specificity, the challenge verdict is FAIL. OMISSION — The claim accurately describes what the observations contain but omits material context that would change its meaning. The SMS exchange might include a later message, retrieved in the same chunk, that says "actually never mind, I'll stick to 3." A claim derived only from the first message without the second is technically accurate about the first message but materially incomplete. OMISSION challenges require the full context window, not just the observation the claim supports. DISTORTION — The claim accurately describes the observations and does not omit material context, but mischaracterizes their significance. "The parent confirmed the rescheduling" when the SMS shows a proposal, not a confirmation, is a distortion. The word "confirmed" implies agreement from both parties; the evidence shows only one party's statement. Distortion challenges require the challenger to understand pragmatics — what the words imply, not just what they say. TEMPORAL_ERROR — The correct event is described but the timestamp or causal ordering is wrong. "The parent sent the rescheduling message before the scheduled exchange time" is a temporal claim. If the message was sent after the exchange was supposed to happen, the temporal claim is false regardless of whether the rescheduling itself is accurately described. ATTRIBUTION_ERROR — The correct event is described at the correct time, but the wrong actor is credited. "The parent proposed the change" versus "the caseworker proposed the change" — the message might make this ambiguous or might make it clear. Attribution errors are particularly common when documents are written in passive voice or when phone numbers have not been mapped to legal names. > § For the Record — Canon §9.1 (R6, Refutation requirement). > > "Every Attestation MUST include a RefutationBlock. The RefutationBlock MUST contain at least one Challenge entry and a Coverage object. The Coverage object MUST include adeclinedlist that names every challenge type not executed and provides a machine-readable reason for each declination. A RefutationBlock that contains only passing challenges is not a sign of a strong attestation — it is a sign of thorough coverage." ## Tri-Model Consensus The TMC harness runs three models on each claim. The models are chosen from different architectural lineages. Meridian-Cannon's defaults: | Slot | Model | Family | |---|---|---| | Challenger A |llama-3.1-8b-instruct| Meta Llama | | Challenger B |mistral-7b-instruct-v0.3| Mistral | | Challenger C |gemma-2-9b-it| Google Gemma |
%%| label: fig-tmc
%%| fig-cap: "Tri-Model Consensus: three models from distinct families vote on each claim"
flowchart LR
CLAIM["Provisional\nClaim + Source excerpt"]
subgraph PANEL ["Adversary Panel — three distinct model families"]
M1["Model M₁\n(llama family)\nOllamaAdapter"]
M2["Model M₂\n(mistral family)\nOllamaAdapter"]
M3["Model M₃\n(gemma / gpt)\nOpenAIAdapter"]
end
VOTE["Majority Vote\n≥ 2 of 3 agree"]
OUT_S([✅ SURVIVED\nClaim supported])
OUT_F([❌ FAILED\nClaim refuted])
OUT_R([⚠ REVISED\nClaim weakened])
CLAIM --> M1
CLAIM --> M2
CLAIM --> M3
M1 --> VOTE
M2 --> VOTE
M3 --> VOTE
VOTE --> OUT_S
VOTE --> OUT_F
VOTE --> OUT_R
style OUT_S fill:#10b981,color:#fff
style OUT_F fill:#ef4444,color:#fff
style OUT_R fill:#f59e0b,color:#14181F
style PANEL fill:#F2EDE2,stroke:#D9D2C2
Each challenger receives: (1) the claim text, (2) the observations the claim cites, (3) any counter-evidence retrieved from the corpus, and (4) a system prompt specifying the challenge type. The challenger returns a structured verdict:
class ChallengeVerdict(BaseModel):
challenge_type: Literal["FABRICATION","OMISSION","DISTORTION","TEMPORAL_ERROR","ATTRIBUTION_ERROR"]
verdict: Literal["pass", "fail", "inconclusive"]
reasoning: str = Field(description="One paragraph. Cite specific text.")
confidence: float = Field(ge=0.0, le=1.0)
Aggregation rule: if 2 or more challengers return fail, the challenge outcome is fail. If all three return pass, the outcome is pass. Mixed results (2 pass, 1 fail; or any inconclusive) produce an inconclusive outcome. Inconclusive challenges are logged as contested and require human review before the claim enters any court-facing document. > ◆ Going Deeper — The statistical argument for 2-of-3. > > At per-model accuracy of 70% on a binary classification task, the probability that at least 2 of 3 independent models reach the correct answer is: > > P(2-of-3 correct) = C(3,2) × 0.7² × 0.3 + C(3,3) × 0.7³ = 0.441 + 0.343 = 0.784 > > 2-of-3 consensus reaches 78.4% accuracy from three 70%-accurate models. The improvement over a single model (70%) is modest — about 8 percentage points — but the improvement in recall of errors is higher: the single model that catches an error when the other two miss it is precisely the heterogeneity the different architectures provide. > > Requiring 3-of-3 unanimity would require 100% per-model accuracy to match 2-of-3 at 70% — an impractical threshold. The value of TMC is not in the unanimity; it is in the independent challenge. A false claim that survives two independent challengers with heterogeneous training is a much stronger claim than one that survived only its own maker. > > Note: this argument assumes model independence. If the three models share a training corpus, a fine-tuning distribution, or a RLHF dataset, they may share systematic biases. Choosing models from genuinely different architectural lineages is not aesthetic preference — it is the condition under which the independence assumption holds approximately. ## LLM-as-judge biases Running a language model as a judge of another model's output introduces known biases. Ignoring them produces a validation layer that is less reliable than the one it is checking. Four biases matter here: Position bias. Models tend to favor the first option presented. When the challenger receives the claim and the counter-evidence in a fixed order, its verdict may reflect the order rather than the content. Mitigation: randomize the order in which the claim and its evidence are presented. For TMC, swap the order between challengers A, B, and C and check whether the verdicts are stable. Length bias. Models often judge longer responses as more authoritative. A claim with extensive supporting prose may be rated as more credible than an identical claim with brief support, even if the prose adds no information. Mitigation: normalize the evidence presentation to a fixed-length format before presenting to the challenger. Self-preference bias. A model tends to prefer outputs from its own family. Llama-family models may systematically favor claims produced by Llama-family extractors. This is the primary reason for using architecturally distinct challengers. Cross-checking with TMC only helps if the challenger families are genuinely distinct. Authority bias. Challenger models may be influenced by framing that implies the claim comes from an authoritative source. "A court-appointed expert found that..." makes the claim harder to challenge than "A caseworker noted that...", even if the underlying evidence is identical. Mitigation: strip source-authority markers from the claim before presenting to the challenger. > ▼ Why It Matters — Biased judges in custody proceedings. > > In a 2026 TPR proceeding, the evidence might include a DCFS worker's report that describes a home visit, a parent's text messages, and a guardian ad litem's interview notes. These sources carry different perceived authority — official reports versus personal communications. A validation layer that treats a caseworker's interpretive conclusion as more credible simply because it appears in an official report is not providing adversarial validation; it is reproducing the authority hierarchy of the original documents. > > TMC's authority-bias mitigation strips source labels before challenge. The challenger sees the text, not the letterhead. Whether the statement came from a court-appointed expert or from a parent's diary entry is material to a court's assessment of credibility — but it should not bias the challenger's assessment of whether the claim is factually supported by the cited observations. ## Counter-evidence retrieval The strongest version of a challenge is not an LLM questioning a claim from memory. It is an LLM questioning a claim while holding a specific document from the corpus that potentially contradicts it. Counter-evidence retrieval searches the evidence corpus for documents that might refute the claim. The search is different from the retrieval in Chapters 9 and 10. Those retrievals search for support — documents that are likely relevant to a query. Counter-evidence retrieval searches for contradiction — documents that assert something incompatible with the claim. This is a harder problem. Retrieval systems are optimized for SUPPORT accuracy because that is what most applications need. Refutation accuracy — the ability to surface a document that says "no, that is wrong" — lags significantly in every 2025–2026 benchmark. The Berkeley MAST taxonomy identifies "counter-evidence retrieval failure" as one of the 14 system-level failure modes in multi-agent evaluation pipelines. Three strategies are in use: Falsification-Verification Alignment RAG (FVA-RAG) — a retrieval approach that adds a negation query alongside the affirmation query and takes the union. For the claim "the parent rescheduled the exchange to 4 PM," the negation query might be "exchange was not rescheduled" or "exchange remained at 3 PM." The negation query often has different lexical characteristics than the affirmation query and retrieves different documents. Self-Consistency RAG (SC-RAG) — runs the same claim through multiple retrieval queries with varied phrasing and checks whether the retrieved documents consistently support or oppose the claim. High variance in retrieved-document sentiment is itself a signal of a contested claim. Corrective RAG (CRAG) — introduces a retrieval evaluator that assesses whether retrieved documents are actually relevant before presenting them to the challenger. Documents that are superficially similar but do not address the claim's specific assertion are filtered out. Meridian-Cannon's counter-evidence retrieval is FVA-RAG with CRAG filtering. It is honest about its limits: when counter-evidence retrieval fails to find contradicting documents, the gap record says so. The absence of counter-evidence is not evidence of support; it is evidence that the corpus does not contain a refutation — which is weaker. > ◆ Going Deeper — Why REFUTE lags SUPPORT in benchmarks. > > The information-retrieval literature has optimized for precision and recall on relevance — does this document address the query? Contradiction is a different relation. A document that says "the exchange took place at 3 PM as scheduled" contradicts the claim "the exchange was moved to 4 PM," but the two documents share vocabulary (exchange, time, PM, custody) that makes their vector representations similar. A cosine-similarity retrieval may rank them as equally relevant to both the affirmation and the contradiction query. > > The 2024 BEIR benchmark's "refutation" track shows that retrieval systems trained on MSMARCO (affirmation-dominated) underperform on refutation tasks by 15–25% on nDCG@10. FVA-RAG closes about half that gap by explicitly training the retrieval model on refutation pairs. Meridian-Cannon uses the BEIR refutation track fixtures as part of the Lab 12 evaluation. ## The R6 requirement: coverage and declined challenges The most important part of R6 is what looks like bookkeeping: the coverage.declined list. Every attestation's Refutation block has a coverage object listing the challenge types that were run and, in declined, the types that were not — with a machine-readable reason for each. Suppose FABRICATION and OMISSION both pass. The Refutation block shows two passing challenges. Without a declined list, a future verifier cannot distinguish "we ran all five types and all passed" from "we ran two types and forgot about the other three." The declined list makes that distinction explicit and falsifiable. A challenge type in declined with reason "RESOURCE_CONSTRAINT: third challenger model unavailable at ingestion time" tells the verifier exactly what was done and why. A challenge type in declined with reason "NOT_APPLICABLE: source document is a phone record with no asserted event, temporal claims, or attributed statements" explains why the challenge type does not apply to this attestation.
This is the Canon's surgical checklist. A checklist is valuable not because surgeons forget things, but because the completed checklist is evidence that the procedure was performed in a specific order with specific items confirmed. A checklist recording only what succeeded is less valuable than one recording everything considered — including what was determined inapplicable.
✻ Try This — Run a single challenge manually.
Take any factual claim you have recently encountered in a document: a news article, a legal brief, a meeting summary. Write it down as a single sentence.
Now write a FABRICATION challenge prompt: "The following claim is being evaluated. Your task is to determine whether the claim asserts anything not directly supported by the cited source text. Assume the claim is fabricated until proven otherwise. Here is the claim: [CLAIM]. Here is the source text: [SOURCE]. Find the weakest point."
Run this prompt through any available LLM. Note the output.
Now run it a second time, through a different LLM if possible, or with a different system prompt if not. Do the two challengers agree? If one finds a fabrication and the other does not, what does that tell you about the claim? About the challengers?
Working example: a complete Refutation block
The custody rescheduling claim from Chapter 11:
Claim: "Sender 715-555-0183 proposed rescheduling the custody pickup from an earlier time (stated as 'not 3') to 4:00, via SMS on 2026-01-09 at 13:41:22." Inference type: inferred_high
The TMC harness runs FABRICATION challenge on this claim. All three challengers receive the claim, the original SMS text, and any retrieved counter-evidence (none retrieved: the corpus contains no other message from this sender on this date).
{
"refutation": {
"challenges": [
{
"challenge_id": "chl_001",
"challenge_type": "FABRICATION",
"targets": ["clm_001"],
"verdict": "pass",
"verdicts_by_model": [
{
"model": "llama-3.1-8b-instruct",
"verdict": "pass",
"confidence": 0.88,
"reasoning": "The claim's assertion of a schedule change is directly supported by 'I'll be there at 4 not 3.' The sender identity (phone number) is directly present. The timestamp is from message metadata, not the message body. No fabricated elements found."
},
{
"model": "mistral-7b-instruct-v0.3",
"verdict": "pass",
"confidence": 0.82,
"reasoning": "Message text supports the time-change claim. 'Proposed' is a reasonable characterization of a unilateral statement; the source does not confirm two-party agreement. No fabricated elements, though 'earlier time' is an inference."
},
{
"model": "gemma-2-9b-it",
"verdict": "pass",
"confidence": 0.91,
"reasoning": "Claim is supported. Note: 'custody pickup' is an inference from 'pickup'; no explicit custody reference in source. This is an OMISSION candidate, not a fabrication."
}
],
"consensus": "pass",
"counter_evidence_retrieved": []
}
],
"coverage": {
"run": ["FABRICATION"],
"declined": [
{
"challenge_type": "OMISSION",
"reason": "DEFERRED: full thread context not yet ingested; OMISSION challenge requires complete message thread. Scheduled for re-run after thread ingestion completes.",
"decided_by": "runner_v0.3.1",
"timestamp": "2026-01-09T14:00:00Z"
},
{
"challenge_type": "DISTORTION",
"reason": "DEFERRED: same dependency as OMISSION.",
"decided_by": "runner_v0.3.1",
"timestamp": "2026-01-09T14:00:00Z"
},
{
"challenge_type": "TEMPORAL_ERROR",
"reason": "NOT_APPLICABLE: claim does not assert causal ordering of events; timestamp is from message metadata (objective) not from message body content.",
"decided_by": "runner_v0.3.1",
"timestamp": "2026-01-09T14:00:00Z"
},
{
"challenge_type": "ATTRIBUTION_ERROR",
"reason": "RESOURCE_CONSTRAINT: sender phone number not yet resolved to legal name in actor registry. Re-run after actor resolution.",
"decided_by": "runner_v0.3.1",
"timestamp": "2026-01-09T14:00:00Z"
}
]
}
}
}
Three things to notice in this example:
One: the FABRICATION challenge passes, but Gemma-2's reasoning identifies a potential OMISSION issue — "custody pickup" is inferred from "pickup." This observation is not the FABRICATION challenge's verdict, but it is captured in the reasoning field. The runner extracts it and pre-populates the OMISSION challenge's counter-evidence field for when that challenge runs.
Two: the declined entries are specific about why they were declined. "DEFERRED" means the challenge type is applicable but cannot run yet. "NOT_APPLICABLE" means the challenge type does not apply to this specific claim. "RESOURCE_CONSTRAINT" means the system could not run it due to a missing dependency. Each reason class has a different implication for how urgently the challenge needs to be run before the attestation is used in court. Three: the entire block is machine-readable. A future verifier can parse the declined list, identify any "DEFERRED" entries, and flag the attestation as incomplete pending the deferred challenges. This is not a feature of the UI — it is a property of the data structure. > ▼ Why It Matters — What opposing counsel will do with this. > > An opposing attorney reviewing a Canon attestation in discovery will read the coverage.declined list first. If the declined list is absent — if the attestation shows only passing challenges with no accounting of what was not checked — the attorney's motion to exclude will argue that the attestation was selectively validated: run the challenges you know will pass, skip the ones you are less sure about. > > A complete declined list, with specific machine-readable reasons for each declination, defeats that argument. It shows not that the system ran all challenges and all passed — that would be extraordinary — but that the system ran the challenges it could run, documented why it could not run the others, and committed to a schedule for running them. This is the forensic posture. It is not a guarantee of accuracy; it is a guarantee of honest accounting. --- > ◆ Going Deeper — Phase C implementation status (Canon v0.2.0). > > As of Canon v0.2.0, the Tri-Model Consensus runner described in this chapter is aspirational. The schema and Refutation block structure — the challenges array, the coverage.declined inventory, the ChallengeVerdict Pydantic model, the R6 enforcement in meridian/canon/schema.py, the inspect-ai task bindings in meridian/refute/inspect_tasks.py, and the Langfuse observability decorators — are fully implemented and enforced at construction time. What is not yet implemented is the actual adversarial prompt generation and the multi-model consensus runner: refute/harness.py and refute/runner.py do not exist in the current codebase. This is Phase C work. > > Current attestations can be produced with manually constructed refutation blocks: write the challenges and coverage.declined entries by hand, following the structure in this chapter's worked example. A manually constructed refutation block is valid under R6 — Canon does not require the challenges to have been run by an automated harness, only that they be present, structurally correct, and honest about what was not run. The lab deliverables in Lab 12 are forward-looking exercises; they describe what the runner will do when Phase C is implemented.
Before relying on any attestation in court-facing documents, verify that its refutation block was produced by an implemented runner (once Phase C ships) or, if manually constructed, that the decline reasons honestly reflect the actual validation coverage. An attestation with a manually constructed refutation block that misrepresents coverage as automated is a Canon conformance violation under R6.
inspect-ai integration
The UK AI Safety Institute's inspect-ai framework provides a structured evaluation harness that maps naturally onto Canon's five challenge types. Install it alongside the Meridian-Cannon refutation layer:
pip install meridian-canon[inspect]
The integration is in meridian.refute.inspect_tasks. Each Canon challenge type is implemented as an inspect-ai task:
from meridian.refute.inspect_tasks import run_adversarial_inspect
result = run_adversarial_inspect(claim=claim, observations=observations)
# Returns InspectRefutationResult with:
# per_task: list[ChallengeOutcome] — one per challenge type
# consensus: str — "pass" | "fail" | "inconclusive"
The InspectRefutationResult is compatible with the Canon Refutation block schema — it can be serialized directly into the challenges array. Running inspect-ai produces per-task ChallengeOutcome objects for each of the five challenge types, plus an aggregate consensus.
inspect-ai runs alongside native TMC, not instead of it. Use inspect-ai when you want structured evaluation logs in the AISI format (e.g., for audit trail purposes or for comparison against UK AISI's published benchmark results). Use TMC when you want multi-model heterogeneous challenger architecture. A conformant attestation may include results from both.
Langfuse observability
Langfuse provides session-linked tracing for every LM call in the refutation layer. Install it with:
pip install meridian-canon[langfuse]
Then set three environment variables in production:
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com # or your self-hosted instance
The @_lf_observe(name="meridian.refute.lm.complete") decorator is already applied to LiteLLMAdapter.complete() and OllamaAdapter.complete(). When Langfuse is configured, every LM call produces a trace in the Langfuse dashboard showing: the attestation ID it serves (via langfuse_session_id="<attestation_id>"), the model name, the prompt sent, the output received, and the latency. The session ID linkage is the key property: every LM call for a given attestation is grouped under that attestation's ID in the dashboard. If an attestation's TMC run produces an inconclusive verdict, you can open the Langfuse session, read the three challengers' prompts and outputs, and diagnose which model diverged and why. If Langfuse is not installed or the environment variables are not set, the adapters function normally — the decorator is a no-op. ## Implementation sketch The harness in refute/harness.py follows this structure:
# refute/harness.py — simplified.
async def run_tmc_challenge(
claim: CanonClaim,
observations: list[Observation],
challenge_type: ChallengeType,
challengers: list[ChallengerModel],
counter_evidence: list[Document],
) -> ChallengeResult:
verdicts = await asyncio.gather(*[
challenger.challenge(claim, observations, counter_evidence, challenge_type)
for challenger in challengers
])
fail_count = sum(1 for v in verdicts if v.verdict == "fail")
if fail_count >= 2:
consensus = "fail"
elif all(v.verdict == "pass" for v in verdicts):
consensus = "pass"
else:
consensus = "inconclusive"
return ChallengeResult(
challenge_type=challenge_type,
verdicts_by_model=verdicts,
consensus=consensus,
counter_evidence_retrieved=[d.document_id for d in counter_evidence],
)
The challengers run concurrently. Each runs against the same claim and observations, independently. The aggregation logic is outside any individual challenger — it sees the verdicts, not the reasoning behind them.
The runner (refute/runner.py) decides which challenge types to run and which to decline. The decision is per-claim, per-attestation. A claim with no temporal assertions gets TEMPORAL_ERROR declined as NOT_APPLICABLE. A claim against a document type where actor identity is always ambiguous (SMS from an unresolved phone number) gets ATTRIBUTION_ERROR declined as RESOURCE_CONSTRAINT pending actor resolution. ## Lab 12 — Build the five-challenge harness The lab is in labs/ch12_adversarial_validation/. Deliverable 1 — Challenger implementation. Implement ChallengerModel for at least two model families. Each challenger must accept a claim, a list of observations, a list of counter-evidence documents, and a challenge type. It must return a ChallengeVerdict with verdict, reasoning, and confidence. Deliverable 2 — TMC aggregation. Implement run_tmc_challenge() as shown above. Run it on a claim from Lab 11's output. Verify that 2-of-3 fail verdicts produce consensus fail, and that mixed verdicts produce consensus inconclusive. Deliverable 3 — Complete Refutation block. For the same claim from Lab 11: run FABRICATION challenge. Produce a complete Refutation block JSON with the challenge result and a coverage.declined list for the four challenge types you did not run. Write a machine-readable reason for each declination. Deliverable 4 — Counter-evidence retrieval. Implement FVA-RAG counter-evidence retrieval for the claim. Run the negation query against the lab fixture corpus. If no counter-evidence is found, record this in the challenge's counter_evidence_retrieved field as an empty list. Verify the claim's behavior changes (or does not) when you manually inject a contradicting document as counter-evidence. Acceptance criteria: pytest labs/ch12_adversarial_validation/test_lab.py passes. The produced Refutation block validates against the Canon schema. The declined list contains all five challenge types minus the one that was run. Running the attestation through meridian-canon walk passes step 6 (refutation targets resolve).
inconclusive rather than pass — the least-prejudicial outcome — because a missing challenger is a resource constraint that should be disclosed in coverage.declined, not silently counted as support. - inspect-ai maps the UK AISI inspection framework to Canon's five challenge types (FABRICATION, OMISSION, DISTORTION, TEMPORAL_ERROR, ATTRIBUTION_ERROR), producing structured evaluation logs in the AISI format that run alongside native TMC rather than replacing it. - Langfuse session-linking ties every LM call in the refutation layer to the attestation it serves via langfuse_session_id="<attestation_id>", creating an audit trail that lets a reviewer open a specific attestation's session and see the exact prompts, outputs, and latency for every challenger that contributed to the Refutation block. coverage.declined list for a claim extracted from a voicemail recording. The voicemail is a 45-second message left by a caseworker. TEMPORAL_ERROR and FABRICATION were run. Write decline reasons for the three remaining challenge types. ### Stretch 6. Implement counter-evidence retrieval using the negation query strategy. For the claim "the parent sent the rescheduling message on January 9, 2026," construct the negation query, run it against the labs/ch12_adversarial_validation/ fixture corpus, and evaluate whether the retrieved documents are genuinely contradictory or only superficially similar. Measure precision at 3 for both the affirmation query and the negation query. 7. Design and implement a consensus override rule. The current 2-of-3 rule treats all models equally. Design a weighted voting scheme where a model's weight is a function of its calibration score on a held-out challenge set. Justify the calibration procedure: how do you produce a held-out challenge set with known-correct verdicts? ## Build-your-own prompt For your capstone: which of the five challenge types are applicable to your corpus's primary document types, and which will you decline? Write the machine-readable reason for each decline now — the format is the coverage.declined entry, with challenge_type, reason, and decided_by. This is not an academic exercise. These entries go directly into your first attested documents. Writing them now forces you to think honestly about what your validation layer is and is not checking. ## Further reading - inspect-ai (UK AISI): https://inspect.ai-safety-institute.org.uk. Install: pip install meridian-canon[inspect]. Canon integration: meridian.refute.inspect_tasks. - Langfuse observability: https://langfuse.com. Install: pip install meridian-canon[langfuse]. Session ID linkage: langfuse_session_id="<attestation_id>". - Perez et al., "Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models," Anthropic (2022): https://arxiv.org/abs/2310.13548 - Du et al., "Improving Factuality and Reasoning in Language Models through Multiagent Debate," MIT (2023): https://arxiv.org/abs/2305.14325 - Berkeley MAST taxonomy, "Multi-Agent System Testing" (2025): https://arxiv.org/pdf/2503.13657 - "Talk Isn't Always Cheap: Debating Multi-Agent Systems" (2025): https://arxiv.org/pdf/2509.05396 - FVA-RAG, "Falsification-Verification Alignment for Retrieval-Augmented Generation" (2024): https://arxiv.org/html/2512.07015 - Thinking Machines Lab, "Defeating Nondeterminism in LLM Inference" (September 2025): https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/ - Research dossier research/04_adversarial_llm_eval.md — the full literature survey underlying this chapter.
- Canon §9 (R6, Refutation requirement) and §14 (the seven-step verification protocol).
Next: Chapter 13 — The Seven-Layer Pipeline. Chapters 9 through 12 have built the retrieval, extraction, and validation layers individually. Chapter 13 assembles them.