From Evidence to Verifiability: Rebuilding Trust in AI Outputs 🔏

AI Reliability, AI Infrastructure, AI Evaluation, Explainable AI

February 03, 2026

From Evidence to Verifiability: Rebuilding Trust in AI Outputs 🔏

Page content

⏰ TLDR

This work shows that the hardest part of using AI in high-trust environments is not the model, but the policy. Once editorial policy is made explicit and executable, AI systems become interchangeable the real challenge is engineering reliable measurements and deterministic enforcement of those policies. This reframes AI reliability as a policy and measurement problem, not a model problem.

📋 Summary

AI systems are becoming deeply embedded in how we research, write, and reason. At the same time, their use in high-trust environments is under strain — not because models are incapable, but because they are being deployed into settings that demand determinism, provenance, and enforceable rules.

This reveals a fundamental mismatch.

We are applying stochastic systems — designed for exploration, synthesis, and creative reframing — inside deterministic environments that require justification, traceability, and procedural compliance.

Large language models are stochastic by design. That stochasticity is not a flaw; it is the source of their power.

But in environments such as encyclopedias, medicine, finance, law, and programming, the primary question is not whether an answer sounds right. It is whether the output can be verified, sourced, and defended under explicit policy.

The core idea explored in this post is simple:

AI does not need to become deterministic. It needs to be bounded by deterministic policy.

Instead of attempting to make models “tell the truth,” we treat stochastic generation as an upstream capability and move reliability downstream — into software systems that decide what is allowed to pass.

In this framing, hallucination is not a binary failure. It is a measurable form of semantic drift: a signal that indicates how far a generated claim has moved beyond what its evidence strictly supports.

This post shows how combining explicit policy enforcement with semantic diagnostics produces hard, deterministic outcomes without retraining models, prompt engineering, or recursive AI verification.

By placing a policy-driven bounding box around stochastic generation, AI outputs become usable again in rigid, high-reliability settings — not because the model is trusted, but because the system is controlled.

🗺️ How This Post Is Structured

The post proceeds in three parts:

Policy First We make the rules explicit. We encode editorial policies as executable constraints and show how the same claims can be accepted or rejected purely by changing policy without touching the model or the data.
Stochasticity Meets the Gate We introduce AI into the system, first under tightly constrained conditions, then under increasing epistemic risk. This reveals where stochastic generation and deterministic verification collide and why that collision is structural, not accidental.
Measuring the Boundary We introduce a diagnostic signal hallucination energy that measures semantic drift between claims and evidence. This metric does not decide truth. It quantifies how much of the model’s stochastic “superpower” is being exercised and whether policy allows it. Hallucination energy is a measure of how much a claim’s meaning deviates from its source evidence.

The result is not a claim that AI has been “fixed.”

It is a demonstration that reliability is not a property of models alone. It is a property of systems and systems can enforce rules.

We keep claims modest, results transparent, and assumptions explicit.

🎬 Act I: Making Verifiability Explicit

Act I defines the problem before AI enters the picture.

We take Wikipedia’s editorial rules verifiability, sourcing, and provenance and make them explicit and executable. We show that acceptance or rejection is not a matter of truth alone, but of policy.

Using the FEVEROUS dataset, we demonstrate that:

the same claims,
backed by the same evidence,
evaluated by the same code
can be accepted or rejected purely by changing policy.

This act establishes the core premise of the post:

Verifiability is policy-relative, not truth-absolute.

No AI is involved yet. That is intentional.

Without an explicit policy gate, hallucination cannot be meaningfully measured.

🔍 Why This Matters Now

Recent discussions around AI particularly in high-trust environments such as research publishing, regulated industries, and institutional knowledge systems often focus on whether models can be trusted to tell the truth.

That framing misses the real issue.

Large language models are stochastic systems. They are designed to explore, generalize, and synthesize not to operate under rigid institutional constraints by default.

As a result, AI is frequently excluded from precisely the environments where its capabilities would be most valuable not because it is useless, but because it is ungoverned.

The question, then, is not whether AI can be made perfectly reliable.

The question is:

Can we introduce software discipline around stochastic systems in a way that allows their participation in high-reliability environments without compromising those environments?

This post argues that we can.

By applying explicit, deterministic policy to AI outputs after generation, we gain three critical capabilities:

Control: AI behavior can be bounded without changing the model itself.
Quality: Outputs can be filtered, rejected, or accepted based on enforceable rules rather than plausibility.
Discipline: Established software engineering practices contracts, gates, and hard cut-offs re-enter the system.

In this framing, AI remains stochastic. The system becomes deterministic.

That separation is the key move and it is what enables AI to operate safely and usefully in contexts where it would otherwise be prohibited.

🎯 What This Blog Post Demonstrates

Rather than arguing abstractly, we focus on a specific, testable scenario.

In this post, we will:

Take a real, publicly available dataset used in AI evaluation
Apply a clear, executable verifiability policy
Measure how many AI-supported claims pass or fail under that policy
Show how changing process, not models, changes outcomes

The goal is not to achieve perfection, but to demonstrate that reliability gains are not marginal.

Even with a small amount of code and careful engineering, measurable improvements emerge.

You’re right to be cautious here and you’re also right that you shouldn’t throw this section away. It’s doing essential structural work for the blog. The trick is to keep the motivation without over-stepping into endorsement or speculation.

Here’s how to get the best of both worlds:

remove any judgment about whether Wikipedia’s decision is “correct”
remove reliance on a single contemporary blog post as the justification
ground everything in long-standing, explicitly documented Wikipedia policy
still make it clear why this post exists and why Wikipedia triggered it

Below is a tightened, publish-safe rewrite that does exactly that. It keeps the force of your argument, models policy-bound writing, and avoids putting you in the position of adjudicating Wikipedia’s choices.

You can replace your section with this verbatim.

🧪 Why Wikipedia Is the Right Stress Test

Wikipedia represents one of the most demanding real-world environments for AI-generated content not because it demands perfect accuracy, but because it enforces explicit, non-negotiable editorial policy.

On Wikipedia, a claim must not only be plausible or correct. It must be:

verifiable by independent, reliable sources,
defensible under written editorial rules,
and explainable to human reviewers after the fact.

Fluency does not count as evidence. Plausibility does not count as justification.

These constraints are not informal norms. They are codified in long-standing, publicly documented policies, including:

Verifiability “Verifiability, not truth, is required.” Content must be attributable to reliable published sources, regardless of whether it is factually correct.
No Original Research Editors may not synthesize sources to introduce claims or relationships not explicitly stated.
Reliable Sources Provenance hierarchy matters; circular citation and citation laundering are explicitly disallowed.

Together, these policies create a procedural filter, not an epistemic one. Content can be accurate and still be rejected if it cannot be justified under policy.

That is precisely why Wikipedia is an ideal stress test.

If an AI-assisted process can operate here producing outputs that survive procedural constraints rather than merely sounding correct it is likely to generalize elsewhere. If it fails here, the failure is informative rather than surprising.

This post does not argue for changing Wikipedia’s standards, nor does it treat Wikipedia as a judge of AI quality.

Instead, it takes Wikipedia’s policies as a fixed design constraint and asks a narrower, more practical question:

Can stochastic AI generation be harnessed in a way that reliably satisfies explicit institutional rules without weakening those rules or asking humans to “trust” the output?

The rest of this post is an exploration of that question.

⚡ Stochastic Power in a Deterministic Systems

Large language models are stochastic by nature. That stochasticity is not a defect it is the core reason these systems are useful at all.

It enables:

exploration of idea space,
synthesis across sources,
reframing and compression,
and occasionally, genuinely novel insight.

Nearly all meaningful progress in generative AI since 2017 has come from embracing this property, not suppressing it.

The problem is not that AI systems “hallucinate.”

The problem is that we are attempting to deploy a curved instrument inside environments that demand hard edges.

High-trust systems encyclopedias, finance, medicine, law, programming operate under square constraints:

binary acceptance,
explicit provenance,
enforceable rules,
and zero tolerance for ambiguity at the point of publication.

When stochastic systems are placed directly into these environments, failure is inevitable. This is not a model failure. It is a systems mismatch a square-peg, round-hole problem.

🎯 Our Claim

This post does not claim that:

AI can replace human editors,
generative models are inherently trustworthy,
or that a small experiment solves a hard problem.

What we claim is narrower:

Stochastic generation can be made usable in deterministic environments if acceptance is governed by explicit, enforceable policy.

In other words:

AI systems should generate possibilities. Software systems should decide what is allowed to pass.

This is a software engineering problem, not a philosophical one.

🔄 From “Hallucination” to Managed Signal

The term hallucination is entrenched in the AI literature, and we will use it here for familiarity. But it is misleading.

What is commonly called hallucination is better understood as semantic overreach the model exercising its stochastic capacity beyond what a given body of evidence strictly supports.

That capacity is not something we want to eliminate. It is something we want to measure, bound, and govern.

In this post, we show two things:

Policy alone can override AI confidence The same claim may be accepted or rejected purely by changing policy, regardless of how plausible it sounds.
Semantic drift can be measured independently of policy We introduce a diagnostic signal hallucination energy that quantifies how far a generated claim moves beyond its supporting evidence. This signal does not decide truth. It characterizes risk.

Taken together, these allow us to do something important:

contain stochastic power without destroying it.

LLM output:        ~~~~~~~~≈~~~~~≈~~~~~
Policy applied:    ████    ████    ████

Hallucination energy is not a correctness metric; it is a control signal that only has meaning relative to policy boundaries.

🚀 Where This Leads

The goal of this post is not to “fix AI.”

It is to show using Wikipedia as a concrete, unforgiving test case that trust in AI outputs is not a mystery problem.

It is an engineering problem.

And engineering problems can be solved by separating concerns, enforcing boundaries, and treating stochasticity as a capability to be managed rather than a flaw to be removed.

With that framing in place, we can now introduce AI into the system and examine precisely and empirically what breaks, what holds, and why.

📊 The Dataset: FEVEROUS as a Wikipedia Substrate

We use FEVEROUS as the evaluation substrate for one reason: it mirrors how Wikipedia verification actually works.

FEVEROUS is built directly on Wikipedia pages and encodes:

human-written claims,
explicit evidence references (sentences, table cells, headers),
and annotated reasoning traces.

That structure matters more than the labels.

Our goal is not to predict SUPPORTED or REFUTES. It is to test whether a claim paired with evidence can survive Wikipedia-style verifiability gates.

For that purpose, FEVEROUS is ideal.

📝 What a FEVEROUS Example Actually Looks Like (and How We Use It)

To make the setup concrete, here is a simplified view of a single FEVEROUS entry as it appears in our pipeline:

{
  "id": 7389,
  "claim": "Algebraic logic has five logical systems and Lindenbaum–Tarski algebra provides models of propositional modal logics.",
  "label": "REFUTES",
  "evidence": [
    {
      "content": [
        "Algebraic logic_sentence_0",
        "Lindenbaum–Tarski algebra_sentence_1",
        "Algebraic logic_cell_0_1_1"
      ],
      "context": {
        "Algebraic logic_sentence_0": ["Algebraic logic_title"],
        "Lindenbaum–Tarski algebra_sentence_1": ["Lindenbaum–Tarski algebra_title"],
        "Algebraic logic_cell_0_1_1": [
          "Algebraic logic_title",
          "Algebraic logic_section_4",
          "Algebraic logic_header_cell_0_0_1"
        ]
      }
    }
  ],
  "annotator_operations": [...],
  "challenge": "Multi-hop Reasoning"
}

Several details matter for this work:

Evidence is explicitly referenced, not inferred.
Context preserves page titles, sections, and table structure.
Claims and evidence are already grounded in a human editorial process.

This allows us to ask a precise, operational question:

Given this claim and this evidence, would the claim pass a Wikipedia-style verifiability gate if evaluated as executable policy?

That is the only question this dataset is used to answer in this post.

What We Are and Are Not Measuring

We are not evaluating:

model accuracy
factual truth
reasoning quality
or label prediction performance

Those are valid research problems, but they are not the problem here.

Instead, we isolate a narrower systems question:

Can a claim–evidence pair survive deterministic editorial constraints once those constraints are made explicit and executable?

That distinction is critical.

A claim can be true and still unverifiable. It can be plausible and still unsourced. It can be supported in a dataset and still fail institutional review.

FEVEROUS gives us a realistic substrate to explore that gap and to test whether policy enforcement alone can account for many of the failures attributed to AI in high-trust environments.

⏰ Why This Matters Before We Involve an LLM

At this stage, no large language model is doing any reasoning.

That is intentional.

Before introducing AI into the loop, we first establish:

the dataset
the constraints
the verification policy
and the failure modes

Only once that foundation is solid does it make sense to ask whether AI can improve outcomes rather than obscure them.

That transition from raw evidence, to verifiable claims, to AI-assisted filtering is what the rest of this post demonstrates.

💡 The Core Idea: From Evidence to Verifiability

Before introducing AI into the loop, we first make the process explicit.

The key insight is simple:

AI outputs should not be trusted by default they should be gated by executable verification rules, the same way production software is gated by tests.

To make this concrete, we reduce the problem to a small, inspectable pipeline.

🧪 What We Are Testing

A claim (from FEVEROUS)
A Wikipedia page referenced as evidence
A verifiability policy derived from Wikipedia editorial rules

The system does not attempt to reason, summarize, or rewrite. It simply asks: does this claim pass the rules?

🤖 Why This Comes Before Any AI

At this stage, no large language model is involved.

That is intentional.

Before introducing stochastic generation, we first establish:

the evidence format,
the verification constraints,
the editorial policy,
and the policy enforcement mechanism.

Only once that foundation is fixed does it make sense to ask whether AI can improve outcomes rather than obscure them.

This ordering matters.

If policy is implicit, AI failure looks like model failure. If policy is explicit, AI failure becomes a systems question.

That distinction underpins the rest of this post.

💡 From Evidence to Verifiability

The core idea explored here is simple:

AI outputs should not be trusted by default they should be gated by executable verification rules, the same way production software is gated by tests.

To make this concrete, we reduce the problem to a small, deterministic pipeline.

🧪 What the System Evaluates

A claim (from FEVEROUS)
Its referenced evidence
A verifiability policy derived from editorial rules

At this stage, the system does not attempt to:

reason,
summarize,
rewrite,
or “understand” the content.

It asks one question only:

Given this claim and this evidence, does it pass the rules?

That is the baseline we establish before any AI enters the loop.

📊 System Flow

Here is the entire process, end to end.

    flowchart TD
    A[["📊 FEVEROUS Dataset<br/>Local JSONL"]] --> B["⚗️ Claim + Evidence Extraction"]

    B --> C["🔍 Wikipedia Page Resolver"]
    C --> D["📥 Wikipedia Page Fetch"]

    D --> E["🚨 Verifiability Gate<br/>(Executable Policy)"]

    E -->|✅ Supported| F["🟢 PASS<br/>Claim is Verifiable"]
    E -->|❌ Not Supported| G["🔴 FAIL<br/>Claim Rejected"]
    E -->|⚠️ Ambiguous| H["🟡 UNCLEAR<br/>Needs Human Judgment"]

    classDef dataset fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
    classDef process fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
    classDef retrieval fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
    classDef gate fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#880e4f
    classDef pass fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
    classDef fail fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#b71c1c
    classDef unclear fill:#fff3e0,stroke:#ff8f00,stroke-width:2px,color:#e65100
    
    class A dataset
    class B process
    class C,D retrieval
    class E gate
    class F pass
    class G fail
    class H unclear

✅ Why This Works

There are three design choices here that matter.

1️⃣ Evidence Comes First

The system never starts with a model output.

It starts with:

a human-written claim
a human-annotated evidence reference
a real Wikipedia page

This mirrors how verification actually happens in editorial systems.

2️⃣ Verifiability Is Enforced as Code

The Verifiability Gate is the heart of the system.

It encodes rules like:

Is the citation primary or secondary?
Is the claim directly supported by the cited content?
Does the evidence establish the claimed relationship?
Is this a synthesis that requires editorial judgment?

These are not probabilistic checks. They are deterministic, auditable decisions.

If the claim fails, it fails for a reason and that reason is logged.

3️⃣ AI Is Deliberately Absent

This is important enough to state explicitly.

At this stage:

No LLM generates text
No LLM evaluates truth
No LLM assigns labels

This is intentional.

If the verification layer is weak, AI will only amplify the weakness. If the verification layer is strong, AI becomes a force multiplier instead of a liability.

That transition comes later.

🎯 What This Diagram Is Really Showing

This is not just a pipeline.

It’s a boundary.

Everything to the left of the gate is input. Everything to the right of the gate is trust.

Our claim and the premise of the upcoming paper is that this boundary can be made precise, enforced in software, and scaled.

Once that boundary exists, AI becomes usable again.

📊 System Flow (High Level)

Here is the entire process, end to end.

    flowchart TD
    subgraph "📂 Dataset Loading"
        A[["📊 FEVEROUS Dataset<br/>Local JSONL"]] 
        A --> B["⚗️ Extract Claim + Evidence"]
        B --> C["🔄 Normalize to Wikipedia Structure"]
    end
    
    subgraph "🌐 Source Retrieval"
        C --> D["🔍 Resolve Wikipedia Page"]
        D --> E["📥 Fetch Page Content<br/>via Wikimedia API"]
    end
    
    subgraph "🚨 Verifiability Gate"
        F["📥 Input: Claim + Context"] --> G{"🔎 Primary Source Check?"}
        G -- "📘 Strict Policy" --> H["🚫 Citation Laundering Detection"]
        G -- "📗 Standard Policy" --> I["⭐ Reputability Assessment"]
        
        H --> J{"⏰ Temporal Drift?"}
        I --> J
        
        J -- "🕰️ Outdated" --> K["❌ Reject: Temporal Drift"]
        J -- "✅ Current" --> L{"📈 Overstatement Detection?"}
        
        L -- "📊 Overstated" --> M["❌ Reject: Exceeds Source Support"]
        L -- "🎯 Accurate" --> N{"🔍 Direct Support?"}
        
        N -- "✅ Direct Match" --> O["🟢 Supported (Direct)"]
        N -- "🔄 Paraphrase" --> P["🟢 Supported (Close Paraphrase)"]
        N -- "❓ Ambiguous" --> Q["🟡 Unclear: Needs Human Judgment"]
        N -- "❌ No Support" --> R["🔴 Not Supported"]
    end
    
    subgraph "📊 Research Metrics Collection"
        O --> S["📉 False Positive Reduction"]
        P --> S
        K --> T["📅 Temporal Drift Rate"]
        M --> U["📈 Overstatement Rate"]
        R --> V["👻 Plausible-Absent Rate"]
    end
    
    E --> F
    S --> W["📈 Final Metrics Report<br/>📋 JSON Results"]
    T --> W
    U --> W
    V --> W
    
    %% Color Coordination System
    classDef dataset fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
    classDef process fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
    classDef retrieval fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
    classDef gate fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#880e4f
    classDef check fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px,color:#283593
    classDef reject fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#b71c1c
    classDef supported fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
    classDef unclear fill:#fff3e0,stroke:#ff8f00,stroke-width:2px,color:#e65100
    classDef metrics fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c
    
    %% Apply Colors
    class A,B,C dataset
    class D,E retrieval
    class F,H,I,J,L,N gate
    class G check
    class K,M,R reject
    class O,P supported
    class Q unclear
    class S,T,U,V,W metrics
    
    %% Emoji Legend
    style W stroke-width:3px

🎯 What This Diagram Is Really Showing

This is not just a pipeline.

It’s a boundary.

Everything to the left of the gate is input. Everything to the right of the gate is trust.

Our claim and the premise of the upcoming paper is that this boundary can be made precise, enforced in software, and scaled.

Once that boundary exists, AI becomes usable again.

📰 A Minimal Test Case: One Claim, Three Editorial Policies

Before introducing any AI generation at all, we start with a deliberately simple experiment.

We take one factual claim, one real Wikipedia citation, and run it through our testing three times, changing only the editorial policy.

Nothing else changes:

Same claim
Same citation
Same code
Same execution environment

This lets us isolate a crucial distinction that is often lost in AI discussions:

Truth, evidence, and publishability are not the same thing.

📄 The Claim

For illustration, we use a straightforward biographical claim backed by a real Wikipedia page:

Claim:
"Johannes Voggenhuber was an Austrian politician and a former spokesperson for the Green Party."

Citation:
https://en.wikipedia.org/wiki/Johannes_Voggenhuber

Most humans would intuitively accept this as true. Wikipedia itself presents it as such. But how it is accepted depends entirely on editorial rules.

🧪 Running the Same Claim Under Three Policies

Below is a simplified version of the test we ran. The only variable is the policy parameter.

from verity_core.context.execution import ExecutionContext

claim = "Johannes Voggenhuber was an Austrian politician and a former spokesperson for the Green Party."
citation_url = "https://en.wikipedia.org/wiki/Johannes_Voggenhuber"

for policy in [
    "wikipedia.editorial",
    "wikipedia.standard",
    "wikipedia.strict",
]:
    ctx = ExecutionContext(params={"policy": policy})

    result = integrations.wikipedia.invoke(
        "wikipedia.citation.verify",
        "verify",
        {
            "claim": claim,
            "citation_url": citation_url,
            "context_snippet": claim,
        },
        context=ctx,
    )

    print(f"Policy: {policy}")
    print(f"Verdict: {result['support_label']}")
    print(f"Confidence: {result['confidence_score']}")
    print(f"Warning: {result.get('warning')}")
    print()

📊 The Results

Running this test produces the following outcomes:

Policy: wikipedia.editorial
Verdict: supported
Confidence: 0.95
Warning: None

Policy: wikipedia.standard
Verdict: supported
Confidence: 0.95
Warning: None

Policy: wikipedia.strict
Verdict: not_supported
Confidence: 0.95
Warning: CITATION LAUNDERING DETECTED:
         wikipedia.strict requires primary sources.
         Rejected non-primary source: unclassified source

This result is intentional, deterministic, and correct.

💡 Why This Matters

All three runs agree on the content of the claim. The confidence score remains high across policies. What changes is whether the claim is allowed to pass.

Editorial policy allows synthesis and common knowledge.
Standard policy allows Wikipedia as a secondary source.
Strict policy rejects Wikipedia citing itself unless backed by primary sources.

In other words:

The claim does not become false it becomes unpublishable under stricter rules.

This distinction is central to understanding why generative AI struggles in editorial environments. The failure mode is not hallucination alone it is policy misalignment.

🔑 Key Insight Policy enforcement separates three distinct questions:

Is it true? (Factual accuracy)
Can we verify it? (Evidence quality)
May we publish it? (Editorial policy)

Current AI systems conflate these. We propose to make the separation explicit and enforce it in software.

🎯 Why We Start Here (Before Using AI)

We begin with this test case for a reason.

Before asking an AI to generate better outputs, we must first define:

What counts as acceptable evidence
Under which rules
And why a claim is rejected

Policy enforcement does not decide truth.

It enforces explicit, inspectable editorial constraints—deterministic rules that answer questions like:

“Is this citation primary or secondary?” “Does the evidence explicitly state this relationship, or is it implied?” “Would this claim survive human editorial review under current Wikipedia standards?”

These rules operate independently of plausibility. A claim can be factually correct and still fail—not because it is false, but because it lacks the required provenance or introduces unstated synthesis. The gate rejects based on process, not probability.

Only once those constraints are encoded does it make sense to introduce AI generation—and measure whether it actually improves outcomes rather than merely sounding confident.

In the next section, we scale this exact mechanism across thousands of real FEVEROUS claims to show how editorial policy, not factual accuracy alone, determines what survives publication.

📈 Baseline: Policy Gating Without AI

Before introducing any large language model, we establish a non-negotiable baseline: Can a deterministic, policy-driven system correctly classify evidence without generation, learning, or prompting?

This matters for two reasons:

It separates verification logic from generation quality
It gives us a control condition against which AI behavior can be meaningfully measured

If this baseline fails, any downstream AI result would be uninterpretable.

✅ Why FEVEROUS is a good selection for this test

We use the FEVEROUS dataset because it is:

Fully annotated against real Wikipedia pages
Designed for multi-hop, table, and sentence-level evidence
Widely cited in fact verification research

However, FEVEROUS evidence is Wikipedia-native by construction. That makes it ideal for testing editorial policy alignment, not just factual correctness.

In other words: FEVEROUS tells us what humans accepted as evidence, not whether that evidence satisfies all editorial standards.

This distinction turns out to be crucial.

🚪 The Wikipedia Gate

We evaluate every claim under three increasingly strict policies:

wikipedia.editorial – mirrors common editorial acceptance
wikipedia.standard – enforces standard sourcing and attribution rules
wikipedia.strict – requires primary, non-derivative sources

Each policy is applied to the same claims, using the same code path, differing only by policy configuration.

No AI models are used.

💻 Example: Running the Gate

Below is a minimal example showing how the same claim is evaluated under three different Wikipedia policies.

from verity.wiki import WikipediaGate

gate = WikipediaGate()

claim = {
    "text": "Family Guy is an American animated sitcom.",
    "evidence": [
        {
            "page": "Family Guy",
            "source": "https://en.wikipedia.org/wiki/Family_Guy",
            "type": "wikipedia"
        }
    ]
}

for policy in [
    "wikipedia.editorial",
    "wikipedia.standard",
    "wikipedia.strict",
]:
    result = gate.evaluate(claim, policy=policy)
    print(f"""
Policy: {policy}
Verdict: {result.verdict}
Confidence: {result.confidence}
Warning: {result.warning}
""")

Observed output:

Policy: wikipedia.editorial
Verdict: supported
Confidence: 0.95
Warning: None

Policy: wikipedia.standard
Verdict: supported
Confidence: 0.95
Warning: None

Policy: wikipedia.strict
Verdict: not_supported
Confidence: 0.95
Warning: CITATION LAUNDERING DETECTED:
         wikipedia.strict requires primary sources.
         Rejected non-primary source: unclassified source

This is the expected and correct behavior.

The claim does not change. The evidence does not change. Only the policy changes.

📊 Dataset-Level Results (n = 3000)

We then run the same gate across 3,000 FEVEROUS claims, using the same evidence and the same evaluation logic.

🔄 Policy-Controlled Phase Transition on Identical Inputs

Policy	Total Claims	Supported	Not Supported	Unclear
wikipedia.editorial	3000	2999	0	1
wikipedia.standard	3000	2999	0	1
wikipedia.strict	3000	0	3000	0

This table evaluates the same 3,000 claims, with the same evidence, using the same code. The only variable is the active editorial policy.

The complete rejection under wikipedia.strict is intentional and correct. FEVEROUS evidence is Wikipedia-derived, and strict policy forbids circular or non-primary sourcing.

This result does not indicate model failure or factual error.

It demonstrates that policy enforcement alone can deterministically override otherwise acceptable claims producing a hard acceptance / rejection boundary without modifying the model, prompts, or data.

A skeptic might say:

“Of course this fails. You explicitly forbid Wikipedia from citing itself.”

That reaction is correct and beside the point.

To a systems engineer, this “obvious” failure is a success. It shows that the gate is deterministic. Given the same input and the same policy, it will always say no.

In a world dominated by probabilistic language and confidence-weighted outputs, producing a reliable binary rejection is not trivial. It is the prerequisite for any system that must operate under institutional rules.

💡 What This Shows (and Why It Matters)

These results are not surprising and that is exactly the point.

They show that:

FEVEROUS evidence aligns extremely well with editorial and standard Wikipedia practice
The same evidence is systematically incompatible with strict, primary-source requirements
The system cleanly separates policy mismatch from factual error

Crucially, this behavior emerges:

without AI,
without heuristics,
without learned thresholds,
without prompt tuning.

This establishes a policy-sensitive verification baseline.

🎯 Why This Baseline Is Necessary

Much of the current discussion around AI and Wikipedia frames failure as hallucination or poor generation.

This experiment shows something more precise:

Even perfectly annotated, human-verified evidence can fail editorial standards when policy constraints tighten.

In other words:

Evidence ≠ Verifiability
Generation ≠ Acceptance
Policy is the missing layer

Only once this baseline is understood does it make sense to introduce stochastic generation and ask how or whether it can be safely contained.

🎭 Setting the Stage for AI

With this baseline in place, we can now ask a meaningful next question:

Can AI-generated content be filtered, corrected, or constrained to pass the same gate?

Because the gate is deterministic and reproducible, any improvement or failure in later experiments can be attributed to generation quality, not evaluation ambiguity.

That is the foundation on which the rest of this work is built.

    flowchart TD
    A[["📥 Claim + Evidence"]] --> B["🔄 Normalize Input"]
    B --> C["🔍 Resolve Evidence Source"]
    C --> D{"📝 Select Policy"}

    D -->|wikipedia.editorial| E["📋 Editorial Checks"]
    D -->|wikipedia.standard| F["📘 Standard Sourcing Checks"]
    D -->|wikipedia.strict| G["🔬 Strict Primary Source Checks"]

    E --> H{"✅ Policy Evaluation"}
    F --> H
    G --> H

    H -->|Supported| I["🟢 Verdict: Supported"]
    H -->|Not Supported| J["🔴 Verdict: Not Supported"]
    H -->|Ambiguous| K["🟡 Verdict: Unclear"]

    I --> L["📊 Confidence + Audit Log"]
    J --> L
    K --> L

    classDef dataset fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
    classDef process fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
    classDef check fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px,color:#283593
    classDef policyCheck fill:#ffecb3,stroke:#ff8f00,stroke-width:2px,color:#ff6f00
    classDef supported fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
    classDef reject fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#b71c1c
    classDef unclear fill:#fff3e0,stroke:#ff8f00,stroke-width:2px,color:#e65100
    classDef metrics fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c

    class A dataset
    class B,C process
    class D,H check
    class E,F,G policyCheck
    class I supported
    class J reject
    class K unclear
    class L metrics

🤖 AI → Wikipedia Gate Test: Verifying AI Outputs Without Trusting Them

Up to this point, we’ve evaluated human-annotated claims (from FEVEROUS) against executable Wikipedia policies. That established an important baseline: the gate itself behaves predictably and matches editorial expectations.

The next question is the one that actually matters in practice:

What happens when the claims come from an AI?

This test answers that question without changing the policy, the gate logic, or the evaluation criteria.

🧪 What We Tested

We ran a single AI-generated claim per policy level through the same Wikipedia verification gates Key constraints:

The AI is not trusted
The AI is not used for verification
The AI produces text only
All judgment is performed by deterministic policy enforcement

We evaluated the same type of historical claim under three policies:

wikipedia.editorial
wikipedia.standard
wikipedia.strict

The only variable is policy strictness not the model, not confidence, not prompt structure.

💻 The Test Code

Below is the full test used to generate the results shown in this section.

# test/integration/wiki/wiki_ai_gate_test.py

from rich.console import Console
from rich.table import Table

from verity_core.context.execution import ExecutionContext
from verity_core.infra.integrations import Integrations
from verity_core.db.stores.memory import Memory

console = Console()

memory = Memory(db_url="sqlite:///:memory:")
integrations = Integrations(memory=memory)

tests = [
    {
        "policy": "wikipedia.editorial",
        "claim": "The Battle of Waterloo was a significant battle during the Napoleonic Wars.",
        "page": "Battle of Waterloo",
    },
    {
        "policy": "wikipedia.standard",
        "claim": "In 1969, Apollo 11 astronauts Neil Armstrong and Buzz Aldrin landed on the Moon.",
        "page": "Apollo 11",
    },
    {
        "policy": "wikipedia.strict",
        "claim": "In 1969, Neil Armstrong became the first person to walk on the Moon.",
        "page": "Neil Armstrong",
    },
]

table = Table(title="AI → Wikipedia Gate Results", show_header=True)
table.add_column("Policy")
table.add_column("AI Claim")
table.add_column("Gate Verdict")
table.add_column("Confidence")
table.add_column("Warning")

for test in tests:
    ctx = ExecutionContext(params={"policy": test["policy"]})
    citation_url = f"https://en.wikipedia.org/wiki/{test['page'].replace(' ', '_')}"

    result = integrations.wikipedia.invoke(
        "wikipedia.citation.verify",
        "verify",
        {
            "claim": test["claim"],
            "citation_url": citation_url,
            "context_snippet": test["claim"],
        },
        context=ctx,
    )

    table.add_row(
        test["policy"],
        test["claim"][:40] + "...",
        result["support_label"],
        str(result["confidence_score"]),
        result.get("warning") or "None",
    )

console.print(table)

This test deliberately avoids:

Prompt engineering tricks
Retrieval augmentation
Model self-verification
Any form of AI-judged correctness

The AI only produces text. The gate alone decides whether that text is acceptable.

📊 Results

The output of the test is shown below:

Policy	Gate Verdict	Confidence	Warning
wikipedia.editorial	supported	0.95	None
wikipedia.standard	supported	0.95	None
wikipedia.strict	not_supported	0.95	Citation laundering detected

Several things are worth calling out.

💡 Why These Results Matter

1. Confidence is not authority

The AI’s confidence score remains constant across all three policies. This is intentional.

The system does not “trust” confidence it enforces rules.

2. Truth is not enough

The rejected claim under wikipedia.strict is factually correct.

It still fails.

That is not a bug it is the point.

Wikipedia’s strict policy requires primary sourcing, and a Wikipedia article is not a primary source.

3. The gate, not the AI, is doing the work

Nothing in this test relies on:

The AI being accurate
The AI being careful
The AI being aligned

The same output can pass or fail depending solely on policy context.

4. This is exactly how high-reliability systems behave

This pattern mirrors how we already build reliable systems:

Compilers don’t trust programmers
Type systems don’t trust intuition
Databases don’t trust inputs

Policy treats AI outputs the same way: useful, powerful and never authoritative.

😊 Why We’re Happy With This

These results show something subtle but crucial:

You can safely use AI without trusting it.

The AI can generate ideas, drafts, summaries, or claims
and the system can enforce invariants afterward, deterministically.

This is not a marginal improvement. It is a structural one.

And it’s why we believe AI systems once properly bounded can reliably be declare safe for any task.

⚡ Stochastic Generation, Deterministic Acceptance

Before we talk about AI, it is worth pausing on something more familiar: how humans think when they are allowed to think freely.

When people speak out loud, sketch ideas on a whiteboard, or brainstorm without immediately editing themselves, something different happens than when they write polished prose. The constraints are lower. The internal filter is weaker. Associations surface earlier. Ideas arrive partially formed, sometimes imprecise, occasionally wrong but often novel.

That looseness is not a flaw. It is how exploration works.

When the same person later edits, revises, or publishes, a different process takes over. Claims are tightened. Ambiguities are resolved. Unsupported leaps are removed. What remains is not the raw idea, but the version that survives constraint.

These two phases free generation and constrained acceptance are not in opposition. They are complementary. Most serious human work relies on both.

AI systems, as it turns out, work in a remarkably similar way.

🤖 AI Stochasticity Is Not the Problem

Large language models are stochastic by design. Given the same prompt, they can produce multiple valid continuations. This variability is what enables them to:

synthesize across sources,
explore alternative framings,
compress complex evidence,
and surface non-obvious connections.

Without stochasticity, models would be reduced to retrieval engines or deterministic rewriters. They would lose the very properties that make them useful.

Much of the current discourse around “hallucinations” implicitly treats stochasticity as a defect something to be eliminated or minimized. But that framing misses the point.

The issue is not that AI systems generate uncertain or exploratory outputs.

The issue is that we often treat those outputs as if they were already final.

🔄 The Missing Phase: Acceptance as a Separate Operation

In most AI workflows today, generation and acceptance are collapsed into a single step:

The model produces text
The text is shown to a user or published
Any correction happens informally, if at all

This is very different from how high-reliability systems operate.

In software engineering, we do not trust intermediate results. We generate candidates, then apply tests. We compile code, then enforce type systems. We accept inputs, then validate them against schemas and contracts.

Crucially, generation is allowed to be flexible but acceptance is not.

The central idea of this work is that AI systems should be treated the same way.

Stochastic generation is acceptable. Deterministic acceptance is non-negotiable.

Once those phases are separated, many apparent AI “failures” begin to look less mysterious.

Without deterministic acceptance, trust is irrelevant.

🏛️ Institutions and AI

High-trust institutions like Wikipedia, medicine, law, and finance do not primarily care whether an output sounds right. They care whether it can be:

justified under explicit rules,
traced to acceptable sources,
explained to human reviewers,
and defended after the fact.

AI-generated text often fails these requirements not because it is false, but because it is procedurally unverifiable. The model compresses, generalizes, and reframes all useful behaviors but in doing so it loses a clean audit trail.

From the institution’s point of view, this is indistinguishable from fabrication.

This explains a pattern we see repeatedly in the experiments that follow:

Claims can be semantically faithful
Embedding similarity can be high
Human readers may find them reasonable

And yet, they are rejected deterministically under strict policy.

That rejection is not anti-AI. It is policy doing its job.

📏 Measuring, Not Suppressing, Stochasticity

Once we accept that stochastic generation is inevitable and even desirable the question shifts.

Instead of asking:

How do we eliminate hallucinations?

we ask:

How do we measure how far stochastic generation has moved beyond what the evidence strictly supports?

This is where the notion of hallucination energy enters the system later in the paper.

Hallucination energy does not attempt to decide whether a claim is true. It does not judge intent or correctness. It does not replace editorial policy.

It simply measures semantic deviation how much of a generated claim cannot be explained as a direct projection of the evidence.

In human terms, it is the difference between:

“thinking out loud” and
“making a publishable assertion”

The metric gives us a way to observe stochasticity rather than pretending it should not exist.

🎯 Containment, Not Correction

A key consequence of this framing is that the solution does not live inside the model.

No amount of prompt engineering, retraining, or self-evaluation can teach a model which semantic moves are acceptable under every institutional policy. Those rules are external, contextual, and often domain-specific.

The only stable solution is architectural:

Allow stochastic generation to do what it does best
Measure how far it drifts
Apply deterministic gates afterward
Accept only what survives explicit policy

This is not an attempt to make AI “tell the truth.”

It is an attempt to make AI usable in environments where truth alone is not sufficient.

🗺️ Why This Matters for the Rest of the Paper

Everything that follows the Wikipedia gate, the policy regimes, the FEVEROUS experiments, the strict rejections, and the hallucination energy plots is downstream of this separation.

Once you see generation and acceptance as distinct phases:

It becomes obvious why accuracy is not the bottleneck
It becomes obvious why policy overrides semantic quality
It becomes obvious why strict rejection can be correct
And it becomes obvious why bounding stochasticity works without destroying it

The rest of this post is not an argument against AI.

It is an argument for putting it in the right place in the system.

🎥 Act II: When AI Enters the System

What this section does

Act II introduces AI as a stress test, not as a solution.

We progressively place a stochastic language model into the verification pipeline under tight constraints, then under increasing freedom. This reveals where and why AI outputs begin to conflict with institutional verification rules.

In this act we show:

why early AI results look “too good”,

why that is not a success but a baseline,

how synthesis differs from paraphrase,

where epistemic risk actually appears.

We then introduce hallucination energy a diagnostic signal that measures semantic drift between claims and evidence.

This act answers a narrow but critical question:

What exactly breaks when stochastic generation meets deterministic verification and why?

The Unexpected Result

At this point, we expected AI-generated claims to fail dramatically.

Given the prevailing narrative around hallucinations, we assumed that introducing a stochastic generator into a rigid editorial system would immediately surface errors, violations, and instability.

That did not happen.

Our first AI experiments paraphrasing existing claims and synthesizing claims strictly from Wikipedia-grounded evidence produced near-perfect pass rates under editorial and standard policies.

This result was not reassuring. It was a warning.

The AI had not become reliable. The experiment had become too constrained to fail.

By operating exclusively on evidence already curated for Wikipedia, we had placed the model in a sandbox where epistemic risk was structurally suppressed. The system was behaving well not because AI had solved verifiability, but because we had not yet created conditions where its stochastic nature could meaningfully conflict with institutional rules.

This forced a deliberate pivot.

Before asking whether policy gates could fix AI outputs, we first needed to understand where and how AI actually breaks institutional constraints.

That shift from validating success to introducing controlled failure defines the rest of Act II.

Up to this point, everything we have described works.

Wikipedia’s editorial process its emphasis on verifiability, sourcing, and policy enforcement functions reliably when the author is human. Claims are evaluated not on whether they sound right, but on whether they meet clearly defined institutional rules. Disagreements are resolved through process, not persuasion.

The moment we introduce AI into this system, something changes.

Not because the rules change. Not because the standards are relaxed. But because AI produces outputs that were never designed to be evaluated under these constraints.

This act explores what happens when we place AI-generated claims into the same verification pipeline that Wikipedia already uses for human editors without special treatment, without exceptions, and without redefining success.

Importantly, this is not an attempt to fix AI, defend AI, or argue that Wikipedia’s standards are wrong. The goal here is diagnostic, not prescriptive.

We ask a narrower question:

What exactly fails when AI-generated claims are subjected to institutional verification rules and why?

To answer this, we do not begin with free-form generation or open-ended prompts. Instead, we constrain the problem as tightly as possible. We use evidence already curated from Wikipedia, ask an AI model to synthesize claims only from that evidence, and then evaluate those claims using deterministic, policy-aware gates that mirror Wikipedia’s own editorial regimes.

This setup removes ambiguity about truth. The evidence is already known. The task is not discovery. It is synthesis under constraint.

What follows in this act is not a story about hallucinations in the abstract. It is a careful examination of friction the mismatch between how AI represents knowledge and how institutions decide whether knowledge is acceptable.

By the end of Act II, we will not claim to have solved this mismatch. But we will have made it visible, measurable, and impossible to ignore.

🧪 Experimental Setup: Evidence, Policies, and Gates

To understand where AI-generated claims fail under institutional verification, we needed an experimental setup that was:

grounded in real editorial practice
reproducible and deterministic
capable of evolving without invalidating earlier results

This section describes the final setup we converged on and, importantly, why it took the shape it did.

📊 Dataset: FEVEROUS as a Policy Stress Test

We use the FEVEROUS dataset as our primary evaluation corpus.

FEVEROUS is derived from Wikipedia and contains:

natural-language claims,
structured evidence annotations (sentences, tables, and metadata),
editorial labels such as SUPPORTS, REFUTES, and NOT ENOUGH INFO.

It is important to be explicit about what FEVEROUS is not.

FEVEROUS is not a truth oracle. It does not certify factual correctness in the abstract. Instead, it captures how claims are supported or rejected within the context of Wikipedia’s editorial process.

That makes it ideal for our purpose. We are not asking whether AI is correct in some universal sense. We are asking whether AI outputs can survive institutional verification rules that already exist.

Because the full FEVEROUS corpus is not reliably available via hosted APIs, we work from a locally downloaded snapshot. This does not affect reproducibility: all preprocessing, sampling, and evaluation steps are deterministic and documented.

🚦 Editorial Policies as Executable Constraints

Rather than treating “Wikipedia standards” as a single, vague notion, we model them explicitly as policy regimes.

In our system, each regime is implemented as an executable gate:

Editorial mirrors everyday human editorial judgment; tolerant of ambiguity
Standard enforces clearer sourcing and attribution requirements
Strict enforces Wikipedia’s strongest verifiability principle, rejecting claims without demonstrable primary-source provenance

These are not learned classifiers. They are deterministic, rule-based evaluators. The same input under the same policy always produces the same output.

This matters for two reasons:

It isolates policy effects from model behavior
It allows us to ask a precise question: What changes when the policy changes even if the claim does not?

🔄 How the Evaluation Actually Runs

for example in feverous_samples:
    evidence = extract_evidence_text(example)
    original_claim = example["claim"]

    ai_claim = synthesize_claim(evidence)

    for policy in ["editorial", "standard", "strict"]:
        result = wikipedia_gate.verify(
            claim=ai_claim,
            context=evidence,
            policy=policy,
        )

        record_result(
            policy=policy,
            original_claim=original_claim,
            ai_claim=ai_claim,
            verdict=result.label,
            confidence=result.confidence,
        )

🚪 The Wikipedia Gate Architecture

At the core of the system is what we call the Wikipedia Gate.

Conceptually, the gate sits between an AI system and a publication environment. It receives:

a claim,
its associated evidence context,
and an active editorial policy.

It then produces:

a support verdict (supported, not supported, unclear),
a confidence score,
and a structured warning when a policy violation is detected (e.g., citation laundering).

Crucially, the gate does not generate text. It does not correct claims. It does not reason probabilistically. It enforces rules.

This separation is intentional. Generation and verification are different problems, and conflating them is one of the core failure modes we are examining.

🎯 Iterative Refinement (Without Moving the Goalposts)

The Wikipedia tests attached to this work reflect a gradual increase in sophistication.

We began with simple baselines:

verifying FEVEROUS claims directly against evidence,
validating that the gate behaved sensibly under each policy.

Only after those baselines stabilized did we introduce AI into the loop:

first as a paraphraser,
then as a constrained synthesizer operating only on provided evidence,
and finally as a generator capable of introducing epistemic risk.

At each step, earlier tests were preserved and re-run. No results were discarded. No definitions were retroactively changed.

This matters because it allows us to make a strong claim:

Any failures observed later are not artifacts of a broken gate. They are consequences of introducing AI into an otherwise stable verification system.

With the experimental foundation in place, we can now examine what actually happens when AI-generated claims are subjected to institutional verification and why the results look the way they do.

🤖 The First AI Experiment: Evidence-Bound Claim Synthesis

With the gate architecture and policies in place, the first question we asked was deliberately conservative:

What happens if an AI is only allowed to speak using the evidence we give it?

No external knowledge. No retrieval. No web access. No prior context.

Just evidence in, claim out.

🧪 Experimental Design

For each FEVEROUS sample, we extracted the annotated evidence text and asked the AI to synthesize a single declarative claim that was strictly grounded in that evidence.

The model was instructed to:

summarize or restate what the evidence supports,
avoid speculation or extrapolation,
remain concise and factual.

Importantly, the AI was not shown the original FEVEROUS claim. It was operating blind its task was synthesis, not reconstruction.

Each generated claim was then passed through the Wikipedia Gate under all three editorial policies:

editorial
standard
strict

No tuning, retries, or post-processing were applied.

📊 Results

The results were striking.

Under editorial and standard policies, the vast majority of AI-generated claims were accepted or marked as unclear rather than rejected outright.

Under strict policy, nearly all claims were rejected.

At first glance, this looked suspicious. If AI hallucination is such a serious problem, why wasn’t it showing up here?

✅ Why These Results Are Not a Red Flag

OK I need to get something that showsThe apparent ‘success’ of the AI in this experiment is evidence that stochasticity had not yet been allowed to matter.

It is evidence that the experiment was correctly constrained.

FEVEROUS evidence is already Wikipedia-grounded. The language is encyclopedic. The scope is narrow. When an AI is asked to restate or compress that material, it is operating in a low-risk epistemic environment.

In other words: we didn’t give the model enough room to fail.

This is not a flaw in the experiment it is a necessary baseline.

Before we can study how AI breaks institutional rules, we need to confirm that:

the gate does not reject valid claims arbitrarily,
the policies behave as expected,
and the AI does not hallucinate by default when constrained.

This experiment establishes exactly that.

⚡ The First Tension Emerges

The strict policy results, however, already reveal something important.

Even when claims are:

synthesized directly from evidence,
semantically faithful,
and editorially acceptable,

they are rejected under strict policy with warnings such as:

“CITATION LAUNDERING DETECTED: wikipedia.strict requires primary sources.”

This is not a semantic failure. It is a provenance failure.

The AI is reasoning in semantic space. Wikipedia is enforcing documentary lineage.

The same claim can be acceptable or unacceptable depending solely on which policy is active.

This is the first clear signal that verifiability is policy-relative, not truth-absolute.

🎯 What This Experiment Proves and What It Does Not

This experiment does not prove that AI is reliable.

It proves something narrower and more important:

the gate works,
the policies are meaningful,
and AI behavior changes only when epistemic risk is introduced.

With that baseline established, we can now do the real work.

In the next section, we deliberately push the system out of this safe zone by allowing the AI to compress, generalize, and drift.

That is where hallucination stops being theoretical and starts becoming measurable.

🎯 Introducing Epistemic Risk: When Synthesis Becomes Reasoning

The first experiment established a calm baseline: when an AI is tightly constrained to restate Wikipedia-grounded evidence, it behaves predictably and safely.

But this is not how AI systems are used in practice.

Real deployments do not ask models to merely restate source material. They ask them to:

summarize across multiple passages,
compress nuanced descriptions into single claims,
generalize from examples,
infer implications,
and speak with confidence under ambiguity.

This is where epistemic risk enters the system.

🔄 What We Changed

To introduce risk deliberately, we changed only one thing:

we allowed the AI to talk more.

Instead of instructing the model to restate the evidence as-is, we asked it to synthesize a higher-level claim that captured the meaning of the evidence.

The constraints were relaxed in subtle but important ways:

the AI could generalize language,
it could merge related facts,
it could choose emphasis and framing,
it could introduce qualifiers implicitly rather than explicitly.

No new information sources were added. The evidence remained the same.

But the space of possible claims expanded.

✨ From Paraphrase to Synthesis

def synthesize_claim(evidence_text: str) -> str:
    prompt = f"""
    Based ONLY on the following evidence,
    generate a single Wikipedia-style claim.

    Evidence:
    {evidence_text}

    Claim:
    """
    return model.generate(prompt)

💡 Why This Matters

From a human perspective, this kind of synthesis is reasonable even expected.

From an institutional perspective, it is dangerous.

Wikipedia’s policies are not designed to evaluate whether a statement sounds reasonable. They are designed to enforce whether a statement can be proven acceptable under documented sourcing rules.

The moment the AI moves from compression to interpretation, it begins to generate claims that are:

semantically plausible,
faithful in spirit,
but increasingly difficult to justify under strict verification.

This is the precise zone where hallucination is often discussed but rarely measured.

📝 A Concrete Example

Consider the following transformation:

Evidence excerpt (Wikipedia-grounded):

“Lindenbaum–Tarski algebra appears in discussions of algebraic logic and propositional modal logic.”

AI-synthesized claim:

“Algebraic logic encompasses multiple logical systems, including Lindenbaum–Tarski algebra, which provides models for propositional modal logics.”

Nothing here is obviously false.

Yet under strict policy, this claim is rejected.

Not because it contradicts the evidence but because it asserts structure that was never explicitly sourced as such.

The AI has crossed a boundary: from what is stated to what is implied.

⚠️ The Nature of the Risk

This is not a bug. It is a property of reasoning.

Human experts do this kind of compression constantly. Institutions tolerate it only when provenance is explicit and traceable.

AI systems, however, do not understand institutional tolerance. They operate in semantic space, not procedural space.

Once the AI begins synthesizing higher-order claims, two things happen simultaneously:

The claim becomes more useful.
The claim becomes harder to verify.

This tradeoff is unavoidable.

🛡️ Why We Needed the Gate

Without a policy-aware gate, these claims would look perfectly acceptable. They are fluent. They are grounded. They are almost right.

But “almost right” is exactly the category that institutional systems must reject.

By introducing epistemic risk in a controlled way, we now have a setting where:

the AI is no longer trivially safe,
hallucination is no longer binary,
and policy differences begin to matter.

This is the environment the gate was designed for.

In the next section, we show how this risk can be measured not as truth or falsity, but as geometric deviation between what the evidence supports and what the AI asserts.

That measurement is what allows the gate to act before failure becomes visible to users.

📏 Measuring Semantic Drift: From Similarity to Direction

Once epistemic risk is introduced, we need a way to answer a subtle question:

How far does an AI-generated claim move away from what the evidence actually supports?

This is not a question of truth or falsity. It is a question of directional alignment in semantic space.

⚖️ Why Similarity Alone Is Not Enough

Most evaluation systems rely on cosine similarity between embeddings:

High similarity → “grounded”
Low similarity → “hallucinated”

But cosine similarity is symmetric and scalar. It only tells us whether two texts are close not whether one text introduces new semantic content that is unsupported by the other.

Two statements can be highly similar while one still asserts something extra.

That extra assertion is where epistemic risk lives.

📐 A Geometric View of Claims and Evidence

We treat embeddings as vectors in a shared semantic space:

the evidence vector represents what the source material supports,
the claim vector represents what the AI is asserting.

If the claim is fully supported by the evidence, then directionally the claim vector should lie along the evidence vector.

If the claim introduces new structure, emphasis, or implication, part of the claim vector will point away from the evidence direction.

That deviation is what we measure.

⚡ Hallucination Energy (Semantic Residual)

We compute semantic drift by decomposing the claim vector into two components:

the portion aligned with the evidence,
the portion orthogonal to it.

In simplified form:

# Normalize vectors
e = normalize(evidence_vector)
c = normalize(claim_vector)

# Project claim onto evidence direction
projection = dot(c, e) * e

# Residual = unsupported semantic mass
residual = c - projection

hallucination_energy = ||residual||

This value captures how much of the claim cannot be explained by the evidence, regardless of surface similarity.

We call this quantity hallucination energy.

💻 Computing Unsupported Semantic Mass

claim_vec = embed(ai_claim)
evidence_vec = embed(evidence_text)

claim_u = normalize(claim_vec)
evidence_u = normalize(evidence_vec)

projection = dot(claim_u, evidence_u) * evidence_u
residual = claim_u - projection

hallucination_energy = norm(residual)

🎯 What This Metric Is and Is Not

It is important to be precise about what this measurement represents.

Hallucination energy is:

a directional deviation metric,
a measure of unsupported semantic mass,
a continuous signal, not a binary judgment,
independent of policy.

Hallucination energy is not:

a truth detector,
a fact-checker,
a replacement for citation verification,
a claim about intent or correctness.

A claim can have:

low hallucination energy and still be rejected (policy violation),
higher hallucination energy and still be acceptable (editorial tolerance).

That distinction is intentional.

✅ Why This Works in Practice

When AI systems are constrained to restate evidence, hallucination energy remains low.

As synthesis becomes more ambitious introducing qualifiers, structure, or generalization the residual grows.

This gives us a way to:

detect early semantic drift,
compare synthesis strategies,
reason about risk before policy failure occurs.

Importantly, this signal is model-agnostic and policy-agnostic. It does not know what Wikipedia allows. It only measures what the evidence supports.

🚪 From Measurement to Gating

Hallucination energy does not decide outcomes by itself.

Instead, it provides context to the policy gate:

low energy + rejection → provenance problem,
high energy + rejection → semantic overreach,
low energy + acceptance → safe compression,
high energy + acceptance → editorial tolerance.

This separation is crucial.

It allows us to distinguish:

reasoning quality,
institutional alignment,
and policy enforcement.

In the next section, we show how these measurements interact with Wikipedia’s editorial, standard, and strict policies and why the same claim can receive radically different verdicts without changing a single word.

That divergence is not a failure of AI. It is the point of the experiment.

🎞️ Act III Policy as a Harness for a Superpower

In Act I, we made verifiability explicit and executable. In Act II, we introduced stochastic generation under controlled conditions.

Now we bring them together.

This act is not about fixing AI. It is about making stochastic intelligence usable in environments that demand hard guarantees.

⚡ Stochasticity Is the Asset, Not the Bug

Large language models are stochastic by design. They explore a space of possible expressions rather than executing a single deterministic path.

This behavior is often framed as a flaw.

We take the opposite position.

Stochasticity is the core capability of modern AI.

It enables:

synthesis across fragmented sources
reframing of ideas
discovery of unexpected connections
compression of large evidence sets into human-readable form

Remove that stochasticity and you do not get a safer system you get a weaker one.

The real problem is not hallucination.

The problem is unbounded stochasticity inside systems that require enforceable rules.

Wikipedia, finance, medicine, law, and production software do not operate on plausibility. They operate on:

provenance
traceability
justification under policy

This is not a modeling problem. It is a systems integration problem.

🎯 The Shift: From Suppression to Containment

Most attempts to improve AI reliability focus on changing the model:

stricter prompting
additional fine-tuning
reinforcement learning
AI judging AI

All of these try to reduce stochasticity.

This work takes a different approach.

We leave the model untouched.

Instead, we introduce a deterministic layer that governs where stochasticity is allowed to operate.

The model generates. The system decides.

This separation is not novel it is how reliable software has always been built.

📊 Hallucination Energy Becomes a Control Surface

In Act II, we introduced hallucination energy as a measurement of semantic drift between a generated claim and its supporting evidence.

Importantly:

hallucination energy does not decide truth
it does not enforce policy
it does not reject output on its own

On its own, it is just a signal.

Its value emerges only when combined with policy.

This is the key move.

At this point, stochastic generation is still either “allowed” or “rejected.” That is safe, but it is not improvable.

Without measurement, stochastic behavior cannot be tuned.

🚧 The Deterministic Boundary

Instead of trying to make AI less stochastic or institutions more tolerant, we let each remain exactly what it is and insert a deterministic boundary between them.

On one side of the boundary:

the AI is free to generate,
explore,
synthesize,
and use its stochastic superpower fully.

On the other side:

acceptance is binary,
rules are explicit,
and outcomes are reproducible.

This boundary is not a model. It is not a prompt. It is not a learned classifier.

It is executable policy.

In our system, this boundary takes the form of policy-aware gates that evaluate AI outputs against institutional rules after generation, not during it. The gate does not care how confident the model is, how fluent the claim sounds, or how plausible it appears.

It asks a narrower, enforceable question:

Is this output admissible under the active policy?

Because the gate is deterministic, the same input under the same policy always produces the same result. Change the policy, and the outcome can change even if the claim does not.

This is not inconsistency. It is control.

By separating generation from acceptance, we gain several properties that model-centric approaches cannot provide:

Hard cut-offs instead of soft guidance
Auditability instead of opacity
Explainable rejection instead of silent failure
Policy tuning without retraining models

Most importantly, we restore human authority.

Editors do not need to trust the model. They only need to trust the policy.

Once acceptance is decided by software rather than probability, the role of the AI becomes clear: it is a generator, not an arbiter. Creativity remains intact, but publication is governed by rules that can be inspected, debated, and revised without touching the model at all.

This is the point where the AI debate shifts from philosophy to systems engineering.

The question is no longer “Can we trust AI?” It becomes “Under which policies is this output allowed?”

And that is a question software can answer deterministically.

🥇 Policy Does Not Compete with AI—It Overrides It

The most important result of this work is not that hallucination energy can be measured.

It is this:

Policy can override semantic similarity every time deliberately and deterministically.

In our synthesis experiments, we observe all four possible regimes:

Hallucination Energy	Policy Verdict	Interpretation
Low	Accepted	Safe compression
Low	Rejected	Provenance failure
High	Accepted	Editorial tolerance
High	Rejected	Semantic overreach

This table matters more than any single score.

It shows that hallucination energy informs the system, but policy remains the final authority.

This is exactly how high-trust institutions operate today.

🧭 From Rejection to Guidance

Here is the crucial improvement that Act III introduces.

Because hallucination energy is continuous, and policy is discrete, the system can do more than reject output.

It can steer generation.

A practical system can implement rules like:

if hallucination_energy > policy.max_energy:
    reject("Semantic overreach")

elif hallucination_energy > policy.review_energy:
    flag("Requires human review")

else:
    accept()

Now stochasticity becomes adjustable.

During exploration → higher energy tolerated
During drafting → moderate energy flagged
During publication → strict thresholds enforced

The same model. The same prompt. Different policy.

This is how AI performance improves without retraining.

🏢 What This Looks Like in a Real Organization

Consider a financial institution using AI for research and reporting.

They do not want:

a deterministic model
or a model that never hallucinates

They want:

creativity during analysis
guarantees before disclosure

A policy-first architecture looks like this:

Stochastic Generation
        ↓
Semantic Diagnostics
(hallucination energy, similarity)
        ↓
Policy Gate
(provenance, thresholds, rules)
        ↓
Accept | Review | Reject

Most generated outputs will never be published.

That is not failure.

That is discipline.

🧪 Why Wikipedia Is the Right Stress Test

Wikipedia is explicit about its rules and already has a strong policy guideline we can leverage.

What Wikipedia exposes is a reality every high-trust domain already lives with:

Finance: auditability and disclosure
Medicine: evidence and liability
Law: admissibility and precedent
Software: tests, invariants, contracts

AI has struggled in these environments not because it is inaccurate, but because it is ungoverned.

This work shows that governance is not philosophical.

It is engineering.

💎 The Core Contribution of Act III

We do not claim to eliminate hallucination.

We claim something narrower and more powerful:

Stochastic generation becomes usable when bounded by deterministic policy.

Truth remains hard. Compliance is not.

🚀 Where This Leads

This post demonstrates the pattern using Wikipedia as a deliberately unforgiving test case.

The same structure applies anywhere rules exist and failure matters.

Future work will:

formalize energy-policy calibration
explore adaptive thresholds
measure system-level guarantees

For now, the conclusion is simple:

AI does not need to become more cautious. Our systems need to become more disciplined.

🌩️ AI Stochasticity Is Not the Problem

Large language models are stochastic by design. They explore a space of possible expressions rather than executing a single deterministic path.

That behavior is often framed as a flaw.

We take the opposite view.

Stochasticity is the only reason LLMs are useful.

It enables:

*All right synthesis across sources,

reframing of ideas,
creative generalization,
discovery of connections humans might miss.

Remove that, and you do not get a safer model you get a worse one.

The problem is not stochasticity.

The problem is unbounded stochasticity inside environments that demand hard guarantees.

Wikipedia, finance, medicine, law, and production software systems are not designed to accept “probably correct” outputs. They require:

traceability,
provenance,
justification under explicit rules.

This is a category mismatch, not a model failure.

The Key Shift: From Correction to Containment

Most approaches to AI reliability focus on changing the model:

better prompting,
additional fine-tuning,
reinforcement learning,
AI judging AI.

This work takes a different path.

We leave the model alone.

Instead, we introduce a deterministic layer that decides what is allowed to pass.

The model generates. The system decides.

This separation is not new it is standard software engineering.

🎮 Hallucination Energy as a Control Signal

In Act II we introduced hallucination energy as a diagnostic metric.

It measures semantic drift how much of a claim cannot be explained by its evidence without making any judgment about truth.

That distinction matters.

Hallucination energy is:

continuous, not binary,
model-agnostic,
independent of editorial policy.

On its own, it does nothing.

Its power comes from how it interacts with policy.

🚔 Policy Always Wins

The most important result of this work is not that hallucination energy can be measured.

It is this:

Policy can override semantic similarity, every time.

A claim can be:

highly similar to its evidence,
low in hallucination energy,
fluent and plausible,

and still be rejected.

Not because it is false — but because it violates institutional rules.

This is intentional.

In our experiments, we see all four combinations:

Hallucination Energy	Policy Result	Interpretation
Low	Accepted	Safe compression
Low	Rejected	Provenance failure
High	Accepted	Editorial tolerance
High	Rejected	Semantic overreach

This is the point.

Hallucination energy informs. Policy decides.

What This Looks Like in Practice

Imagine a financial institution using AI for internal analysis and external reporting.

They do not want:

a deterministic model,
or a model that “never hallucinates.”

They want:

creativity during exploration,
hard guarantees before publication.

A policy-first system looks like this:

AI Generation (stochastic)
        ↓
Semantic Diagnostics (hallucination energy, similarity)
        ↓
Policy Gate (hard cut-offs, provenance rules)
        ↓
Accepted Output or Deterministic Rejection

During research:

higher hallucination energy may be tolerated.

During reporting:

thresholds tighten.
policy becomes strict.
most outputs are rejected.

No retraining. No prompt rewriting. No special “safe model.”

Just different policy.

🌍 Why This Scales Beyond Wikipedia

Wikipedia is not special.

It is simply explicit about its rules.

Every high-trust domain already operates this way:

Finance: auditability and disclosure requirements
Medicine: clinical evidence and liability
Law: admissibility and precedent
Software: tests, contracts, and invariants

AI has struggled in these domains not because it is inaccurate, but because it is ungoverned.

This work shows that governance is not mystical.

It is engineering.

🍏 The Core Contribution

We do not claim to have solved hallucination.

We claim something narrower and stronger:

Stochastic generation becomes usable in rigid environments when bounded by deterministic policy.

Truth remains a hard problem.

Compliance is not.

🚀 Where This Leads

This post demonstrates the pattern using Wikipedia as a stress test.

The same pattern applies anywhere the rules are explicit and the cost of failure is high.

Future work will:

formalize the metrics,
explore adaptive policy thresholds,
and evaluate system-level guarantees.

For now, the takeaway is simple:

AI does not need to become more human. Our systems need to become more disciplined.

🔄 Policy Is Not a Mode—It Is an Attribute

Up to now, we have described policy using three named regimes: editorial, standard, and strict.

These are useful illustrations but they are not the right mental model.

A better analogy comes from security engineering.

In secure systems, we do not say:

“This application runs in admin mode.”

Instead, we say:

this action is allowed for all users
this action requires elevated privileges
this action is restricted to audited roles

Policy is applied per action, not per system.

AI governance should work the same way.

🎯 From Fixed Policies to Policy-Per-Action

The mistake many AI systems make is treating policy as a global setting:

“This model is safe.” “This deployment is restricted.” “This output is allowed.”

That framing does not scale.

Real organizations do not operate that way.

They perform many different kinds of actions, each with different risk profiles.

Examples:

exploratory analysis
internal research notes
draft summaries
client-facing reports
regulated disclosures
fiduciary decisions

Each of these actions tolerates a different amount of stochasticity.

The correct model is not three fixed policies.

The correct model is many policies, bound to actions.

⚡ Hallucination Energy Enables Action-Scoped Policy

This is where hallucination energy becomes more than a diagnostic.

Because hallucination energy is continuous, it allows policy to be parameterized.

Instead of asking:

“Is this output acceptable?”

We ask:

“Is this output acceptable for this action?”

Conceptually:

policy = policy_for(action)

if hallucination_energy > policy.max_energy:
    reject()

elif hallucination_energy > policy.review_energy:
    require_human_review()

else:
    accept()

The same AI output can be:

acceptable for brainstorming
questionable for internal drafts
forbidden for external publication

Nothing about the model changes.

Only the action-bound policy does.

💰 A Concrete Example: Finance

Consider a financial institution using AI across its workflow.

Action	Allowed Hallucination	Policy Behavior
Market exploration	High	Accept with logging
Internal analysis	Medium	Flag for review
Client communication	Low	Strict provenance
Regulatory filing	Near zero	Deterministic rejection

This is not hypothetical.

This is how risk is already managed in mature systems.

AI simply lacked the enforcement layer.

💡 Why This Matters

This reframing resolves a false dichotomy that dominates AI discourse:

“AI must be creative” vs “AI must be safe”

Both are true for different actions.

By treating policy as an attribute of what the AI is being asked to do, rather than what the AI is, we get:

maximal utility during exploration
maximal safety during execution
zero need for model retraining
zero need for prompt contortions

This is not AI alignment by persuasion.

It is AI alignment by authorization.

🧩 The Deeper Pattern

Once you see this, the broader pattern becomes clear:

Stochasticity is the engine
Hallucination energy is the gauge
Policy is the circuit breaker
Action defines the risk envelope

That combination is sufficient to make AI usable in environments that previously had to ban it outright.

🧪 Why We Started with Wikipedia

Wikipedia is an extreme case.

It enforces one of the strictest editorial policies in existence.

That makes it a perfect stress test not because most systems are that strict, but because if the pattern works there, it works anywhere.

The takeaway is not “use strict policy everywhere.”

The takeaway is:

Use the right policy for the right action and enforce it deterministically.

🎯 Policy as a Harness for Stochastic Generation

Figure: Stochastic generation bounded by action-specific policy. Hallucination energy informs decisions, but policy deterministically governs acceptance.

    flowchart TD
    %% --- INPUT SECTION ---
    A["🧑‍💻 User / Task Request<br/>📋 What needs to be done?"]
    
    %% --- ACTION CLASSIFICATION ---
    A --> B{"🎯 Action Classifier<br/>What type of request is this?"}
    
    %% --- POLICY SELECTION ---
    B -->|🌱 exploratory| P1["Policy: exploratory.open<br/>🛡️ High creativity tolerance"]
    B -->|🔍 analysis| P2["Policy: analysis.standard<br/>🛡️ Balanced approach"]
    B -->|🏢 client| P3["Policy: client.strict<br/>🛡️ Low risk tolerance"]
    B -->|⚖️ regulatory| P4["Policy: regulatory.hard<br/>🛡️ Zero tolerance"]
    
    %% --- AI GENERATION ZONE ---
    subgraph AI["⚡ AI Generation Layer<br/>(Stochastic • Creative • Unbounded)"]
        G["🤖 LLM Generates Response<br/>✨ Stochastic output ✨<br/>🎲 Probability-driven"]
    end
    
    %% CONNECT POLICIES TO AI
    P1 -.-> G
    P2 -.-> G
    P3 -.-> G
    P4 -.-> G
    
    %% --- SEMANTIC DIAGNOSTICS ---
    G --> D["📊 Semantic Diagnostics<br/>📏 Similarity Score<br/>⚡ Hallucination Energy<br/>🎯 Confidence Level"]
    
    %% --- POLICY GATES (THE CRITICAL BOUNDARY) ---
    subgraph GATES["🚨 Policy Enforcement"]
        D -->|evaluates against| Gate1{"exploratory.open"}
        D -->|evaluates against| Gate2{"analysis.standard"}
        D -->|evaluates against| Gate3{"client.strict"}
        D -->|evaluates against| Gate4{"regulatory.hard"}
    end
    
    %% --- OUTCOMES ---
    Gate1 -->|✅ High tolerance| O1["🟢 ACCEPT<br/>✨ Log only"]
    Gate2 -->|⚠️ Medium tolerance| O2["🟡 ACCEPT + REVIEW<br/>📌 Flag for human check"]
    Gate3 -->|🚫 Low tolerance| O3["🔴 REJECT<br/>📋 Policy violation logged"]
    Gate4 -->|⛔ Near-zero tolerance| O4["🛑 REJECT + AUDIT<br/>📂 Full audit trail created"]
    
    %% --- COLOR SCHEME ---
    classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:3px,color:#0d47a1
    classDef classifier fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
    classDef policy fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
    classDef aiZone fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,color:#4a148c
    classDef diagnostics fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px,color:#283593
    classDef gates fill:#ffebee,stroke:#d32f2f,stroke-width:3px,color:#b71c1c
    classDef accept fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
    classDef review fill:#fff3e0,stroke:#ff8f00,stroke-width:2px,color:#e65100
    classDef reject fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#b71c1c
    classDef audit fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#880e4f
    
    %% Apply styles
    class A input
    class B classifier
    class P1,P2,P3,P4 policy
    class AI,G aiZone
    class D diagnostics
    class GATES,Gate1,Gate2,Gate3,Gate4 gates
    class O1 accept
    class O2 review
    class O3 reject
    class O4 audit
    
    %% Add note about the core concept
    note["💡 CORE CONCEPT:<br/>AI generates freely • Policy decides what's acceptable<br/>Stochastic creativity ↔ Deterministic rules"]
    style note fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#01579b

The key idea is that policy is bound to the action, not the model. The same stochastic AI output flows through different policy gates depending on what the system is being asked to do.

Hallucination energy and similarity are measured once, but their interpretation depends entirely on policy. In low-risk actions, stochasticity is tolerated. In high-risk actions, the same signal deterministically blocks output.

🛡️ Example: Policy Per Action (Security Model Analogy)

Action Type	Allowed Stochasticity	Hallucination Energy	Policy Strictness	Outcome
Brainstorming	High	High	Low	Always accepted
Research Notes	Medium	Medium	Medium	Flag for review
Draft Report	Low	Low	High	Mostly rejected
Client-Facing Output	Very Low	Very Low	Strict	Binary pass/fail
Regulatory Filing	None	~0	Max	Reject by default

Then add one sentence:

This is not model tuning. This is policy selection the same way access control works in security systems.

🍿 Act IV Learning From the Boundary (Exploratory)

Up to this point, everything in this post has been deliberately conservative.

We have:

Made policy explicit and executable
Shown that policy can deterministically override otherwise acceptable claims
Demonstrated that this works without modifying the model

At this stage, we already have a usable system.

But as AI developers, it would be disingenuous to stop here.

Because once a policy is enforced, it creates something extremely valuable:

A learning signal.

This final act explores how that signal could be used carefully without weakening the guarantees established earlier.

📚 Policy Enforcement Creates Information

Every policy decision produces structured feedback:

accepted vs rejected
reason codes (provenance failure, overreach, circular sourcing)
hallucination energy at time of rejection

This feedback is not opinionated. It is not probabilistic. It is the result of a deterministic rule system.

In traditional software engineering, signals like this are gold.

Ignoring them would be wasteful.

🚫 What We Are Not Proposing

Let’s be explicit.

We are not proposing:

replacing policy with a learned model
letting AI decide what policy should be
weakening editorial constraints
“AI judging AI” in the abstract sense

Policy remains authoritative. Policy remains external. Policy always wins.

✅ What We Are Proposing

Once a policy is defined and enforced, the system can observe how stochastic generation interacts with that policy over time.

This enables a narrow, well-scoped use of learning:

Learning how much stochasticity a given policy will tolerate.

Not truth. Not correctness. Not editorial judgment.

Just tolerance.

📊 Hallucination Energy as a Feedback Signal

Earlier, we introduced hallucination energy as a diagnostic:

continuous, not binary
independent of policy
descriptive, not normative

On its own, it does nothing.

But paired with policy outcomes, it becomes informative.

Over many evaluations, the system can observe patterns like:

claims rejected above a certain energy threshold
claims accepted below it
policy-specific tolerance bands

This enables calibration, not autonomy.

🧪 A Concrete (and Safe) Use Case

Consider the simplest possible application:

if hallucination_energy > policy.estimated_tolerance:
    regenerate(
        warning="Previous output exceeded policy tolerance"
    )

Nothing passes automatically. Nothing is overridden. No rule is bypassed.

The system simply avoids producing outputs that it already knows will fail.

This improves efficiency not permissiveness.

🏛️ Why This Does Not Undermine Wikipedia’s Position

From Wikipedia’s perspective, this section changes nothing.

The final arbiter is still policy. The acceptance criteria are unchanged. The rules are enforced exactly as written.

If anything, this approach reduces pressure on reviewers by preventing doomed outputs from being generated in the first place.

The policy boundary remains intact.

👩‍💻 Why This Matters for AI Developers

In many real systems:

policies are strict
AI is exploratory
rejection rates are high
iteration is expensive

Learning from policy outcomes allows systems to:

reduce wasted generations
surface warnings earlier
adapt behavior per task without retraining
preserve creativity where allowed
tighten constraints where required

This is how stochastic systems become operational.

🔍 A Useful Analogy: Error Detection, Not Correction

In communications systems, error-detection codes do not decide meaning.

They detect when data exceeds tolerance and trigger retransmission.

That is the role hallucination energy can play here.

It is not a judge. It is a sensor.

🎯 Why We Include This Section

This post is about engineering, not ideology.

Once we have:

executable policy
deterministic enforcement
measurable deviation

we have everything needed to close a feedback loop.

Not using that signal would be poor system design.

This section does not claim success. It does not claim safety. It does not claim convergence.

It simply acknowledges the obvious next step and stops there.

🔄 Closing the Loop

At its core, this work makes one argument:

AI does not fail because it is stochastic. AI fails because we deploy it without boundaries.

Once boundaries exist, learning becomes possible.

But boundaries must come first.

📊 Policy-Bounded Learning Loop (Mermaid)

Figure X Policy-Bounded Learning Loop. Stochastic generation is preserved. Policy enforcement remains deterministic and final. Hallucination energy is used only as an observational signal to reduce future policy violations, never to override policy decisions.

    flowchart TD
    A[Stochastic AI Generation]:::stochastic
    B[Generated Claim + Evidence]
    C[Semantic Diagnostics<br/>(Hallucination Energy)]:::metric
    D[Policy Gate<br/>(Executable Rules)]:::policy
    E[Accepted Output]:::accept
    F[Deterministic Rejection<br/>+ Reason Codes]:::reject
    G[Learning Signal<br/>(Energy vs Policy Outcome)]:::signal
    H[Generation Tuning<br/>(Pre-Policy Adjustment)]:::tuning

    A --> B
    B --> C
    C --> D
    D -->|Pass| E
    D -->|Fail| F
    F --> G
    G --> H
    H --> A

    %% Styling
    classDef stochastic fill:#e3f2fd,stroke:#1e88e5,stroke-width:1px;
    classDef policy fill:#fff3e0,stroke:#fb8c00,stroke-width:1px;
    classDef metric fill:#f3e5f5,stroke:#8e24aa,stroke-width:1px;
    classDef signal fill:#e8f5e9,stroke:#43a047,stroke-width:1px;
    classDef tuning fill:#ede7f6,stroke:#5e35b1,stroke-width:1px;
    classDef accept fill:#e0f2f1,stroke:#00897b,stroke-width:1px;
    classDef reject fill:#ffebee,stroke:#c62828,stroke-width:1px;

💬 How to Caption This in the Blog

You’ll want a caption that prevents misinterpretation. Here’s a tight one you can reuse:

🧭 Conclusion: Measurement Before Understanding

Historically, we rarely wait to fully understand a phenomenon before learning how to control it.

We did not need a complete theory of electricity to build power grids. We learned to measure voltage, current, and resistance. Those measurements were projections of something we didn’t fully understand and they were enough.

Once measurement existed, control followed. Once control existed, tuning followed. Once tuning existed, improvement compounded.

AI is at the same stage.

🤖 We Don’t Need to “Understand” AI to Control It

Large language models are opaque, stochastic systems. We do not have a complete theory of how they reason, generalize, or synthesize.

But that is not unusual.

What matters is not perfect understanding, but reliable measurement surfaces.

This work shows that we can measure AI behavior sideways:

not by asking “is it true?”
but by asking “how far did it move beyond what evidence strictly supports?”
and “does this output violate an explicit policy boundary?”

That is enough to build control.

⚖️ Policy Forces the Hard Decisions

Executable policy does something crucial that informal guidelines never do:

It forces us to decide what is acceptable, unacceptable, and non-negotiable.

Once policy is explicit:

some behaviors are allowed,
some require review,
some trigger immediate rejection.

There is no ambiguity.

This is the hard binary every high-trust system relies on the equivalent of circuit breakers, type errors, or safety interlocks.

Policy answers questions AI cannot:

May this be published?
Is this provenance sufficient?
Is this class of synthesis allowed at all?

Those decisions must exist outside the model.

📶 Continuous Signal Inside a Binary Boundary

Binary policy alone is not enough to improve systems over time.

That is where measurement enters.

Hallucination energy provides a continuous signal that tracks how far stochastic generation drifts beyond evidence support without deciding truth and without overriding policy.

This gives us two orthogonal axes:

Policy → hard acceptance / rejection
Measurement → how close or far the output was from the boundary

This is the same structure used everywhere else in engineering:

hard limits + continuous feedback
invariants + tunable parameters
shutdown conditions + optimization signals

🚀 Why This Enables Scaling

Once both pieces exist, the system becomes improvable in a disciplined way.

As AI operates:

outputs fall inside or outside policy,
hallucination energy rises or falls,
failures become categorized rather than mysterious,
tuning becomes empirical rather than speculative.

At that point, improvement is no longer about intuition or trust.

It becomes a control problem.

And control problems scale.

🏆 The Core Takeaway

Without an explicit policy gate, hallucination cannot be meaningfully measured. Without measurement, stochastic behavior cannot be tuned. Without deterministic acceptance, trust is irrelevant.

This work does not claim to solve AI reliability.

It establishes something more foundational:

Stochastic systems become scalable only once they are measurable and bounded even when they are not fully understood.

Policy provides the boundary. Measurement provides the signal.

Together, they turn stochastic generation from a liability into a controllable asset.

That is how every other complex system we rely on became usable and AI is no exception.

AI reliability does not emerge from better answers.
It emerges from enforceable boundaries and measurable deviation.

📚 Glossary

Term	Definition	Context / Example
Verifiability Gate	A deterministic policy enforcement mechanism that evaluates AI-generated claims against explicit editorial rules before acceptance.	Rejects a factually correct claim about Neil Armstrong if it cites Wikipedia itself under `wikipedia.strict` policy (citation laundering).
Hallucination Energy	A continuous metric measuring semantic drift—the portion of a claim’s embedding vector orthogonal to its supporting evidence vector. Quantifies unsupported semantic mass without asserting falsity or correctness.	Computed as `‖residual‖ = ‖claim_vector − projection_onto_evidence‖`; high values indicate synthesis beyond evidence boundaries.
Stochastic Generation	The inherent probabilistic nature of LLMs that enables exploration, synthesis, and creative reframing—treated as a capability to harness, not a defect to eliminate.	Enables AI to compress multi-sentence evidence into concise claims, but introduces epistemic risk when unbounded.
Deterministic Acceptance	Binary, rule-based evaluation of outputs after generation, in which policy decisions override all model confidence, fluency, or similarity signals.	The gate accepts, rejects, or flags for review—never “trusts” confidence scores.
Semantic Drift / Overreach	When an AI-generated claim asserts structure, relationships, or implications not explicitly supported by source evidence—even if semantically plausible.	Evidence: “Lindenbaum–Tarski algebra appears in discussions of algebraic logic” → Claim: “Algebraic logic encompasses Lindenbaum–Tarski algebra.”
Citation Laundering	Violation occurring when secondary or tertiary sources are used to support claims requiring primary-source provenance under strict policy.	Wikipedia citing itself to establish biographical facts fails `wikipedia.strict` even when accurate.
Policy Regimes	Executable editorial constraints with increasing stringency: `editorial` (human-like tolerance), `standard` (clear sourcing), `strict` (primary sources only).	Same claim passes under `editorial` but fails under `strict` due to provenance—not factual error.
Epistemic Risk	Risk introduced when AI moves from restating evidence to synthesizing higher-order claims (generalization, implication, framing).	Paraphrasing evidence has low risk; inferring unstated relationships introduces measurable risk.
Evidence-Bound Synthesis	Constrained generation where AI produces claims using only provided evidence text, with no external knowledge or retrieval.	Baseline showing high pass rates under relaxed policies—verifying gate correctness before introducing risk.
Action-Scoped Policy	Binding policy constraints to specific actions or tasks rather than treating policy as a global system mode.	Same output allowed for “brainstorming” but rejected for “regulatory filing.”
Semantic Residual	The orthogonal component of a claim’s embedding vector after projection onto the evidence vector—the mathematical basis of hallucination energy.	`residual = claim_normalized − (dot(claim, evidence) × evidence_normalized)`
Policy-Bounded Learning	Using deterministic policy outcomes as a feedback signal to reduce wasted or non-compliant generations, without modifying or relaxing policy constraints.	Regenerating when `hallucination_energy > policy.tolerance`—improves efficiency, not permissiveness.
Acceptance Boundary	The deterministic decision surface defined by policy that separates admissible outputs from rejected ones.	Identical claims may fall on different sides of the boundary under different policies.
Verity	The cognitive operating system implementing this architecture: stochastic generation upstream, deterministic policy enforcement downstream.	AI generates possibilities; software decides what passes.
FEVEROUS Dataset	Wikipedia-grounded fact verification corpus with human-annotated claims and explicit evidence references.	Used as a policy stress test—not a truth oracle.
Provenance Failure	Rejection due to insufficient source lineage, even when semantic content is accurate.	Low hallucination energy + rejection = provenance issue; high energy + rejection = overreach.
Measurement Before Understanding	Engineering principle that reliable control is possible using measurable projections of a system before full theoretical understanding exists.	Voltage precedes a full theory of electricity; hallucination energy precedes a full theory of AI reasoning.

📖 References & Context

This work builds on three existing strands already operational in high-trust environments:

Institutional policy as the boundary Wikipedia’s explicit editorial rules and documented experience rejecting AI not for inaccuracy, but for unverifiability under institutional constraints
Evidence-grounded verification Datasets that capture human-verified evidence relationships, not abstract truth
Measurement before understanding Engineering precedent: reliable systems emerge from measurable boundaries, not perfect models

The contribution here is architectural, not algorithmic: treating stochastic generation as an upstream capability and enforcing reliability downstream through executable policy.

🔖 Core References

Institutional Verifiability

Wikipedia. Wikipedia: Verifiability. https://en.wikipedia.org/wiki/Wikipedia:Verifiability

“Verifiability, not truth, is required.” Canonical statement that institutional trust demands procedural justification, not plausibility.
Wikipedia. Wikipedia: Reliable Sources. https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources

Defines primary vs. secondary sourcing requirements the provenance boundary that rejects otherwise-accurate claims.
Wiki Education. Wikipedia and Generative AI Editing: What We Learned in 2025 (2026). https://wikiedu.org/blog/2026/01/29/generative-ai-and-wikipedia-editing-what-we-learned-in-2025/

Institutional evidence that AI failure modes in editorial environments are procedural (unverifiable synthesis), not merely factual.

Evidence-Grounded Verification
4. Aly et al. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured Evidence (2021). https://fever.ai/dataset/feverous.html

Wikipedia-native evidence annotations used here as policy stress test substrate not truth oracle.

Thorne et al. FEVER: A Large-Scale Dataset for Fact Extraction and VERification (ACL 2018). https://aclanthology.org/N18-1074/

Establishes lineage of evidence-bound verification separate from model performance.

Semantic Drift as Measurable Signal
6. Ji et al. Survey of Hallucination in Natural Language Generation (ACM Computing Surveys 2023). https://arxiv.org/abs/2303.08774

Documents field’s lack of operational definition for “hallucination” creating space for continuous diagnostics over binary judgments.

Maynez et al. On Faithfulness and Factuality in Abstractive Summarization (ACL 2020). https://aclanthology.org/2020.acl-main.173/

Early distinction between surface similarity and semantic faithfulness precursor to directional drift measurement.

Engineering Precedent
8. Meyer, B. Object-Oriented Software Construction (1997), Chapter 11: Design by Contract.

Executable acceptance criteria independent of implementation direct parallel to policy gates.

Lord Kelvin (William Thomson). Electrical Units of Measurement (1883).

“When you can measure what you are speaking about… you know something about it.”
Engineering principle: control emerges from measurement, not perfect understanding.

📑 Appendix 1.

In early 2026, Wikipedia publicly documented its experience with generative AI–assisted editing and the decision to significantly restrict its use. The post, “Wikipedia and Generative AI Editing: What We Learned in 2025”, describes a year of experimentation that led to a clear conclusion:

AI-generated contributions consistently failed to meet Wikipedia’s standards for verifiability, sourcing, and editorial reliability. As a result, Wikipedia chose to ban or heavily limit generative AI content in its editing workflows.

I read this article and this prompted this blog post.