From Evidence to Verifiability: Rebuilding Trust in AI Outputs 🔏
⏰ TLDR
This work shows that the hardest part of using AI in high-trust environments is not the model, but the policy. Once editorial policy is made explicit and executable, AI systems become interchangeable the real challenge is engineering reliable measurements and deterministic enforcement of those policies. This reframes AI reliability as a policy and measurement problem, not a model problem.
📋 Summary
AI systems are becoming deeply embedded in how we research, write, and reason. At the same time, their use in high-trust environments is under strain — not because models are incapable, but because they are being deployed into settings that demand determinism, provenance, and enforceable rules.
This reveals a fundamental mismatch.
We are applying stochastic systems — designed for exploration, synthesis, and creative reframing — inside deterministic environments that require justification, traceability, and procedural compliance.
Large language models are stochastic by design. That stochasticity is not a flaw; it is the source of their power.
But in environments such as encyclopedias, medicine, finance, law, and programming, the primary question is not whether an answer sounds right. It is whether the output can be verified, sourced, and defended under explicit policy.
The core idea explored in this post is simple:
AI does not need to become deterministic. It needs to be bounded by deterministic policy.
Instead of attempting to make models “tell the truth,” we treat stochastic generation as an upstream capability and move reliability downstream — into software systems that decide what is allowed to pass.
In this framing, hallucination is not a binary failure. It is a measurable form of semantic drift: a signal that indicates how far a generated claim has moved beyond what its evidence strictly supports.
This post shows how combining explicit policy enforcement with semantic diagnostics produces hard, deterministic outcomes without retraining models, prompt engineering, or recursive AI verification.
By placing a policy-driven bounding box around stochastic generation, AI outputs become usable again in rigid, high-reliability settings — not because the model is trusted, but because the system is controlled.
🗺️ How This Post Is Structured
The post proceeds in three parts:
-
Policy First We make the rules explicit. We encode editorial policies as executable constraints and show how the same claims can be accepted or rejected purely by changing policy without touching the model or the data.
-
Stochasticity Meets the Gate We introduce AI into the system, first under tightly constrained conditions, then under increasing epistemic risk. This reveals where stochastic generation and deterministic verification collide and why that collision is structural, not accidental.
-
Measuring the Boundary We introduce a diagnostic signal hallucination energy that measures semantic drift between claims and evidence. This metric does not decide truth. It quantifies how much of the model’s stochastic “superpower” is being exercised and whether policy allows it.
Hallucination energyis a measure of how much a claim’s meaning deviates from its source evidence.
The result is not a claim that AI has been “fixed.”
It is a demonstration that reliability is not a property of models alone. It is a property of systems and systems can enforce rules.
We keep claims modest, results transparent, and assumptions explicit.
🎬 Act I: Making Verifiability Explicit
Act I defines the problem before AI enters the picture.
We take Wikipedia’s editorial rules verifiability, sourcing, and provenance and make them explicit and executable. We show that acceptance or rejection is not a matter of truth alone, but of policy.
Using the FEVEROUS dataset, we demonstrate that:
- the same claims,
- backed by the same evidence,
- evaluated by the same code
- can be accepted or rejected purely by
changing policy.
This act establishes the core premise of the post:
Verifiability is policy-relative, not truth-absolute.
No AI is involved yet. That is intentional.
Without an explicit policy gate, hallucination cannot be meaningfully measured.
🔍 Why This Matters Now
Recent discussions around AI particularly in high-trust environments such as research publishing, regulated industries, and institutional knowledge systems often focus on whether models can be trusted to tell the truth.
That framing misses the real issue.
Large language models are stochastic systems. They are designed to explore, generalize, and synthesize not to operate under rigid institutional constraints by default.
As a result, AI is frequently excluded from precisely the environments where its capabilities would be most valuable not because it is useless, but because it is ungoverned.
The question, then, is not whether AI can be made perfectly reliable.
The question is:
Can we introduce software discipline around stochastic systems in a way that allows their participation in high-reliability environments without compromising those environments?
This post argues that we can.
By applying explicit, deterministic policy to AI outputs after generation, we gain three critical capabilities:
- Control: AI behavior can be bounded without changing the model itself.
- Quality: Outputs can be filtered, rejected, or accepted based on enforceable rules rather than plausibility.
- Discipline: Established software engineering practices contracts, gates, and hard cut-offs re-enter the system.
In this framing, AI remains stochastic. The system becomes deterministic.
That separation is the key move and it is what enables AI to operate safely and usefully in contexts where it would otherwise be prohibited.
🎯 What This Blog Post Demonstrates
Rather than arguing abstractly, we focus on a specific, testable scenario.
In this post, we will:
- Take a real, publicly available dataset used in AI evaluation
- Apply a clear, executable verifiability policy
- Measure how many AI-supported claims pass or fail under that policy
- Show how changing process, not models, changes outcomes
The goal is not to achieve perfection, but to demonstrate that reliability gains are not marginal.
Even with a small amount of code and careful engineering, measurable improvements emerge.
You’re right to be cautious here and you’re also right that you shouldn’t throw this section away. It’s doing essential structural work for the blog. The trick is to keep the motivation without over-stepping into endorsement or speculation.
Here’s how to get the best of both worlds:
- remove any judgment about whether Wikipedia’s decision is “correct”
- remove reliance on a single contemporary blog post as the justification
- ground everything in long-standing, explicitly documented Wikipedia policy
- still make it clear why this post exists and why Wikipedia triggered it
Below is a tightened, publish-safe rewrite that does exactly that. It keeps the force of your argument, models policy-bound writing, and avoids putting you in the position of adjudicating Wikipedia’s choices.
You can replace your section with this verbatim.
🧪 Why Wikipedia Is the Right Stress Test
Wikipedia represents one of the most demanding real-world environments for AI-generated content not because it demands perfect accuracy, but because it enforces explicit, non-negotiable editorial policy.
On Wikipedia, a claim must not only be plausible or correct. It must be:
- verifiable by independent, reliable sources,
- defensible under written editorial rules,
- and explainable to human reviewers after the fact.
Fluency does not count as evidence. Plausibility does not count as justification.
These constraints are not informal norms. They are codified in long-standing, publicly documented policies, including:
- Verifiability “Verifiability, not truth, is required.” Content must be attributable to reliable published sources, regardless of whether it is factually correct.
- No Original Research Editors may not synthesize sources to introduce claims or relationships not explicitly stated.
- Reliable Sources Provenance hierarchy matters; circular citation and citation laundering are explicitly disallowed.
Together, these policies create a procedural filter, not an epistemic one. Content can be accurate and still be rejected if it cannot be justified under policy.
That is precisely why Wikipedia is an ideal stress test.
If an AI-assisted process can operate here producing outputs that survive procedural constraints rather than merely sounding correct it is likely to generalize elsewhere. If it fails here, the failure is informative rather than surprising.
This post does not argue for changing Wikipedia’s standards, nor does it treat Wikipedia as a judge of AI quality.
Instead, it takes Wikipedia’s policies as a fixed design constraint and asks a narrower, more practical question:
Can stochastic AI generation be harnessed in a way that reliably satisfies explicit institutional rules without weakening those rules or asking humans to “trust” the output?
The rest of this post is an exploration of that question.
⚡ Stochastic Power in a Deterministic Systems
Large language models are stochastic by nature. That stochasticity is not a defect it is the core reason these systems are useful at all.
It enables:
- exploration of idea space,
- synthesis across sources,
- reframing and compression,
- and occasionally, genuinely novel insight.
Nearly all meaningful progress in generative AI since 2017 has come from embracing this property, not suppressing it.
The problem is not that AI systems “hallucinate.”
The problem is that we are attempting to deploy a curved instrument inside environments that demand hard edges.
High-trust systems encyclopedias, finance, medicine, law, programming operate under square constraints:
- binary acceptance,
- explicit provenance,
- enforceable rules,
- and zero tolerance for ambiguity at the point of publication.
When stochastic systems are placed directly into these environments, failure is inevitable. This is not a model failure. It is a systems mismatch a square-peg, round-hole problem.
🎯 Our Claim
This post does not claim that:
- AI can replace human editors,
- generative models are inherently trustworthy,
- or that a small experiment solves a hard problem.
What we claim is narrower:
Stochastic generation can be made usable in deterministic environments if acceptance is governed by explicit, enforceable policy.
In other words:
AI systems should generate possibilities. Software systems should decide what is allowed to pass.
This is a software engineering problem, not a philosophical one.
🔄 From “Hallucination” to Managed Signal
The term hallucination is entrenched in the AI literature, and we will use it here for familiarity. But it is misleading.
What is commonly called hallucination is better understood as semantic overreach the model exercising its stochastic capacity beyond what a given body of evidence strictly supports.
That capacity is not something we want to eliminate. It is something we want to measure, bound, and govern.
In this post, we show two things:
-
Policy alone can override AI confidence The same claim may be accepted or rejected purely by changing policy, regardless of how plausible it sounds.
-
Semantic drift can be measured independently of policy We introduce a diagnostic signal hallucination energy that quantifies how far a generated claim moves beyond its supporting evidence. This signal does not decide truth. It characterizes risk.
Taken together, these allow us to do something important:
contain stochastic power without destroying it.
LLM output: ~~~~~~~~≈~~~~~≈~~~~~
Policy applied: ████ ████ ████
Hallucination energy is not a correctness metric; it is a control signal that only has meaning relative to policy boundaries.
🚀 Where This Leads
The goal of this post is not to “fix AI.”
It is to show using Wikipedia as a concrete, unforgiving test case that trust in AI outputs is not a mystery problem.
It is an engineering problem.
And engineering problems can be solved by separating concerns, enforcing boundaries, and treating stochasticity as a capability to be managed rather than a flaw to be removed.
With that framing in place, we can now introduce AI into the system and examine precisely and empirically what breaks, what holds, and why.
📊 The Dataset: FEVEROUS as a Wikipedia Substrate
We use FEVEROUS as the evaluation substrate for one reason: it mirrors how Wikipedia verification actually works.
FEVEROUS is built directly on Wikipedia pages and encodes:
- human-written claims,
- explicit evidence references (sentences, table cells, headers),
- and annotated reasoning traces.
That structure matters more than the labels.
Our goal is not to predict SUPPORTED or REFUTES. It is to test whether a claim paired with evidence can survive Wikipedia-style verifiability gates.
For that purpose, FEVEROUS is ideal.
📝 What a FEVEROUS Example Actually Looks Like (and How We Use It)
To make the setup concrete, here is a simplified view of a single FEVEROUS entry as it appears in our pipeline:
{
"id": 7389,
"claim": "Algebraic logic has five logical systems and Lindenbaum–Tarski algebra provides models of propositional modal logics.",
"label": "REFUTES",
"evidence": [
{
"content": [
"Algebraic logic_sentence_0",
"Lindenbaum–Tarski algebra_sentence_1",
"Algebraic logic_cell_0_1_1"
],
"context": {
"Algebraic logic_sentence_0": ["Algebraic logic_title"],
"Lindenbaum–Tarski algebra_sentence_1": ["Lindenbaum–Tarski algebra_title"],
"Algebraic logic_cell_0_1_1": [
"Algebraic logic_title",
"Algebraic logic_section_4",
"Algebraic logic_header_cell_0_0_1"
]
}
}
],
"annotator_operations": [...],
"challenge": "Multi-hop Reasoning"
}
Several details matter for this work:
- Evidence is explicitly referenced, not inferred.
- Context preserves page titles, sections, and table structure.
- Claims and evidence are already grounded in a human editorial process.
This allows us to ask a precise, operational question:
Given this claim and this evidence, would the claim pass a Wikipedia-style verifiability gate if evaluated as executable policy?
That is the only question this dataset is used to answer in this post.
What We Are and Are Not Measuring
We are not evaluating:
- model accuracy
- factual truth
- reasoning quality
- or label prediction performance
Those are valid research problems, but they are not the problem here.
Instead, we isolate a narrower systems question:
Can a claim–evidence pair survive deterministic editorial constraints once those constraints are made explicit and executable?
That distinction is critical.
A claim can be true and still unverifiable. It can be plausible and still unsourced. It can be supported in a dataset and still fail institutional review.
FEVEROUS gives us a realistic substrate to explore that gap and to test whether policy enforcement alone can account for many of the failures attributed to AI in high-trust environments.
⏰ Why This Matters Before We Involve an LLM
At this stage, no large language model is doing any reasoning.
That is intentional.
Before introducing AI into the loop, we first establish:
- the dataset
- the constraints
- the verification policy
- and the failure modes
Only once that foundation is solid does it make sense to ask whether AI can improve outcomes rather than obscure them.
That transition from raw evidence, to verifiable claims, to AI-assisted filtering is what the rest of this post demonstrates.
💡 The Core Idea: From Evidence to Verifiability
Before introducing AI into the loop, we first make the process explicit.
The key insight is simple:
AI outputs should not be trusted by default they should be gated by executable verification rules, the same way production software is gated by tests.
To make this concrete, we reduce the problem to a small, inspectable pipeline.
🧪 What We Are Testing
- A claim (from FEVEROUS)
- A Wikipedia page referenced as evidence
- A verifiability policy derived from Wikipedia editorial rules
The system does not attempt to reason, summarize, or rewrite. It simply asks: does this claim pass the rules?
🤖 Why This Comes Before Any AI
At this stage, no large language model is involved.
That is intentional.
Before introducing stochastic generation, we first establish:
- the evidence format,
- the verification constraints,
- the editorial policy,
- and the policy enforcement mechanism.
Only once that foundation is fixed does it make sense to ask whether AI can improve outcomes rather than obscure them.
This ordering matters.
If policy is implicit, AI failure looks like model failure. If policy is explicit, AI failure becomes a systems question.
That distinction underpins the rest of this post.
💡 From Evidence to Verifiability
The core idea explored here is simple:
AI outputs should not be trusted by default they should be gated by executable verification rules, the same way production software is gated by tests.
To make this concrete, we reduce the problem to a small, deterministic pipeline.
🧪 What the System Evaluates
- A claim (from FEVEROUS)
- Its referenced evidence
- A verifiability policy derived from editorial rules
At this stage, the system does not attempt to:
- reason,
- summarize,
- rewrite,
- or “understand” the content.
It asks one question only:
Given this claim and this evidence, does it pass the rules?
That is the baseline we establish before any AI enters the loop.
📊 System Flow
Here is the entire process, end to end.
flowchart TD
A[["📊 FEVEROUS Dataset<br/>Local JSONL"]] --> B["⚗️ Claim + Evidence Extraction"]
B --> C["🔍 Wikipedia Page Resolver"]
C --> D["📥 Wikipedia Page Fetch"]
D --> E["🚨 Verifiability Gate<br/>(Executable Policy)"]
E -->|✅ Supported| F["🟢 PASS<br/>Claim is Verifiable"]
E -->|❌ Not Supported| G["🔴 FAIL<br/>Claim Rejected"]
E -->|⚠️ Ambiguous| H["🟡 UNCLEAR<br/>Needs Human Judgment"]
classDef dataset fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
classDef process fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
classDef retrieval fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
classDef gate fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#880e4f
classDef pass fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
classDef fail fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#b71c1c
classDef unclear fill:#fff3e0,stroke:#ff8f00,stroke-width:2px,color:#e65100
class A dataset
class B process
class C,D retrieval
class E gate
class F pass
class G fail
class H unclear
✅ Why This Works
There are three design choices here that matter.
1️⃣ Evidence Comes First
The system never starts with a model output.
It starts with:
- a human-written claim
- a human-annotated evidence reference
- a real Wikipedia page
This mirrors how verification actually happens in editorial systems.
2️⃣ Verifiability Is Enforced as Code
The Verifiability Gate is the heart of the system.
It encodes rules like:
- Is the citation primary or secondary?
- Is the claim directly supported by the cited content?
- Does the evidence establish the claimed relationship?
- Is this a synthesis that requires editorial judgment?
These are not probabilistic checks. They are deterministic, auditable decisions.
If the claim fails, it fails for a reason and that reason is logged.
3️⃣ AI Is Deliberately Absent
This is important enough to state explicitly.
At this stage:
- No LLM generates text
- No LLM evaluates truth
- No LLM assigns labels
This is intentional.
If the verification layer is weak, AI will only amplify the weakness. If the verification layer is strong, AI becomes a force multiplier instead of a liability.
That transition comes later.
🎯 What This Diagram Is Really Showing
This is not just a pipeline.
It’s a boundary.
Everything to the left of the gate is input. Everything to the right of the gate is trust.
Our claim and the premise of the upcoming paper is that this boundary can be made precise, enforced in software, and scaled.
Once that boundary exists, AI becomes usable again.
📊 System Flow (High Level)
Here is the entire process, end to end.
flowchart TD
subgraph "📂 Dataset Loading"
A[["📊 FEVEROUS Dataset<br/>Local JSONL"]]
A --> B["⚗️ Extract Claim + Evidence"]
B --> C["🔄 Normalize to Wikipedia Structure"]
end
subgraph "🌐 Source Retrieval"
C --> D["🔍 Resolve Wikipedia Page"]
D --> E["📥 Fetch Page Content<br/>via Wikimedia API"]
end
subgraph "🚨 Verifiability Gate"
F["📥 Input: Claim + Context"] --> G{"🔎 Primary Source Check?"}
G -- "📘 Strict Policy" --> H["🚫 Citation Laundering Detection"]
G -- "📗 Standard Policy" --> I["⭐ Reputability Assessment"]
H --> J{"⏰ Temporal Drift?"}
I --> J
J -- "🕰️ Outdated" --> K["❌ Reject: Temporal Drift"]
J -- "✅ Current" --> L{"📈 Overstatement Detection?"}
L -- "📊 Overstated" --> M["❌ Reject: Exceeds Source Support"]
L -- "🎯 Accurate" --> N{"🔍 Direct Support?"}
N -- "✅ Direct Match" --> O["🟢 Supported (Direct)"]
N -- "🔄 Paraphrase" --> P["🟢 Supported (Close Paraphrase)"]
N -- "❓ Ambiguous" --> Q["🟡 Unclear: Needs Human Judgment"]
N -- "❌ No Support" --> R["🔴 Not Supported"]
end
subgraph "📊 Research Metrics Collection"
O --> S["📉 False Positive Reduction"]
P --> S
K --> T["📅 Temporal Drift Rate"]
M --> U["📈 Overstatement Rate"]
R --> V["👻 Plausible-Absent Rate"]
end
E --> F
S --> W["📈 Final Metrics Report<br/>📋 JSON Results"]
T --> W
U --> W
V --> W
%% Color Coordination System
classDef dataset fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
classDef process fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
classDef retrieval fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
classDef gate fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#880e4f
classDef check fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px,color:#283593
classDef reject fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#b71c1c
classDef supported fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
classDef unclear fill:#fff3e0,stroke:#ff8f00,stroke-width:2px,color:#e65100
classDef metrics fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c
%% Apply Colors
class A,B,C dataset
class D,E retrieval
class F,H,I,J,L,N gate
class G check
class K,M,R reject
class O,P supported
class Q unclear
class S,T,U,V,W metrics
%% Emoji Legend
style W stroke-width:3px
🎯 What This Diagram Is Really Showing
This is not just a pipeline.
It’s a boundary.
Everything to the left of the gate is input. Everything to the right of the gate is trust.
Our claim and the premise of the upcoming paper is that this boundary can be made precise, enforced in software, and scaled.
Once that boundary exists, AI becomes usable again.
📰 A Minimal Test Case: One Claim, Three Editorial Policies
Before introducing any AI generation at all, we start with a deliberately simple experiment.
We take one factual claim, one real Wikipedia citation, and run it through our testing three times, changing only the editorial policy.
Nothing else changes:
- Same claim
- Same citation
- Same code
- Same execution environment
This lets us isolate a crucial distinction that is often lost in AI discussions:
Truth, evidence, and publishability are not the same thing.
📄 The Claim
For illustration, we use a straightforward biographical claim backed by a real Wikipedia page:
Claim:
"Johannes Voggenhuber was an Austrian politician and a former spokesperson for the Green Party."
Citation:
https://en.wikipedia.org/wiki/Johannes_Voggenhuber
Most humans would intuitively accept this as true. Wikipedia itself presents it as such. But how it is accepted depends entirely on editorial rules.
🧪 Running the Same Claim Under Three Policies
Below is a simplified version of the test we ran. The only variable is the policy parameter.
from verity_core.context.execution import ExecutionContext
claim = "Johannes Voggenhuber was an Austrian politician and a former spokesperson for the Green Party."
citation_url = "https://en.wikipedia.org/wiki/Johannes_Voggenhuber"
for policy in [
"wikipedia.editorial",
"wikipedia.standard",
"wikipedia.strict",
]:
ctx = ExecutionContext(params={"policy": policy})
result = integrations.wikipedia.invoke(
"wikipedia.citation.verify",
"verify",
{
"claim": claim,
"citation_url": citation_url,
"context_snippet": claim,
},
context=ctx,
)
print(f"Policy: {policy}")
print(f"Verdict: {result['support_label']}")
print(f"Confidence: {result['confidence_score']}")
print(f"Warning: {result.get('warning')}")
print()
📊 The Results
Running this test produces the following outcomes:
Policy: wikipedia.editorial
Verdict: supported
Confidence: 0.95
Warning: None
Policy: wikipedia.standard
Verdict: supported
Confidence: 0.95
Warning: None
Policy: wikipedia.strict
Verdict: not_supported
Confidence: 0.95
Warning: CITATION LAUNDERING DETECTED:
wikipedia.strict requires primary sources.
Rejected non-primary source: unclassified source
This result is intentional, deterministic, and correct.
💡 Why This Matters
All three runs agree on the content of the claim. The confidence score remains high across policies. What changes is whether the claim is allowed to pass.
- Editorial policy allows synthesis and common knowledge.
- Standard policy allows Wikipedia as a secondary source.
- Strict policy rejects Wikipedia citing itself unless backed by primary sources.
In other words:
The claim does not become false it becomes unpublishable under stricter rules.
This distinction is central to understanding why generative AI struggles in editorial environments. The failure mode is not hallucination alone it is policy misalignment.
🔑 Key Insight Policy enforcement separates three distinct questions:
Is it true? (Factual accuracy)
Can we verify it? (Evidence quality)
May we publish it? (Editorial policy)
Current AI systems conflate these. We propose to make the separation explicit and enforce it in software.
🎯 Why We Start Here (Before Using AI)
We begin with this test case for a reason.
Before asking an AI to generate better outputs, we must first define:
- What counts as acceptable evidence
- Under which rules
- And why a claim is rejected
Policy enforcement does not decide truth.
It enforces explicit, inspectable editorial constraints—deterministic rules that answer questions like:
“Is this citation primary or secondary?” “Does the evidence explicitly state this relationship, or is it implied?” “Would this claim survive human editorial review under current Wikipedia standards?”
These rules operate independently of plausibility. A claim can be factually correct and still fail—not because it is false, but because it lacks the required provenance or introduces unstated synthesis. The gate rejects based on process, not probability.
Only once those constraints are encoded does it make sense to introduce AI generation—and measure whether it actually improves outcomes rather than merely sounding confident.
In the next section, we scale this exact mechanism across thousands of real FEVEROUS claims to show how editorial policy, not factual accuracy alone, determines what survives publication.
📈 Baseline: Policy Gating Without AI
Before introducing any large language model, we establish a non-negotiable baseline: Can a deterministic, policy-driven system correctly classify evidence without generation, learning, or prompting?
This matters for two reasons:
- It separates verification logic from generation quality
- It gives us a control condition against which AI behavior can be meaningfully measured
If this baseline fails, any downstream AI result would be uninterpretable.
✅ Why FEVEROUS is a good selection for this test
We use the FEVEROUS dataset because it is:
- Fully annotated against real Wikipedia pages
- Designed for multi-hop, table, and sentence-level evidence
- Widely cited in fact verification research
However, FEVEROUS evidence is Wikipedia-native by construction. That makes it ideal for testing editorial policy alignment, not just factual correctness.
In other words: FEVEROUS tells us what humans accepted as evidence, not whether that evidence satisfies all editorial standards.
This distinction turns out to be crucial.
🚪 The Wikipedia Gate
We evaluate every claim under three increasingly strict policies:
wikipedia.editorial– mirrors common editorial acceptancewikipedia.standard– enforces standard sourcing and attribution ruleswikipedia.strict– requires primary, non-derivative sources
Each policy is applied to the same claims, using the same code path, differing only by policy configuration.
No AI models are used.
💻 Example: Running the Gate
Below is a minimal example showing how the same claim is evaluated under three different Wikipedia policies.
from verity.wiki import WikipediaGate
gate = WikipediaGate()
claim = {
"text": "Family Guy is an American animated sitcom.",
"evidence": [
{
"page": "Family Guy",
"source": "https://en.wikipedia.org/wiki/Family_Guy",
"type": "wikipedia"
}
]
}
for policy in [
"wikipedia.editorial",
"wikipedia.standard",
"wikipedia.strict",
]:
result = gate.evaluate(claim, policy=policy)
print(f"""
Policy: {policy}
Verdict: {result.verdict}
Confidence: {result.confidence}
Warning: {result.warning}
""")
Observed output:
Policy: wikipedia.editorial
Verdict: supported
Confidence: 0.95
Warning: None
Policy: wikipedia.standard
Verdict: supported
Confidence: 0.95
Warning: None
Policy: wikipedia.strict
Verdict: not_supported
Confidence: 0.95
Warning: CITATION LAUNDERING DETECTED:
wikipedia.strict requires primary sources.
Rejected non-primary source: unclassified source
This is the expected and correct behavior.
The claim does not change. The evidence does not change. Only the policy changes.
📊 Dataset-Level Results (n = 3000)
We then run the same gate across 3,000 FEVEROUS claims, using the same evidence and the same evaluation logic.
🔄 Policy-Controlled Phase Transition on Identical Inputs
| Policy | Total Claims | Supported | Not Supported | Unclear |
|---|---|---|---|---|
| wikipedia.editorial | 3000 | 2999 | 0 | 1 |
| wikipedia.standard | 3000 | 2999 | 0 | 1 |
| wikipedia.strict | 3000 | 0 | 3000 | 0 |
This table evaluates the same 3,000 claims, with the same evidence, using the same code. The only variable is the active editorial policy.
The complete rejection under wikipedia.strict is intentional and correct.
FEVEROUS evidence is Wikipedia-derived, and strict policy forbids circular or non-primary sourcing.
This result does not indicate model failure or factual error.
It demonstrates that policy enforcement alone can deterministically override otherwise acceptable claims producing a hard acceptance / rejection boundary without modifying the model, prompts, or data.
A skeptic might say:
“Of course this fails. You explicitly forbid Wikipedia from citing itself.”
That reaction is correct and beside the point.
To a systems engineer, this “obvious” failure is a success. It shows that the gate is deterministic. Given the same input and the same policy, it will always say no.
In a world dominated by probabilistic language and confidence-weighted outputs, producing a reliable binary rejection is not trivial. It is the prerequisite for any system that must operate under institutional rules.
💡 What This Shows (and Why It Matters)
These results are not surprising and that is exactly the point.
They show that:
- FEVEROUS evidence aligns extremely well with editorial and standard Wikipedia practice
- The same evidence is systematically incompatible with strict, primary-source requirements
- The system cleanly separates policy mismatch from factual error
Crucially, this behavior emerges:
- without AI,
- without heuristics,
- without learned thresholds,
- without prompt tuning.
This establishes a policy-sensitive verification baseline.
🎯 Why This Baseline Is Necessary
Much of the current discussion around AI and Wikipedia frames failure as hallucination or poor generation.
This experiment shows something more precise:
Even perfectly annotated, human-verified evidence can fail editorial standards when policy constraints tighten.
In other words:
- Evidence ≠ Verifiability
- Generation ≠ Acceptance
- Policy is the missing layer
Only once this baseline is understood does it make sense to introduce stochastic generation and ask how or whether it can be safely contained.
🎭 Setting the Stage for AI
With this baseline in place, we can now ask a meaningful next question:
Can AI-generated content be filtered, corrected, or constrained to pass the same gate?
Because the gate is deterministic and reproducible, any improvement or failure in later experiments can be attributed to generation quality, not evaluation ambiguity.
That is the foundation on which the rest of this work is built.
flowchart TD
A[["📥 Claim + Evidence"]] --> B["🔄 Normalize Input"]
B --> C["🔍 Resolve Evidence Source"]
C --> D{"📝 Select Policy"}
D -->|wikipedia.editorial| E["📋 Editorial Checks"]
D -->|wikipedia.standard| F["📘 Standard Sourcing Checks"]
D -->|wikipedia.strict| G["🔬 Strict Primary Source Checks"]
E --> H{"✅ Policy Evaluation"}
F --> H
G --> H
H -->|Supported| I["🟢 Verdict: Supported"]
H -->|Not Supported| J["🔴 Verdict: Not Supported"]
H -->|Ambiguous| K["🟡 Verdict: Unclear"]
I --> L["📊 Confidence + Audit Log"]
J --> L
K --> L
classDef dataset fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
classDef process fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
classDef check fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px,color:#283593
classDef policyCheck fill:#ffecb3,stroke:#ff8f00,stroke-width:2px,color:#ff6f00
classDef supported fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
classDef reject fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#b71c1c
classDef unclear fill:#fff3e0,stroke:#ff8f00,stroke-width:2px,color:#e65100
classDef metrics fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c
class A dataset
class B,C process
class D,H check
class E,F,G policyCheck
class I supported
class J reject
class K unclear
class L metrics
🤖 AI → Wikipedia Gate Test: Verifying AI Outputs Without Trusting Them
Up to this point, we’ve evaluated human-annotated claims (from FEVEROUS) against executable Wikipedia policies. That established an important baseline: the gate itself behaves predictably and matches editorial expectations.
The next question is the one that actually matters in practice:
What happens when the claims come from an AI?
This test answers that question without changing the policy, the gate logic, or the evaluation criteria.
🧪 What We Tested
We ran a single AI-generated claim per policy level through the same Wikipedia verification gates Key constraints:
- The AI is not trusted
- The AI is not used for verification
- The AI produces text only
- All judgment is performed by deterministic policy enforcement
We evaluated the same type of historical claim under three policies:
wikipedia.editorialwikipedia.standardwikipedia.strict
The only variable is policy strictness not the model, not confidence, not prompt structure.
💻 The Test Code
Below is the full test used to generate the results shown in this section.
# test/integration/wiki/wiki_ai_gate_test.py
from rich.console import Console
from rich.table import Table
from verity_core.context.execution import ExecutionContext
from verity_core.infra.integrations import Integrations
from verity_core.db.stores.memory import Memory
console = Console()
memory = Memory(db_url="sqlite:///:memory:")
integrations = Integrations(memory=memory)
tests = [
{
"policy": "wikipedia.editorial",
"claim": "The Battle of Waterloo was a significant battle during the Napoleonic Wars.",
"page": "Battle of Waterloo",
},
{
"policy": "wikipedia.standard",
"claim": "In 1969, Apollo 11 astronauts Neil Armstrong and Buzz Aldrin landed on the Moon.",
"page": "Apollo 11",
},
{
"policy": "wikipedia.strict",
"claim": "In 1969, Neil Armstrong became the first person to walk on the Moon.",
"page": "Neil Armstrong",
},
]
table = Table(title="AI → Wikipedia Gate Results", show_header=True)
table.add_column("Policy")
table.add_column("AI Claim")
table.add_column("Gate Verdict")
table.add_column("Confidence")
table.add_column("Warning")
for test in tests:
ctx = ExecutionContext(params={"policy": test["policy"]})
citation_url = f"https://en.wikipedia.org/wiki/{test['page'].replace(' ', '_')}"
result = integrations.wikipedia.invoke(
"wikipedia.citation.verify",
"verify",
{
"claim": test["claim"],
"citation_url": citation_url,
"context_snippet": test["claim"],
},
context=ctx,
)
table.add_row(
test["policy"],
test["claim"][:40] + "...",
result["support_label"],
str(result["confidence_score"]),
result.get("warning") or "None",
)
console.print(table)
This test deliberately avoids:
- Prompt engineering tricks
- Retrieval augmentation
- Model self-verification
- Any form of AI-judged correctness
The AI only produces text. The gate alone decides whether that text is acceptable.
📊 Results
The output of the test is shown below:
| Policy | Gate Verdict | Confidence | Warning |
|---|---|---|---|
| wikipedia.editorial | supported | 0.95 | None |
| wikipedia.standard | supported | 0.95 | None |
| wikipedia.strict | not_supported | 0.95 | Citation laundering detected |
Several things are worth calling out.
💡 Why These Results Matter
1. Confidence is not authority
The AI’s confidence score remains constant across all three policies. This is intentional.
The system does not “trust” confidence it enforces rules.
2. Truth is not enough
The rejected claim under wikipedia.strict is factually correct.
It still fails.
That is not a bug it is the point.
Wikipedia’s strict policy requires primary sourcing, and a Wikipedia article is not a primary source.
3. The gate, not the AI, is doing the work
Nothing in this test relies on:
- The AI being accurate
- The AI being careful
- The AI being aligned
The same output can pass or fail depending solely on policy context.
4. This is exactly how high-reliability systems behave
This pattern mirrors how we already build reliable systems:
- Compilers don’t trust programmers
- Type systems don’t trust intuition
- Databases don’t trust inputs
Policy treats AI outputs the same way: useful, powerful and never authoritative.
😊 Why We’re Happy With This
These results show something subtle but crucial:
You can safely use AI without trusting it.
The AI can generate ideas, drafts, summaries, or claims
and the system can enforce invariants afterward, deterministically.
This is not a marginal improvement. It is a structural one.
And it’s why we believe AI systems once properly bounded can reliably be declare safe for any task.
⚡ Stochastic Generation, Deterministic Acceptance
Before we talk about AI, it is worth pausing on something more familiar: how humans think when they are allowed to think freely.
When people speak out loud, sketch ideas on a whiteboard, or brainstorm without immediately editing themselves, something different happens than when they write polished prose. The constraints are lower. The internal filter is weaker. Associations surface earlier. Ideas arrive partially formed, sometimes imprecise, occasionally wrong but often novel.
That looseness is not a flaw. It is how exploration works.
When the same person later edits, revises, or publishes, a different process takes over. Claims are tightened. Ambiguities are resolved. Unsupported leaps are removed. What remains is not the raw idea, but the version that survives constraint.
These two phases free generation and constrained acceptance are not in opposition. They are complementary. Most serious human work relies on both.
AI systems, as it turns out, work in a remarkably similar way.
🤖 AI Stochasticity Is Not the Problem
Large language models are stochastic by design. Given the same prompt, they can produce multiple valid continuations. This variability is what enables them to:
- synthesize across sources,
- explore alternative framings,
- compress complex evidence,
- and surface non-obvious connections.
Without stochasticity, models would be reduced to retrieval engines or deterministic rewriters. They would lose the very properties that make them useful.
Much of the current discourse around “hallucinations” implicitly treats stochasticity as a defect something to be eliminated or minimized. But that framing misses the point.
The issue is not that AI systems generate uncertain or exploratory outputs.
The issue is that we often treat those outputs as if they were already final.
🔄 The Missing Phase: Acceptance as a Separate Operation
In most AI workflows today, generation and acceptance are collapsed into a single step:
- The model produces text
- The text is shown to a user or published
- Any correction happens informally, if at all
This is very different from how high-reliability systems operate.
In software engineering, we do not trust intermediate results. We generate candidates, then apply tests. We compile code, then enforce type systems. We accept inputs, then validate them against schemas and contracts.
Crucially, generation is allowed to be flexible but acceptance is not.
The central idea of this work is that AI systems should be treated the same way.
Stochastic generation is acceptable. Deterministic acceptance is non-negotiable.
Once those phases are separated, many apparent AI “failures” begin to look less mysterious.
Without deterministic acceptance, trust is irrelevant.
🏛️ Institutions and AI
High-trust institutions like Wikipedia, medicine, law, and finance do not primarily care whether an output sounds right. They care whether it can be:
- justified under explicit rules,
- traced to acceptable sources,
- explained to human reviewers,
- and defended after the fact.
AI-generated text often fails these requirements not because it is false, but because it is procedurally unverifiable. The model compresses, generalizes, and reframes all useful behaviors but in doing so it loses a clean audit trail.
From the institution’s point of view, this is indistinguishable from fabrication.
This explains a pattern we see repeatedly in the experiments that follow:
- Claims can be semantically faithful
- Embedding similarity can be high
- Human readers may find them reasonable
And yet, they are rejected deterministically under strict policy.
That rejection is not anti-AI. It is policy doing its job.
📏 Measuring, Not Suppressing, Stochasticity
Once we accept that stochastic generation is inevitable and even desirable the question shifts.
Instead of asking:
How do we eliminate hallucinations?
we ask:
How do we measure how far stochastic generation has moved beyond what the evidence strictly supports?
This is where the notion of hallucination energy enters the system later in the paper.
Hallucination energy does not attempt to decide whether a claim is true. It does not judge intent or correctness. It does not replace editorial policy.
It simply measures semantic deviation how much of a generated claim cannot be explained as a direct projection of the evidence.
In human terms, it is the difference between:
- “thinking out loud” and
- “making a publishable assertion”
The metric gives us a way to observe stochasticity rather than pretending it should not exist.
🎯 Containment, Not Correction
A key consequence of this framing is that the solution does not live inside the model.
No amount of prompt engineering, retraining, or self-evaluation can teach a model which semantic moves are acceptable under every institutional policy. Those rules are external, contextual, and often domain-specific.
The only stable solution is architectural:
- Allow stochastic generation to do what it does best
- Measure how far it drifts
- Apply deterministic gates afterward
- Accept only what survives explicit policy
This is not an attempt to make AI “tell the truth.”
It is an attempt to make AI usable in environments where truth alone is not sufficient.
🗺️ Why This Matters for the Rest of the Paper
Everything that follows the Wikipedia gate, the policy regimes, the FEVEROUS experiments, the strict rejections, and the hallucination energy plots is downstream of this separation.
Once you see generation and acceptance as distinct phases:
- It becomes obvious why accuracy is not the bottleneck
- It becomes obvious why policy overrides semantic quality
- It becomes obvious why strict rejection can be correct
- And it becomes obvious why bounding stochasticity works without destroying it
The rest of this post is not an argument against AI.
It is an argument for putting it in the right place in the system.
🎥 Act II: When AI Enters the System
What this section does
Act II introduces AI as a stress test, not as a solution.
We progressively place a stochastic language model into the verification pipeline under tight constraints, then under increasing freedom. This reveals where and why AI outputs begin to conflict with institutional verification rules.
In this act we show:
why early AI results look “too good”,
why that is not a success but a baseline,
how synthesis differs from paraphrase,
where epistemic risk actually appears.
We then introduce hallucination energy a diagnostic signal that measures semantic drift between claims and evidence.
This act answers a narrow but critical question:
What exactly breaks when stochastic generation meets deterministic verification and why?
The Unexpected Result
At this point, we expected AI-generated claims to fail dramatically.
Given the prevailing narrative around hallucinations, we assumed that introducing a stochastic generator into a rigid editorial system would immediately surface errors, violations, and instability.
That did not happen.
Our first AI experiments paraphrasing existing claims and synthesizing claims strictly from Wikipedia-grounded evidence produced near-perfect pass rates under editorial and standard policies.
This result was not reassuring. It was a warning.
The AI had not become reliable. The experiment had become too constrained to fail.
By operating exclusively on evidence already curated for Wikipedia, we had placed the model in a sandbox where epistemic risk was structurally suppressed. The system was behaving well not because AI had solved verifiability, but because we had not yet created conditions where its stochastic nature could meaningfully conflict with institutional rules.
This forced a deliberate pivot.
Before asking whether policy gates could fix AI outputs, we first needed to understand where and how AI actually breaks institutional constraints.
That shift from validating success to introducing controlled failure defines the rest of Act II.
Up to this point, everything we have described works.
Wikipedia’s editorial process its emphasis on verifiability, sourcing, and policy enforcement functions reliably when the author is human. Claims are evaluated not on whether they sound right, but on whether they meet clearly defined institutional rules. Disagreements are resolved through process, not persuasion.
The moment we introduce AI into this system, something changes.
Not because the rules change. Not because the standards are relaxed. But because AI produces outputs that were never designed to be evaluated under these constraints.
This act explores what happens when we place AI-generated claims into the same verification pipeline that Wikipedia already uses for human editors without special treatment, without exceptions, and without redefining success.
Importantly, this is not an attempt to fix AI, defend AI, or argue that Wikipedia’s standards are wrong. The goal here is diagnostic, not prescriptive.
We ask a narrower question:
What exactly fails when AI-generated claims are subjected to institutional verification rules and why?
To answer this, we do not begin with free-form generation or open-ended prompts. Instead, we constrain the problem as tightly as possible. We use evidence already curated from Wikipedia, ask an AI model to synthesize claims only from that evidence, and then evaluate those claims using deterministic, policy-aware gates that mirror Wikipedia’s own editorial regimes.
This setup removes ambiguity about truth. The evidence is already known. The task is not discovery. It is synthesis under constraint.
What follows in this act is not a story about hallucinations in the abstract. It is a careful examination of friction the mismatch between how AI represents knowledge and how institutions decide whether knowledge is acceptable.
By the end of Act II, we will not claim to have solved this mismatch. But we will have made it visible, measurable, and impossible to ignore.
🧪 Experimental Setup: Evidence, Policies, and Gates
To understand where AI-generated claims fail under institutional verification, we needed an experimental setup that was:
- grounded in real editorial practice
- reproducible and deterministic
- capable of evolving without invalidating earlier results
This section describes the final setup we converged on and, importantly, why it took the shape it did.
📊 Dataset: FEVEROUS as a Policy Stress Test
We use the FEVEROUS dataset as our primary evaluation corpus.
FEVEROUS is derived from Wikipedia and contains:
- natural-language claims,
- structured evidence annotations (sentences, tables, and metadata),
- editorial labels such as SUPPORTS, REFUTES, and NOT ENOUGH INFO.
It is important to be explicit about what FEVEROUS is not.
FEVEROUS is not a truth oracle. It does not certify factual correctness in the abstract. Instead, it captures how claims are supported or rejected within the context of Wikipedia’s editorial process.
That makes it ideal for our purpose. We are not asking whether AI is correct in some universal sense. We are asking whether AI outputs can survive institutional verification rules that already exist.
Because the full FEVEROUS corpus is not reliably available via hosted APIs, we work from a locally downloaded snapshot. This does not affect reproducibility: all preprocessing, sampling, and evaluation steps are deterministic and documented.
🚦 Editorial Policies as Executable Constraints
Rather than treating “Wikipedia standards” as a single, vague notion, we model them explicitly as policy regimes.
In our system, each regime is implemented as an executable gate:
- Editorial mirrors everyday human editorial judgment; tolerant of ambiguity
- Standard enforces clearer sourcing and attribution requirements
- Strict enforces Wikipedia’s strongest verifiability principle, rejecting claims without demonstrable primary-source provenance
These are not learned classifiers. They are deterministic, rule-based evaluators. The same input under the same policy always produces the same output.
This matters for two reasons:
- It isolates policy effects from model behavior
- It allows us to ask a precise question: What changes when the policy changes even if the claim does not?
🔄 How the Evaluation Actually Runs
for example in feverous_samples:
evidence = extract_evidence_text(example)
original_claim = example["claim"]
ai_claim = synthesize_claim(evidence)
for policy in ["editorial", "standard", "strict"]:
result = wikipedia_gate.verify(
claim=ai_claim,
context=evidence,
policy=policy,
)
record_result(
policy=policy,
original_claim=original_claim,
ai_claim=ai_claim,
verdict=result.label,
confidence=result.confidence,
)
🚪 The Wikipedia Gate Architecture
At the core of the system is what we call the Wikipedia Gate.
Conceptually, the gate sits between an AI system and a publication environment. It receives:
- a claim,
- its associated evidence context,
- and an active editorial policy.
It then produces:
- a support verdict (supported, not supported, unclear),
- a confidence score,
- and a structured warning when a policy violation is detected (e.g., citation laundering).
Crucially, the gate does not generate text. It does not correct claims. It does not reason probabilistically. It enforces rules.
This separation is intentional. Generation and verification are different problems, and conflating them is one of the core failure modes we are examining.
🎯 Iterative Refinement (Without Moving the Goalposts)
The Wikipedia tests attached to this work reflect a gradual increase in sophistication.
We began with simple baselines:
- verifying FEVEROUS claims directly against evidence,
- validating that the gate behaved sensibly under each policy.
Only after those baselines stabilized did we introduce AI into the loop:
- first as a paraphraser,
- then as a constrained synthesizer operating only on provided evidence,
- and finally as a generator capable of introducing epistemic risk.
At each step, earlier tests were preserved and re-run. No results were discarded. No definitions were retroactively changed.
This matters because it allows us to make a strong claim:
Any failures observed later are not artifacts of a broken gate. They are consequences of introducing AI into an otherwise stable verification system.
With the experimental foundation in place, we can now examine what actually happens when AI-generated claims are subjected to institutional verification and why the results look the way they do.
🤖 The First AI Experiment: Evidence-Bound Claim Synthesis
With the gate architecture and policies in place, the first question we asked was deliberately conservative:
What happens if an AI is only allowed to speak using the evidence we give it?
No external knowledge. No retrieval. No web access. No prior context.
Just evidence in, claim out.
🧪 Experimental Design
For each FEVEROUS sample, we extracted the annotated evidence text and asked the AI to synthesize a single declarative claim that was strictly grounded in that evidence.
The model was instructed to:
- summarize or restate what the evidence supports,
- avoid speculation or extrapolation,
- remain concise and factual.
Importantly, the AI was not shown the original FEVEROUS claim. It was operating blind its task was synthesis, not reconstruction.
Each generated claim was then passed through the Wikipedia Gate under all three editorial policies:
- editorial
- standard
- strict
No tuning, retries, or post-processing were applied.
📊 Results
The results were striking.
Under editorial and standard policies, the vast majority of AI-generated claims were accepted or marked as unclear rather than rejected outright.
Under strict policy, nearly all claims were rejected.
At first glance, this looked suspicious. If AI hallucination is such a serious problem, why wasn’t it showing up here?
✅ Why These Results Are Not a Red Flag
OK I need to get something that showsThe apparent ‘success’ of the AI in this experiment is evidence that stochasticity had not yet been allowed to matter.
It is evidence that the experiment was correctly constrained.
FEVEROUS evidence is already Wikipedia-grounded. The language is encyclopedic. The scope is narrow. When an AI is asked to restate or compress that material, it is operating in a low-risk epistemic environment.
In other words: we didn’t give the model enough room to fail.
This is not a flaw in the experiment it is a necessary baseline.
Before we can study how AI breaks institutional rules, we need to confirm that:
- the gate does not reject valid claims arbitrarily,
- the policies behave as expected,
- and the AI does not hallucinate by default when constrained.
This experiment establishes exactly that.
⚡ The First Tension Emerges
The strict policy results, however, already reveal something important.
Even when claims are:
- synthesized directly from evidence,
- semantically faithful,
- and editorially acceptable,
they are rejected under strict policy with warnings such as:
“CITATION LAUNDERING DETECTED: wikipedia.strict requires primary sources.”
This is not a semantic failure. It is a provenance failure.
The AI is reasoning in semantic space. Wikipedia is enforcing documentary lineage.
The same claim can be acceptable or unacceptable depending solely on which policy is active.
This is the first clear signal that verifiability is policy-relative, not truth-absolute.
🎯 What This Experiment Proves and What It Does Not
This experiment does not prove that AI is reliable.
It proves something narrower and more important:
- the gate works,
- the policies are meaningful,
- and AI behavior changes only when epistemic risk is introduced.
With that baseline established, we can now do the real work.
In the next section, we deliberately push the system out of this safe zone by allowing the AI to compress, generalize, and drift.
That is where hallucination stops being theoretical and starts becoming measurable.
🎯 Introducing Epistemic Risk: When Synthesis Becomes Reasoning
The first experiment established a calm baseline: when an AI is tightly constrained to restate Wikipedia-grounded evidence, it behaves predictably and safely.
But this is not how AI systems are used in practice.
Real deployments do not ask models to merely restate source material. They ask them to:
- summarize across multiple passages,
- compress nuanced descriptions into single claims,
- generalize from examples,
- infer implications,
- and speak with confidence under ambiguity.
This is where epistemic risk enters the system.
🔄 What We Changed
To introduce risk deliberately, we changed only one thing:
we allowed the AI to talk more.
Instead of instructing the model to restate the evidence as-is, we asked it to synthesize a higher-level claim that captured the meaning of the evidence.
The constraints were relaxed in subtle but important ways:
- the AI could generalize language,
- it could merge related facts,
- it could choose emphasis and framing,
- it could introduce qualifiers implicitly rather than explicitly.
No new information sources were added. The evidence remained the same.
But the space of possible claims expanded.
✨ From Paraphrase to Synthesis
def synthesize_claim(evidence_text: str) -> str:
prompt = f"""
Based ONLY on the following evidence,
generate a single Wikipedia-style claim.
Evidence:
{evidence_text}
Claim:
"""
return model.generate(prompt)
I
💡 Why This Matters
From a human perspective, this kind of synthesis is reasonable even expected.
From an institutional perspective, it is dangerous.
Wikipedia’s policies are not designed to evaluate whether a statement sounds reasonable. They are designed to enforce whether a statement can be proven acceptable under documented sourcing rules.
The moment the AI moves from compression to interpretation, it begins to generate claims that are:
- semantically plausible,
- faithful in spirit,
- but increasingly difficult to justify under strict verification.
This is the precise zone where hallucination is often discussed but rarely measured.
📝 A Concrete Example
Consider the following transformation:
Evidence excerpt (Wikipedia-grounded):
“Lindenbaum–Tarski algebra appears in discussions of algebraic logic and propositional modal logic.”
AI-synthesized claim:
“Algebraic logic encompasses multiple logical systems, including Lindenbaum–Tarski algebra, which provides models for propositional modal logics.”
Nothing here is obviously false.
Yet under strict policy, this claim is rejected.
Not because it contradicts the evidence but because it asserts structure that was never explicitly sourced as such.
The AI has crossed a boundary: from what is stated to what is implied.
⚠️ The Nature of the Risk
This is not a bug. It is a property of reasoning.
Human experts do this kind of compression constantly. Institutions tolerate it only when provenance is explicit and traceable.
AI systems, however, do not understand institutional tolerance. They operate in semantic space, not procedural space.
Once the AI begins synthesizing higher-order claims, two things happen simultaneously:
- The claim becomes more useful.
- The claim becomes harder to verify.
This tradeoff is unavoidable.
🛡️ Why We Needed the Gate
Without a policy-aware gate, these claims would look perfectly acceptable. They are fluent. They are grounded. They are almost right.
But “almost right” is exactly the category that institutional systems must reject.
By introducing epistemic risk in a controlled way, we now have a setting where:
- the AI is no longer trivially safe,
- hallucination is no longer binary,
- and policy differences begin to matter.
This is the environment the gate was designed for.
In the next section, we show how this risk can be measured not as truth or falsity, but as geometric deviation between what the evidence supports and what the AI asserts.
That measurement is what allows the gate to act before failure becomes visible to users.
📏 Measuring Semantic Drift: From Similarity to Direction
Once epistemic risk is introduced, we need a way to answer a subtle question:
How far does an AI-generated claim move away from what the evidence actually supports?
This is not a question of truth or falsity. It is a question of directional alignment in semantic space.
⚖️ Why Similarity Alone Is Not Enough
Most evaluation systems rely on cosine similarity between embeddings:
- High similarity → “grounded”
- Low similarity → “hallucinated”
But cosine similarity is symmetric and scalar. It only tells us whether two texts are close not whether one text introduces new semantic content that is unsupported by the other.
Two statements can be highly similar while one still asserts something extra.
That extra assertion is where epistemic risk lives.
📐 A Geometric View of Claims and Evidence
We treat embeddings as vectors in a shared semantic space:
- the evidence vector represents what the source material supports,
- the claim vector represents what the AI is asserting.
If the claim is fully supported by the evidence, then directionally the claim vector should lie along the evidence vector.
If the claim introduces new structure, emphasis, or implication, part of the claim vector will point away from the evidence direction.
That deviation is what we measure.
⚡ Hallucination Energy (Semantic Residual)
We compute semantic drift by decomposing the claim vector into two components:
- the portion aligned with the evidence,
- the portion orthogonal to it.
In simplified form:
# Normalize vectors
e = normalize(evidence_vector)
c = normalize(claim_vector)
# Project claim onto evidence direction
projection = dot(c, e) * e
# Residual = unsupported semantic mass
residual = c - projection
hallucination_energy = ||residual||
This value captures how much of the claim cannot be explained by the evidence, regardless of surface similarity.
We call this quantity hallucination energy.
💻 Computing Unsupported Semantic Mass
claim_vec = embed(ai_claim)
evidence_vec = embed(evidence_text)
claim_u = normalize(claim_vec)
evidence_u = normalize(evidence_vec)
projection = dot(claim_u, evidence_u) * evidence_u
residual = claim_u - projection
hallucination_energy = norm(residual)
🎯 What This Metric Is and Is Not
It is important to be precise about what this measurement represents.
Hallucination energy is:
- a directional deviation metric,
- a measure of unsupported semantic mass,
- a continuous signal, not a binary judgment,
- independent of policy.
Hallucination energy is not:
- a truth detector,
- a fact-checker,
- a replacement for citation verification,
- a claim about intent or correctness.
A claim can have:
- low hallucination energy and still be rejected (policy violation),
- higher hallucination energy and still be acceptable (editorial tolerance).
That distinction is intentional.
✅ Why This Works in Practice
When AI systems are constrained to restate evidence, hallucination energy remains low.
As synthesis becomes more ambitious introducing qualifiers, structure, or generalization the residual grows.
This gives us a way to:
- detect early semantic drift,
- compare synthesis strategies,
- reason about risk before policy failure occurs.
Importantly, this signal is model-agnostic and policy-agnostic. It does not know what Wikipedia allows. It only measures what the evidence supports.
🚪 From Measurement to Gating
Hallucination energy does not decide outcomes by itself.
Instead, it provides context to the policy gate:
- low energy + rejection → provenance problem,
- high energy + rejection → semantic overreach,
- low energy + acceptance → safe compression,
- high energy + acceptance → editorial tolerance.
This separation is crucial.
It allows us to distinguish:
- reasoning quality,
- institutional alignment,
- and policy enforcement.
In the next section, we show how these measurements interact with Wikipedia’s editorial, standard, and strict policies and why the same claim can receive radically different verdicts without changing a single word.
That divergence is not a failure of AI. It is the point of the experiment.
🎞️ Act III Policy as a Harness for a Superpower
In Act I, we made verifiability explicit and executable. In Act II, we introduced stochastic generation under controlled conditions.
Now we bring them together.
This act is not about fixing AI. It is about making stochastic intelligence usable in environments that demand hard guarantees.
⚡ Stochasticity Is the Asset, Not the Bug
Large language models are stochastic by design. They explore a space of possible expressions rather than executing a single deterministic path.
This behavior is often framed as a flaw.
We take the opposite position.
Stochasticity is the core capability of modern AI.
It enables:
- synthesis across fragmented sources
- reframing of ideas
- discovery of unexpected connections
- compression of large evidence sets into human-readable form
Remove that stochasticity and you do not get a safer system you get a weaker one.
The real problem is not hallucination.
The problem is unbounded stochasticity inside systems that require enforceable rules.
Wikipedia, finance, medicine, law, and production software do not operate on plausibility. They operate on:
- provenance
- traceability
- justification under policy
This is not a modeling problem. It is a systems integration problem.
🎯 The Shift: From Suppression to Containment
Most attempts to improve AI reliability focus on changing the model:
- stricter prompting
- additional fine-tuning
- reinforcement learning
- AI judging AI
All of these try to reduce stochasticity.
This work takes a different approach.
We leave the model untouched.
Instead, we introduce a deterministic layer that governs where stochasticity is allowed to operate.
The model generates. The system decides.
This separation is not novel it is how reliable software has always been built.
📊 Hallucination Energy Becomes a Control Surface
In Act II, we introduced hallucination energy as a measurement of semantic drift between a generated claim and its supporting evidence.
Importantly:
- hallucination energy does not decide truth
- it does not enforce policy
- it does not reject output on its own
On its own, it is just a signal.
Its value emerges only when combined with policy.
This is the key move.
At this point, stochastic generation is still either “allowed” or “rejected.” That is safe, but it is not improvable.
Without measurement, stochastic behavior cannot be tuned.
🚧 The Deterministic Boundary
Instead of trying to make AI less stochastic or institutions more tolerant, we let each remain exactly what it is and insert a deterministic boundary between them.
On one side of the boundary:
- the AI is free to generate,
- explore,
- synthesize,
- and use its stochastic superpower fully.
On the other side:
- acceptance is binary,
- rules are explicit,
- and outcomes are reproducible.
This boundary is not a model. It is not a prompt. It is not a learned classifier.
It is executable policy.
In our system, this boundary takes the form of policy-aware gates that evaluate AI outputs against institutional rules after generation, not during it. The gate does not care how confident the model is, how fluent the claim sounds, or how plausible it appears.
It asks a narrower, enforceable question:
Is this output admissible under the active policy?
Because the gate is deterministic, the same input under the same policy always produces the same result. Change the policy, and the outcome can change even if the claim does not.
This is not inconsistency. It is control.
By separating generation from acceptance, we gain several properties that model-centric approaches cannot provide:
- Hard cut-offs instead of soft guidance
- Auditability instead of opacity
- Explainable rejection instead of silent failure
- Policy tuning without retraining models
Most importantly, we restore human authority.
Editors do not need to trust the model. They only need to trust the policy.
Once acceptance is decided by software rather than probability, the role of the AI becomes clear: it is a generator, not an arbiter. Creativity remains intact, but publication is governed by rules that can be inspected, debated, and revised without touching the model at all.
This is the point where the AI debate shifts from philosophy to systems engineering.
The question is no longer “Can we trust AI?” It becomes “Under which policies is this output allowed?”
And that is a question software can answer deterministically.
🥇 Policy Does Not Compete with AI—It Overrides It
The most important result of this work is not that hallucination energy can be measured.
It is this:
Policy can override semantic similarity every time deliberately and deterministically.
In our synthesis experiments, we observe all four possible regimes:
| Hallucination Energy | Policy Verdict | Interpretation |
|---|---|---|
| Low | Accepted | Safe compression |
| Low | Rejected | Provenance failure |
| High | Accepted | Editorial tolerance |
| High | Rejected | Semantic overreach |
This table matters more than any single score.
It shows that hallucination energy informs the system, but policy remains the final authority.
This is exactly how high-trust institutions operate today.
🧭 From Rejection to Guidance
Here is the crucial improvement that Act III introduces.
Because hallucination energy is continuous, and policy is discrete, the system can do more than reject output.
It can steer generation.
A practical system can implement rules like:
if hallucination_energy > policy.max_energy:
reject("Semantic overreach")
elif hallucination_energy > policy.review_energy:
flag("Requires human review")
else:
accept()
Now stochasticity becomes adjustable.
- During exploration → higher energy tolerated
- During drafting → moderate energy flagged
- During publication → strict thresholds enforced
The same model. The same prompt. Different policy.
This is how AI performance improves without retraining.
🏢 What This Looks Like in a Real Organization
Consider a financial institution using AI for research and reporting.
They do not want:
- a deterministic model
- or a model that never hallucinates
They want:
- creativity during analysis
- guarantees before disclosure
A policy-first architecture looks like this:
Stochastic Generation
↓
Semantic Diagnostics
(hallucination energy, similarity)
↓
Policy Gate
(provenance, thresholds, rules)
↓
Accept | Review | Reject
Most generated outputs will never be published.
That is not failure.
That is discipline.
🧪 Why Wikipedia Is the Right Stress Test
Wikipedia is explicit about its rules and already has a strong policy guideline we can leverage.
What Wikipedia exposes is a reality every high-trust domain already lives with:
- Finance: auditability and disclosure
- Medicine: evidence and liability
- Law: admissibility and precedent
- Software: tests, invariants, contracts
AI has struggled in these environments not because it is inaccurate, but because it is ungoverned.
This work shows that governance is not philosophical.
It is engineering.
💎 The Core Contribution of Act III
We do not claim to eliminate hallucination.
We claim something narrower and more powerful:
Stochastic generation becomes usable when bounded by deterministic policy.
Truth remains hard. Compliance is not.
🚀 Where This Leads
This post demonstrates the pattern using Wikipedia as a deliberately unforgiving test case.
The same structure applies anywhere rules exist and failure matters.
Future work will:
- formalize energy-policy calibration
- explore adaptive thresholds
- measure system-level guarantees
For now, the conclusion is simple:
AI does not need to become more cautious. Our systems need to become more disciplined.
🌩️ AI Stochasticity Is Not the Problem
Large language models are stochastic by design. They explore a space of possible expressions rather than executing a single deterministic path.
That behavior is often framed as a flaw.
We take the opposite view.
Stochasticity is the only reason LLMs are useful.
It enables:
*All right synthesis across sources,
- reframing of ideas,
- creative generalization,
- discovery of connections humans might miss.
Remove that, and you do not get a safer model you get a worse one.
The problem is not stochasticity.
The problem is unbounded stochasticity inside environments that demand hard guarantees.
Wikipedia, finance, medicine, law, and production software systems are not designed to accept “probably correct” outputs. They require:
- traceability,
- provenance,
- justification under explicit rules.
This is a category mismatch, not a model failure.
The Key Shift: From Correction to Containment
Most approaches to AI reliability focus on changing the model:
- better prompting,
- additional fine-tuning,
- reinforcement learning,
- AI judging AI.
This work takes a different path.
We leave the model alone.
Instead, we introduce a deterministic layer that decides what is allowed to pass.
The model generates. The system decides.
This separation is not new it is standard software engineering.
🎮 Hallucination Energy as a Control Signal
In Act II we introduced hallucination energy as a diagnostic metric.
It measures semantic drift how much of a claim cannot be explained by its evidence without making any judgment about truth.
That distinction matters.
Hallucination energy is:
- continuous, not binary,
- model-agnostic,
- independent of editorial policy.
On its own, it does nothing.
Its power comes from how it interacts with policy.
🚔 Policy Always Wins
The most important result of this work is not that hallucination energy can be measured.
It is this:
Policy can override semantic similarity, every time.
A claim can be:
- highly similar to its evidence,
- low in hallucination energy,
- fluent and plausible,
and still be rejected.
Not because it is false — but because it violates institutional rules.
This is intentional.
In our experiments, we see all four combinations:
| Hallucination Energy | Policy Result | Interpretation |
|---|---|---|
| Low | Accepted | Safe compression |
| Low | Rejected | Provenance failure |
| High | Accepted | Editorial tolerance |
| High | Rejected | Semantic overreach |
This is the point.
Hallucination energy informs. Policy decides.
What This Looks Like in Practice
Imagine a financial institution using AI for internal analysis and external reporting.
They do not want:
- a deterministic model,
- or a model that “never hallucinates.”
They want:
- creativity during exploration,
- hard guarantees before publication.
A policy-first system looks like this:
AI Generation (stochastic)
↓
Semantic Diagnostics (hallucination energy, similarity)
↓
Policy Gate (hard cut-offs, provenance rules)
↓
Accepted Output or Deterministic Rejection
During research:
- higher hallucination energy may be tolerated.
During reporting:
- thresholds tighten.
- policy becomes strict.
- most outputs are rejected.
No retraining. No prompt rewriting. No special “safe model.”
Just different policy.
🌍 Why This Scales Beyond Wikipedia
Wikipedia is not special.
It is simply explicit about its rules.
Every high-trust domain already operates this way:
- Finance: auditability and disclosure requirements
- Medicine: clinical evidence and liability
- Law: admissibility and precedent
- Software: tests, contracts, and invariants
AI has struggled in these domains not because it is inaccurate, but because it is ungoverned.
This work shows that governance is not mystical.
It is engineering.
🍏 The Core Contribution
We do not claim to have solved hallucination.
We claim something narrower and stronger:
Stochastic generation becomes usable in rigid environments when bounded by deterministic policy.
Truth remains a hard problem.
Compliance is not.
🚀 Where This Leads
This post demonstrates the pattern using Wikipedia as a stress test.
The same pattern applies anywhere the rules are explicit and the cost of failure is high.
Future work will:
- formalize the metrics,
- explore adaptive policy thresholds,
- and evaluate system-level guarantees.
For now, the takeaway is simple:
AI does not need to become more human. Our systems need to become more disciplined.
🔄 Policy Is Not a Mode—It Is an Attribute
Up to now, we have described policy using three named regimes: editorial, standard, and strict.
These are useful illustrations but they are not the right mental model.
A better analogy comes from security engineering.
In secure systems, we do not say:
“This application runs in admin mode.”
Instead, we say:
- this action is allowed for all users
- this action requires elevated privileges
- this action is restricted to audited roles
Policy is applied per action, not per system.
AI governance should work the same way.
🎯 From Fixed Policies to Policy-Per-Action
The mistake many AI systems make is treating policy as a global setting:
“This model is safe.” “This deployment is restricted.” “This output is allowed.”
That framing does not scale.
Real organizations do not operate that way.
They perform many different kinds of actions, each with different risk profiles.
Examples:
- exploratory analysis
- internal research notes
- draft summaries
- client-facing reports
- regulated disclosures
- fiduciary decisions
Each of these actions tolerates a different amount of stochasticity.
The correct model is not three fixed policies.
The correct model is many policies, bound to actions.
⚡ Hallucination Energy Enables Action-Scoped Policy
This is where hallucination energy becomes more than a diagnostic.
Because hallucination energy is continuous, it allows policy to be parameterized.
Instead of asking:
“Is this output acceptable?”
We ask:
“Is this output acceptable for this action?”
Conceptually:
policy = policy_for(action)
if hallucination_energy > policy.max_energy:
reject()
elif hallucination_energy > policy.review_energy:
require_human_review()
else:
accept()
The same AI output can be:
- acceptable for brainstorming
- questionable for internal drafts
- forbidden for external publication
Nothing about the model changes.
Only the action-bound policy does.
💰 A Concrete Example: Finance
Consider a financial institution using AI across its workflow.
| Action | Allowed Hallucination | Policy Behavior |
|---|---|---|
| Market exploration | High | Accept with logging |
| Internal analysis | Medium | Flag for review |
| Client communication | Low | Strict provenance |
| Regulatory filing | Near zero | Deterministic rejection |
This is not hypothetical.
This is how risk is already managed in mature systems.
AI simply lacked the enforcement layer.
💡 Why This Matters
This reframing resolves a false dichotomy that dominates AI discourse:
- “AI must be creative” vs “AI must be safe”
Both are true for different actions.
By treating policy as an attribute of what the AI is being asked to do, rather than what the AI is, we get:
- maximal utility during exploration
- maximal safety during execution
- zero need for model retraining
- zero need for prompt contortions
This is not AI alignment by persuasion.
It is AI alignment by authorization.
🧩 The Deeper Pattern
Once you see this, the broader pattern becomes clear:
- Stochasticity is the engine
- Hallucination energy is the gauge
- Policy is the circuit breaker
- Action defines the risk envelope
That combination is sufficient to make AI usable in environments that previously had to ban it outright.
🧪 Why We Started with Wikipedia
Wikipedia is an extreme case.
It enforces one of the strictest editorial policies in existence.
That makes it a perfect stress test not because most systems are that strict, but because if the pattern works there, it works anywhere.
The takeaway is not “use strict policy everywhere.”
The takeaway is:
Use the right policy for the right action and enforce it deterministically.
🎯 Policy as a Harness for Stochastic Generation
Figure: Stochastic generation bounded by action-specific policy. Hallucination energy informs decisions, but policy deterministically governs acceptance.
flowchart TD
%% --- INPUT SECTION ---
A["🧑💻 User / Task Request<br/>📋 What needs to be done?"]
%% --- ACTION CLASSIFICATION ---
A --> B{"🎯 Action Classifier<br/>What type of request is this?"}
%% --- POLICY SELECTION ---
B -->|🌱 exploratory| P1["Policy: exploratory.open<br/>🛡️ High creativity tolerance"]
B -->|🔍 analysis| P2["Policy: analysis.standard<br/>🛡️ Balanced approach"]
B -->|🏢 client| P3["Policy: client.strict<br/>🛡️ Low risk tolerance"]
B -->|⚖️ regulatory| P4["Policy: regulatory.hard<br/>🛡️ Zero tolerance"]
%% --- AI GENERATION ZONE ---
subgraph AI["⚡ AI Generation Layer<br/>(Stochastic • Creative • Unbounded)"]
G["🤖 LLM Generates Response<br/>✨ Stochastic output ✨<br/>🎲 Probability-driven"]
end
%% CONNECT POLICIES TO AI
P1 -.-> G
P2 -.-> G
P3 -.-> G
P4 -.-> G
%% --- SEMANTIC DIAGNOSTICS ---
G --> D["📊 Semantic Diagnostics<br/>📏 Similarity Score<br/>⚡ Hallucination Energy<br/>🎯 Confidence Level"]
%% --- POLICY GATES (THE CRITICAL BOUNDARY) ---
subgraph GATES["🚨 Policy Enforcement"]
D -->|evaluates against| Gate1{"exploratory.open"}
D -->|evaluates against| Gate2{"analysis.standard"}
D -->|evaluates against| Gate3{"client.strict"}
D -->|evaluates against| Gate4{"regulatory.hard"}
end
%% --- OUTCOMES ---
Gate1 -->|✅ High tolerance| O1["🟢 ACCEPT<br/>✨ Log only"]
Gate2 -->|⚠️ Medium tolerance| O2["🟡 ACCEPT + REVIEW<br/>📌 Flag for human check"]
Gate3 -->|🚫 Low tolerance| O3["🔴 REJECT<br/>📋 Policy violation logged"]
Gate4 -->|⛔ Near-zero tolerance| O4["🛑 REJECT + AUDIT<br/>📂 Full audit trail created"]
%% --- COLOR SCHEME ---
classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:3px,color:#0d47a1
classDef classifier fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
classDef policy fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
classDef aiZone fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px,color:#4a148c
classDef diagnostics fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px,color:#283593
classDef gates fill:#ffebee,stroke:#d32f2f,stroke-width:3px,color:#b71c1c
classDef accept fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
classDef review fill:#fff3e0,stroke:#ff8f00,stroke-width:2px,color:#e65100
classDef reject fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#b71c1c
classDef audit fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#880e4f
%% Apply styles
class A input
class B classifier
class P1,P2,P3,P4 policy
class AI,G aiZone
class D diagnostics
class GATES,Gate1,Gate2,Gate3,Gate4 gates
class O1 accept
class O2 review
class O3 reject
class O4 audit
%% Add note about the core concept
note["💡 CORE CONCEPT:<br/>AI generates freely • Policy decides what's acceptable<br/>Stochastic creativity ↔ Deterministic rules"]
style note fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#01579b
The key idea is that policy is bound to the action, not the model. The same stochastic AI output flows through different policy gates depending on what the system is being asked to do.
Hallucination energy and similarity are measured once, but their interpretation depends entirely on policy. In low-risk actions, stochasticity is tolerated. In high-risk actions, the same signal deterministically blocks output.
🛡️ Example: Policy Per Action (Security Model Analogy)
| Action Type | Allowed Stochasticity | Hallucination Energy | Policy Strictness | Outcome |
|---|---|---|---|---|
| Brainstorming | High | High | Low | Always accepted |
| Research Notes | Medium | Medium | Medium | Flag for review |
| Draft Report | Low | Low | High | Mostly rejected |
| Client-Facing Output | Very Low | Very Low | Strict | Binary pass/fail |
| Regulatory Filing | None | ~0 | Max | Reject by default |
Then add one sentence:
This is not model tuning. This is policy selection the same way access control works in security systems.
🍿 Act IV Learning From the Boundary (Exploratory)
Up to this point, everything in this post has been deliberately conservative.
We have:
- Made policy explicit and executable
- Shown that policy can deterministically override otherwise acceptable claims
- Demonstrated that this works without modifying the model
At this stage, we already have a usable system.
But as AI developers, it would be disingenuous to stop here.
Because once a policy is enforced, it creates something extremely valuable:
A learning signal.
This final act explores how that signal could be used carefully without weakening the guarantees established earlier.
📚 Policy Enforcement Creates Information
Every policy decision produces structured feedback:
- accepted vs rejected
- reason codes (provenance failure, overreach, circular sourcing)
- hallucination energy at time of rejection
This feedback is not opinionated. It is not probabilistic. It is the result of a deterministic rule system.
In traditional software engineering, signals like this are gold.
Ignoring them would be wasteful.
🚫 What We Are Not Proposing
Let’s be explicit.
We are not proposing:
- replacing policy with a learned model
- letting AI decide what policy should be
- weakening editorial constraints
- “AI judging AI” in the abstract sense
Policy remains authoritative. Policy remains external. Policy always wins.
✅ What We Are Proposing
Once a policy is defined and enforced, the system can observe how stochastic generation interacts with that policy over time.
This enables a narrow, well-scoped use of learning:
Learning how much stochasticity a given policy will tolerate.
Not truth. Not correctness. Not editorial judgment.
Just tolerance.
📊 Hallucination Energy as a Feedback Signal
Earlier, we introduced hallucination energy as a diagnostic:
- continuous, not binary
- independent of policy
- descriptive, not normative
On its own, it does nothing.
But paired with policy outcomes, it becomes informative.
Over many evaluations, the system can observe patterns like:
- claims rejected above a certain energy threshold
- claims accepted below it
- policy-specific tolerance bands
This enables calibration, not autonomy.
🧪 A Concrete (and Safe) Use Case
Consider the simplest possible application:
if hallucination_energy > policy.estimated_tolerance:
regenerate(
warning="Previous output exceeded policy tolerance"
)
Nothing passes automatically. Nothing is overridden. No rule is bypassed.
The system simply avoids producing outputs that it already knows will fail.
This improves efficiency not permissiveness.
🏛️ Why This Does Not Undermine Wikipedia’s Position
From Wikipedia’s perspective, this section changes nothing.
The final arbiter is still policy. The acceptance criteria are unchanged. The rules are enforced exactly as written.
If anything, this approach reduces pressure on reviewers by preventing doomed outputs from being generated in the first place.
The policy boundary remains intact.
👩💻 Why This Matters for AI Developers
In many real systems:
- policies are strict
- AI is exploratory
- rejection rates are high
- iteration is expensive
Learning from policy outcomes allows systems to:
- reduce wasted generations
- surface warnings earlier
- adapt behavior per task without retraining
- preserve creativity where allowed
- tighten constraints where required
This is how stochastic systems become operational.
🔍 A Useful Analogy: Error Detection, Not Correction
In communications systems, error-detection codes do not decide meaning.
They detect when data exceeds tolerance and trigger retransmission.
That is the role hallucination energy can play here.
It is not a judge. It is a sensor.
🎯 Why We Include This Section
This post is about engineering, not ideology.
Once we have:
- executable policy
- deterministic enforcement
- measurable deviation
we have everything needed to close a feedback loop.
Not using that signal would be poor system design.
This section does not claim success. It does not claim safety. It does not claim convergence.
It simply acknowledges the obvious next step and stops there.
🔄 Closing the Loop
At its core, this work makes one argument:
AI does not fail because it is stochastic. AI fails because we deploy it without boundaries.
Once boundaries exist, learning becomes possible.
But boundaries must come first.
📊 Policy-Bounded Learning Loop (Mermaid)
Figure X Policy-Bounded Learning Loop. Stochastic generation is preserved. Policy enforcement remains deterministic and final. Hallucination energy is used only as an observational signal to reduce future policy violations, never to override policy decisions.
flowchart TD
A[Stochastic AI Generation]:::stochastic
B[Generated Claim + Evidence]
C[Semantic Diagnostics<br/>(Hallucination Energy)]:::metric
D[Policy Gate<br/>(Executable Rules)]:::policy
E[Accepted Output]:::accept
F[Deterministic Rejection<br/>+ Reason Codes]:::reject
G[Learning Signal<br/>(Energy vs Policy Outcome)]:::signal
H[Generation Tuning<br/>(Pre-Policy Adjustment)]:::tuning
A --> B
B --> C
C --> D
D -->|Pass| E
D -->|Fail| F
F --> G
G --> H
H --> A
%% Styling
classDef stochastic fill:#e3f2fd,stroke:#1e88e5,stroke-width:1px;
classDef policy fill:#fff3e0,stroke:#fb8c00,stroke-width:1px;
classDef metric fill:#f3e5f5,stroke:#8e24aa,stroke-width:1px;
classDef signal fill:#e8f5e9,stroke:#43a047,stroke-width:1px;
classDef tuning fill:#ede7f6,stroke:#5e35b1,stroke-width:1px;
classDef accept fill:#e0f2f1,stroke:#00897b,stroke-width:1px;
classDef reject fill:#ffebee,stroke:#c62828,stroke-width:1px;
💬 How to Caption This in the Blog
You’ll want a caption that prevents misinterpretation. Here’s a tight one you can reuse:
🧭 Conclusion: Measurement Before Understanding
Historically, we rarely wait to fully understand a phenomenon before learning how to control it.
We did not need a complete theory of electricity to build power grids. We learned to measure voltage, current, and resistance. Those measurements were projections of something we didn’t fully understand and they were enough.
Once measurement existed, control followed. Once control existed, tuning followed. Once tuning existed, improvement compounded.
AI is at the same stage.
🤖 We Don’t Need to “Understand” AI to Control It
Large language models are opaque, stochastic systems. We do not have a complete theory of how they reason, generalize, or synthesize.
But that is not unusual.
What matters is not perfect understanding, but reliable measurement surfaces.
This work shows that we can measure AI behavior sideways:
- not by asking “is it true?”
- but by asking “how far did it move beyond what evidence strictly supports?”
- and “does this output violate an explicit policy boundary?”
That is enough to build control.
⚖️ Policy Forces the Hard Decisions
Executable policy does something crucial that informal guidelines never do:
It forces us to decide what is acceptable, unacceptable, and non-negotiable.
Once policy is explicit:
- some behaviors are allowed,
- some require review,
- some trigger immediate rejection.
There is no ambiguity.
This is the hard binary every high-trust system relies on the equivalent of circuit breakers, type errors, or safety interlocks.
Policy answers questions AI cannot:
- May this be published?
- Is this provenance sufficient?
- Is this class of synthesis allowed at all?
Those decisions must exist outside the model.
📶 Continuous Signal Inside a Binary Boundary
Binary policy alone is not enough to improve systems over time.
That is where measurement enters.
Hallucination energy provides a continuous signal that tracks how far stochastic generation drifts beyond evidence support without deciding truth and without overriding policy.
This gives us two orthogonal axes:
- Policy → hard acceptance / rejection
- Measurement → how close or far the output was from the boundary
This is the same structure used everywhere else in engineering:
- hard limits + continuous feedback
- invariants + tunable parameters
- shutdown conditions + optimization signals
🚀 Why This Enables Scaling
Once both pieces exist, the system becomes improvable in a disciplined way.
As AI operates:
- outputs fall inside or outside policy,
- hallucination energy rises or falls,
- failures become categorized rather than mysterious,
- tuning becomes empirical rather than speculative.
At that point, improvement is no longer about intuition or trust.
It becomes a control problem.
And control problems scale.
🏆 The Core Takeaway
Without an explicit policy gate, hallucination cannot be meaningfully measured. Without measurement, stochastic behavior cannot be tuned. Without deterministic acceptance, trust is irrelevant.
This work does not claim to solve AI reliability.
It establishes something more foundational:
Stochastic systems become scalable only once they are measurable and bounded even when they are not fully understood.
Policy provides the boundary. Measurement provides the signal.
Together, they turn stochastic generation from a liability into a controllable asset.
That is how every other complex system we rely on became usable and AI is no exception.
AI reliability does not emerge from better answers.
It emerges from enforceable boundaries and measurable deviation.
📚 Glossary
| Term | Definition | Context / Example |
|---|---|---|
| Verifiability Gate | A deterministic policy enforcement mechanism that evaluates AI-generated claims against explicit editorial rules before acceptance. | Rejects a factually correct claim about Neil Armstrong if it cites Wikipedia itself under wikipedia.strict policy (citation laundering). |
| Hallucination Energy | A continuous metric measuring semantic drift—the portion of a claim’s embedding vector orthogonal to its supporting evidence vector. Quantifies unsupported semantic mass without asserting falsity or correctness. | Computed as ‖residual‖ = ‖claim_vector − projection_onto_evidence‖; high values indicate synthesis beyond evidence boundaries. |
| Stochastic Generation | The inherent probabilistic nature of LLMs that enables exploration, synthesis, and creative reframing—treated as a capability to harness, not a defect to eliminate. | Enables AI to compress multi-sentence evidence into concise claims, but introduces epistemic risk when unbounded. |
| Deterministic Acceptance | Binary, rule-based evaluation of outputs after generation, in which policy decisions override all model confidence, fluency, or similarity signals. | The gate accepts, rejects, or flags for review—never “trusts” confidence scores. |
| Semantic Drift / Overreach | When an AI-generated claim asserts structure, relationships, or implications not explicitly supported by source evidence—even if semantically plausible. | Evidence: “Lindenbaum–Tarski algebra appears in discussions of algebraic logic” → Claim: “Algebraic logic encompasses Lindenbaum–Tarski algebra.” |
| Citation Laundering | Violation occurring when secondary or tertiary sources are used to support claims requiring primary-source provenance under strict policy. | Wikipedia citing itself to establish biographical facts fails wikipedia.strict even when accurate. |
| Policy Regimes | Executable editorial constraints with increasing stringency: editorial (human-like tolerance), standard (clear sourcing), strict (primary sources only). |
Same claim passes under editorial but fails under strict due to provenance—not factual error. |
| Epistemic Risk | Risk introduced when AI moves from restating evidence to synthesizing higher-order claims (generalization, implication, framing). | Paraphrasing evidence has low risk; inferring unstated relationships introduces measurable risk. |
| Evidence-Bound Synthesis | Constrained generation where AI produces claims using only provided evidence text, with no external knowledge or retrieval. | Baseline showing high pass rates under relaxed policies—verifying gate correctness before introducing risk. |
| Action-Scoped Policy | Binding policy constraints to specific actions or tasks rather than treating policy as a global system mode. | Same output allowed for “brainstorming” but rejected for “regulatory filing.” |
| Semantic Residual | The orthogonal component of a claim’s embedding vector after projection onto the evidence vector—the mathematical basis of hallucination energy. | residual = claim_normalized − (dot(claim, evidence) × evidence_normalized) |
| Policy-Bounded Learning | Using deterministic policy outcomes as a feedback signal to reduce wasted or non-compliant generations, without modifying or relaxing policy constraints. | Regenerating when hallucination_energy > policy.tolerance—improves efficiency, not permissiveness. |
| Acceptance Boundary | The deterministic decision surface defined by policy that separates admissible outputs from rejected ones. | Identical claims may fall on different sides of the boundary under different policies. |
| Verity | The cognitive operating system implementing this architecture: stochastic generation upstream, deterministic policy enforcement downstream. | AI generates possibilities; software decides what passes. |
| FEVEROUS Dataset | Wikipedia-grounded fact verification corpus with human-annotated claims and explicit evidence references. | Used as a policy stress test—not a truth oracle. |
| Provenance Failure | Rejection due to insufficient source lineage, even when semantic content is accurate. | Low hallucination energy + rejection = provenance issue; high energy + rejection = overreach. |
| Measurement Before Understanding | Engineering principle that reliable control is possible using measurable projections of a system before full theoretical understanding exists. | Voltage precedes a full theory of electricity; hallucination energy precedes a full theory of AI reasoning. |
📖 References & Context
This work builds on three existing strands already operational in high-trust environments:
- Institutional policy as the boundary Wikipedia’s explicit editorial rules and documented experience rejecting AI not for inaccuracy, but for unverifiability under institutional constraints
- Evidence-grounded verification Datasets that capture human-verified evidence relationships, not abstract truth
- Measurement before understanding Engineering precedent: reliable systems emerge from measurable boundaries, not perfect models
The contribution here is architectural, not algorithmic: treating stochastic generation as an upstream capability and enforcing reliability downstream through executable policy.
🔖 Core References
Institutional Verifiability
-
Wikipedia. Wikipedia: Verifiability. https://en.wikipedia.org/wiki/Wikipedia:Verifiability
“Verifiability, not truth, is required.” Canonical statement that institutional trust demands procedural justification, not plausibility.
-
Wikipedia. Wikipedia: Reliable Sources. https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources
Defines primary vs. secondary sourcing requirements the provenance boundary that rejects otherwise-accurate claims.
-
Wiki Education. Wikipedia and Generative AI Editing: What We Learned in 2025 (2026). https://wikiedu.org/blog/2026/01/29/generative-ai-and-wikipedia-editing-what-we-learned-in-2025/
Institutional evidence that AI failure modes in editorial environments are procedural (unverifiable synthesis), not merely factual.
Evidence-Grounded Verification
4. Aly et al. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured Evidence (2021). https://fever.ai/dataset/feverous.html
Wikipedia-native evidence annotations used here as policy stress test substrate not truth oracle.
- Thorne et al. FEVER: A Large-Scale Dataset for Fact Extraction and VERification (ACL 2018). https://aclanthology.org/N18-1074/
Establishes lineage of evidence-bound verification separate from model performance.
Semantic Drift as Measurable Signal
6. Ji et al. Survey of Hallucination in Natural Language Generation (ACM Computing Surveys 2023). https://arxiv.org/abs/2303.08774
Documents field’s lack of operational definition for “hallucination” creating space for continuous diagnostics over binary judgments.
- Maynez et al. On Faithfulness and Factuality in Abstractive Summarization (ACL 2020). https://aclanthology.org/2020.acl-main.173/
Early distinction between surface similarity and semantic faithfulness precursor to directional drift measurement.
Engineering Precedent
8. Meyer, B. Object-Oriented Software Construction (1997), Chapter 11: Design by Contract.
Executable acceptance criteria independent of implementation direct parallel to policy gates.
- Lord Kelvin (William Thomson). Electrical Units of Measurement (1883).
“When you can measure what you are speaking about… you know something about it.”
Engineering principle: control emerges from measurement, not perfect understanding.
📑 Appendix 1.
In early 2026, Wikipedia publicly documented its experience with generative AI–assisted editing and the decision to significantly restrict its use. The post, “Wikipedia and Generative AI Editing: What We Learned in 2025”, describes a year of experimentation that led to a clear conclusion:
AI-generated contributions consistently failed to meet Wikipedia’s standards for verifiability, sourcing, and editorial reliability. As a result, Wikipedia chose to ban or heavily limit generative AI content in its editing workflows.
I read this article and this prompted this blog post.