Warranted Search: When AI Must Prove Before It Looks
TL;DR
Modern AI systems often retrieve nearby text and generate confident answers, but that is not the same as proof. A citation can be real, the answer can be correct, and the evidence can still fail to support the claim.
This post argues for warranted search: a scoped, claim-driven form of intelligent grep. Instead of asking an AI to rummage through a corpus, we give it a specific claim, a bounded search warrant, and a limited set of safe operations.
The architecture combines three ideas:
| Layer | Source | Role |
|---|---|---|
| Executable search | GrepSeek | Search the corpus through auditable operations |
| Extractive evidence | ACL-Verbatim | Return exact spans instead of generated support |
| Verified attribution | CiteVQA | Check whether the cited evidence actually carries the claim |
Together, they form an Evidence Engine: a system that searches with a warrant, extracts verbatim evidence, verifies attribution, flags overclaims, and stores the trace so future searches can improve.
In short: not just AI, not just grep — intelligent grep that can prove what it writes.
Abstract
Modern AI systems do not really search. They retrieve.
They issue a query against an index, pull back semantically related chunks, and generate prose around whatever comes back. This often looks grounded, but grounding by appearance is not the same as proof. The answer can be correct while the citation is wrong. The source can be real while the claim overreaches. The model can reproduce the visual form of scholarship without establishing evidentiary support.
Three recent papers point toward a better architecture.
GrepSeek shows how an agent can interact directly with a corpus through executable search operations. ACL-Verbatim shows how hallucination risk can be reduced by returning verbatim evidence spans rather than generated support. CiteVQA shows why answer correctness is insufficient: attribution itself must be verified.
This post synthesizes those ideas into a single architecture for evidence-first AI writing:
I call this architecture the Evidence Engine.
How the Evidence Engine Works
This diagram shows the core pipeline of the Evidence Engine. A claim is first wrapped in a warrant (a permissions envelope), then undergoes executable search (grep-like, scoped operations) to locate the source region. The system extracts verbatim evidence (exact source spans), runs an attribution check (does this span truly support the claim?), and finally a policy router decides to accept, repair, refute, or abstain – turning writing into a verifiable, auditable evidence procedure.
flowchart LR
classDef claim fill:#f2f2f2,stroke:#999,stroke-width:2px,color:#333
classDef warrant fill:#fff4cc,stroke:#b38f00,stroke-width:2px,color:#b38f00
classDef search fill:#d9f2d9,stroke:#2d7d2d,stroke-width:2px,color:#2d7d2d
classDef evidence fill:#cce6ff,stroke:#005fa3,stroke-width:2px,color:#005fa3
classDef verify fill:#ffe6cc,stroke:#d97a00,stroke-width:2px,color:#d97a00
classDef policy fill:#f2e6ff,stroke:#6b2d99,stroke-width:2px,color:#6b2d99
A("✍️ Claim") --> B("🛡️ Warranted Search<br/><i>permissions envelope</i>")
B --> C("⚡ Executable Search<br/><i>grep-like operations</i>")
C --> D("✂️ Verbatim Evidence<br/><i>exact source span</i>")
D --> E("🔬 Attribution Check<br/><i>citevqa-style test</i>")
E --> F("🧭 Accept / Repair / Refute / Abstain<br/><i>policy router</i>")
class A claim
class B warrant
class C search
class D evidence
class E verify
class F policy
1. The Research Problem: Correct Answer, Wrong Evidence
The problem this post addresses is not hypothetical.
In May 2026, the CiteVQA paper named and measured a failure mode that many people working with AI systems have already seen in practice: Attribution Hallucination.
The failure is simple:
A model gives the right answer, but points to the wrong evidence.
That distinction matters. Most question-answering benchmarks reward the final answer. If the model says the right number, name, date, or metric, it gets credit. But in document-grounded work, the answer alone is not enough. A legal assistant, research tool, medical summarizer, financial analyst, or engineering copilot must also show where the answer came from.
CiteVQA formalizes this problem for document visual question answering. Instead of asking only whether a model answers correctly, the benchmark requires the model to return an answer and an element-level citation pointing to the supporting region of the document. Its core metric, Strict Attributed Accuracy, gives credit only when both pieces are correct: the answer and the cited evidence. (Hugging Face)
That is the key shift.
Answer accuracy is not attribution accuracy.
A model can know the answer from parametric memory, infer it from nearby context, or guess it from repeated patterns in training data. It can then attach a citation that looks plausible but does not actually carry the claim.
From the user’s perspective, this is worse than an ordinary hallucination. A false answer is easier to challenge. A correct answer with a bad citation looks trustworthy. It passes the surface test. It has a source. It has the shape of scholarship.
But the evidence is sand.
I call this failure mode Evidence Quicksand.
The answer stands. The citation exists. But the evidence cannot carry the claim.
CiteVQA is important because it turns this from a vague complaint into a measurable benchmark. The dataset contains 1,897 questions across 711 PDFs spanning seven domains and two languages, with documents averaging 40.6 pages. The authors report that even strong multimodal systems can answer correctly while citing the wrong region, which is exactly the reliability gap that answer-only evaluation misses. (github)
Once you name the problem, it appears everywhere:
- research summaries citing the wrong paragraph
- paper claims supported by nearby but insufficient text
- metrics attributed to the wrong table
- legal answers pointing to the wrong clause
- financial answers citing the wrong filing section
- generated explanations drifting beyond the source
- citations that decorate rather than prove
This is the gap the Evidence Engine is designed to close.
The problem is not merely that AI systems hallucinate. The deeper problem is that they often lack a reliable procedure for connecting a generated claim to the exact evidence that supports it.
They need to know:
what claim is being checked
where they are allowed to search
what exact span was found
whether that span supports the claim
what to do when it does not
That is why retrieval alone is not enough.
We need a different architecture: one that treats search, evidence extraction, and attribution verification as separate, auditable steps.
2. Retrieval Is Not Proof
Many production RAG systems follow a familiar pattern:
flowchart LR
classDef query fill:#f2f2f2,stroke:#999,stroke-width:2px,color:#333
classDef retriever fill:#e6f3ff,stroke:#1a5c99,stroke-width:2px,color:#1a5c99
classDef chunks fill:#cce6ff,stroke:#005fa3,stroke-width:2px,color:#005fa3
classDef generation fill:#d9f2d9,stroke:#2d7d2d,stroke-width:2px,color:#2d7d2d
classDef answer fill:#f2e6ff,stroke:#6b2d99,stroke-width:2px,color:#6b2d99
Q("❓ Question or Claim") --> R("🔍 Retriever<br/><i>embedding / keyword search</i>")
R --> C("📋 Top‑k Chunks<br/><i>semantically nearby text</i>")
C --> G("🤖 LLM Generation<br/><i>synthesise answer from chunks</i>")
G --> A("📝 Answer with Citations<br/><i>plausible, but unverified</i>")
class Q query
class R retriever
class C chunks
class G generation
class A answer
This is useful.
It is also not proof.
Retrieval gives the model candidate context. It does not establish that the context supports the sentence being written. A retriever can return adjacent text. A generator can overstate what the source says. A citation can point to a real document while failing to support the actual claim. An answer can be correct because the model guessed it, remembered it, or inferred it from nearby material, not because it found the right evidence.
This is the gap between retrieval and verification.
A retrieved chunk says:
“This text may be relevant.”
A verified citation demonstrates something stronger:
“This span, table, region, or evidence set supports this claim.”
Those are different operations.
Here, proof does not mean mathematical certainty. It means auditable evidentiary support: a claim, a permitted scope, an extracted evidence object, and a verification result.
That distinction matters because many dense retrieval systems are optimized for semantic nearness, not evidentiary sufficiency. Embeddings are powerful because they can find related material even when the wording changes. They are good at semantic recall. They are good at clustering. They are good at finding the neighborhood.
But the neighborhood is not the address.
A chunk about CiteVQA may discuss document evaluation, bounding boxes, or multimodal QA. That does not mean it defines Strict Attributed Accuracy. A passage can live in the right conceptual neighborhood and still be the wrong evidentiary address.
This is exactly the type of gap CiteVQA exposes. A model can answer a document question correctly while pointing to the wrong paragraph, table, or region. Under answer-only evaluation, that looks successful. Under attribution-aware evaluation, it fails.
So the answer is not to throw embeddings away.
The answer is to put them in the right place.
Embeddings can tell us where evidence may live. They can rank likely documents, suggest candidate passages, normalize surface-form variation, recover fuzzy matches, and help the system avoid the brittleness of exact lexical search. This matters because raw grep has its own weaknesses. Exact string search can miss aliases, spelling variants, acronyms, diacritics, and semantically relevant passages that use different wording. It can also return the first lexical hit rather than the best evidentiary hit.
But embeddings are not proof.
Lexical hits are not proof either.
Both are search aids.
The Evidence Engine therefore treats retrieval as a candidate-generation layer, not a proof layer. Embeddings may suggest the likely region. Metadata may narrow the source set. Prior traces may identify useful files. Grep-like operations may locate concrete terms inside a bounded window. But proof only begins when the system extracts a source-grounded evidence object and verifies that the evidence supports the claim.
The pipeline becomes:
flowchart LR
classDef claim fill:#f2f2f2,stroke:#999,stroke-width:2px,color:#333
classDef discovery fill:#e6f3ff,stroke:#1a5c99,stroke-width:2px,color:#1a5c99
classDef warrant fill:#fff4cc,stroke:#b38f00,stroke-width:2px,color:#b38f00
classDef extraction fill:#cce6ff,stroke:#005fa3,stroke-width:2px,color:#005fa3
classDef verify fill:#ffe6cc,stroke:#d97a00,stroke-width:2px,color:#d97a00
C("✍️ Claim") --> D("🔎 Candidate Discovery<br/>embeddings / metadata / prior traces<br/><i>“Where might the proof live?”</i>")
D --> W("🛡️ Warranted Search<br/>scoped grep‑like operations<br/><i>“What am I allowed to inspect?”</i>")
W --> E("✂️ Verbatim Evidence Extraction<br/>span / table / region / code symbol<br/><i>exact source text, no paraphrase</i>")
E --> V("🔬 Attribution Verification<br/>support / partial / refute / abstain<br/><i>“Does the evidence carry the claim?”</i>")
class C claim
class D discovery
class W warrant
class E extraction
class V verify
A careful researcher does not merely ask:
“What text is nearby?”
A careful researcher asks:
What exactly am I trying to verify?
Where would that evidence live?
Which sources am I allowed to inspect?
What phrase, entity, table, symbol, or section should I search?
What would refute this claim?
Which exact passage carries the sentence?
Does the citation actually support the wording?
Am I overstating the source?
That is not passive retrieval.
That is search as disciplined action.
The central design rule is simple:
Embeddings can tell us where to look. They cannot certify that the claim is supported.
That is why the missing primitive is not just another embedding model.
The missing primitive is intelligent grep: a hybrid system where semantic retrieval prepares the search space, warrant-governed operations make the search auditable, extractive evidence anchors the claim, and attribution verification decides whether the claim can stand.
In earlier work on Trendslop, we established three orthogonal signals: Containment (Hallucination Energy), Consistency (structural fidelity), and Sensitivity (response to perturbation). This post introduces the fourth: Attribution. Containment asks whether a claim stays inside the evidence span. Attribution asks whether it can be traced to a specific, recoverable source. The Evidence Engine operationalizes all four.
3. Intelligent Grep
In the last section, I argued that retrieval finds the neighborhood, not the address.
Embeddings, metadata, BM25, and prior traces can all help us find candidate material. That is useful. But once the system is inside the likely neighborhood, it still needs a way to move from candidate context to evidence.
That is where intelligent grep begins.
The instinct is to trust grep because a grep result is not a model impression. It is not a probability-weighted paraphrase. It is not a fluent guess. It is a concrete operation over a concrete corpus.
It searches real files. It returns real matches. It gives you a deterministic, version-anchored path from query to source text.
That matters.
But ordinary grep is not enough.
It optimizes for exact precision, but has no semantic recall. It can miss aliases, acronyms, spelling variants, diacritics, and semantically relevant passages that use different wording. It can return the first lexical hit rather than the best evidentiary hit. It can locate a phrase without knowing whether the surrounding passage actually supports the claim.
Grep can locate text.
It cannot, by itself, decide what counts as evidence.
LLMs tend to have the opposite problem. They model language well enough to form hypotheses, expand search terms, recognize related concepts, and propose better queries. But they also drift. They infer. They over-compress. They can turn nearby evidence into a stronger claim than the source supports.
So the answer is not to replace retrieval with grep.
And it is not to replace AI with string matching.
The answer is to put intelligence around grep.
By grep-like operations, I do not mean only Unix grep. I mean deterministic, auditable corpus operations: search phrase, exact match, regex match, locate heading, read bounded window, return byte offsets, and log the trace.
An intelligent grep system uses AI where flexibility is useful:
- to classify the claim
- to generate candidate search terms
- to expand aliases and related entities
- to decide where evidence might live
- to refine the search when the first hit is weak
- to select candidate evidence spans
- to evaluate whether a span appears to support the claim
- to propose a safer rewrite when the claim overreaches
But it uses a constrained search kernel where auditability matters:
- locating candidate windows in approved files
- reading bounded windows rather than whole documents
- returning byte offsets or token ranges
- recording every search operation
- anchoring evidence to source text
- keeping corpus access inside the allowed boundary
That is the key division of labor.
AI plans and evaluates. The search kernel touches the corpus. The span extractor anchors the evidence. The attribution verifier checks whether the span supports the claim. The Policy Router selects the appropriate next step.
A real intelligent-grep loop looks like this:
flowchart LR
classDef claim fill:#f2f2f2,stroke:#999,stroke-width:2px,color:#333
classDef warrant fill:#fff4cc,stroke:#b38f00,stroke-width:2px,color:#b38f00
classDef discovery fill:#e6f3ff,stroke:#1a5c99,stroke-width:2px,color:#1a5c99
classDef search fill:#d9f2d9,stroke:#2d7d2d,stroke-width:2px,color:#2d7d2d
classDef evidence fill:#cce6ff,stroke:#005fa3,stroke-width:2px,color:#005fa3
classDef verify fill:#ffe6cc,stroke:#d97a00,stroke-width:2px,color:#d97a00
classDef policy fill:#f2e6ff,stroke:#6b2d99,stroke-width:2px,color:#6b2d99
classDef memory fill:#ffcccc,stroke:#b00000,stroke-width:2px,stroke-dasharray: 5 5,color:#b00000
A("✍️ Claim") --> B("🛡️ Bounded Search Permission<br/><i>warrant: claim, scope, allowed ops</i>")
B --> C("🔎 Candidate Discovery<br/>embeddings / metadata / prior traces<br/><i>“Where might the proof live?”</i>")
C --> D("⚡ Scoped Search Operation<br/>grep‑like: search_phrase, read_window<br/><i>safe, warrant‑limited</i>")
D --> E("📁 File Hit<br/><i>match found in allowed corpus</i>")
E --> F("📄 Bounded Text Window<br/><i>read radius around match</i>")
F --> G("✂️ Verbatim Evidence Span<br/><i>exact source text, content‑hashed</i>")
G --> H("🔬 Attribution Test<br/><i>does span carry the claim?</i>")
H --> I("🧭 Policy Decision")
I --> J("✅ Accept")
I --> K("🔧 Repair / Rewrite")
I --> L("❌ Refute / Flag")
I --> M("⏸️ Abstain")
J & K & L & M --> N("🧠 Trace Stored for Future Improvement<br/><i>search path, span, verdict, repair</i>")
N --> C("♻️ Recycle Results<br/><i>prior traces inform discovery</i>")
class A claim
class B warrant
class C discovery
class D,E,F search
class G evidence
class H verify
class I,J,K,L,M policy
class N memory
The important word there is bounded.
The system is not being invited to rummage through everything. It is being given a specific claim and a constrained search path. The next section gives that boundary a name: the warrant.
For now, the key idea is simple: intelligent grep is not raw string matching. It is a claim-driven search procedure where semantic discovery helps find the candidate space, deterministic operations inspect the corpus, extractive spans anchor the evidence, and attribution verification decides whether the claim can stand.
The output is not just an answer.
It is an auditable evidence decision.
It says:
This is the claim.
This is the candidate space.
This is the search operation.
This is the source window.
This is the extracted span.
This is the attribution result.
This is the policy decision.
That is intelligent grep.
Not raw string matching. Not black-box retrieval. Not citation decoration.
A search procedure that can show how it moved from question to evidence.
4. What Is a Warrant?
When I use the word warrant, I am not using it only in the narrow courtroom sense.
I mean a permissions envelope expressed as an execution contract around an intelligent process.
More precisely, a warrant defines what a process is trying to verify, what it is allowed to inspect, which tools it may use, which models it may call, which agents or subprocesses it may invoke, how far it may search, how much content it may read, what it may return, and what must be logged.
This matters because AI systems are becoming processes, not single replies.
We will not always ask one model one question and receive one answer. Increasingly, we will hand a task to an intelligent process, and that process will spawn subtasks. It may search a local project. It may inspect a codebase. It may query the internet. It may call a foundation model. It may call a smaller local model. It may need to inspect a private document store — but only within an explicitly authorized scope. It may ask another agent to verify part of the work.
That is powerful.
It is also dangerous if the process can inspect tools, files, models, and remote systems without enforceable boundaries.
A warrant defines those boundaries.
It says:
This is the claim or task.
This is the purpose of the search.
These are the sources you may inspect.
These are the tools you may use.
These are the models you may call.
These are the agents or subprocesses you may invoke.
This is your read budget: maximum files, windows, bytes, or tokens.
This is the amount of content you may return.
This is when you must stop.
This is the audit policy: what must be logged, and what must never be stored.
In other words, a warrant is policy applied to intelligent search.
It turns an open-ended instruction like:
“Go find evidence for this.”
into a bounded execution contract:
“Verify this specific claim, using these approved sources, with these allowed operations, under these limits, and return only the evidence needed to support, repair, refute, or abstain.”
That distinction matters.
Without a warrant, an AI search process risks becoming a fishing expedition. It can wander through a corpus, pull unrelated private material into context, call unnecessary models, inspect files it did not need, and produce an answer whose evidence trail is impossible to audit.
With a warrant, the process is constrained from the start.
But a warrant is not just a prompt instruction. It is not enough to tell the model, “only search these files.” The warrant has to be enforced by the runtime. If the model proposes an action outside the approved scope, the search kernel rejects it. If it tries to read too much content, the budget stops it. If it tries to call an unapproved model or tool, the execution layer blocks the call.
The model may plan.
The warrant decides what can execute.
In implementation terms, the warrant has a simple lifecycle:
compile warrant from claim and context
validate scope and permissions
execute only approved operations
record the trace
expire the warrant after the run
A warrant should be temporary, specific, enforceable, and auditable.
For example, a warrant might look like this:
{
"claim": "CiteVQA contains 1,897 questions across 711 PDFs.",
"purpose": "verify a benchmark claim before publication",
"allowed_sources": [
"papers/citevqa.pdf",
"notes/citevqa.md"
],
"allowed_operations": [
"search_phrase",
"read_window",
"extract_span"
],
"allowed_models": [
"local_span_extractor",
"attribution_verifier"
],
"max_files_touched": 3,
"max_content_return_chars": 2000,
"return_policy": "verbatim_spans_only",
"audit_policy": "log_operations_without_storing_unrelated_content"
}
That is warranted search.
The system is not being asked:
“What do you know about CiteVQA?”
It is being asked:
“Can this specific claim be supported inside this specific evidence boundary?”
The same pattern generalizes beyond blog writing.
A company laptop search should not mean “search the whole laptop.” It should mean: search for this specific file, phrase, hash, or policy violation inside this approved scope, and return only the permitted evidence.
A medical assistant should not inspect an entire patient history merely because the records are available. It should inspect only the records relevant to the clinical question and return only the facts needed for that decision.
A coding agent must not have blanket read access to secrets, configs, or credentials. Its warrant should specify exactly which files and directories are in scope, which commands are allowed, and which model calls are permitted.
A research agent should not browse the internet indefinitely. It should receive a warrant that defines the claim, the source types, the allowed search depth, and the required evidence format.
This is the governance layer many AI search systems still treat as an afterthought.
GrepSeek gives us executable corpus interaction. ACL-Verbatim gives us extractive evidence. CiteVQA gives us attribution verification.
But before any of those layers run, we need to ask:
What is this process allowed to inspect, call, read, and return?
That is the warrant.
A warrant binds the search to a specific claim, approved sources, allowed operations, permitted models, resource budgets, and audit requirements.
It is the difference between an AI rummaging through a world and an AI verifying a claim under explicit permission, bounded scope, and recorded accountability.
5. The Three Papers
A warrant defines the boundary.
These three papers supply the mechanics that operate inside it.
They do not describe one complete system. They come from different domains: open-domain QA, extractive research QA, and visual document QA. What follows is my synthesis: three demonstrated primitives assembled into one evidence pipeline.
Each paper supplies one missing primitive, but none supplies the complete Evidence Engine:
| Paper | Primitive | Role in the Evidence Engine |
|---|---|---|
| GrepSeek | executable search | find candidate evidence regions through auditable corpus interaction |
| ACL-Verbatim | extractive evidence | return source spans instead of generated support |
| CiteVQA | attribution evaluation | test whether cited evidence supports the answer or claim |
This separation of concerns is intentional.
GrepSeek does not solve attribution. ACL-Verbatim does not solve search governance. CiteVQA does not tell us how to search.
But together they describe the skeleton of an evidence-first AI system:
search the corpus
→ extract the evidence
→ evaluate the attribution
The warrant wraps that skeleton in policy.
The Evidence Engine turns it into a working architecture.
5.1 GrepSeek: Search as Executable Corpus Interaction
GrepSeek gives us the search layer: a way to express search as an explicit sequence of corpus operations rather than an opaque retrieval call.
Its central move is simple but important: the corpus becomes the environment.
Instead of hiding documents behind an index and asking a retriever for top-k chunks, a GrepSeek-style agent interacts with the corpus directly. It issues executable search operations. It filters. It reads. It narrows. It builds a trajectory through the files.
In ordinary RAG, the retrieval step often looks like this:
query → ranked chunks
That ranking may be useful, but the path is mostly hidden. The model receives candidate context. It does not necessarily show how it found the evidence.
In a GrepSeek-style system, the path is explicit:
search "Strict Attributed Accuracy"
→ find matches in citevqa_notes.md
→ read surrounding paragraphs
→ narrow to the definition or benchmark section
→ return candidate evidence regions
In the paper, those operations are expressed as executable shell-command trajectories over the corpus. The exact commands are less important here than the architectural shift: search becomes a visible, inspectable sequence of actions.
The system can show:
what it searched for
which files it touched
which windows it read
which query failed
which query succeeded
where the candidate evidence came from
That is the first missing primitive: search as an auditable action loop.
Executable search is not a panacea. GrepSeek’s value lies in exposing the retrieval path, not in eliminating lexical brittleness. Exact matching remains vulnerable to aliases, spelling differences, diacritics, paraphrase, and surface-form variation. It can also return a nearby lexical hit before the best evidentiary hit.
That is exactly why the Evidence Engine is not raw grep.
GrepSeek supplies the executable search substrate: an auditable way to move from claim to candidate evidence regions. Embeddings, metadata, and prior traces can help with candidate discovery. The warrant constrains what the system may inspect. The later layers decide whether the candidate material actually supports the claim.
So GrepSeek gives us the first leg:
Search should be an auditable action sequence over a real corpus, not just an opaque retrieval result.
5.2 ACL-Verbatim: Evidence as a Span, Not a Generation
Search is not enough.
Once the system finds a candidate region, it still has to avoid drifting from the source.
That is where ACL-Verbatim matters.
ACL-Verbatim applies the VerbatimRAG pattern to research papers in the ACL Anthology, mapping user queries to verbatim spans in retrieved documents.
The core constraint shifts the system from abstractive generation to extractive evidence. Instead of asking an LLM to synthesize an answer from a chunk, the architecture treats the source text as the authority and asks which exact span should be returned.
That is the second missing primitive.
The system should not begin by asking:
Can you write an answer?
It should ask:
What exact source span can support this claim?
The span becomes the unit of evidence.
Generation can happen later. Interpretation can happen later. Blog prose can happen later. But the evidence layer should be extractive first.
That changes the writing pipeline.
Instead of:
claim → retrieved chunk → generated explanation → citation
we get:
claim → source window → verbatim span → support test → prose
This matters because generation is where drift enters. A model can turn “reduces hallucination risk” into “solves hallucination.” It can turn a narrow benchmark result into a broad architectural claim. It can merge two nearby ideas into a sentence neither source actually supports.
A verbatim span is necessary but not sufficient. It anchors the claim to source text, but it may still be incomplete, adjacent, or only partially supportive. It may need to be combined with another span. And if the search layer misses the right source, extraction has nothing reliable to extract.
But it gives the verifier something concrete to test.
ACL-Verbatim gives us the second leg:
Evidence should be extracted before it is generated.
5.3 CiteVQA: Attribution Must Be Evaluated
CiteVQA supplies the third primitive: a way to evaluate whether the cited evidence actually supports the answer.
Its central insight is the failure mode this whole post is built around:
A model can answer correctly while citing the wrong evidence.
That breaks answer-only evaluation.
If the answer is right but the citation is wrong, the system has not proven the answer from the cited source. It may have guessed. It may have remembered. It may have inferred from nearby context. It may have attached a plausible-looking citation after the fact.
CiteVQA makes that failure measurable.
Instead of scoring only the final answer, it evaluates the answer and the cited evidence together. In CiteVQA, that evidence is an element-level bounding-box region in a PDF. In the Evidence Engine, we adapt the same principle to text spans, tables, code symbols, or other reference primitives.
Its Strict Attributed Accuracy principle gives us the evaluation shape we need for trustworthy writing:
A prediction should pass only when the answer is correct
and the cited evidence is correct.
For Writer, the equivalent rule is:
A factual sentence should pass only when its cited evidence can support it.
That turns citation from decoration into a load-bearing acceptance constraint.
A citation is not a badge that says “source nearby.” It is a connection between a sentence and evidence.
This gives us the third missing primitive: attribution verification.
The Evidence Engine has to ask:
Does this span support the claim?
Does it only partially support the claim?
Does it contradict the claim?
Is the claim overstated?
Does the claim require multiple spans?
Should the system abstain?
One caveat matters: attribution verification is not a truth oracle. It is a judgment under uncertainty. A verifier can miss nuance, misread partial support, or over-score weak evidence. So the verifier itself must be calibrated, allowed to abstain, and auditable.
A verifier you cannot inspect is just another source of unearned confidence.
CiteVQA does not give us a production prose verifier by itself. It gives us the evaluation principle: correctness must include attribution.
That is the difference between a citation generator and an evidence system.
CiteVQA gives us the third leg:
Attribution must be scored, not assumed.
5.4 From Primitives to Architecture
The three papers line up cleanly:
GrepSeek
→ find candidate evidence regions through executable search
ACL-Verbatim
→ extract exact spans from the source
CiteVQA
→ evaluate whether the cited evidence supports the answer
This becomes the Evidence Engine:
claim
→ warrant
→ executable search
→ candidate source window
→ verbatim evidence span
→ attribution verification
→ accept / repair / refute / abstain
The point is not that any one paper solves the whole problem.
The point is that each paper isolates one primitive modern AI systems often lack, and the Evidence Engine composes them:
- GrepSeek makes search visible.
- ACL-Verbatim makes evidence extractive.
- CiteVQA makes attribution measurable.
- The warrant makes the whole process bounded.
Together, once we add the parts the papers do not supply, they reframe AI writing from a fluent generation problem into an evidence procedure.
The next section builds that runtime.
6. From Warrant to Runtime
The warrant gives us the boundary.
The three papers give us the mechanics.
Now we can assemble the runtime.
At this point, the Evidence Engine is no longer just:
search → extract → verify
It becomes:
claim → warrant → search → extract → verify → route
That extra step changes the runtime contract.
Better search tools do not remove the need for a boundary. A GrepSeek-style agent can still search too broadly, read too much, call the wrong model, pull private material into context, or produce an evidence trail no one can audit.
At runtime, the warrant stops being a definition and becomes an enforcement object.
The system does not ask:
What can I find?
It asks:
Given this claim, this scope, these tools, and this budget,
can I find evidence strong enough to support, repair, refute, or abstain?
That is what warranted search means at runtime: every evidence run is governed by an explicit, enforceable contract.
The warrant is compiled before the corpus is touched. From that point on, every proposed action is checked against it: source, operation, model call, read budget, return policy, and audit policy.
A minimal runtime looks like this:
Claim Extractor
↓
Warrant Compiler
↓
Scoped Search Kernel
↓
Candidate Evidence Windows
↓
Verbatim Span Extractor
↓
Attribution Verifier
↓
Policy Router
↓
Accept / Repair / Refute / Abstain
↓
Evidence Trace Store
The important part is that each layer receives a narrower, more constrained object than the one before it.
The claim extractor receives prose.
The warrant compiler receives a claim.
The search kernel receives a warrant.
The span extractor receives bounded source windows.
The verifier receives a claim-span pair.
The policy router receives a verdict.
The trace store receives the final evidence decision.
That narrowing is the discipline of the runtime.
The runtime is a narrowing machine: prose becomes claims, claims become warrants, warrants become bounded reads, bounded reads become spans, spans become verdicts, and verdicts become actions.
When enforced by the runtime, this narrowing prevents the system from treating the entire corpus as context. It prevents the model from reading everything “just in case.” It forces each step to justify itself as the process moves from prose to evidence.
For Writer, this becomes a claim-level workflow.
A draft sentence like:
CiteVQA contains 1,897 questions across 711 PDFs.
does not trigger a general search about CiteVQA. It triggers a specific warrant:
{
"claim": "CiteVQA contains 1,897 questions across 711 PDFs.",
"purpose": "verify benchmark claim before publication",
"allowed_sources": [
"papers/citevqa.pdf",
"notes/citevqa.md"
],
"allowed_operations": [
"search_phrase",
"read_window",
"extract_span"
],
"allowed_models": [
"local_span_extractor",
"attribution_verifier"
],
"query_terms": [
"1,897",
"711",
"questions",
"CiteVQA"
],
"return_policy": "verbatim_spans_only",
"audit_policy": "log_operations_and_span_hashes",
"on_exhaustion": "abstain",
"max_files_touched": 3,
"max_content_return_chars": 2000
}
The search kernel is then allowed to inspect only the declared sources, using only the declared operations, within the declared budget.
If the planner proposes an out-of-scope file, an unapproved operation, or an over-budget read, the kernel rejects the action before execution and records the boundary event in the trace.
If it finds a candidate region, the span extractor isolates the exact text.
If the attribution verifier scores the span as sufficient, the policy router accepts it.
If the span only partially supports the claim, the router proposes a repair.
If the span contradicts the claim, the router refutes it.
If the evidence is missing, too weak, or the warrant budget is exhausted, the system abstains.
The router also handles runtime failure states:
| Failure | Runtime response |
|---|---|
| Out-of-scope action | reject operation and log violation |
| Budget exhausted | abstain |
| No candidate hit | abstain or request expanded warrant |
| Weak span | repair or retry within budget |
| Contradictory span | refute |
| Multiple required spans | decompose claim |
That is the shift from retrieval to evidence discipline.
The search is not broad. The span is not generated. The citation is not decorative. The verdict is not hidden.
Every accepted claim carries a trace:
claim
→ warrant
→ search operations
→ source window
→ extracted span
→ attribution verdict
→ policy action
→ trace record
This is the point where intelligent grep becomes an Evidence Engine.
Not because it can search more.
Because it is constrained enough to show whether the evidence can carry the claim.
7. The Evidence Engine Architecture
The previous sections gave us the ingredients:
- GrepSeek gives us executable corpus interaction.
- ACL-Verbatim gives us extractive evidence.
- CiteVQA gives us attribution-aware evaluation.
- The warrant gives us the runtime boundary.
Section 6 described the runtime as a narrowing machine: prose becomes claims, claims become warrants, warrants become bounded reads, bounded reads become spans, and spans become verdicts.
This section names the components inside that machine.
7.1 The Full Loop
This diagram shows the full warranted-search loop proposed in the post. A draft claim is first converted into a bounded search warrant, then passed through a scoped search kernel, an extractive evidence layer, and an attribution verifier. The final policy router decides whether the claim should be accepted, repaired, refuted, reviewed, or left unverified, while the trace is stored for future audit and improvement.
flowchart TD
classDef entry fill:#f2f2f2,stroke:#999,stroke-width:2px,color:#333
classDef extract fill:#e6f3ff,stroke:#1a5c99,stroke-width:2px,color:#1a5c99
classDef warrant fill:#fff4cc,stroke:#b38f00,stroke-width:2px,color:#b38f00
classDef search fill:#d9f2d9,stroke:#2d7d2d,stroke-width:2px,color:#2d7d2d
classDef evidence fill:#cce6ff,stroke:#005fa3,stroke-width:2px,color:#005fa3
classDef verify fill:#ffe6cc,stroke:#d97a00,stroke-width:2px,color:#d97a00
classDef policy fill:#f2e6ff,stroke:#6b2d99,stroke-width:2px,color:#6b2d99
classDef memory fill:#ffcccc,stroke:#b00000,stroke-width:2px,stroke-dasharray: 5 5,color:#b00000
classDef pass fill:#e6ffe6,stroke:#2d7d2d,stroke-dasharray: 3 3,color:#2d7d2d
A("📄 Draft / Chapter / Query") --> B("🔍 Claim Extractor")
B --> C{"❓ Needs Evidence?"}
C -- "✅ No" --> Z1("✔️ Pass Through<br/>(opinion, framing, original term)")
C -- "⚠️ Yes" --> D("🛡️ Warrant Generator<br/><i>claim, scope, allowed ops, return policy</i>")
D --> E("📂 Scoped Search Kernel<br/><i>warrant-governed grep</i>")
E --> F("⚡ Executable Search Operations<br/>search_phrase / read_window / find_heading")
F --> G("📋 Candidate Evidence Windows")
G --> H("✂️ Verbatim Span Extractor<br/><i>exact source span, no paraphrase</i>")
H --> I("📌 Evidence Span<br/><i>path, char range, content hash</i>")
I --> J("🔬 Attribution Verifier<br/><i>CiteVQA strict test</i>")
J --> K("📉 Drift / Predicate Support<br/><i>claim ↔ evidence distance</i>")
K --> L("🧭 Policy Router")
L --> M("✅ Accept")
L --> N("🔧 Repair / Rewrite")
L --> O("❌ Refute / Flag")
L --> P("⏸️ Abstain (out of scope / budget)")
E -- "📝 audit trail" --> Q("🧾 Search Trace")
H -- "🔗 span origin" --> R("📎 Span Provenance")
J -- "🏷️ support label" --> S("📊 Verification Result")
Q --> T("🧠 Evidence Memory")
R --> T
S --> T
T --> U("🔄 Improved Future Search & Verification<br/><i>trained query planner, better verifier</i>")
class A entry
class B extract
class D warrant
class E,F,G search
class H,I evidence
class J,K verify
class L,M,N,O,P policy
class Q,R,S,T,U memory
class Z1 pass
In practice, the Evidence Engine is an intelligent grep: a claim-bound search loop that turns verification into an executable process. For each evidence-bearing claim, it generates a warrant, searches only the authorized scope, extracts verbatim spans, verifies attribution, estimates drift, and records the trace.
The output is not just an answer with a citation. It is a decision: accept the claim, repair the wording, surface a contradiction, send it to review, or abstain because the evidence is not strong enough.
The goal is to make AI writing auditable at the level where it usually fails: the connection between a sentence and the evidence supposed to support it.
Each layer has one job.
7.2 Claim Extractor and Modality Gate
The first mistake an evidence system can make is treating every sentence as the same kind of thing.
Not every sentence needs a citation.
Some sentences are narrative. Some are source-backed definitions. Some are author-defined terms. Some are interpretations. Some are speculation. Some are factual claims. Some are synthesis claims.
The engine must distinguish between them.
If I write:
I call this failure mode Evidence Quicksand.
that does not need a citation. It is a naming act.
If I write:
CiteVQA contains 1,897 questions across 711 PDFs.
that does need evidence. It is a factual benchmark claim.
So the first layer extracts candidate claims and assigns modality.
Example:
{
"text": "CiteVQA contains 1,897 questions across 711 PDFs.",
"claim_type": "metric_claim",
"modality": "factual",
"needs_evidence": true,
"risk": "high"
}
Another example:
{
"text": "Evidence Quicksand is the failure mode where citations look solid but cannot carry the claim.",
"claim_type": "definition",
"modality": "author_defined",
"needs_evidence": false,
"risk": "low"
}
The Modality Gate acts as a short-circuit evaluator.
If a sentence is author-defined, narrative, speculative, or interpretive, the system should not force it through the full evidence pipeline. It can label it and pass it through. If the sentence is factual, metric-bearing, source-backed, comparative, or synthesis-heavy, the system routes it toward warrant compilation.
This gate is essential.
Without it, the engine becomes hostile to original thought. It starts demanding citations for framing, metaphors, definitions, arguments, and synthesis. That would make it useless as a writing tool.
The correct behavior is not:
Every sentence needs a citation.
The correct behavior is:
Every evidence-bearing claim needs an evidence decision.
That decision may be:
needs evidence
does not need evidence
is author-defined
is interpretive
is speculative
is synthesis and must be decomposed
A tool that punishes original synthesis will not be trusted.
7.3 Warrant Compiler
For each evidence-bearing claim, the system compiles a warrant.
Here the warrant is no longer a concept. It is a runtime object.
It binds the claim to approved sources, allowed operations, permitted models, read budgets, return policy, audit policy, and expiration rules.
A claim like:
CiteVQA contains 1,897 questions across 711 PDFs.
becomes:
{
"claim": "CiteVQA contains 1,897 questions across 711 PDFs.",
"allowed_sources": ["papers/citevqa.pdf", "notes/citevqa.md"],
"allowed_operations": ["search_phrase", "read_window", "extract_span"],
"allowed_models": ["local_span_extractor", "attribution_verifier"],
"query_terms": ["1,897", "711", "questions", "CiteVQA"],
"max_files_touched": 3,
"max_search_ops": 8,
"return_policy": "verbatim_spans_only",
"on_exhaustion": "abstain"
}
That warrant is passed to the search kernel.
From that point on, the search is no longer an open-ended request. It is a constrained execution.
If the kernel attempts to open a file outside the warrant, the action is rejected.
If the operation is not listed, it is rejected.
If the read budget is exhausted, the engine abstains or requests an expanded warrant.
That enforcement matters because a warrant that is only written into a prompt is not a warrant. It is a suggestion.
A real warrant is enforced at the execution boundary: every operation is validated against the warrant before it touches the corpus.
7.4 Scoped Search Kernel
The Scoped Search Kernel is the GrepSeek-inspired layer.
It executes deterministic, auditable operations over the allowed corpus. These operations can include:
list_files
search_phrase
search_text
open_file
read_window
find_heading
find_near
The key is that the kernel is not raw shell access.
Raw shell access is too broad for a writing tool. It can chain commands, redirect output, traverse directories, read secrets, or run destructive operations if mishandled.
The kernel should expose a restricted, auditable command set instead:
search approved files
read bounded windows
return character offsets
log every operation
reject out-of-scope actions
This preserves grep’s determinism without exposing the filesystem or shell primitives.
In practice, the kernel is where intelligent grep becomes real. The planner may suggest search terms. Embeddings or metadata may suggest candidate files. But only the kernel touches the corpus.
That trust boundary is the point.
The AI proposes.
The warrant constrains.
The kernel executes.
7.5 Candidate Evidence Windows
The search kernel does not return evidence.
It returns candidates that must still be tested.
A search hit may be in the right document but the wrong paragraph. It may contain the right phrase but not the right claim. It may be adjacent to the evidence but not itself evidentiary.
So the kernel should return bounded windows, not whole documents.
A candidate window should include:
source path
document id
window text
character range
search operation
query term
warrant id
content hash
This gives the downstream extractor enough context to work, while avoiding the common RAG failure of dumping large, loosely related chunks into the model.
A window is a search result, not an evidentiary unit.
7.6 Verbatim Span Extractor
The Verbatim Span Extractor is the ACL-Verbatim-inspired layer.
Given a candidate evidence window, it must identify the exact source span that could support the claim.
It must not summarize.
It must not paraphrase.
It must not clean up the wording.
It must return text that exists in the source.
The output should look something like this:
{
"source_path": "papers/citevqa.pdf",
"span_text": "CiteVQA contains 1,897 questions across 711 PDFs...",
"char_start": 48211,
"char_end": 48267,
"content_hash": "sha256:...",
"warrant_id": "warrant_...",
"claim_id": "claim_..."
}
This transforms the citation from a string reference into an offset-and-hash anchored dependency.
The source span is no longer a loose reference to a paper, section, or document. It is an anchored evidence object.
But span extraction is not verification.
A span can be accurately extracted and still fail to support the claim.
A span can mention the topic and still fail to carry the sentence.
A span can support only part of a claim.
That is why the next layer exists.
7.7 Attribution Verifier
The Attribution Verifier is the CiteVQA-inspired layer.
It receives a claim and an evidence object, then asks:
Does this evidence actually support this claim?
The output should not be just yes or no.
It should be a structured verdict:
supports
partially_supported
not_supported
contradicts
too_vague
needs_multiple_spans
out_of_scope
not_in_corpus
abstain
This is the layer that turns evidence from decoration into load-bearing structure.
If the verifier scores the span as sufficient, the engine may accept it.
If the span appears to partially support the claim, the engine should repair or narrow the wording.
If the span appears to contradict the claim, the engine should flag or refute it.
If the claim requires multiple spans, the engine should attempt to decompose it.
If the verifier is uncertain, the engine should abstain.
The verifier is not a source of truth. It is a claim-span judge. Its output should include confidence, rationale, and enough metadata for review. A low-confidence supports label should not be treated like a high-confidence one.
Attribution verification is fallible. It should be calibrated, auditable, and allowed to abstain.
A verifier you cannot inspect is just another source of unearned confidence.
7.8 Drift and Predicate Support
A binary support label is useful, but not enough.
Many bad claims are not completely unsupported. They are slightly too strong.
A source might say:
ACL-Verbatim reduces hallucination risk by forcing answers to be returned as verbatim source spans.
A draft might say:
ACL-Verbatim eliminates hallucination.
The second sentence is not unrelated. It is worse: it is an overclaim. It lives close enough to the evidence to sound grounded, but it has drifted beyond what the evidence can carry.
In this MVP, I treat the visible gap between a claim and its cited span as a practical drift signal. In the fuller framework, this connects to Hallucination Energy: the unsupported component of the claim relative to its evidence.
The MVP uses lexical drift as a heuristic proxy. Production systems should use a stronger verifier: an NLI model, a cross-encoder, a calibrated LLM judge, or the geometric Hallucination Energy signal developed in earlier work.
The point is not to invent a magical hallucination number.
The point is to show the writer where the wording has moved beyond the source.
Not merely:
supported / unsupported
But:
well-supported
slightly overstated
interpretive
unsupported
contradicted
out of scope
Predicate support is part of the same problem.
A span that mentions “GrepSeek” and “retrieval” does not automatically support the claim:
GrepSeek replaces dense retrieval.
The evidence must carry the relationship asserted by the predicate:
replaces
eliminates
proves
solves
guarantees
makes obsolete
always
never
all
first
best
Entity overlap is not enough.
This gives Writer a practical heat map of evidentiary risk. The writer should be able to inspect which sentences are well-supported, which are interpretive, which require softer wording, and which should not be published.
That is how the Evidence Engine becomes useful as a writing system rather than just a citation checker.
7.9 Policy Router
The Policy Router decides what happens next.
It receives the attribution verdict, drift estimate, warrant status, and claim modality. Then it chooses an action.
| Verdict | Meaning | Action |
|---|---|---|
supports |
evidence carries claim | accept |
partially_supported |
evidence is narrower than claim | suggest rewrite |
not_supported |
span is unrelated or insufficient | retry, repair, review, or abstain |
contradicts |
source appears to refute claim | refute / hard flag |
too_vague |
weak evidence | refine query within warrant |
needs_multiple_spans |
synthesis claim | decompose |
not_in_corpus |
source absent | request source or expanded warrant |
out_of_scope |
warrant too narrow | request expanded warrant |
abstain |
verifier uncertain or budget exhausted | leave unresolved |
This is not a linter. A linter checks style and leaves the text alone.
The Policy Router changes the status of the claim:
accepted
needs repair
needs refutation
needs decomposition
needs wider warrant
unverified
The router is where automation policy lives. Some actions can be automatic, such as accepting a high-confidence exact metric span. Others should enter a human review queue, especially contradiction, synthesis decomposition, and high-risk repairs.
That is why I think of the Evidence Engine as a compiler for evidence.
It takes loose prose and tries to compile it into warranted, sourced, verified claims.
If the claim cannot compile, the system should not fake success.
It should emit a structured diagnostic:
unsupported claim
out-of-scope source
exhausted warrant
contradictory evidence
missing decomposition
low-confidence verifier
7.10 What the Architecture Guarantees, and What It Does Not
The Evidence Engine is not a truth oracle.
It is an evidence discipline.
That distinction matters because the system should not overclaim its own authority. It can make claim checking more explicit, bounded, and auditable, but it cannot make source evidence magically correct, make a verifier infallible, or turn synthesis into a single-span fact.
| The architecture guarantees | It does not guarantee |
|---|---|
| Search stays inside the warrant | The claim is true in the world |
| Evidence is extracted, not invented | The source itself is correct |
| Attribution is explicitly tested | The verifier is infallible |
| Weak evidence can route to review or abstention | Every claim can be resolved |
| The trace survives for audit | The synthesis is automatically valid |
| Source windows are bounded | The right source exists inside the warrant |
| Repairs can be proposed | Repairs are automatically correct |
| Evidence links can become living dependencies | Evidence never becomes stale |
This boundary is the whole point.
The system is allowed to say:
This claim is supported inside this warrant.
This claim is partially supported but overstated.
This claim is not supported by the cited span.
This claim requires multiple spans.
This claim is authorial synthesis.
This claim could not be verified inside the allowed scope.
It is not allowed to pretend that nearby text is proof.
The architecture is therefore a loop, not a straight line. The router can send weak claims back into bounded search, request refutation, ask for decomposition, propose repair, or abstain. But every loop stays inside the warrant. Once the budget is exhausted, the honest output is not confidence.
It is a traceable failure:
No sufficient evidence found inside the warrant.
Search budget exhausted.
Claim remains unverified.
That is a good failure.
The next sections look more closely at what happens after this architecture runs: how traces can become learning material, how refutation search catches overreach, why synthesis claims need EvidenceSets, and why evidence must become a first-class object rather than a decorative citation.
8. Learning From the Evidence Trace
The previous section described the architecture.
This section explains why the trace matters.
The architecture is not a straight line.
It is a bounded loop.
The first search often finds the right neighborhood but the wrong passage. The verifier may decide that a span is only partial, too vague, or unrelated. The router may then refine the query, open another bounded window, try a different span, request refutation, propose repair, or abstain.
But every loop stays inside the warrant.
A retry is not a new permission. It is another action under the same claim, scope, operation list, read budget, return policy, and audit policy. Once that budget is exhausted, the system must stop and report what happened.
That is why the trace matters.
The trace is not only the final verdict. It is the record of the loop: which search terms were tried, which files were touched, which windows were read, which spans were rejected, which repairs were proposed, where the verifier was uncertain, and why the router accepted, repaired, refuted, reviewed, or abstained.
Without that trace, the system has only an answer.
With the trace, the system has an episode that can be inspected, replayed, compared, and, after validation, used to improve the next search.
Every run through the Evidence Engine therefore leaves behind more than a verdict. It leaves behind a bounded execution history: what the system searched, what it touched, what it found, what it rejected, where it retried, where it abstained, and which decision the router finally made.
That trace has two jobs:
It makes the current claim auditable.
It gives future systems material to learn from.
But the second job needs a warning up front.
A trace is not truth.
A failed trace does not mean the claim is false. It may mean the warrant was too narrow, the query was poor, the source was missing, or the search budget ran out.
An accepted span does not mean the verifier was right. It means the verifier judged that span sufficient under the current model, threshold, and warrant.
So raw traces should not be treated as automatic supervision. They are audit records first. They become training material only after validation, calibration, replay, or human review.
The trace is evidence about the system’s behavior.
It is not evidence that the system was correct.
8.1 Evidence Runs as Episodes
Each verification run can be treated as an episode:
claim
→ warrant
→ search operations
→ candidate windows
→ extracted spans
→ attribution verdict
→ policy action
→ trace record
The trace tells us what happened.
It records:
what the claim was
what warrant constrained the search
which queries were tried
which files were touched
which windows were read
which spans were extracted
which verifier scores were assigned
which repair, refutation, or abstention was chosen
That trace is valuable because evidence finding is not only about the final span. It is also about the path that led to it.
A good evidence run finds the right source quickly, uses few operations, reads little irrelevant content, extracts a tight span, passes attribution with high confidence, keeps drift low, and respects the warrant.
A bad evidence run is not waste. It tells us where the system failed:
the query was too broad
the first hit was adjacent but not evidentiary
the span was too wide
the claim was overstated
the warrant was too narrow
the verifier was uncertain
the search budget was exhausted
Those differences between good and bad runs are the basis for improving search, extraction, verification, and repair policies.
Provided the traces are validated before they are reused.
8.2 Loops Stay Inside the Warrant
The trace is valuable because the Evidence Engine does not always resolve a claim in one pass.
A typical run may look like this:
claim
→ warrant
→ search
→ extract span
→ verify
→ if supports: accept
→ if partial: repair or refine
→ if vague: search again
→ if contradiction: refute
→ if no support: try refutation search
→ if budget exhausted: abstain
The important constraint is that every retry is still governed by the original warrant unless a new warrant is explicitly requested and approved.
The engine can refine a query, open another bounded window, test another span, or try a refutation pass. But it cannot silently expand the corpus, read private files, call an unapproved model, or keep searching after the budget has expired.
That is what makes the trace useful.
A successful trace shows the path to evidence.
A failed trace shows why the system stopped.
A partial trace shows where the source was nearby but not strong enough.
Those differences matter later. A failed trace is not proof that the claim is false. It may mean the warrant was too narrow, the query was poor, the source was missing, the verifier was uncertain, or the search budget ran out.
So the trace must preserve not just the result, but the reason the result happened.
8.3 A Trace Record
A trace should be a structured object, not a blob of logs.
For example:
{
"trace_id": "trace_123",
"claim_id": "claim_456",
"warrant_id": "warrant_789",
"claim_hash": "sha256:...",
"corpus_hash": "sha256:...",
"operations": [
{
"step": 1,
"operation": "search_phrase",
"query": "1,897",
"status": "success",
"files_touched": 1,
"chars_read": 0
},
{
"step": 2,
"operation": "read_window",
"source": "papers/citevqa.pdf",
"status": "success",
"chars_read": 1840
}
],
"candidate_windows": 3,
"spans_extracted": 2,
"best_span_hash": "sha256:...",
"verdict": "partially_supports",
"support_score": 0.72,
"drift_score": 0.31,
"policy_action": "repair",
"human_validation": "pending",
"privacy_summary": {
"files_touched": 1,
"chars_read": 1840,
"raw_text_retained": false
}
}
This object can be audited, replayed, filtered, and later promoted into training data.
The privacy field matters.
A trace should not casually store every raw window the engine read. If the warrant permits only narrow evidence return, the trace should store hashes, offsets, operation summaries, verdicts, and policy decisions by default. Full text spans should be retained only when the warrant permits it.
Otherwise the trace memory becomes a privacy leak.
8.4 Rewarding Better Evidence Behavior
The verifier can become a feedback signal for search, but that feedback must be handled carefully.
A better span should receive a higher attribution score. A tighter claim should reduce drift. A shorter trace should reduce search cost. A narrower read should reduce privacy cost. A justified abstention should score higher than a forced answer. A warrant violation should not be a small penalty. It should invalidate the run.
A simple episode-level scoring sketch might look like this:
if warrant_violation:
reward = 0
else:
reward =
external_validation_score
+ attribution_score
+ correct_abstention_bonus
- drift_score
- search_cost
- privacy_cost
This is not meant to be a universal formula. It is a design sketch.
The important part is that at least one signal should come from outside the model loop: human validation, a held-out gold set, replay tests, or cross-verifier agreement. If the system rewards traces only using its own verifier confidence, it can learn to satisfy the verifier rather than become more correct.
That is reward hacking in evidence clothing.
So the reward should be computed against validated outcomes, not raw model confidence alone.
And warrant violations should be hard failures. The system should not learn that it is acceptable to breach scope when the evidence looks useful.
A retry is not a new permission.
It is another action inside the same warrant.
8.5 What Trace Artifacts Can Improve
Validated traces can train or calibrate better components.
| Trace artifact | What it can improve |
|---|---|
| Successful search paths | query planning / search policy |
| Failed search paths | negative examples for query refinement |
| Accepted spans | span extraction |
| Rejected spans | attribution calibration |
| Verifier scores plus human corrections | threshold tuning and calibration |
| Accepted rewrites | claim repair |
| Abstentions | uncertainty handling |
| Human overrides | verifier calibration |
| Stale evidence hashes | re-verification triggers |
| Warrant violations | enforcement tests and policy hardening |
The last row is important.
A warrant violation should not mainly teach the model to behave better. The runtime should already make the violation impossible to execute. But violation attempts are still useful because they show where the policy boundary needs to be clearer, where the planner is confused, or where the UI should ask for an expanded warrant.
That is engineering feedback, not just model feedback.
8.6 Offline Training, Not Live Self-Modification
This is where the Evidence Engine becomes more than a checker.
But not because it rewrites itself live in production.
That would be dangerous.
Instead, traces should flow into an offline curation pipeline:
raw trace
→ privacy scrub
→ replay test
→ human or gold-set validation
→ calibration set
→ training / evaluation pool
→ model or policy update
→ regression test
→ promotion
Only validated traces should be promoted into training data.
Raw traces should first enter an evaluation pool.
Some traces will be used for replay tests. Some will be used for verifier calibration. Some will become supervised examples for query planning, span extraction, or claim repair. Some will be discarded for privacy, ambiguity, or bad verifier behavior.
This avoids the dangerous version of self-improvement, where a system grades its own homework and trains on its own mistakes.
The engine produces the data.
It does not automatically certify the data.
8.7 Abstention as a Good Failure
The loop must remain warranted.
The engine may refine a query.
It may open another bounded window.
It may test another span.
It may try a refutation pass.
But it cannot expand forever.
If the warrant budget is exhausted, the correct output is not a confident answer.
It is an abstention with a trace:
No sufficient evidence found inside the warrant.
Search budget exhausted.
Claim remains unverified.
That is a good failure.
It preserves the difference between:
false
and:
not verified inside this scope
A system that knows when to stop is more trustworthy than a system that always finds something.
A justified abstention should be rewarded over a forced answer that happens to satisfy a fallible verifier.
That rule matters because otherwise the system quietly learns the wrong lesson: always produce something.
8.8 The Trace Has Two Lives
The evidence trace therefore has two lives.
First, it is an audit artifact.
It lets a human inspect the current claim:
What was checked?
Where did the system search?
What did it read?
Which span did it extract?
Why was the claim accepted, repaired, refuted, or left unresolved?
Second, it is a training artifact.
It gives future systems material to improve:
Which queries worked?
Which searches failed?
Which spans were accepted?
Which spans fooled the verifier?
Which repairs preserved the author’s intent?
Which abstentions were justified?
That is the long-term promise of the Evidence Engine.
It turns verification from a one-off check into a trace-driven calibration loop.
It does not merely ask whether a claim is supported.
It gives the system the material needed to learn how support is found.
9. Refutation Search
The architecture already includes a refutation pass.
This section makes it concrete: how it works, what it looks for, and how it turns a borderline claim into a stronger one.
Support-only search asks the easiest question:
What can I find that agrees with this?
That is confirmation bias implemented as a search policy.
In a large corpus, almost any claim can find something nearby: the same paper, the same company, the same metric, the same model, the same concept. The span may be topically related, and the citation may look plausible, while still failing to carry the claim.
Topical relevance is not evidentiary support.
A serious evidence system must also ask the harder question:
What would make this claim false, narrower, or misleading?
That is refutation search: a bounded, warrant-governed attempt to find evidence that narrows, qualifies, or contradicts the claim.
It is not adversarial debate.
It is not an infinite attempt to disprove every sentence.
It is a scoped counter-evidence pass for claims where the cost of overclaiming is high.
It is especially useful for high-risk claims where overstatement is common or costly:
benchmark comparisons
"replaces X" claims
"solves Y" claims
"always / never" claims
"eliminates the need for" claims
synthesis claims built from multiple papers
claims likely to be weakened by caveats
The runtime pattern is:
claim
→ support search
→ candidate support evidence
→ attribution check
→ counter-evidence search
→ refutation check
→ accept / repair / refute / abstain
The refutation pass should remain warranted.
Adversarial intent does not expand warrant boundaries. The system does not get to rummage through everything just because it is looking for counter-evidence. If the original warrant does not authorize the sources needed for refutation, the runtime should stop and request a separate refutation warrant rather than silently broadening scope.
That distinction matters.
A support warrant and a refutation warrant may inspect different sources, use different query terms, and have different budgets. Keeping them separate makes the trace clearer:
support trace: evidence that appears to carry the claim
refutation trace: evidence that narrows, weakens, or contradicts it
The counter-query generator should not merely negate the sentence.
It should search for limitation terms, comparison terms, failure modes, baseline names, and author caveats.
For a technical paper, useful counter-query terms might include:
limitations
failure cases
does not
complementary
future work
ablation
error analysis
caveat
not sufficient
underperforms
compared with
It may also target different structural regions of the same source:
abstract
limitations
discussion
conclusion
error analysis
appendix
ablation tables
This matters because caveats often live far from the headline result.
Example claim:
GrepSeek replaces dense retrieval.
A support search might find GrepSeek’s strengths: executable corpus interaction, auditable command trajectories, no index requirement, and strong benchmark results.
A refutation search asks what evidence inside the warrant would narrow, qualify, or contradict that sentence.
It might find the paper positioning Direct Corpus Interaction as complementary to existing retrieval paradigms. It might find future work on hybrid DCI-plus-index retrieval. It might also find benchmark cases where GrepSeek does not dominate every retriever on every task.
That counter-evidence does not merely reject the claim.
It changes the safe version of the claim.
The better sentence is not:
GrepSeek replaces dense retrieval.
The better sentence is:
GrepSeek does not replace all retrieval systems. It shows that direct corpus interaction can complement index-based retrieval by making search trajectories explicit and auditable.
That is a stronger claim because it has been stress-tested.
Refutation search should produce structured outcomes:
| Outcome | Meaning | Action |
|---|---|---|
no_counterevidence_found |
no weakening evidence found inside warrant | keep claim provisional |
claim_too_broad |
source supports a narrower claim | repair |
claim_contradicted |
source directly conflicts with claim | refute / hard flag |
needs_caveat |
source supports claim but adds limits | add caveat |
needs_comparison |
claim depends on another system or baseline | search comparison evidence |
counterevidence_ambiguous |
possible counter-evidence found, verifier uncertain | human review / abstain from strong wording |
weak_counterevidence |
counter-evidence is itself poorly supported | do not refute; continue review |
out_of_scope |
counter-evidence requires unauthorized sources | request refutation warrant |
budget_exhausted |
warrant limit reached | abstain from strong claim |
The key row is the first one.
no_counterevidence_found does not mean the claim is true.
It means no counter-evidence was found inside this warrant, with these sources, these query terms, and this budget.
Like every other layer, refutation search is fallible. It can miss counter-evidence that exists. A clean refutation result means these counter-queries found nothing in this scope, not that no counter-evidence exists anywhere.
That is why the trace matters.
A shallow refutation pass and a serious refutation pass may have the same verdict, but they should not have the same trace. The trace should show which counter-queries were tried, which sections were searched, which windows were read, and which counter-spans were rejected.
For Writer, the practical value is claim repair.
The system can detect claims that have the shape of overreach:
X replaces Y.
X solves hallucination.
X proves retrieval is obsolete.
X always improves accuracy.
X eliminates the need for human review.
Then it can run a bounded counter-evidence pass before those claims make it into the final draft.
Many of the best edits come from this step.
Refutation search does not merely reject bad claims.
It often finds the better sentence hidden underneath them.
It turns:
This proves the old approach is dead.
into:
This exposes a limitation in the old approach and suggests a complementary path rather than a replacement.
It turns:
This solves hallucination.
into:
This reduces one class of hallucination by forcing claims through extractive evidence and attribution checks.
It turns:
GrepSeek makes dense retrieval obsolete.
into:
GrepSeek shows that direct corpus interaction can make search trajectories auditable, while dense retrieval remains useful for candidate discovery.
These rewrites are proposals.
The engine does not get to assert them either. A repair must itself pass attribution against the support and counter-evidence that motivated it. Otherwise the system has only moved the overclaim into a new sentence.
High-risk repairs should not be silently applied. They should enter a review queue with the support span, the counter-evidence span, and the proposed rewrite.
Implementation-wise, the refutation pass should have its own trace entries:
{
"claim_id": "claim_123",
"trace_type": "refutation",
"support_warrant_id": "warrant_support_456",
"refutation_warrant_id": "warrant_refute_789",
"counter_queries": [
"limitations",
"complementary retrieval",
"hybrid index retrieval",
"underperforms"
],
"counter_windows_read": 4,
"counter_spans_found": 2,
"outcome": "claim_too_broad",
"policy_action": "repair"
}
The exact schema can change, but the principle should not.
Support and refutation traces should be stored separately so the report can show which evidence carried the claim and which evidence constrained it.
The system should not merely find confirming spans.
It should search for the strongest source-bounded correction.
Support search makes a claim plausible.
Refutation search makes it harder to knock over.
Publication still requires judgment.
10. Synthesis Claims Need Decomposition
Refutation search catches overreach inside a claim.
Synthesis decomposition catches overreach across claims.
That distinction matters because the central claim of this post is itself a synthesis claim:
GrepSeek, ACL-Verbatim, and CiteVQA form a single architecture for trustworthy AI writing.
No single paper says that.
And the system should not pretend otherwise.
GrepSeek does not say:
Combine me with ACL-Verbatim and CiteVQA to build an Evidence Engine.
ACL-Verbatim does not say:
Use GrepSeek for search and CiteVQA for attribution.
CiteVQA does not say:
This benchmark completes a warranted-search architecture for AI writing.
That is my synthesis: an architectural claim built from source-backed primitives.
That does not make the synthesis invalid. It means the evidence system has to represent it honestly.
A factual claim can often be checked against a single source span:
CiteVQA uses Strict Attributed Accuracy.
A synthesis claim cannot.
A synthesis claim connects multiple supported pieces with an authorial join:
GrepSeek, ACL-Verbatim, and CiteVQA form the three legs of an Evidence Engine.
A span extractor cannot verify that directly, and not just because the claim is spread across papers.
The join itself is not in any of them.
“These primitives compose into a system” is an inference, not a sentence we can grep for. Even with every paper in scope, the composition is unsourced by nature because it is a reasoning step, not a fact.
So the Evidence Engine has to distinguish:
atomic factual claim
from:
synthesis claim
and then handle them differently.
An atomic claim asks:
Does this source say this?
A synthesis claim asks:
Do these supported pieces justify this composition?
That second question is not extraction.
It is composition analysis.
10.1 Sourceable Atoms and Interpretive Joins
Decomposition separates two different things.
First, there are sourceable atoms: claims that can be checked against evidence spans.
For this post, those might be:
GrepSeek provides executable corpus interaction.
ACL-Verbatim provides extractive evidence spans.
CiteVQA provides attribution-aware evaluation.
Each of those can be verified against a source.
Then there are interpretive joins: claims that connect the source-backed atoms into a larger argument.
For this post, those might be:
Executable search, extractive evidence, and attribution verification map onto search, extraction, and verification stages.
A warranted-search runtime can compose those stages under a permissions envelope.
Together, those primitives form the basis of an Evidence Engine.
Those joins are not ordinary factual claims.
They are the author’s connective argument.
The system can inspect them. It can ask whether they are modest, overstrong, missing bridge support, or contradicted by the sources. But it cannot pretend the joins are directly stated by the papers.
That is the key discipline.
The engine verifies the atoms.
It evaluates the joins.
It never pretends a join has a source.
10.2 Evidence Sets, Not Evidence Spans
A synthesis claim should produce an EvidenceSet, not a single EvidenceSpan.
In code, the shape is different:
EvidenceSpan = one source-backed evidence object
EvidenceSet =
atomic subclaims
+ supporting spans
+ bridge claims
+ composition verdict
+ composition note
+ review status
A possible object might look like this:
{
"claim": "GrepSeek, ACL-Verbatim, and CiteVQA form a single architecture for trustworthy AI writing.",
"claim_type": "synthesis",
"subclaims": [
{
"node_id": "node_grepseek",
"text": "GrepSeek provides executable corpus interaction.",
"evidence": [
{
"span_id": "span_grepseek_001",
"source_hash": "sha256:...",
"byte_range": [1024, 1150]
}
],
"verdict": "supports",
"support_score": 0.91
},
{
"node_id": "node_acl_verbatim",
"text": "ACL-Verbatim provides extractive evidence spans.",
"evidence": [
{
"span_id": "span_acl_verbatim_004",
"source_hash": "sha256:...",
"byte_range": [2210, 2380]
}
],
"verdict": "supports",
"support_score": 0.88
},
{
"node_id": "node_citevqa",
"text": "CiteVQA provides attribution-aware evaluation.",
"evidence": [
{
"span_id": "span_citevqa_002",
"source_hash": "sha256:...",
"byte_range": [4420, 4590]
}
],
"verdict": "supports",
"support_score": 0.93
}
],
"bridge_claims": [
{
"text": "These three primitives map onto search, extraction, and attribution verification stages.",
"verdict": "author_interpretation",
"requires_review": true
},
{
"text": "A warranted-search runtime can compose these stages into an Evidence Engine.",
"verdict": "author_interpretation",
"requires_review": true
}
],
"composition_note": "The sources support the components. The author is responsible for the architectural composition.",
"composition_verdict": "supported_interpretation",
"requires_human_review": true,
"review_reason": "multi-source synthesis"
}
The machine-actionable field is composition_verdict.
That is what the router acts on.
The composition_note is the human-readable companion. It tells the reader:
The sources supplied the parts.
The author supplied the join.
This prevents the system from laundering an author’s inference as if it were a source claim.
10.3 Composition Verdicts
The router needs a vocabulary for synthesis claims.
Something like:
supported_interpretation
overstrong_join
unsupported_subclaim
partially_supported_subclaim
missing_bridge
conflicting_subclaims
speculative_extension
composition_unstable
requires_human_review
These labels are not the same as ordinary attribution verdicts.
A factual claim can pass when the span supports the sentence.
A synthesis claim passes only when the source-backed atoms are supported and the join is modest enough to stand as interpretation.
The router can then act:
| Case | Meaning | Action |
|---|---|---|
| all subclaims supported, bridge modest | synthesis is source-backed and cautiously framed | accept with composition note |
| all subclaims supported, join too strong | parts are true but conclusion overreaches | repair / human review |
| one or more subclaims partially supported | foundation is weaker than the synthesis implies | soften or request stronger evidence |
| one or more subclaims unsupported | foundation is weak | abstain or request evidence |
| subclaims contradict each other | composition unstable | refute / hard flag |
| bridge claim unsupported | parts are sourced but connection is not justified | add bridge evidence or mark as interpretation |
| join requires missing bridge claim | argument has a gap | add bridge or send to review |
| synthesis extends beyond evidence | conclusion is speculative | mark as speculation / soften |
| synthesis is novel but bridge is modest | author makes a reasonable connection from supported parts | accept as supported interpretation |
| high-impact synthesis | conclusion carries thesis-level weight | require human review |
This is the difference between evidence and argument.
Evidence can show that the parts exist.
Argument explains why they belong together.
The Evidence Engine should not erase that distinction.
10.4 Composition Is Contestable
Marking a claim as authorial synthesis is not a free pass.
It changes the question from:
Is this sourced?
to:
Is this reasoning sound?
That second question is harder, and the engine cannot fully answer it.
It can show the parts are real.
It can show which spans support the parts.
It can show where the bridge claims are.
It can flag an overstrong join.
It can request human review.
But whether the synthesis is good remains an intellectual judgment.
A synthesis you own is still one you can be wrong about.
That is why high-impact synthesis claims should not be silently blessed by the system. They should be marked, reviewed, and made explicit.
For Writer, this matters because technical blog posts often live on synthesis. The most interesting claims are rarely copied from one paper. They are built by connecting papers, code, benchmarks, runtime patterns, and practical experience.
That is good.
But the system must be honest about what is being verified.
It can say:
This paper supports this component.
This benchmark supports this limitation.
This source supports this metric.
This bridge is the author’s interpretation.
This conclusion is a supported synthesis, not a direct source claim.
It should not say:
The papers prove the whole architecture.
That would be evidence quicksand in its most respectable form: every citation real, every part supported, and the conclusion still unearned.
The most dangerous overclaims are not always unsourced.
Sometimes they are well-sourced parts welded into a conclusion the sources never made.
The parts can be fully supported while the synthesis remains an overreach.
The Evidence Engine must encode that difference structurally, not hope a model learns it later.
A trustworthy writing system should not merely attach citations to conclusions.
It should show which parts are sourced, which parts are inferred, and where the author is making the intellectual move.
11. Evidence as a First-Class Object
In this architecture, evidence is not a string or a citation.
It is a structured runtime object that the system can hash, re-verify, invalidate, repair, replay, and, after validation, learn from.
That shift is what makes the rest of the Evidence Engine possible.
A traditional citation points outward:
See paper X, section Y.
An Evidence Engine needs something stronger:
This claim was checked under this warrant.
This operation found this source window.
This span was extracted from this byte range.
This verifier assigned this support verdict.
This policy router accepted, repaired, refuted, or abstained.
This evidence remains valid only while the claim, source, warrant, and verifier relationship still holds.
Evidence therefore has to become part of the runtime state.
A minimal evidence object should include:
claim_id
claim_hash
claim_type / modality
source_id
source_version_id
source_path
span_text
span_hash
source_hash
location
byte_range / char_range
search_trace_id
warrant_id
warrant_version
verifier_version
extractor_version
support_score
drift_score
verification_result
refutation_result
policy_action
repair_history
retention_policy
redaction_status
human_validation_status
training_eligibility
created_at
verified_at
stale_status
schema_version
That looks like a lot, but each field exists because evidence is a relationship, not a blob of text.
An evidence object does not prove truth in the abstract.
It records a bounded evidentiary relationship:
this claim
checked under this warrant
against this source version
using this extractor and verifier
received this verdict
at this time
That is the unit the system can audit.
11.1 EvidenceSpan and EvidenceSet
The Evidence Engine needs at least two related evidence types.
An EvidenceSpan anchors one claim to one source region.
An EvidenceSet anchors a composed claim to multiple spans plus bridge claims, a composition note, and a composition verdict.
EvidenceSpan =
one claim
+ one source region
+ one support verdict
EvidenceSet =
synthesis claim
+ atomic subclaims
+ supporting spans
+ bridge claims
+ composition verdict
+ composition note
+ review status
This matters because different claim types need different evidence shapes.
An atomic factual claim may point to one span.
A benchmark comparison may need a table region plus a metric definition.
A synthesis claim may need multiple spans plus an explicit authorial join.
A code claim may need a symbol, file path, commit hash, and test result.
A long-lived writing system cannot treat all of those as plain text citations.
It needs evidence objects.
11.2 Hashes, Locations, and Staleness
A span should be anchored by content hash, with offsets used as relocation hints.
Offsets drift when files are edited.
Line numbers drift when paragraphs move.
PDF extraction may change when the parser changes.
Markdown sections may shift when the author revises the document.
The span hash is the real evidence anchor. It answers:
Does this exact source text still exist?
The source hash is a change detector. It answers:
Did the surrounding source document change?
Both matter, but they do different jobs.
A source hash mismatch does not automatically mean the evidence is false or gone. It may only mean something else in the file changed. The correct response is to re-check the evidence object.
A span hash match means the exact text still exists somewhere in the source.
A span hash miss means the original text was changed, removed, or transformed.
A location tells the engine where the span was found.
A hash tells the engine whether the span is still the same.
Both are needed.
{
"source_path": "papers/citevqa.pdf",
"source_version_id": "git:9f2a...",
"byte_range": [48211, 48267],
"span_hash": "sha256:...",
"source_hash": "sha256:...",
"claim_hash": "sha256:...",
"claim_id": "claim_123",
"warrant_id": "warrant_456",
"verifier_version": "attrib-v0.3.1"
}
If the byte range still points to the same text and the hash matches, the evidence remains anchored.
If the source changed but the exact span hash can still be located elsewhere, the engine can re-anchor the evidence and re-run verification.
If the span cannot be found exactly, the engine may attempt conservative relocation.
But fuzzy relocation is not a safe automatic repair.
A relocated span is a candidate, not a confirmed anchor. It may land on similar-but-not-identical text. It may find a repeated phrase in a different context. It may preserve words while losing the evidentiary force.
So the rule should be:
relocate candidate
→ re-verify against claim
→ accept repair only if attribution still passes
If the relocated span falls below the configured similarity threshold, or if the verifier no longer supports the claim, the evidence becomes stale.
The system should not silently reuse stale evidence.
It should mark the claim for re-verification.
11.3 Attribution Diff
Once evidence is structured, Writer can do something ordinary citations cannot do.
It can produce an attribution diff.
Since last run:
- 3 claims still anchored
- 2 claims lost evidence
- 1 source span changed
- 2 claims changed and need re-verification
- 1 verifier decision changed after model update
- 1 synthesis claim has a stale bridge
- 2 repairs are awaiting review
This is essential for long-lived writing projects, where sources, drafts, and verifiers evolve over time.
A blog post, book chapter, research note, or technical report may live for months or years. Sources change. Local notes are edited. Drafts are reorganized. Tables are updated. PDFs are re-parsed. Code moves across files. Verifier models improve.
A normal citation does not know any of that.
An evidence object can track those changes because it stores the claim, source version, span hash, warrant, verifier result, and policy action together.
The system can ask:
Is the span still present?
Is the source hash unchanged?
Did the claim text change?
Did the warrant change?
Did the verifier version change?
Did the support verdict change?
Did the refutation result change?
Did a repair invalidate the old evidence?
That gives the writer a maintenance loop, not just a publication checklist.
11.4 Evidence Objects Are Accreted
An evidence object is not created in a single step.
Each stage of the pipeline contributes fields, scores, and metadata until the object is complete.
The warrant compiler contributes:
claim_id
claim_hash
claim_type
warrant_id
warrant_version
allowed scope
retention policy
The search kernel contributes:
source path
query terms
operation trace
candidate windows
files touched
read budget consumed
The span extractor contributes:
span text or redacted span marker
byte range
span hash
source location
source version
extractor version
extraction confidence
The attribution verifier contributes:
support label
support score
drift estimate
rationale
verifier version
The refutation pass contributes:
counter-query trace
counter-evidence spans
refutation outcome
counter-evidence verdict
The policy router contributes:
accept / repair / refute / abstain
repair suggestion
review requirement
final claim status
The evidence memory contributes:
stale status
replay status
human validation
repair history
training eligibility
schema version
This accretion should be audit-safe.
Each layer should append to the evidence record or create a new version rather than silently overwriting earlier fields. If a repair changes the claim, if a re-search finds a better span, or if a new verifier changes the verdict, the object should preserve the history of how that state changed.
That makes the evidence object both:
an audit artifact
and:
a state object
It can be rendered for humans, replayed for tests, used for calibration, or promoted into training data after validation.
11.5 Evidence Memory and Privacy
Evidence memory must store only what is necessary for audit and re-verification.
It must never become a secondary corpus.
That would violate the entire warrant idea.
The default should be to store the minimum needed:
claim id
claim hash
warrant id
source id
source version
operation trace
offsets
hashes
verdicts
scores
policy actions
Raw spans should be stored only when the warrant permits it.
Sensitive windows should not be retained just because they were inspected.
If a warrant allowed the engine to read a bounded source window for verification, that does not automatically mean the engine is allowed to store that window forever.
This distinction matters.
This enforces the principle that permission to inspect does not grant permission to store.
A privacy-aware evidence object should separate:
what was inspected
what was retained
what was hashed
what was returned
what was redacted
what expires
If the warrant permits inspection but not retention, the engine can store hashes, offsets, verdicts, and policy metadata while discarding raw text after the run.
But even hashes and traces can leak information.
A hash can become a membership oracle if an attacker can guess the sensitive string and compare hashes. A query log can leak intent even when no raw span is stored. A trace that says search_phrase("patient HIV status") is already sensitive.
So retention discipline must cover:
raw spans
query terms
search traces
hashes
paths
metadata
This keeps the trace useful without turning audit memory into a data leak.
11.6 Replay and Incremental Verification
Once evidence objects exist, Writer can replay them.
A replay test asks:
Given the same claim, warrant, source version, span, and verifier version,
do we get the same verdict?
A migration test asks:
Given the same evidence object and a new verifier version,
does the verdict change?
An incremental verification pass asks:
Which claims actually need to be rechecked?
This is where the build-system metaphor becomes literal.
The engine should maintain a dependency index:
source → evidence objects
claim → evidence objects
warrant → evidence objects
verifier version → evidence objects
When a source changes, the system finds the evidence objects that depend on it.
When a claim changes, the system re-checks only the affected evidence.
When a verifier changes, the system replays old verdicts and reports verdict drift.
This avoids full recomputation.
It gives Writer incremental evidence builds.
11.7 Living Evidence Links
A blog post, book, or research note should not merely have citations.
It should have evidence links that can be rechecked, invalidated, repaired, and reused.
That is what I mean by living evidence links.
They are not decorative.
They are active dependencies between claims, source versions, spans, warrants, verifier decisions, and repairs.
If a source changes, the dependency can break.
If a claim changes, the dependency can become too weak.
If a verifier improves, the old verdict can be replayed.
If a synthesis claim grows stronger than its bridge, the composition can be marked for review.
This is how an Evidence Engine turns writing into something closer to a build system.
The draft is the source code.
The claims are compilation units.
The warrants define the allowed dependencies.
The evidence objects are the linked artifacts.
The verifier is the type checker for claim-evidence compatibility.
The policy router emits diagnostics.
The attribution diff tells you what broke since the last build.
The metaphor holds for structure: dependencies, incremental rebuilds, diagnostics, and stateful artifacts.
It does not mean the verifier is perfectly sound.
Unlike a real type checker, this verifier can be wrong. That is why evidence objects must be auditable and replayable, not trusted on sight.
That is the practical payoff.
Not just citations.
Evidence objects.
Not just sources.
Living links between claims, source versions, spans, warrants, verdicts, and repairs.
A failed evidence build does not mean the article is bad.
It means some claims no longer compile against the available evidence.
12. Implementation: A Minimal Intelligent Grep
Now we can make the architecture concrete.
This implementation does not replicate GrepSeek. It does not train a GRPO policy. It does not implement ACL-Verbatim’s full token-classifier span extraction. It does not reproduce CiteVQA’s multimodal attribution benchmark.
That is not the goal.
The goal is to build the smallest useful version of the Evidence Engine:
local text corpus
claim objects
search warrants
restricted grep-like operations
candidate windows
verbatim spans
attribution verdicts
evidence records
audit traces
This MVP demonstrates the architectural point:
No claim enters search without a warrant.
No corpus access happens outside the search kernel.
No evidence span is generated from model prose.
No attribution verdict is hidden inside prose.
No failed search disappears without a trace.
The implementation is intentionally conservative.
That is a feature.
The safest first version of intelligent grep should not begin with an unconstrained agent and a shell. It should begin with typed objects, scoped operations, explicit budgets, traceable failures, and privacy-aware persistence.
This MVP assumes a corpus of local .md, .txt, .py, .yaml, .yml, and .json files. PDFs must be pre-extracted into text first. Production PDF support would need parser-native page, region, and table coordinates.
12.1 Project Shape
A minimal implementation can live in a small package:
evidence_engine/
dto.py
search_kernel.py
span_extractor.py
verifier.py
policy.py
engine.py
The modules map directly to the architecture:
| Module | Responsibility |
|---|---|
dto.py |
claims, warrants, traces, spans, verdicts, evidence records |
search_kernel.py |
restricted, warrant-governed corpus operations |
span_extractor.py |
verbatim extraction from candidate windows |
verifier.py |
MVP attribution scoring |
policy.py |
verdict-to-action routing |
engine.py |
end-to-end claim verification loop |
This is not yet a production system.
It is enough to demonstrate the loop.
12.2 DTOs
The DTO layer matters because it prevents the implementation from collapsing back into loose strings.
Claims, warrants, spans, traces, and evidence records are different objects.
# evidence_engine/dto.py
from __future__ import annotations
from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from pathlib import Path
from typing import Any, Literal
from uuid import uuid4
import hashlib
def now_utc() -> datetime:
return datetime.now(timezone.utc)
def new_id(prefix: str) -> str:
return f"{prefix}_{uuid4().hex[:12]}"
def sha256_text(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
class ClaimModality(str, Enum):
FACTUAL = "factual"
METRIC = "metric"
DEFINITIONAL = "definitional"
AUTHOR_DEFINED = "author_defined"
INTERPRETIVE = "interpretive"
SPECULATIVE = "speculative"
SYNTHESIS = "synthesis"
class SupportLabel(str, Enum):
SUPPORTS = "supports"
PARTIAL = "partially_supports"
DOES_NOT_SUPPORT = "does_not_support"
CONTRADICTS = "contradicts"
TOO_VAGUE = "too_vague"
NEEDS_MULTIPLE_SPANS = "needs_multiple_spans"
NOT_IN_CORPUS = "not_in_corpus"
OUT_OF_SCOPE = "out_of_scope"
ABSTAIN = "abstain"
class PolicyAction(str, Enum):
ACCEPT = "accept"
REPAIR = "repair"
REFUTE = "refute"
ABSTAIN = "abstain"
REVIEW = "review"
class RetentionPolicy(str, Enum):
METADATA_ONLY = "metadata_only"
STORE_SPAN = "store_span"
STORE_REDACTED_SPAN = "store_redacted_span"
@dataclass(frozen=True)
class EvidenceClaim:
id: str
text: str
source_path: str = ""
char_start: int = 0
char_end: int = 0
claim_type: str = "unknown"
modality: ClaimModality = ClaimModality.FACTUAL
risk: Literal["low", "medium", "high"] = "medium"
@property
def claim_hash(self) -> str:
return sha256_text(self.text)
@property
def needs_evidence(self) -> bool:
return self.modality in {
ClaimModality.FACTUAL,
ClaimModality.METRIC,
ClaimModality.DEFINITIONAL,
ClaimModality.SYNTHESIS,
}
@dataclass(frozen=True)
class SearchWarrant:
id: str
claim_id: str
purpose: str
allowed_scopes: list[Path]
allowed_operations: list[str]
query_terms: list[str]
excluded_dir_names: set[str] = field(
default_factory=lambda: {".git", "private", "secrets"}
)
max_files_touched: int = 20
max_search_ops: int = 12
max_content_return_chars: int = 4000
log_query_terms: bool = False
return_policy: str = "supporting_spans_only"
retention_policy: RetentionPolicy = RetentionPolicy.METADATA_ONLY
expires_after_run: bool = True
created_at: datetime = field(default_factory=now_utc)
warrant_version: str = "warrant-v0.1"
@dataclass
class SearchStep:
id: str
warrant_id: str
claim_id: str
operation: str
arguments: dict[str, Any]
result_summary: str
files_touched: list[str] = field(default_factory=list)
chars_read: int = 0
timestamp: datetime = field(default_factory=now_utc)
@dataclass
class CandidateWindow:
id: str
claim_id: str
warrant_id: str
source_path: Path
text: str
char_start: int
char_end: int
query_term_hash: str
source_hash: str
search_trace: list[SearchStep] = field(default_factory=list)
@dataclass
class EvidenceSpan:
id: str
claim_id: str
warrant_id: str
source_path: Path
text: str
char_start: int
char_end: int
span_hash: str
source_hash: str
extractor_version: str = "verbatim-window-v0.1"
confidence: float = 0.0
search_trace: list[SearchStep] = field(default_factory=list)
@dataclass
class AttributionResult:
claim_id: str
span_id: str | None
label: SupportLabel
support_score: float
drift_score: float
reason: str
suggested_rewrite: str | None = None
verifier_version: str = "simple-overlap-v0.1"
@dataclass
class EvidenceRecord:
id: str
claim_id: str
claim_hash: str
warrant_id: str
source_path: str | None
span_id: str | None
span_hash: str | None
source_hash: str | None
retained_span_text: str | None
redaction_status: str
verification_result: AttributionResult
policy_action: PolicyAction
retention_policy: RetentionPolicy
search_trace: list[SearchStep] = field(default_factory=list)
repair_history: list[str] = field(default_factory=list)
stale_status: str = "fresh"
human_validation_status: str = "pending"
training_eligibility: bool = False
schema_version: str = "evidence-record-v0.1"
created_at: datetime = field(default_factory=now_utc)
Notice the important distinction:
EvidenceSpan = source-anchored text used during verification
EvidenceRecord = claim + warrant + span hash + verifier + policy decision
The system should not confuse the source text with the decision made about it.
12.3 Restricted Search Kernel
The search kernel is the trust boundary.
The model, planner, or calling code may propose a search. Only the kernel touches the corpus.
This is not raw shell access.
It is a restricted command set.
# evidence_engine/search_kernel.py
from __future__ import annotations
import re
from pathlib import Path
from typing import Iterable
from .dto import (
CandidateWindow,
SearchStep,
SearchWarrant,
new_id,
sha256_text,
)
class WarrantViolation(Exception):
pass
class ScopedSearchKernel:
"""
GrepSeek-inspired corpus interaction governed by SearchWarrant.
This is not raw shell execution. It is restricted, auditable,
grep-like search over approved files.
Production systems still need hardened filesystem policies, parser
sandboxes, retention controls, and tested exclusion logic.
"""
ALLOWED_EXTENSIONS = {".md", ".txt", ".py", ".yaml", ".yml", ".json"}
def __init__(self, warrant: SearchWarrant):
self.warrant = warrant
self.trace: list[SearchStep] = []
self.files_touched: set[str] = set()
self.ops_used = 0
def _check_operation(self, operation: str) -> None:
if operation not in self.warrant.allowed_operations:
raise WarrantViolation(f"{operation} not allowed by warrant.")
if self.ops_used >= self.warrant.max_search_ops:
raise WarrantViolation("Search operation budget exhausted.")
if len(self.files_touched) >= self.warrant.max_files_touched:
raise WarrantViolation("File touch budget exhausted.")
def _is_excluded(self, path: Path) -> bool:
resolved = path.resolve()
return any(part in self.warrant.excluded_dir_names for part in resolved.parts)
def _is_in_scope(self, path: Path) -> bool:
if path.is_symlink():
return False
try:
resolved = path.resolve()
except OSError:
return False
allowed = any(
resolved == scope.resolve() or resolved.is_relative_to(scope.resolve())
for scope in self.warrant.allowed_scopes
)
if not allowed:
return False
if self._is_excluded(resolved):
return False
return True
def _iter_files(self) -> Iterable[Path]:
for scope in self.warrant.allowed_scopes:
if not scope.exists() or scope.is_symlink():
continue
if scope.is_file():
if scope.suffix.lower() in self.ALLOWED_EXTENSIONS and self._is_in_scope(scope):
yield scope
continue
for path in scope.rglob("*"):
if path.is_symlink():
continue
if not path.is_file():
continue
if path.suffix.lower() not in self.ALLOWED_EXTENSIONS:
continue
if self._is_in_scope(path):
yield path
def _trace_args(self, arguments: dict[str, object]) -> dict[str, object]:
"""
Query terms can leak sensitive information.
By default, log hashes rather than raw query strings.
"""
if self.warrant.log_query_terms:
return arguments
redacted: dict[str, object] = {}
for key, value in arguments.items():
if isinstance(value, str):
redacted[f"{key}_hash"] = sha256_text(value)
else:
redacted[key] = value
return redacted
def _log(
self,
operation: str,
arguments: dict[str, object],
summary: str,
files: list[str],
chars_read: int = 0,
) -> None:
self.ops_used += 1
self.trace.append(
SearchStep(
id=new_id("step"),
warrant_id=self.warrant.id,
claim_id=self.warrant.claim_id,
operation=operation,
arguments=self._trace_args(arguments),
result_summary=summary,
files_touched=files,
chars_read=chars_read,
)
)
def search_phrase(
self,
phrase: str,
case_sensitive: bool = False,
) -> list[tuple[Path, list[tuple[int, int]]]]:
self._check_operation("search_phrase")
flags = 0 if case_sensitive else re.IGNORECASE
pattern = re.compile(re.escape(phrase), flags=flags)
results: list[tuple[Path, list[tuple[int, int]]]] = []
for path in self._iter_files():
if len(self.files_touched) >= self.warrant.max_files_touched:
break
try:
text = path.read_text(encoding="utf-8", errors="replace")
except OSError:
continue
# Count every scanned file as touched, not only files with matches.
self.files_touched.add(str(path))
matches = [(m.start(), m.end()) for m in pattern.finditer(text)]
if matches:
results.append((path, matches))
self._log(
operation="search_phrase",
arguments={"phrase": phrase, "case_sensitive": case_sensitive},
summary=f"Found matches in {len(results)} file(s).",
files=sorted(self.files_touched),
)
return results
def read_window(
self,
path: Path,
center: int,
query_term: str,
radius: int = 1000,
) -> CandidateWindow:
self._check_operation("read_window")
if not self._is_in_scope(path):
raise WarrantViolation(f"Path outside warrant scope: {path}")
text = path.read_text(encoding="utf-8", errors="replace")
center = max(0, min(center, len(text)))
start = max(0, center - radius)
end = min(len(text), center + radius)
window = text[start:end]
if len(window) > self.warrant.max_content_return_chars:
window = window[: self.warrant.max_content_return_chars]
end = start + len(window)
self.files_touched.add(str(path))
self._log(
operation="read_window",
arguments={"path": str(path), "center": center, "radius": radius},
summary=f"Read {len(window)} chars from {path.name}.",
files=[str(path)],
chars_read=len(window),
)
return CandidateWindow(
id=new_id("window"),
claim_id=self.warrant.claim_id,
warrant_id=self.warrant.id,
source_path=path,
text=window,
char_start=start,
char_end=end,
query_term_hash=sha256_text(query_term),
source_hash=sha256_text(text),
search_trace=list(self.trace),
)
There are no shell commands here.
The only operations are the ones the warrant allows.
One caveat: this MVP uses Python character offsets, not byte offsets. A production system should also store byte offsets or parser-native coordinates, especially for PDFs and extracted documents.
12.4 Verbatim Span Extractor
The first span extractor should be simple.
It should not summarize.
It should not paraphrase.
It should return source text.
# evidence_engine/span_extractor.py
from __future__ import annotations
from .dto import CandidateWindow, EvidenceClaim, EvidenceSpan, SearchWarrant, new_id, sha256_text
class VerbatimSpanExtractor:
"""
ACL-Verbatim-inspired extractor.
MVP behavior:
- receive a candidate window
- return exact source text
- never paraphrase
"""
extractor_version = "verbatim-window-v0.1"
def extract(
self,
claim: EvidenceClaim,
warrant: SearchWarrant,
window: CandidateWindow,
) -> EvidenceSpan:
# MVP: return the trimmed candidate window.
# Production extractors should preserve exact byte offsets and
# selected token boundaries.
span_text = window.text.strip()
if not span_text:
raise ValueError("Cannot extract empty evidence span.")
# Avoid window.text.index(span_text), because repeated text can
# corrupt offsets. For the MVP, only whitespace trimming is allowed.
local_start = len(window.text) - len(window.text.lstrip())
char_start = window.char_start + local_start
char_end = char_start + len(span_text)
return EvidenceSpan(
id=new_id("span"),
claim_id=claim.id,
warrant_id=warrant.id,
source_path=window.source_path,
text=span_text,
char_start=char_start,
char_end=char_end,
span_hash=sha256_text(span_text),
source_hash=window.source_hash,
extractor_version=self.extractor_version,
confidence=0.70,
search_trace=list(window.search_trace),
)
Returning the full window is intentionally unoptimized.
The invariant is extractive fidelity, not precision tuning.
The extractor is allowed to be imprecise in the MVP. It is not allowed to invent text.
12.5 Attribution Verifier
The verifier below is intentionally weak.
It uses lexical overlap and numeric containment. That is not a production attribution model.
This means drift_score here is lexical drift, not true semantic Hallucination Energy. A faithful paraphrase may score poorly. A lexical overclaim may score too well. That limitation is exactly why a production system needs a calibrated NLI model, cross-encoder, or human-reviewed verifier.
But this MVP is enough to demonstrate the policy boundary:
related text is not automatically support
numbers must be preserved
overclaims should be softened
weak evidence should not be accepted
# evidence_engine/verifier.py
from __future__ import annotations
import re
from .dto import AttributionResult, EvidenceClaim, EvidenceSpan, SupportLabel
class SimpleAttributionVerifier:
"""
CiteVQA-inspired verifier.
MVP:
- lexical overlap
- numeric containment
- simple overclaim hooks
Later:
- replace with NLI model, cross-encoder, or calibrated LLM judge
- calibrate against human-labelled claim/span pairs
"""
verifier_version = "simple-overlap-v0.1"
STOPWORDS = {
"the", "and", "for", "with", "that", "this", "from", "into",
"have", "has", "had", "are", "was", "were", "not", "but",
"does", "did", "can", "could", "should", "would", "will",
}
STRONG_WORDS = {
"proves": "suggests",
"solves": "helps address",
"eliminates": "reduces",
"replaces": "can complement",
"always": "often",
"never": "rarely",
"all": "many",
}
def verify(self, claim: EvidenceClaim, span: EvidenceSpan) -> AttributionResult:
claim_terms = self._terms(claim.text)
span_terms = self._terms(span.text)
if not claim_terms:
return self._abstain(claim, span, "No usable claim terms.")
overlap = claim_terms & span_terms
overlap_ratio = len(overlap) / max(1, len(claim_terms))
claim_numbers = self._numbers(claim.text)
span_numbers = self._numbers(span.text)
numbers_ok = claim_numbers.issubset(span_numbers)
drift_score = 1.0 - overlap_ratio
if claim_numbers and not numbers_ok:
drift_score += 0.25
if self._has_overclaim(claim.text):
drift_score += 0.10
drift_score = min(1.0, drift_score)
if overlap_ratio > 0.75 and numbers_ok and not self._has_overclaim(claim.text):
return AttributionResult(
claim_id=claim.id,
span_id=span.id,
label=SupportLabel.SUPPORTS,
support_score=0.90,
drift_score=drift_score,
reason="The span contains most claim terms and preserves numeric details.",
verifier_version=self.verifier_version,
)
if overlap_ratio > 0.45 and numbers_ok:
return AttributionResult(
claim_id=claim.id,
span_id=span.id,
label=SupportLabel.PARTIAL,
support_score=0.62,
drift_score=drift_score,
reason="The span is related, but the claim may overreach or need narrower wording.",
suggested_rewrite=self._soften(claim.text),
verifier_version=self.verifier_version,
)
return AttributionResult(
claim_id=claim.id,
span_id=span.id,
label=SupportLabel.DOES_NOT_SUPPORT,
support_score=0.20,
drift_score=drift_score,
reason="The span does not contain enough evidence to carry the claim.",
verifier_version=self.verifier_version,
)
def _terms(self, text: str) -> set[str]:
words = re.findall(r"\b[a-zA-Z][a-zA-Z0-9_-]{2,}\b", text.lower())
return {w for w in words if w not in self.STOPWORDS}
def _numbers(self, text: str) -> set[str]:
# Matches 1,897 | 53.6 | 150M | 50% | 2x
return set(re.findall(r"\d+(?:,\d{3})*(?:\.\d+)?(?:[a-zA-Z%]+)?", text))
def _has_overclaim(self, text: str) -> bool:
lower = text.lower()
return any(re.search(rf"\b{re.escape(word)}\b", lower) for word in self.STRONG_WORDS)
def _soften(self, text: str) -> str:
out = text
for strong, soft in self.STRONG_WORDS.items():
out = re.sub(rf"\b{re.escape(strong)}\b", soft, out, flags=re.IGNORECASE)
return out
def _abstain(self, claim: EvidenceClaim, span: EvidenceSpan, reason: str) -> AttributionResult:
return AttributionResult(
claim_id=claim.id,
span_id=span.id,
label=SupportLabel.ABSTAIN,
support_score=0.0,
drift_score=1.0,
reason=reason,
verifier_version=self.verifier_version,
)
This verifier is deliberately weak.
It defines the interface, not the final attribution model.
It also does not detect contradiction. Contradiction handling is reserved for a later NLI, cross-encoder, or LLM-based verifier.
The important point is the interface:
claim + evidence span → structured attribution verdict
Once that interface exists, the verifier can improve without changing the rest of the architecture.
Suggested rewrites are only hints. They should not be applied automatically without their own attribution check.
12.6 Policy Router
The router turns attribution into an action.
# evidence_engine/policy.py
from __future__ import annotations
from .dto import AttributionResult, PolicyAction, SupportLabel
class PolicyRouter:
def route(self, result: AttributionResult) -> PolicyAction:
if result.label == SupportLabel.SUPPORTS and result.support_score >= 0.80:
return PolicyAction.ACCEPT
if result.label == SupportLabel.PARTIAL:
return PolicyAction.REPAIR
if result.label == SupportLabel.CONTRADICTS:
return PolicyAction.REFUTE
if result.label in {
SupportLabel.TOO_VAGUE,
SupportLabel.NEEDS_MULTIPLE_SPANS,
}:
return PolicyAction.REVIEW
if result.label in {
SupportLabel.NOT_IN_CORPUS,
SupportLabel.OUT_OF_SCOPE,
SupportLabel.ABSTAIN,
}:
return PolicyAction.ABSTAIN
return PolicyAction.ABSTAIN
A production router would be more nuanced.
It would account for claim risk, modality, refutation results, human review requirements, verifier confidence, and warrant exhaustion.
But even this minimal router prevents the worst failure mode:
related span → confident citation
Related text is not enough.
The router requires a verdict.
12.7 Evidence Engine Loop
Now we can assemble the loop.
# evidence_engine/engine.py
from __future__ import annotations
from pathlib import Path
from .dto import (
AttributionResult,
EvidenceClaim,
EvidenceRecord,
EvidenceSpan,
PolicyAction,
RetentionPolicy,
SearchWarrant,
SupportLabel,
new_id,
)
from .policy import PolicyRouter
from .search_kernel import ScopedSearchKernel, WarrantViolation
from .span_extractor import VerbatimSpanExtractor
from .verifier import SimpleAttributionVerifier
class EvidenceEngine:
def __init__(self, corpus_root: Path):
self.corpus_root = corpus_root
def build_warrant(self, claim: EvidenceClaim) -> SearchWarrant:
query_terms = self._query_terms(claim.text)
return SearchWarrant(
id=new_id("warrant"),
claim_id=claim.id,
purpose=f"Verify claim before publication: {claim.text}",
allowed_scopes=[self.corpus_root],
allowed_operations=["search_phrase", "read_window"],
query_terms=query_terms,
max_files_touched=20,
max_search_ops=12,
max_content_return_chars=4000,
log_query_terms=False,
return_policy="supporting_spans_only",
retention_policy=RetentionPolicy.METADATA_ONLY,
)
def verify_claim(self, claim: EvidenceClaim) -> EvidenceRecord:
if not claim.needs_evidence:
result = AttributionResult(
claim_id=claim.id,
span_id=None,
label=SupportLabel.ABSTAIN,
support_score=0.0,
drift_score=0.0,
reason=f"Claim modality {claim.modality.value} does not require evidence.",
)
return self._record(
claim=claim,
warrant_id="none",
span=None,
result=result,
action=PolicyAction.ABSTAIN,
retention_policy=RetentionPolicy.METADATA_ONLY,
trace=[],
)
warrant = self.build_warrant(claim)
kernel = ScopedSearchKernel(warrant)
extractor = VerbatimSpanExtractor()
verifier = SimpleAttributionVerifier()
router = PolicyRouter()
best_result: AttributionResult | None = None
best_span: EvidenceSpan | None = None
try:
for term in warrant.query_terms:
hits = kernel.search_phrase(term)
for path, matches in hits:
for start, _end in matches[:3]:
window = kernel.read_window(
path=path,
center=start,
query_term=term,
radius=600,
)
span = extractor.extract(
claim=claim,
warrant=warrant,
window=window,
)
result = verifier.verify(claim, span)
if self._is_better(result, best_result):
best_result = result
best_span = span
action = router.route(result)
if action == PolicyAction.ACCEPT:
return self._record(
claim=claim,
warrant_id=warrant.id,
span=span,
result=result,
action=action,
retention_policy=warrant.retention_policy,
trace=list(kernel.trace),
)
except WarrantViolation as exc:
# If the warrant is exhausted after finding partial evidence,
# return the best bounded result rather than losing the trace.
if best_result is not None:
action = router.route(best_result)
return self._record(
claim=claim,
warrant_id=warrant.id,
span=best_span,
result=best_result,
action=action,
retention_policy=warrant.retention_policy,
trace=list(kernel.trace),
)
result = AttributionResult(
claim_id=claim.id,
span_id=None,
label=SupportLabel.OUT_OF_SCOPE,
support_score=0.0,
drift_score=1.0,
reason=str(exc),
)
return self._record(
claim=claim,
warrant_id=warrant.id,
span=None,
result=result,
action=PolicyAction.ABSTAIN,
retention_policy=warrant.retention_policy,
trace=list(kernel.trace),
)
if best_result is not None:
action = router.route(best_result)
return self._record(
claim=claim,
warrant_id=warrant.id,
span=best_span,
result=best_result,
action=action,
retention_policy=warrant.retention_policy,
trace=list(kernel.trace),
)
result = AttributionResult(
claim_id=claim.id,
span_id=None,
label=SupportLabel.NOT_IN_CORPUS,
support_score=0.0,
drift_score=1.0,
reason="No candidate evidence found within the warrant scope.",
)
return self._record(
claim=claim,
warrant_id=warrant.id,
span=None,
result=result,
action=PolicyAction.ABSTAIN,
retention_policy=warrant.retention_policy,
trace=list(kernel.trace),
)
def _is_better(
self,
result: AttributionResult,
best: AttributionResult | None,
) -> bool:
if best is None:
return True
return (result.support_score, -result.drift_score) > (
best.support_score,
-best.drift_score,
)
def _record(
self,
claim: EvidenceClaim,
warrant_id: str,
span: EvidenceSpan | None,
result: AttributionResult,
action: PolicyAction,
retention_policy: RetentionPolicy,
trace: list,
) -> EvidenceRecord:
retained_span_text: str | None = None
redaction_status = "not_retained"
if span and retention_policy == RetentionPolicy.STORE_SPAN:
retained_span_text = span.text
redaction_status = "stored_raw"
elif span and retention_policy == RetentionPolicy.STORE_REDACTED_SPAN:
retained_span_text = "[REDACTED_PER_WARRANT_POLICY]"
redaction_status = "stored_redacted"
return EvidenceRecord(
id=new_id("evidence"),
claim_id=claim.id,
claim_hash=claim.claim_hash,
warrant_id=warrant_id,
source_path=str(span.source_path) if span else None,
span_id=span.id if span else None,
span_hash=span.span_hash if span else None,
source_hash=span.source_hash if span else None,
retained_span_text=retained_span_text,
redaction_status=redaction_status,
verification_result=result,
policy_action=action,
retention_policy=retention_policy,
search_trace=trace,
stale_status="fresh" if span else "unanchored",
repair_history=[result.suggested_rewrite] if result.suggested_rewrite else [],
training_eligibility=False,
)
def _query_terms(self, claim_text: str) -> list[str]:
"""
MVP query planning.
Later:
- entity extraction
- metric extraction
- paper-title routing
- alias expansion
- counter-query generation
"""
terms: list[str] = []
words = claim_text.split()
for word in words:
clean = word.strip(".,;:()[]{}\"'")
if not clean:
continue
if any(ch.isdigit() for ch in clean):
terms.append(clean)
elif clean[:1].isupper() and len(clean) > 3:
terms.append(clean)
if len(words) >= 4:
terms.append(" ".join(w.strip(".,;:()[]{}\"'") for w in words[:4]))
return list(dict.fromkeys(terms))[:8]
This engine is still small.
But it already has the shape we need:
claim
→ warrant
→ scoped search
→ candidate window
→ verbatim span
→ attribution verdict
→ policy action
→ evidence record
And because it returns EvidenceRecord, not just text, the result can be audited, replayed, stored, repaired, or rejected.
A METADATA_ONLY record cannot be replayed without reopening the source under a new warrant. That is intentional: permission to search does not imply permission to retain.
12.8 Example Usage
from pathlib import Path
from evidence_engine.dto import EvidenceClaim, ClaimModality, new_id
from evidence_engine.engine import EvidenceEngine
engine = EvidenceEngine(corpus_root=Path("papers"))
claim = EvidenceClaim(
id=new_id("claim"),
text="CiteVQA contains 1,897 questions across 711 PDFs.",
modality=ClaimModality.METRIC,
risk="high",
)
record = engine.verify_claim(claim)
print(record.policy_action)
print(record.verification_result.label)
print(record.verification_result.reason)
if record.repair_history:
print("Suggested repair:", record.repair_history[-1])
A possible output:
PolicyAction.ACCEPT
SupportLabel.SUPPORTS
The span contains most claim terms and preserves numeric details.
Or, if the corpus does not contain the required evidence:
PolicyAction.ABSTAIN
SupportLabel.NOT_IN_CORPUS
No candidate evidence found within the warrant scope.
That second output is important.
The system did not guess.
It did not invent a citation.
It did not pretend the claim was false.
It said:
not verified inside this warrant
That is a good failure.
12.9 What This MVP Does Not Yet Do
This MVP is deliberately limited.
It does not yet:
train a GrepSeek-style search policy
perform semantic candidate discovery with embeddings
run an NLI or cross-encoder attribution verifier
detect contradiction
extract minimal spans with a token classifier
run verifier-driven retry loops
run a separate refutation warrant
decompose synthesis claims into EvidenceSets
perform fuzzy span relocation
persist evidence records to a database
produce an evidence lockfile
produce attribution diffs
calibrate verifier scores against human labels
support PDFs directly
harden every hostile-filesystem edge case
Those are next steps.
But the architecture already makes room for them.
The query planner can be replaced.
The span extractor can be replaced.
The verifier can be replaced.
The router can become risk-aware.
The evidence record can be stored.
The traces can become training data after validation.
That is why the interfaces matter more than the first heuristic implementation.
The MVP is not intelligent because the verifier is clever.
It is intelligent because semantic planning is separated from corpus access, evidence is extractive, the verdict is explicit, and the trace survives.
That is intelligent grep.
13. Running the Demo on This Article
The most honest demo is to run the Evidence Engine on the article itself.
The article should not merely argue for evidence-first writing.
It should perform it.
So after implementing the first Writer Evidence Engine MVP, I pointed it at a small version of this article and asked it to verify the claims directly.
This was not a simulated result.
This was the first real run.
The goal was not to prove that the MVP verifier is perfect. It is not. The verifier is still heuristic. It uses lexical overlap, numeric containment, high-risk predicate checks, and policy routing.
The goal was narrower:
Can the system extract evidence-bearing claims?
Can it generate bounded warrants?
Can it search only inside an approved source folder?
Can it find candidate evidence?
Can it accept a simple supported metric claim?
Can it route overclaims to review instead of blindly accepting them?
Can it abstain when a predicate is not supported?
Can it refuse to certify a synthesis claim as a single-span fact?
Can it preserve a report of what happened?
That is the first bar for intelligent grep.
Not omniscience.
Evidence discipline.
13.1 Demo Corpus
The demo corpus was deliberately small.
C:\demo\
papers\
grepseek.txt
acl_verbatim.txt
citevqa.txt
draft\
warranted_search.md
The three paper files were pre-extracted text snippets representing the sources. Raw PDF parsing is still outside the MVP.
The draft file contained four claims designed to test the core behaviors:
CiteVQA contains 1,897 questions across 711 PDFs.
ACL-Verbatim eliminates hallucination from research QA.
GrepSeek replaces dense retrieval.
GrepSeek, ACL-Verbatim, and CiteVQA form a single architecture for trustworthy AI writing.
Those four claims are small, but they cover the important cases:
| Claim Type | Purpose |
|---|---|
| Supported metric claim | Tests numeric containment and direct support |
| Overclaim | Tests whether strong language is routed to review |
| Replacement claim | Tests whether entity overlap is enough or whether the predicate matters |
| Synthesis claim | Tests whether the system refuses to certify multi-source interpretation as one-span fact |
13.2 Step 1: Extract Claims
The first step was claim extraction.
writer evidence extract-claims --project warranted-search c:/demo/draft/warranted_search.md
The engine extracted four claims:
| Claim ID | Modality | Needs Evidence | High Risk | Claim |
|---|---|---|---|---|
claim-906c715741cd |
metric |
yes | CiteVQA contains 1,897 questions across 711 PDFs. | |
claim-909c686983f5 |
causal |
yes | YES | ACL-Verbatim eliminates hallucination from research QA. |
claim-631c1ca0571c |
causal |
yes | YES | GrepSeek replaces dense retrieval. |
claim-e0424ae8b7b0 |
synthesis |
yes | GrepSeek, ACL-Verbatim, and CiteVQA form a single architecture for trustworthy AI writing. |
That is already useful.
The system correctly identified:
the metric claim
the two high-risk causal / overclaim sentences
the synthesis claim
It did not treat all sentences as the same kind of object.
That is the first important result.
A trustworthy evidence system needs a modality gate. It needs to know the difference between a number, an overclaim, and an authorial synthesis.
13.3 Step 2: First Verification Run
Next, I ran the verifier against the approved paper-source folder.
writer evidence verify-chapter --project warranted-search c:/demo/draft/warranted_search.md --root C:\demo\papers\
The first run produced this report:
Evidence Report — C:\demo\draft\warranted_search.md
Project: warranted-search | Total claims: 4
Accepted: 1
Repairs: 0
Reviews: 3
Refuted: 0
Abstained: 0
Not-in-corpus: 0
The claim-level result was:
| Action | Verdict | Score | Claim |
|---|---|---|---|
accept |
supported |
0.950 |
CiteVQA contains 1,897 questions across 711 PDFs. |
review |
partially_supported |
0.433 |
ACL-Verbatim eliminates hallucination from research QA. |
review |
supported |
0.880 |
GrepSeek replaces dense retrieval. |
review |
not_supported |
0.100 |
GrepSeek, ACL-Verbatim, and CiteVQA form a single architecture for trustworthy AI writing. |
This was a good first result.
The system accepted the clean metric claim. It routed the ACL-Verbatim overclaim to review. It refused to accept the synthesis claim as a single-span fact.
But it also exposed a verifier weakness.
The third row was wrong at the verdict level:
GrepSeek replaces dense retrieval.
→ review / supported / 0.880
The action was safe because the router still sent the claim to review.
But the verifier label was too generous.
The evidence did not carry the predicate:
replaces
The verifier had given too much credit to entity overlap. It saw “GrepSeek” and “retrieval” in the same neighborhood and treated that as support.
That is exactly the failure mode this article warns about.
The text was nearby.
The citation-shaped evidence looked relevant.
But the claim overreached.
So I turned the failure into a regression test.
13.4 Step 3: Fixing Predicate Support
The bug was not in search.
The system found the right neighborhood.
The bug was in attribution.
The verifier was asking:
Does this span mention the same entities?
But for high-risk claims, it needs to ask:
Does this span support the predicate?
A claim like:
GrepSeek replaces dense retrieval.
does not only depend on the terms:
GrepSeek
retrieval
It depends on the relationship:
replaces
So I added a predicate-support gate for high-risk verbs and phrases:
replaces
eliminates
proves
solves
guarantees
makes obsolete
always
never
all
first
best
The rule is simple.
Entity overlap alone can never confirm these predicates.
If the claim says “replaces,” the span must support replacement. If the span instead says “complementary,” “hybrid,” “future work,” or “existing retrieval paradigms,” the claim must be downgraded.
The test case became:
Claim:
GrepSeek replaces dense retrieval.
Evidence:
Direct Corpus Interaction can complement existing retrieval paradigms.
Future work includes hybrid DCI-plus-index retrieval.
Expected:
not_supported or partially_supported
never supported
After adding the predicate gate, the test suite grew to 58 tests, all passing.
That matters because the Evidence Engine did something important:
it made the failure visible
it turned the failure into a test
it improved the next run
That is trace-driven improvement in miniature.
13.5 Step 4: Second Verification Run
After the fix, I ran the exact same command again:
writer evidence verify-chapter --project warranted-search c:/demo/draft/warranted_search.md --root C:\demo\papers\
This time the report changed:
Evidence Report — C:\demo\draft\warranted_search.md
Project: warranted-search | Total claims: 4
Accepted: 1
Repairs: 0
Reviews: 2
Refuted: 0
Abstained: 1
Not-in-corpus: 0
The final claim-level result was:
| Action | Verdict | Score | Claim |
|---|---|---|---|
accept |
supported |
0.950 |
CiteVQA contains 1,897 questions across 711 PDFs. |
review |
partially_supported |
0.350 |
ACL-Verbatim eliminates hallucination from research QA. |
abstain |
not_supported |
0.200 |
GrepSeek replaces dense retrieval. |
review |
not_supported |
0.100 |
GrepSeek, ACL-Verbatim, and CiteVQA form a single architecture for trustworthy AI writing. |
This is the result I wanted.
The metric claim still passes.
The ACL-Verbatim overclaim is still routed to review.
The GrepSeek replacement claim is no longer falsely labelled as supported.
The synthesis claim is still not certified as a single-span fact.
This is a better evidence report because the system now distinguishes:
same topic
from:
same supported claim
That distinction is the entire point of the Evidence Engine.
13.6 Result 1: The Metric Claim Passed
Claim:
CiteVQA contains 1,897 questions across 711 PDFs.
Final result:
Action: accept
Verdict: supported
Score: 0.950
This is the easiest case, but it matters.
The system found the relevant source text, preserved the two numeric details, and accepted the claim.
This demonstrates the baseline path:
claim
→ warrant
→ scoped search
→ source window
→ verbatim span
→ numeric containment check
→ accept
This is not deep reasoning.
It is something more boring and more important:
the number in the claim was carried by the source span
That is exactly the kind of claim a citation should be able to support.
13.7 Result 2: The ACL-Verbatim Overclaim Went to Review
Claim:
ACL-Verbatim eliminates hallucination from research QA.
Final result:
Action: review
Verdict: partially_supported
Score: 0.350
This is a useful failure.
The source may support a narrower claim. It may show that ACL-Verbatim reduces hallucination risk by forcing research QA answers to be returned as verbatim source spans.
But that does not justify the word:
eliminates
The MVP did the right thing at the policy level.
It did not accept the claim.
It marked it as partially supported and routed it to review.
The better sentence is probably closer to:
ACL-Verbatim reduces hallucination risk by forcing research QA answers to be returned as verbatim source spans.
That rewrite still needs to be verified.
A repair is not automatically evidence.
A repair is a new claim that must pass the same process.
This is the difference between decorative citation and evidence discipline.
The system did not merely find a nearby source.
It said:
This is related, but the wording is too strong.
That is useful.
13.8 Result 3: The GrepSeek Replacement Claim Was Downgraded
Claim:
GrepSeek replaces dense retrieval.
First run:
Action: review
Verdict: supported
Score: 0.880
Second run:
Action: abstain
Verdict: not_supported
Score: 0.200
This is the most important result in the demo.
The first run exposed a weakness.
The second run showed that the weakness could be fixed.
The claim was not wrong because it mentioned the wrong topic. It was wrong because the source did not support the relationship asserted by the predicate.
A span saying that Direct Corpus Interaction can complement retrieval does not support:
replaces dense retrieval
It supports something closer to:
GrepSeek shows that direct corpus interaction can complement index-based retrieval by making search trajectories explicit and auditable.
That is why predicate support matters.
A model can find the right paper.
It can find the right paragraph.
It can even find the right concept.
And still attach the wrong claim.
That is Evidence Quicksand.
The second run demonstrates the fix:
entity overlap is not enough
predicate support must be checked
unsupported replacement claims should be downgraded
The system now abstains instead of accepting unsupported replacement language.
That is not a failure.
That is the correct outcome.
13.9 Result 4: The Synthesis Claim Was Not Certified as a Single-Span Fact
Claim:
GrepSeek, ACL-Verbatim, and CiteVQA form a single architecture for trustworthy AI writing.
Final result:
Action: review
Verdict: not_supported
Score: 0.100
This is acceptable for the MVP.
The claim is not stated in any one paper.
It is the central synthesis of the article.
The system should not pretend that one source span proves it.
The ideal future behavior is not simply not_supported. The full Evidence Engine should route this to an EvidenceSet:
sourceable atoms:
- GrepSeek provides executable corpus interaction.
- ACL-Verbatim provides extractive evidence spans.
- CiteVQA provides attribution-aware evaluation.
interpretive joins:
- These primitives map onto search, extraction, and verification.
- A warranted-search runtime can compose those stages into an Evidence Engine.
- The final architecture is the author's synthesis.
The MVP does not yet fully implement that decomposition in the report.
But it did the most important thing:
it refused to accept the synthesis claim as a single-span fact
That is a good failure.
It preserves the distinction between:
the papers support the components
and:
the author composes the architecture
That distinction is one of the core ideas of this post.
13.10 What the Demo Shows
The final run produced:
Accepted: 1
Repairs: 0
Reviews: 2
Refuted: 0
Abstained: 1
Not-in-corpus: 0
This is exactly the kind of imperfect result I wanted.
A perfect result would have been suspicious.
A useful result shows where the system works and where it needs improvement.
The MVP demonstrated that it can:
| Capability | Result |
|---|---|
| Extract evidence-bearing claims | Working |
| Classify metric, causal, and synthesis claims | Working |
| Search inside an approved corpus | Working |
| Accept a clean supported metric claim | Working |
| Route overclaims to review | Working |
| Abstain on unsupported replacement language | Working |
| Refuse to certify synthesis as one-span fact | Working |
| Expose verifier weakness as a visible trace | Working |
| Convert that weakness into a regression test | Working |
It also exposed the next engineering tasks:
| Weakness | Fix |
|---|---|
| Synthesis verdict is too crude | Route synthesis claims to needs_multiple_spans / EvidenceSet instead of simple not_supported |
| Repair generation is still shallow | Generate source-aware repair candidates and re-verify them |
| Verifier is still lexical | Replace or augment with NLI, cross-encoder, or calibrated attribution model |
| The report hides useful verifier metadata | Display overclaim flags, predicate failures, and suggested repairs |
That is a successful first demo.
Not because the system proved every claim.
Because it showed the evidence path, accepted only the clean case, routed questionable claims away from automatic acceptance, and surfaced the next engineering step.
13.11 The Point of the Demo
The point of this demo is not:
The MVP verifier is finished.
It is:
The architecture works well enough to make claim checking inspectable.
The first run already changed the system.
Before the run, I could say:
An Evidence Engine should catch overclaims.
After the run, I can say something more precise:
The first MVP accepted the metric claim, routed the ACL-Verbatim overclaim to review, over-scored the GrepSeek replacement claim, exposed that mistake in the trace, and improved after a predicate-support gate was added.
That is better.
It is more precise.
It is more trustworthy.
The article is no longer merely arguing for warranted search.
It has become a warranted artifact.
The system checked its claims, showed the report, exposed a verifier weakness, converted that weakness into a regression test, and improved on the second run.
That is what accountable AI writing should look like.
14. Evidence Memory
Section 8 argued that a single verification run can produce a useful trace.
This section is about what changes when those traces persist.
The long-term value is not one evidence report.
It is governed memory.
A single trace captures one verification episode:
claim
warrant
search operations
candidate windows
extracted spans
support verdict
drift estimate
policy action
accepted repair
rejected repair
human review decision
Evidence memory is what happens when many of those traces accumulate across drafts, sources, verifier versions, and review decisions.
It is not merely a larger log.
Persistence creates new responsibilities:
validation
staleness
promotion
privacy
forgetting
replay
training eligibility
That is the shift.
One trace tells us what happened to one claim.
Governed evidence memory tells us what the system has learned, what it is allowed to remember, what must be rechecked, and what must never be promoted into training.
14.1 Four Different Stores
It helps to separate four objects that are often blurred together.
| Object | Purpose | Mutability | Training role |
|---|---|---|---|
| Citation cache | remembers where a source was found | ephemeral / invalidated by source change | none by itself |
| Audit trace | records what happened in one warrant run | append-only | raw diagnostic material |
| Evidence memory | stores reviewed claim-evidence decisions across runs | versioned and lifecycle-managed | candidate calibration material |
| Training corpus | contains validated, privacy-cleared examples | snapshot-based | used for model or policy updates |
A citation cache maps a topic or claim to a source location.
An audit trace records the exact operations of one run.
Evidence memory records what happened when a claim was tested against a source under a warrant.
A training corpus is a curated derivative of memory, not memory itself.
That distinction matters because these objects have different rules.
The audit trace is for accountability.
Evidence memory is for recall, replay, calibration, and review.
The training corpus is for model improvement.
Raw traces should not flow into training just because they exist.
14.2 What Evidence Memory Stores
A useful evidence memory stores more than accepted citations.
It stores the full decision surface:
| Artifact | Why it matters |
|---|---|
| accepted spans | positive evidence examples |
| rejected spans | negative attribution examples |
| partial spans | overclaim and repair examples |
| contradiction spans | refutation examples |
| abstentions | uncertainty and scope examples |
| warrant violations | policy-hardening examples |
| accepted rewrites | claim repair examples |
| rejected rewrites | bad repair examples |
| human overrides | calibration examples |
| stale evidence events | re-verification examples |
| source hash changes | dependency invalidation examples |
| verifier version changes | replay and calibration examples |
| retention events | privacy and compliance examples |
The rejected cases may be more valuable than the accepted ones.
They show the boundary between relevance and support.
A bad citation teaches the verifier what “nearby but not supportive” looks like.
A failed search teaches the planner which query paths waste budget.
A human-rejected repair teaches the system which rewrites damaged the author’s meaning.
A stale span teaches the memory layer what needs re-verification when sources move.
This is how an evidence system improves while preserving epistemic humility: traces inform learning only after validation.
14.3 Memory Is Not Ground Truth
Evidence memory should be treated as training material, not truth.
A failed search is not proof of falsehood.
It may mean:
the warrant was too narrow
the query terms were poor
the source was missing
the parser failed
the verifier abstained
the budget ran out
Evidence memory should preserve those distinctions.
It should not flatten:
not found inside this warrant
into:
false
It should not flatten:
accepted by weak verifier
into:
true
And it should not flatten:
human edited this rewrite
into:
model was correct
The dangerous loop looks like this:
the system grades its own work
stores the grade
trains on the grade
becomes more confident in its own mistakes
Evidence memory should prevent that loop by separating raw traces from validated training examples.
Raw traces are audit material.
Reviewed traces are editorial feedback.
Validated traces become calibration data.
Promoted traces become training examples.
Deprecated traces remain historical artifacts, but should not guide future decisions unless revalidated.
14.4 Lifecycle Axes
The memory lifecycle should not be one flat status field.
A record can be validated and stale.
A record can be reviewed and private.
A record can be promoted and later deprecated.
So implementation should treat lifecycle as several independent axes:
review_status:
raw | reviewed | validated | rejected
promotion_status:
unpromoted | promoted | deprecated | quarantined
freshness_status:
fresh | stale | source_missing | verifier_changed
privacy_tier:
public | project_private | sensitive | restricted | no_retention
training_eligible:
true | false
training_eligible should usually be derived, not manually asserted.
A record becomes training-eligible only when it is:
validated
fresh
not deprecated
not privacy-restricted
approved for the relevant training purpose
A compact memory record might look like this:
{
"memory_id": "mem_123",
"evidence_record_id": "evidence_456",
"claim_hash": "sha256:...",
"warrant_id": "warrant_789",
"source_hash": "sha256:...",
"span_hash": "sha256:...",
"verdict": "partially_supports",
"policy_action": "repair",
"review_status": "validated",
"promotion_status": "unpromoted",
"freshness_status": "fresh",
"privacy_tier": "project_private",
"retention_policy": "metadata_only",
"training_eligible": false,
"verifier_version": "simple-overlap-v0.1",
"created_at": "2026-06-02T00:00:00Z"
}
This is the difference between memory as a pile of logs and memory as governed infrastructure.
14.5 Promotion and Validation
The most important rule is simple:
raw trace ≠ training example
A trace should pass through a promotion gate before it influences future behavior.
Possible validation routes include:
human review
cross-verifier agreement
gold-set replay
regression test
accepted author correction
calibrated threshold check
A promotion policy might say:
raw
→ reviewed after human or automated triage
→ validated after review, consensus, or replay
→ promoted only if privacy policy allows training use
→ deprecated if source, verifier, warrant, or claim changes
The system should also distinguish warrant exhaustion from source absence and contradiction:
| State | Meaning |
|---|---|
abstain |
no safe verdict |
not_in_corpus |
source not found inside the warrant |
out_of_scope |
needed source was outside permission |
contradicts |
source actively conflicts with claim |
budget_exhausted |
search stopped before completion |
Flattening these into “failure” corrupts the training signal.
A good evidence memory does not merely remember outcomes.
It remembers why those outcomes happened.
14.6 Dependency Index and Replay
Evidence memory needs an inverted dependency index.
Otherwise it cannot know what to re-check when something changes.
At minimum, the memory layer should be able to ask:
which evidence records depend on this source hash?
which records depend on this verifier version?
which records came from this warrant?
which claims used this span?
which records are training-eligible but now stale?
That gives us cascade invalidation.
If a source changes, dependent evidence records can be marked stale.
If a verifier changes, old verdicts can be replayed.
If a warrant changes, old traces may no longer be valid under the new permission boundary.
If a claim changes, the old span may no longer carry it.
This is where evidence memory connects back to the build-system metaphor.
The memory is not just storing history.
It is tracking dependencies.
14.7 Privacy and Forgetting
Evidence memory must be designed to forget.
A system that remembers every inspected window forever violates the warrant model.
Search permission is not retention permission.
If a warrant allowed the system to inspect a private source for one claim, that does not mean the source text should become permanent memory or training data.
Evidence memory should therefore support:
metadata-only retention
raw span expiry
redacted trace storage
source-level deletion
project-level deletion
private-memory exclusion
training opt-out
retention expiry
purpose limitation
right-to-forget tombstones
For sensitive sources, the memory may keep only metadata:
claim hash
source hash
span hash
verdict
policy action
timestamp
But even that can leak.
A span hash can become a membership oracle if an attacker can guess the sensitive text and compare hashes.
A query log can leak intent even when no raw span is retained.
A trace that says:
search_phrase("patient HIV status")
is already sensitive.
For restricted warrants, the system may need salted hashes, local-only storage, encrypted metadata, query redaction, or complete deletion after verification.
Deletion may also require a policy tombstone.
A tombstone should not retain the deleted content. It should retain only the fact that a source, span, project, or memory region must not be reused, re-ingested, or promoted into training.
This is not an implementation footnote.
It is part of the warrant.
A trustworthy Evidence Engine must be able to remember what helps while forgetting what it was not allowed to keep.
14.8 What the System Can Learn
The mappings from traces to improvable components are the same ones introduced earlier.
What changes here is governance.
A trace improves a component only after it has been validated and cleared for training.
Validated memory can improve:
successful search traces
→ query planning
failed search traces
→ negative search policy
accepted spans
→ span extraction
rejected spans
→ attribution verification
overclaim repairs
→ claim rewriting
human overrides
→ calibration
stale evidence events
→ replay and invalidation
warrant violations
→ policy enforcement
But the learning target is not:
produce more confident answers
The learning target is:
find better evidence
with less scope
less drift
less privacy exposure
clearer abstention
and better calibration
The system is not learning how to sound more authoritative.
It is being trained toward the discipline of evidence.
14.9 The Long-Term Payoff
Initial runs generate reports.
Validated runs populate evidence memory.
Curated subsets become training material.
But only if the system keeps the right distinctions:
trace is not truth
evidence is not prose
citation is not attribution
repair is not acceptance
abstention is not failure
memory is not permission
That is the deeper value.
The Evidence Engine is not merely checking writing.
It is building a reusable, governed record of how evidence was found, tested, repaired, rejected, invalidated, forgotten, and remembered.
Over time, that record can train better components:
a better query planner
a local span extractor
a calibrated attribution verifier
a refutation search policy
a voice-preserving claim repair model
This is the bridge from one verified article to a validated-data flywheel.
Better over time, not because it trusts itself.
Better because it preserves traces, gates them through validation, and trains only on what survived.
15. Why This Matters Now
AI writing tools are already moving into domains where evidence cannot be treated as decoration:
research
law
finance
medicine
engineering
policy
education
compliance
software documentation
corporate governance
The generation layer is arriving faster than the verification layer.
That is the problem.
In these domains, none of the usual signals are enough on their own:
fluent prose
a real citation
a source link
a confident answer
Each can be present while the claim underneath remains unsupported.
I am not claiming that the MVP in this post is ready for courtrooms, clinics, trading desks, or regulatory review. A lexical-overlap verifier has no business making clinical or legal judgments.
The point is the standard these domains demand.
A trustworthy system must be able to show, for each evidence-bearing claim:
what claim is being checked
what permission governed the search
where it searched
what it found
what it extracted
which span is being used to support the claim
which span contradicts it
where the claim overreaches
why the system accepted, repaired, refuted, abstained, or requested a wider warrant
This is not absolute proof.
It is auditable evidentiary support: a claim checked under a warrant, against a source, with a trace the writer can inspect.
That is the boundary between text generation and accountable knowledge work.
A generator optimizes for fluency.
An Evidence Engine enforces verifiable support.
The difference matters because AI systems are no longer confined to low-stakes drafting. They summarize papers, interpret regulations, compare financial claims, review code, assist clinicians, draft policy briefs, generate documentation, and build institutional knowledge.
In those settings, the failure mode is not always a fake source.
Often the source is real.
The citation exists.
The paragraph is nearby.
The sentence sounds right.
But the evidence does not support the claim.
That is Evidence Quicksand: not the absence of sources, but the failure of the connection between claim and source.
Better retrievers and larger models may help, but they do not remove the need for a different contract between the writer, the system, and the sources:
Do not merely retrieve.
Search under warrant.
Do not merely cite.
Extract evidence.
Do not merely answer.
Verify attribution.
Do not merely repair silently.
Show the trace.
Do not merely remember.
Remember under retention policy.
Warrants make the search accountable.
Evidence makes the claim accountable.
I think this should change the standard for AI writing tools.
The next generation of tools should not be judged only by how well they write.
They should be judged by whether they can expose the warrant, trace, evidence span, attribution verdict, and policy action behind their claims.
Not every sentence needs a citation.
Not every idea needs an external source.
Not every synthesis can or should be reduced to a single span.
But every evidence-bearing claim should be able to answer a simple question:
What carries this?
If the system can answer that question, the writer can inspect the evidence path.
If the system can answer it but the evidence is weak, partial, or contrary, it should narrow the claim, repair it, refute it, or route it to review.
If the system cannot answer it inside the warrant, the honest outcome is to abstain or ask for a wider warrant.
A system that can abstain is more trustworthy than one that always finds something.
That is the shift.
From prose generation to evidence discipline.
From decorative citations to load-bearing attribution.
From passive retrieval to warranted search.
From one-off answers to living evidence memory.
The future of AI writing is not just better wording.
It is accountable knowledge work.
16. What the Demo Showed
This post began with a failure:
right answer
real citation
fake grounding
That is Evidence Quicksand.
The answer may be correct.
The source may exist.
The citation may look scholarly.
But the evidence does not carry the claim.
The solution is not simply a bigger model, a larger context window, or another retrieval layer.
Those may help.
But they do not change the contract.
The solution is a better evidence architecture.
GrepSeek gives us the search loop.
ACL-Verbatim gives us the extractive evidence constraint.
CiteVQA gives us the attribution test.
Warranted search gives us the governance layer.
The Writer implementation gives us the first practical runtime.
Together, these pieces point toward intelligent grep:
not dumb grep
not loose AI
not black-box retrieval
not decorative citations
A system like this should not merely answer.
It should be able to show:
Here is the claim.
Here is the warrant.
Here is where I searched.
Here is what I was allowed to inspect.
Here is the source window.
Here is the exact span.
Here is the attribution verdict.
Here is where the claim overreaches.
Here is why I accepted, repaired, refuted, reviewed, or abstained.
Here is the trace you can audit.
That is what the demo had to show.
And the first implementation did show it.
Imperfectly.
Usefully.
The final run checked four claims:
| Claim | Result |
|---|---|
CiteVQA contains 1,897 questions across 711 PDFs. |
accept / supported / 0.950 |
ACL-Verbatim eliminates hallucination from research QA. |
review / partially_supported / 0.350 |
GrepSeek replaces dense retrieval. |
abstain / not_supported / 0.200 |
GrepSeek, ACL-Verbatim, and CiteVQA form a single architecture for trustworthy AI writing. |
review / not_supported / 0.100 |
That table is the whole post in miniature.
The clean metric claim passed.
The overclaim was not accepted.
The unsupported replacement claim was downgraded.
The synthesis claim was not certified as a single-span fact.
The system did not pretend to settle truth.
It made claim decisions under a warrant.
The most important part of the demo was not the accepted claim.
It was the failure between the first run and the second run.
On the first run, the verifier over-scored this sentence:
GrepSeek replaces dense retrieval.
It found the right neighborhood. It saw the right entities. It saw “GrepSeek” and “retrieval” close together.
But it missed the unsupported predicate:
replaces
So the first result came back too generous:
review / supported / 0.880
That was Evidence Quicksand inside the Evidence Engine itself.
The evidence was nearby.
The citation-shaped match looked plausible.
But the span did not carry the claim.
The difference is that this time the failure was visible.
The trace exposed it.
The report preserved it.
The bug became a regression test.
The verifier was updated so high-risk predicates such as replaces, eliminates, proves, solves, and guarantees require explicit support. Entity overlap alone is no longer enough.
On the second run, the same claim changed to:
abstain / not_supported / 0.200
That is the result the article is arguing for.
Not a perfect system.
A corrigible one.
A system that can show its work, expose its weak spots, and improve because the failure was recorded rather than hidden inside prose.
The demo showed that search can be bounded.
It showed that evidence can be extractive.
It showed that attribution can be tested.
It showed that overclaims can be routed away from automatic acceptance.
It showed that abstention is better than fake certainty.
It showed that synthesis needs a different evidence shape.
It showed that a trace can survive long enough to become the next test.
That is the difference between:
a generated answer with a citation
and:
a claim checked against evidence under a warrant
That is the shift.
Not AI that merely writes.
AI that can show how its writing was checked.
Not evidence as decoration.
Evidence as infrastructure.
Not a model performing grounding.
A system exposing the path between claim and source.
17. Conclusion: Build on Rock
A citation is not ground.
Neither is a source link.
Neither is a retrieved chunk.
A citation becomes ground only when the cited evidence can carry the claim.
That is the problem this post began with:
right answer
real citation
fake grounding
The answer looked correct.
The citation looked real.
But the cited evidence could not carry the claim.
That failure is not a minor RAG bug. It is a structural problem in how AI systems perform knowledge work. They are very good at producing the appearance of grounding: source links, plausible references, confident explanations, scholarly prose. But appearance is not enough when the work depends on evidence.
The fix is not a larger context window, a more fluent decoder, or a higher-recall retriever.
The fix is a different acceptance standard.
An evidence-bearing claim should not pass because it sounds right.
It should not pass because a source is nearby.
It should not pass because a citation exists.
It should pass only when the system can show the evidence path it is relying on:
claim
→ warrant
→ search trace
→ source window
→ evidence span
→ attribution verdict
→ accept / repair / refute / abstain
That does not guarantee truth.
It gives the writer an auditable basis for judgment.
Together, the pieces in this post define an executable evidence discipline:
- Warranted search bounds the investigation.
- Executable corpus interaction makes the search path visible.
- Verbatim spans anchor the evidence in source text.
- Attribution verification tests whether the span carries the claim.
- Refutation search catches overreach.
- Synthesis decomposition separates sourced atoms from authorial joins.
- Abstention makes “not verified inside this warrant” a real outcome.
- Evidence memory turns verified runs into governed, reusable learning material.
That is intelligent grep.
Not raw string matching.
Not an unconstrained agent.
Not black-box retrieval.
Not decorative citations.
Intelligent grep means AI planning around deterministic corpus operations, bounded by warrants, anchored in source spans, and judged by attribution.
The important point is that this post did not stop at the architecture.
We ran the MVP.
We fed the article’s own claims into the Evidence Engine.
The first run accepted the clean metric claim, routed an overclaim to review, refused to certify the synthesis claim as a single-span fact, and exposed a flaw in the verifier.
That flaw mattered.
The verifier initially over-scored:
GrepSeek replaces dense retrieval.
It found the right neighborhood. It saw the right entities. But it failed to test whether the source supported the predicate:
replaces
That was Evidence Quicksand inside the Evidence Engine itself.
The difference is that this time the failure was visible.
The report exposed it.
The trace preserved it.
The bug became a regression test.
After adding a predicate-support gate, the same claim was downgraded:
before: review / supported / 0.880
after: abstain / not_supported / 0.200
That is the strongest result in the post.
Not because the MVP is perfect.
Because the MVP is corrigible.
It can show its work, reveal its weak spots, and improve because the failure is recorded rather than buried inside fluent prose.
This does not remove judgment from the writer.
It makes the judgment visible.
The writer still argues.
The writer still synthesizes.
The writer still decides what matters.
But now the system can say:
This part is sourced.
This part is inferred.
This part overreaches.
This part needs review.
This part has no support inside the warrant.
This part exposed a verifier weakness.
That is the standard this post argues for.
The future of serious AI writing is not just better prose.
It is better evidence discipline.
Writing tools should not merely help us generate fluent text faster. They should help us ask, sentence by sentence, claim by claim:
What carries this?
If the evidence carries the claim, show it.
If the claim overreaches, repair it.
If the source contradicts it, refute it.
If the warrant is too narrow, say so.
If the evidence is missing, abstain.
If the verifier fails, preserve the failure and turn it into the next test.
That is how we move from citation performance to accountable knowledge work.
Do not build on citation-shaped sand.
Build on evidence that can carry the claim.
Glossary
| Term | Meaning in This Post |
|---|---|
| Evidence Quicksand | A failure mode where the answer is correct, the citation is real, but the cited evidence does not actually support the claim. The output looks grounded until inspected closely. |
| Warrant | A runtime-enforced permissions envelope around an intelligent process. It defines what claim is being checked, what sources may be inspected, what operations are allowed, what models may be called, what budget applies, and what may be returned or retained. |
| Warranted Search | Search performed under an explicit warrant. Instead of asking “what can I find?”, the system asks whether a specific claim can be supported, repaired, refuted, or left unverified inside a declared scope. |
| Intelligent Grep | AI-guided, warrant-bound corpus search. The AI plans and evaluates search actions, but deterministic grep-like operations perform scoped interaction with real files. |
| Executable Corpus Interaction | A search process where the agent interacts with the corpus through explicit operations such as search_phrase, read_window, or extract_span, producing an auditable trace instead of an opaque retrieval result. |
| Direct Corpus Interaction / DCI | The broader pattern, exemplified by GrepSeek, where a system searches and reads the corpus directly rather than relying only on a prebuilt index or top-k retriever. |
| Scoped Search Kernel | The runtime component that enforces the warrant. It is the only layer allowed to touch the corpus, and it rejects out-of-scope files, disallowed operations, over-budget reads, and unsafe actions. |
| Search Trace | The ordered record of search operations: what was searched, which files were touched, what windows were read, what failed, what succeeded, and what evidence candidates were produced. |
| Candidate Evidence Window | A bounded slice of source text returned by the search kernel. It is not yet evidence; it is a region from which a verbatim evidence span may be extracted. |
| Verbatim Evidence Span | Exact source text extracted from the corpus. It must not be paraphrased, summarized, or generated. It is the concrete object later tested against a claim. |
| Extractive Evidence | Evidence returned as source text rather than generated explanation. ACL-Verbatim motivates this pattern by treating the answer as a verbatim span. |
| Attribution Verification | The process of checking whether a cited span actually supports the claim. It answers: “Does this evidence carry this sentence?” |
| Strict Attributed Accuracy | A CiteVQA-inspired evaluation principle: an answer should pass only when both the answer and the cited evidence are correct. In this post, the same idea is adapted to text claims and evidence spans. |
| Evidence-Bearing Claim | A factual, metric, comparative, source-backed, or synthesis-heavy statement that requires an evidence decision. Not every sentence in a piece of writing is evidence-bearing. |
| Modality Gate | The layer that classifies sentences by type: factual, author-defined, interpretive, speculative, narrative, or synthesis. It prevents the system from demanding citations for every sentence. |
| Atomic Claim | A claim that can usually be checked against one source span, such as a benchmark number, definition, or paper-specific statement. |
| Synthesis Claim | A claim built by connecting multiple sources or ideas. It cannot be verified by one span alone because the author is making an interpretive join across evidence. |
| Sourceable Atom | A subclaim inside a synthesis claim that can be checked directly against source evidence. |
| Interpretive Join | The authorial reasoning that connects multiple source-backed atoms into a larger argument. It should be marked as interpretation, not treated as something the sources directly said. |
| EvidenceSet | A structured evidence object for synthesis claims. It contains multiple subclaims, spans, bridge claims, composition notes, and a composition verdict. |
| EvidenceSpan | A structured object representing one claim-to-source-region relationship: source, span, hash, location, warrant, extractor version, and related trace metadata. |
| EvidenceRecord | The full decision record for a claim: claim hash, warrant, evidence span, verifier result, policy action, retention policy, trace, repair history, and stale status. |
| Evidence Object | A first-class runtime object that the system can hash, replay, invalidate, repair, review, and potentially use for training after validation. |
| Evidence Memory | A governed store of reviewed evidence records across runs. It is not a citation cache and not automatic truth; it is structured memory of claim-evidence decisions. |
| Citation Cache | A simple lookup store that remembers where sources were found. Unlike evidence memory, it does not record whether a claim was actually supported. |
| Audit Trace | An append-only record of what happened during one warrant run. It is used for inspection, debugging, replay, and accountability. |
| Training Corpus | A curated, privacy-cleared subset of validated evidence memory used to improve components such as query planners, span extractors, verifiers, or repair models. |
| Hallucination Energy | A heuristic estimate of drift between a claim and its cited evidence. It measures how far the wording moves beyond what the span can carry; it is not a truth score. |
| Drift Score | A practical score estimating how much a claim overstates, narrows, paraphrases, or departs from its evidence. In the MVP, this is only a crude lexical signal. |
| Policy Router | The component that turns verifier output into an action: accept, repair, refute, abstain, request a wider warrant, or send to human review. |
| Repair | A proposed rewrite that narrows or corrects a claim so it better matches the evidence. Repairs are proposals and must themselves be re-verified. |
| Refutation Search | A bounded counter-evidence pass. Instead of searching only for support, the system also asks what would make the claim false, narrower, or misleading. |
| Abstention | A valid outcome where the system refuses to accept or refute a claim because sufficient evidence was not found inside the warrant. Abstention is better than fake certainty. |
| Warrant Exhaustion | The state where the search has reached its allowed budget, scope, or operation limit. The system must stop, report the trace, and abstain or request a wider warrant. |
| Retention Policy | The part of the warrant that defines what may be stored after a run: raw spans, redacted spans, metadata only, hashes only, or nothing. |
| Search Permission vs. Retention Permission | The principle that being allowed to inspect a source during a warrant run does not automatically mean the system may store that source text afterward. |
| Attribution Diff | A report showing what changed since the last evidence run: claims still anchored, spans lost, source hashes changed, verifier decisions changed, or claims needing re-verification. |
| Stale Evidence | Evidence whose source, claim, warrant, span, or verifier context has changed enough that it must be rechecked before reuse. |
| Content Hash / Span Hash | A cryptographic fingerprint of the evidence text. It helps detect whether the exact span still exists, but it does not prove the claim true. |
| Source Hash | A fingerprint of the surrounding source document or extracted text. It helps detect source changes and trigger re-verification. |
| Living Evidence Link | A claim-to-evidence dependency that can be rechecked, invalidated, repaired, or replayed over time. |
| Deterministic Acceptance Boundary | The principle that generation may be stochastic, but acceptance of evidence-bearing claims should be governed by explicit checks, warrants, spans, and policy decisions. |
| Claim Decision | The structured outcome of checking a claim: accepted, repaired, refuted, abstained, marked as synthesis, or sent to review. |
| Load-Bearing Attribution | A citation or evidence span that actually supports the claim it is attached to. The evidence is not decorative; it carries the sentence. |
| “What carries this?” | The central question of the post. Every evidence-bearing claim should be able to point to the warrant, source, span, attribution verdict, and policy action that support it. |
| Build on Rock | The conclusion’s metaphor for evidence-first writing: do not build on citation-shaped sand; build on evidence that can carry the claim. |
Related Posts in This Series
This post sits inside a longer line of work on policy-bounded, evidence-aware AI systems. The articles below develop the surrounding ideas: deterministic acceptance, hallucination energy, trendslop, lightweight critics, and trace-native self-improvement.
| Post | Core Idea | How It Connects to Warranted Search |
|---|---|---|
| Hallucination Energy: A Geometric Foundation for Policy-Bounded AI | Introduces Hallucination Energy as a projection-residual signal between claims and evidence, used as a deterministic policy scalar rather than a truth oracle. | Provides the geometric grounding layer behind the drift / overclaim idea. Warranted Search uses spans and attribution checks, while Hallucination Energy helps estimate how far a claim moves beyond its evidence. |
| From Evidence to Verifiability: Rebuilding Trust in AI Outputs | Argues that the hard problem in high-trust AI is not only model quality, but executable policy: stochastic generation must be governed by deterministic verification rules. | Supplies the policy-first foundation. Warranted Search extends that idea from claim-evidence policy gates into scoped search, warrants, traceable evidence, and claim-level attribution. |
| Applied Policy: How to Incorporate Policy and Hallucination in a Self-Improving System | Develops a trace-native self-improvement loop using multi-objective reward, hallucination energy, embedding margin, policy advantage, and memory consolidation. | Connects evidence traces to learning. The Evidence Engine’s audit traces and evidence memory can become validated training material for better query planning, span extraction, verifier calibration, and repair models. |
| Beyond Hallucination Energy: A Three-Dimensional Framework for Reliable AI Outputs | Expands the failure model beyond hallucination to include trendslop: fluent, safe, generic outputs that fail to respond to the specific problem. | Shows why evidence containment is necessary but not sufficient. Warranted Search checks what carries a claim; trendslop analysis asks whether the output actually responded to the task. |
| Tiny Critics: Lightweight Reasoning Checks for Large AI Systems | Shows how small, cheap critics can flag suspicious reasoning traces using structured features rather than another large LLM. | Points toward lightweight verifier layers inside the Evidence Engine. Not every check needs a frontier model; some warrant, attribution, drift, or trace-quality checks can be small, fast, and interpretable. |
References and Further Reading
| Category | Reference | Why It Matters for This Post | Where It Fits |
|---|---|---|---|
| Core paper | GrepSeek: Training Search Agents for Direct Corpus Interaction | Introduces Direct Corpus Interaction: search agents interacting with a raw corpus through executable operations rather than relying only on an opaque retriever. | Executable search, intelligent grep, auditable search traces |
| Core paper | ACL-Verbatim: Hallucination-Free Question Answering for Research | Applies the VerbatimRAG pattern to research papers and returns verbatim source spans instead of free-form generated answers. | Extractive evidence, span-first answering, abstention |
| Core paper | CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence | Evaluates whether answers are supported by cited evidence regions, not merely whether the answer text is correct. | Attribution verification, Strict Attributed Accuracy, evidence-carrying claims |
| Related system | VerbatimRAG: Build Hallucination-Free RAG with Verbatim | Practical background on the verbatim-span approach: return exact source text rather than generated claims when evidence fidelity matters. | ACL-Verbatim context, extractive QA, evidence spans |
| Foundational RAG paper | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | The canonical RAG paper. Useful as the baseline architecture this post critiques: retrieval helps, but retrieval alone is not proof. | Retrieval is not proof, RAG baseline |
| Retrieval for code/documentation | Retrieval Augmented Code Generation and Summarization | Shows retrieval-augmented generation applied to code and documentation tasks, useful background for software-documentation use cases. | Software documentation, engineering knowledge work |
| Benchmark / attribution concept | CiteVQA’s Strict Attributed Accuracy | The key evaluation idea: the answer and the evidence must both be correct. This post adapts that principle from document VQA to text claims. | Attribution tests, claim-span verification |
| Search architecture concept | Direct Corpus Interaction | The corpus becomes the environment. The system searches, reads, filters, and narrows through explicit actions instead of receiving only top-k chunks. | GrepSeek, executable corpus interaction |
| Evidence architecture concept | Verbatim evidence spans | The system returns exact source text before generating prose. This reduces drift between source and claim. | ACL-Verbatim, extractive evidence |
| Governance concept | Warranted search | Search constrained by a claim, source scope, allowed operations, budget, return policy, retention policy, and audit trail. | Warrant, scoped search kernel, privacy |
| Evaluation concept | Attribution failure | The answer may be correct and the citation may be real, but the cited evidence may not support the claim. | Evidence Quicksand |
| Implementation concept | Scoped search kernel | A safe runtime boundary that exposes grep-like operations such as search_phrase and read_window, while rejecting out-of-scope access. |
Minimal intelligent grep implementation |
| Implementation concept | Evidence object / EvidenceRecord | A structured record containing claim hash, warrant, source hash, span hash, verifier result, policy action, retention policy, and trace. | Evidence as first-class object |
| Implementation concept | EvidenceSet | A multi-span evidence structure for synthesis claims. It separates sourceable atoms from interpretive joins. | Synthesis decomposition |
| Reliability concept | Refutation search | A bounded counter-evidence pass that asks what would make a claim false, narrower, or misleading. | Overclaim repair, confirmation-bias control |
| Reliability concept | Abstention | A valid system outcome: “not verified inside this warrant.” It is preferable to fake certainty or decorative citation. | Policy router, evidence discipline |
| Memory concept | Evidence memory | Governed storage of validated claim-evidence decisions across runs. It is not a citation cache and not automatic truth. | Long-term evidence infrastructure |
| Privacy concept | Search permission vs. retention permission | Being allowed to inspect a source during a warrant run does not automatically mean the system may store the text afterward. | Warrant policy, evidence memory, privacy |
| Engineering metaphor | Deterministic acceptance boundary | Generation may be stochastic, but accepting evidence-bearing claims should depend on explicit checks, spans, warrants, and policy decisions. | Conclusion, build on rock |
| Central takeaway | “What carries this?” | The key question every evidence-bearing claim should be able to answer: what exact evidence, under what warrant, supports this sentence? | Final conclusion |