A Memory Gate for AI: Policy-Bounded Acceptance in the Executable Cognitive Kernel
Summary
Dynamic AI systems face a hidden failure mode: they can learn from their own mistakes. If every output is allowed into memory, stochastic errors do not stay local they accumulate.
In earlier posts, I argued that AI systems should not be trusted to enforce their own correctness.
Modern models are stochastic. They produce correct outputs, partially correct outputs, and completely incorrect outputs, but they do not reliably distinguish between them. That means a system that stores everything it generates will eventually learn from its own mistakes.
This post makes that problem concrete inside the Executable Cognitive Kernel (ECK).
It introduces the first operational memory gate in ECK: a policy-bounded acceptance layer that determines what the system is allowed to store. Instead of treating every output as equally eligible for memory, the kernel now verifies outputs, attempts repair when possible, and rejects failures before commit.
The loop changes from:
generate โ score โ store
to:
generate โ verify โ repair or reject โ commit
On a 100-item run, this produced only a small improvement in raw average F1, from 0.77 to 0.78. But that is not the main result.
The main result is that:
- bad trace admission fell from 0.17 to 0.00
- clean memory rate rose from 0.83 to 1.00
- 44 of 100 outputs were rejected instead of silently stored
That is the point of the memory gate.
The model generates possibilities. Policy determines what becomes memory.
1. From Policy Theory to Runtime Control
Across several earlier posts, I kept returning to one idea:
policy
That was not accidental.
The argument was that stochastic generators should not be responsible for enforcing their own guarantees. In any high-trust system, the thing that produces candidates and the thing that decides what is acceptable should be separated.
That is already how traditional systems work:
- file systems do not enforce their own security; the operating system does
- processes do not enforce their own permissions; the kernel does
- applications do not enforce their own isolation; the runtime does
The pattern is consistent:
critical constraints are enforced externally, not internally
The same principle applies to AI.
If the model is both generator and controller, then hallucinations, invalid structure, and incorrect results all become system problems. Prompting can help. Fine-tuning can help. But neither replaces a hard acceptance boundary outside the model.
That is what policy means here:
an external, deterministic layer that governs what is allowed to pass
Until now, that idea was mostly conceptual. This post turns it into a runnable mechanism inside ECK.
2. What Changes Inside ECK
The original ECK loop was built around execution:
generate โ score โ store
That design is enough to support action selection and iterative behavior. It is not enough to protect memory.
If every output is stored, then the system can improve its action policy while still learning from noisy, invalid, or misleading traces. Over time, that contaminates the very memory the system depends on.
So ECK needs two distinct policy layers:
| Policy | Role |
|---|---|
| Action Policy | decides what the system does |
| Acceptance Policy | decides what the system is allowed to store |
flowchart LR
Policy["๐ POLICY<br/>(Normative Layer)<br/>Defines constraints & invariants<br/>'Source of Truth'"] --> Verify
Verify["โ
VERIFICATION<br/>(Enforcement Layer)<br/>Applies policy constraints<br/>'Kernel Gate'"] --> Execute
Execute["โ๏ธ EXECUTION<br/>(Proposal Layer)<br/>Generates candidate outputs<br/>'Stochastic Engine'"] --> Memory
Memory["๐ฆ MEMORY<br/>(Compliant Store)<br/>Stores ONLY verified outputs<br/>'Trusted State'"]
classDef policy fill:#fff9c4,stroke:#fbc02d,stroke-width:3px,color:#000
classDef verify fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000
classDef execute fill:#bbdefb,stroke:#0d47a1,stroke-width:3px,color:#000
classDef memory fill:#a5d6a7,stroke:#1b5e20,stroke-width:3px,color:#000
class Policy policy
class Verify verify
class Execute execute
class Memory memory
The first controls behavior.
The second controls what is allowed to become memory.
This post focuses on the second.
The updated loop is:
generate โ verify โ repair or reject โ commit
That is the architectural shift.
If the first ECK post was about how the system learns to act, this one is about how the system learns to distrust its own bad outputs.
3. Why Dynamic Systems Need a Memory Gate
The need for a memory gate becomes much clearer when we look at how ECK actually behaves over time.
ECK is not a static model.
It is a dynamic system.
- it generates outputs
- it evaluates them
- it stores them
- and it uses stored results to inform future behavior
This means the system is continuously modifying the data it depends on.
The problem: self-contamination
In a static model, errors are isolated.
In a dynamic system, errors can propagate.
If incorrect outputs are stored:
- they become part of memory
- they influence future reasoning
- they get reused in later steps
Over time, this creates a feedback loop:
bad output โ stored โ reused โ reinforced
This is how a system poisons itself.
Not because the model is broken.
But because the system has no boundary around what it is allowed to remember.
Why stochastic generation makes this worse
The underlying model is stochastic:
- it produces variable outputs
- it does not enforce strict correctness
- it cannot guarantee consistency
That means errors are not rare edge cases.
They are a normal part of operation.
Without a control layer, those errors accumulate.
The role of the memory gate
The memory gate breaks this loop.
Instead of allowing all outputs into memory, the system now enforces:
only policy-compliant outputs are allowed to persist
This changes the system from:
- a self-accumulating process
into:
- a policy-regulated process
What the gate actually protects
The memory gate does not make the model correct.
It protects something more important:
- the integrity of memory
- the quality of future learning
- the stability of the system over time
The deeper point
In a static system, policy is useful.
In a dynamic, self-modifying system, policy becomes critical.
The more a system learns from itself, the more it needs a boundary around what it is allowed to learn.
4. The Memory Gate
This post introduces the first operational memory-gating prototype inside ECK.
The core idea is simple:
- the model still generates outputs
- verification checks those outputs against policy
- repair is attempted when failure looks recoverable
- commit happens only if verification passes
That turns memory from an append-only log into a governed state boundary.
The behavior of the system is fully determined by a single decision point: verification.
flowchart LR
Model["โ๏ธ MODEL OUTPUT<br/>(Proposed Result)"] --> Verify
Verify{"โ
VERIFICATION<br/>Policy Check"}
Verify -->|Pass| Commit["๐ฆ COMMIT<br/>Store in Memory"]
Verify -->|Fail + Repairable| Repair["๐ง REPAIR<br/>Generate Fix"]
Verify -->|Fail + Unfixable| Reject["โ REJECT<br/>Do Not Store"]
Repair --> Model
classDef model fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000
classDef verify fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000
classDef commit fill:#a5d6a7,stroke:#1b5e20,stroke-width:3px,color:#000
classDef repair fill:#ffe082,stroke:#ff6f00,stroke-width:2px,color:#000
classDef reject fill:#ef9a9a,stroke:#b71c1c,stroke-width:3px,color:#000
class Model model
class Verify verify
class Commit commit
class Repair repair
class Reject reject
This is the operational form of policy-bounded acceptance: every output must pass through this gate before it becomes memory.
The crucial mechanism is a single decision:
should_commit = (not self.use_verification) or v.passed
if should_commit:
self.memory.record(score)
This is the memory gate.
In standard mode, everything is committed.
In verified mode, only policy-compliant outputs are committed.
That sounds like a small change, but it has large consequences.
The model can still be wrong. The system no longer has to remember that it was.
5. The Verification Layer
To make this concrete, the experiment implements policy through explicit constraints.
Policy types
There are two major classes of constraint:
- schema constraints: the output must be valid JSON with required fields
- semantic constraints: the output must match the task well enough to satisfy policy
This creates an external acceptance boundary around the model.
Severity and enforcement
Not all failures are the same.
Some are critical:
- invalid JSON
- missing required fields
- broken structure
Others are semantic:
- hallucinated entities
- missing entities
- wrong entity types
- low F1
That distinction matters because it lets the system separate:
- outputs that must be blocked immediately
- outputs that are worth trying to repair
So verification is not just a binary stop sign. It is also a diagnostic layer.
6. The Repair Loop
The repair step is one of the most important parts of the design.
When an output fails verification, the system does not immediately give up. Instead, it feeds the failure back into the model in a constrained way.
The repair prompt includes:
- the previous output
- the list of violations
- an instruction to fix specific problems
That creates a bounded correction loop:
attempt โ verify โ repair โ verify
This matters because verification is doing two jobs at once:
- filtering bad outputs
- diagnosing repairable ones
In other words, verification is not only a rejection mechanism. It is also a controlled self-correction mechanism.
That is what makes the kernel more than a passive validator.
7. The Experiment
To test whether policy-bounded acceptance actually changes system behavior, I built a minimal ECK experiment with two modes:
- standard mode: outputs are always stored
- verified mode: outputs must pass policy before being stored
Everything else remains the same.
What stayed fixed
I did not change:
- the model
- the prompts
- the dataset slice
The only change was whether outputs were allowed into memory unconditionally or only after verification.
Task setup
The task is structured extraction.
Input is raw text.
Output is JSON with:
personsorganizationslocations
This is a good test case because it exposes both structural and semantic failure modes:
- invalid formatting
- hallucinated entities
- missing entities
- wrong entity typing
It also makes verification measurable.
Metrics
The experiment tracks three kinds of outcome.
Task quality
- average F1
Memory quality
- bad trace admission rate
- clean memory rate
- number of stored traces
System behavior
- retries
- repair attempts
- rejection rate
Expected tradeoff
Before running the experiment, the expected tradeoff was clear:
- more retries
- fewer stored traces
- cleaner retained memory
- possible quality improvement among committed outputs
That is exactly the kind of tradeoff a memory gate should create.
8. Results
Both modes were run on the same fixed 100-item slice.
Summary
| Metric | Standard | Verified |
|---|---|---|
| Average F1 | 0.77 | 0.78 |
| Stored Traces | 100 | 56 |
| Rejections | 0 | 44 |
| Bad Trace Admission | 0.17 | 0.00 |
| Clean Memory Rate | 0.83 | 1.00 |
What changed
Nothing about the model changed.
Nothing about the prompts changed.
Nothing about the data changed.
Only one thing changed:
the system became selective about what it accepts
That selectivity produced three effects.
1. Repair before acceptance
Outputs that failed verification were given a structured chance to improve.
2. Rejection of repeated failures
Outputs that continued to fail were not committed.
3. Memory became stricter
Verified mode stored fewer traces because it refused to preserve non-compliant outputs.
9. Why the Main Result Is Not F1
The raw F1 gain on 100 items is small: +0.01.
That matters, but it is not the real story of this experiment.
The real story is selectivity.
A standard kernel commits everything, including outputs it should not trust.
A verified kernel rejects nearly half its outputs in order to keep memory clean.
That is not a throughput failure. It is evidence that the acceptance boundary is doing work.
The most important result is this:
bad trace admission dropped to zero
That is the real systems result.
Because ECK is not only generating outputs. It is building memory.
If invalid traces are admitted, the system learns from corrupted evidence.
If invalid traces are blocked, memory becomes a more trustworthy base for future learning.
So the correct interpretation is not:
verification makes the model better
It is:
verification changes what the system is willing to preserve
That is the deeper architectural contribution.
10. Case Studies
A few concrete cases make the behavior clearer.
Case 1: Standard mode stores a hallucinated entity
Input contains a phrase like:
... he said the meeting would continue ...
The model outputs:
{
"persons": ["he"],
"organizations": [],
"locations": []
}
This is a hallucinated person entity.
In standard mode, the output fails verification but is still committed.
That means the system stores a trace it already has evidence against.
Case 2: Verified mode rejects the same failure
On the same kind of input, verified mode retries and still gets the same incorrect output:
{
"persons": ["he"],
"organizations": [],
"locations": []
}
After repeated failure, the trace is rejected.
The model remains imperfect.
The memory does not inherit that imperfection.
Case 3: Verified mode repairs an incorrect type assignment
Input contains:
... reporting from London Newsroom ...
Initial output:
{
"persons": ["London Newsroom"],
"organizations": [],
"locations": []
}
Verification detects the type error and the missing organization.
A repair prompt is issued.
Repair output:
{
"persons": [],
"organizations": ["London Newsroom"],
"locations": []
}
That passes verification and is committed.
This shows the full intended behavior:
failure โ feedback โ repair โ validation โ commit
These examples capture the core difference:
| Behavior | Standard | Verified |
|---|---|---|
| Accept incorrect outputs | Yes | No |
| Attempt structured repair | Limited | Yes |
| Reject repeated failures | No | Yes |
| Store only valid traces | No | Yes |
11. What This Prototype Shows
This post is not the final theory of policy in AI systems.
It is a concrete ECK case study of policy-bounded acceptance.
It shows that the broader policy idea can be operationalized as a simple runtime mechanism:
- define external constraints
- verify outputs against them
- repair what can be repaired
- reject what cannot
- gate memory at commit time
That is enough to materially change system behavior.
This prototype does not prove that verification always improves benchmark accuracy.
It does show something narrower and more important for ECK:
policy changes what becomes memory
And once that changes, the system itself changes.
12. Limitations
This is a deliberately small and controlled prototype.
A 100-item run is useful for showing the behavior of the memory gate, but it is not enough to support broad benchmark claims.
The exact magnitude of the F1 effect will vary with:
- slice composition
- task difficulty
- repair prompt quality
- threshold choice
The main claim here is architectural, not universal:
if a system is allowed to learn from its own outputs, then the boundary around what it is allowed to store becomes a first-class design problem
That is what this experiment demonstrates.
This integrity is not free
The gate increases retries, repair attempts, and rejected outputs, trading throughput for cleaner memory.
We see
- 44 rejections out of 100
- 113 repair attempts
- a 0.28 repair success rate in the demo output
13. Where This Goes Next
This prototype opens the door to several more important questions:
- richer domain-specific policy definitions
- tool-backed verification
- adaptive thresholds
- policy learning
- verification over multi-step reasoning
- long-horizon experiments comparing gated vs ungated memory over time
That last one is especially important.
The strongest future test is not whether a memory gate improves one batch of outputs.
It is whether a system without a memory gate degrades over time while a system with one remains stable.
That is where this idea becomes much bigger than a filtering mechanism.
Conclusion
This post introduced the first operational memory-gating prototype inside ECK.
It changed one thing:
what the system is allowed to store
That change produced:
- perfect clean-memory rate
- zero bad trace admission
- explicit rejection of 44 outputs that would otherwise have entered memory
That is the contribution.
ECK now has two distinct policy layers:
| Policy | Role |
|---|---|
| Action Policy | decides what the system does |
| Acceptance Policy | decides what the system learns from |
Both are necessary.
Without action policy, the system cannot explore.
Without acceptance policy, it cannot trust its own memory.
The first ECK post argued that systems can improve through execution.
This post adds the missing condition:
they must govern what becomes memory before they can improve safely
Closing line
The model generates possibilities. Policy determines what becomes memory.
๐ Appendix: Running the Demo
The code below implements the ECK v2 architecture described in this post. It is fully self-contained and runnable.
# ==============================================================================
# ECK v2: Policy-Bounded Verifiable Intelligence (SEMANTICALLY ALIGNED)
# ==============================================================================
from __future__ import annotations
import json
import random
import sqlite3
import time
import uuid
import requests
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from enum import Enum
from statistics import mean
from typing import Any, Dict, List, Optional, Tuple, Set
from datasets import load_dataset
# ------------------------------------------------------------------------------
# 1. Core
# ------------------------------------------------------------------------------
class SeverityLevel(Enum):
CRITICAL = "critical"
WARNING = "warning"
class ConstraintType(Enum):
SCHEMA = "schema"
LOGIC = "logic"
@dataclass
class ConstraintViolation:
constraint_id: str
message: str
severity: SeverityLevel
@dataclass
class VerificationResult:
passed: bool
violations: List[ConstraintViolation] = field(default_factory=list)
confidence: float = 1.0
is_blocking: bool = False
def __post_init__(self):
if any(v.severity == SeverityLevel.CRITICAL for v in self.violations):
self.passed = False
self.is_blocking = True
@dataclass
class ExecutionTrace:
task: str
action: str
result: Any
context: Dict[str, Any]
constraint_id: str = ""
class Logger:
def stage(self, msg):
print(f"\n๐ฆ {msg}")
def attempt(self, n, action):
print(f" ๐ Attempt {n} (action={action})")
def llm_start(self, model):
print(f" ๐ง LLM CALL ({model})...")
def llm_end(self, latency, output):
print(f" โฑ๏ธ {latency:.2f}s")
print(f" ๐ค {output[:120].replace(chr(10), ' ')}...")
def verify(self, result):
print(f" ๐ Verification: {'PASS' if result.passed else 'FAIL'}")
for v in result.violations:
print(f" - {v.message} ({v.severity.value})")
def score(self, f1):
print(f" ๐ฏ F1 Score: {f1:.2f}")
def commit(self, yes):
print(f" ๐พ Commit: {'YES' if yes else 'NO'}")
log = Logger()
# ------------------------------------------------------------------------------
# 2. Policy
# ------------------------------------------------------------------------------
class ConstraintPolicy:
def __init__(self):
self.constraints = []
def define_constraint(self, ctype, severity, rule):
self.constraints.append({
"type": ctype,
"severity": severity,
"rule": rule,
})
# ------------------------------------------------------------------------------
# 3. Memory
# ------------------------------------------------------------------------------
class Memory:
def __init__(self):
self.data = []
def record(self, score):
self.data.append(score)
# ------------------------------------------------------------------------------
# 4. Verification
# ------------------------------------------------------------------------------
class ConstraintEvaluator(ABC):
@abstractmethod
def evaluate(self, trace: ExecutionTrace, rule: Dict) -> VerificationResult:
pass
class SchemaEvaluator(ConstraintEvaluator):
def evaluate(self, trace, rule):
try:
obj = json.loads(trace.result) if isinstance(trace.result, str) else trace.result
except (json.JSONDecodeError, ValueError, TypeError):
return VerificationResult(False, [ConstraintViolation("schema", "Invalid JSON", SeverityLevel.CRITICAL)])
for f in ["persons", "organizations", "locations"]:
if f not in obj or not isinstance(obj[f], list):
return VerificationResult(False, [ConstraintViolation("schema", f"Missing or invalid {f}", SeverityLevel.CRITICAL)])
return VerificationResult(True)
class SemanticEvaluator(ConstraintEvaluator):
def evaluate(self, trace, rule):
truth = trace.context["truth"]
pred = parse_prediction(trace.result)
if pred is None:
return VerificationResult(
passed=False,
violations=[
ConstraintViolation(
"semantic",
"Output is not valid JSON",
SeverityLevel.CRITICAL,
)
],
confidence=0.0,
is_blocking=True,
)
violations = []
for key in ["persons", "organizations", "locations"]:
pred_set = set(x.lower() for x in pred.get(key, []))
truth_set = set(x.lower() for x in truth.get(key, []))
for item in pred_set - truth_set:
violations.append(
ConstraintViolation(
"semantic",
f"Hallucinated {key}: {item}",
SeverityLevel.WARNING,
)
)
for item in truth_set - pred_set:
violations.append(
ConstraintViolation(
"semantic",
f"Missing {key}: {item}",
SeverityLevel.WARNING,
)
)
f1 = compute_f1(pred, truth)
if f1 < rule["min_f1"]:
violations.append(
ConstraintViolation(
"semantic",
f"Low F1={f1:.2f}",
SeverityLevel.WARNING,
)
)
# IMPORTANT:
# semantic violations should FAIL verification,
# but not necessarily block execution
if violations:
return VerificationResult(
passed=False,
violations=violations,
confidence=f1,
is_blocking=False,
)
return VerificationResult(
passed=True,
violations=[],
confidence=f1,
is_blocking=False,
)
class Verifier:
def __init__(self, policy):
self.policy = policy
self.schema = SchemaEvaluator()
self.semantic = SemanticEvaluator()
def verify(self, trace):
results = []
for c in self.policy.constraints:
if c["type"] == ConstraintType.SCHEMA:
results.append(self.schema.evaluate(trace, c["rule"]))
else:
results.append(self.semantic.evaluate(trace, c["rule"]))
final = VerificationResult(True)
for r in results:
if not r.passed:
final.passed = False
final.violations.extend(r.violations)
if r.is_blocking:
final.is_blocking = True
return final
# ------------------------------------------------------------------------------
# 5. LLM
# ------------------------------------------------------------------------------
def run_llm(prompt, log):
log.llm_start("mistral")
start = time.time()
try:
res = requests.post(
"http://localhost:11434/api/generate",
json={"model": "mistral", "prompt": prompt, "stream": False},
timeout=120,
)
res.raise_for_status()
output = res.json().get("response", "")
except requests.exceptions.RequestException as e:
output = f"ERROR: {e}"
log.llm_end(time.time() - start, output)
return output
def build_initial_prompt(text):
return f"""
You are a named entity extraction system.
Extract ALL named entities EXACTLY as written.
Return STRICT JSON:
{{
"persons": [],
"organizations": [],
"locations": []
}}
Rules:
- Preserve exact text spans
- No guessing
- No explanation
- Output ONLY JSON
Text:
{text}
""".strip()
def build_repair_prompt(text, previous_output, violations):
issues = "\n".join(f"- {v.message}" for v in violations)
return f"""
You previously returned:
{previous_output}
It failed for the following reasons:
{issues}
Fix the output. Do not repeat the same mistakes.
Rules:
- Keep correct entities
- Remove hallucinated entities
- Add missing entities ONLY if they appear exactly in text
- Do NOT guess
- Output ONLY valid JSON
Schema:
{{
"persons": [],
"organizations": [],
"locations": []
}}
Text:
{text}
""".strip()
# ------------------------------------------------------------------------------
# 6. Scoring
# ------------------------------------------------------------------------------
def parse_prediction(result):
if not result:
return None
text = result.strip()
# Remove markdown fences
if "```" in text:
parts = text.split("```")
for part in parts:
part = part.strip()
if part.startswith("{") and part.endswith("}"):
text = part
break
# Try direct parse
try:
return json.loads(text)
except (json.JSONDecodeError, ValueError):
return None
# Try extracting JSON substring
start = text.find("{")
end = text.rfind("}")
if start != -1 and end != -1:
try:
return json.loads(text[start:end+1])
except:
return None
return None
def compute_f1(pred, truth):
def norm(x): return set(i.lower() for i in x)
scores = []
for k in truth:
p, t = norm(pred.get(k, [])), norm(truth[k])
if not p and not t:
scores.append(1)
continue
inter = len(p & t)
prec = inter / len(p) if p else 0
rec = inter / len(t) if t else 0
scores.append(0 if prec+rec==0 else 2*prec*rec/(prec+rec))
return sum(scores)/len(scores)
# ------------------------------------------------------------------------------
# 7. Ground Truth
# ------------------------------------------------------------------------------
def build_truth(item):
tokens, tags = item["tokens"], item["ner_tags"]
mapping = {1:"persons",2:"persons",3:"organizations",4:"organizations",5:"locations",6:"locations"}
out = {"persons":[], "organizations":[], "locations":[]}
cur, typ = [], None
for tok, tag in zip(tokens, tags):
if tag in mapping:
t = mapping[tag]
if tag in [1,3,5]:
if cur:
out[typ].append(" ".join(cur))
cur, typ = [tok], t
else:
cur.append(tok)
else:
if cur:
out[typ].append(" ".join(cur))
cur, typ = [], None
if cur:
out[typ].append(" ".join(cur))
return out
# ------------------------------------------------------------------------------
# 8. Kernel
# ------------------------------------------------------------------------------
class Kernel:
def __init__(self, verifier, memory, use_verification=True):
self.verifier = verifier
self.memory = memory
self.use_verification = use_verification
self.retry_count = 0
self.blocking_failures = 0
self.semantic_failures = 0
self.repair_prompt_uses = 0
self.pass_count = 0
self.total_runs = 0
self.commit_count = 0
self.reject_count = 0
self.failed_verifications = 0
self.repair_success_count = 0
self.initial_failures = 0
self.committed_scores: List[float] = []
def run(self, text, truth, log: Logger):
self.total_runs += 1
final_result = None
v = VerificationResult(True)
first_attempt_failed = False
repaired_successfully = False
for attempt in range(3):
action_name = "extract" if attempt == 0 else "repair"
log.attempt(attempt, action_name)
if attempt == 0:
prompt = build_initial_prompt(text)
print(" ๐ Using initial prompt")
else:
prompt = build_repair_prompt(text, final_result, v.violations)
self.repair_prompt_uses += 1
print(" ๐ Using repair prompt")
result = run_llm(prompt, log)
trace = ExecutionTrace(
task="extract",
action=action_name,
result=result,
context={"truth": truth},
)
v = self.verifier.verify(trace)
log.verify(v)
if not v.passed:
self.failed_verifications += 1
if attempt == 0:
self.initial_failures += 1
first_attempt_failed = True
if v.is_blocking:
self.blocking_failures += 1
else:
self.semantic_failures += 1
final_result = result
if v.passed:
self.pass_count += 1
if first_attempt_failed:
repaired_successfully = True
self.repair_success_count += 1
break
if attempt < 2:
self.retry_count += 1
pred = parse_prediction(final_result)
score = compute_f1(pred, truth) if pred else 0.0
log.score(score)
should_commit = (not self.use_verification) or v.passed
log.commit(should_commit)
if should_commit:
self.memory.record(score)
self.committed_scores.append(score)
self.commit_count += 1
else:
self.reject_count += 1
return score
# ------------------------------------------------------------------------------
# 9. Experiment
# ------------------------------------------------------------------------------
def run_experiment(mode, eval_items):
policy = ConstraintPolicy()
policy.define_constraint(ConstraintType.SCHEMA, SeverityLevel.CRITICAL, {})
policy.define_constraint(ConstraintType.LOGIC, SeverityLevel.WARNING, {"min_f1": 0.5})
verifier = Verifier(policy)
memory = Memory()
kernel = Kernel(verifier, memory, use_verification=(mode == "verified"))
scores = []
for i, item in enumerate(eval_items):
log.stage(f"Task {i+1}/{len(eval_items)}")
text = " ".join(item["tokens"])
truth = build_truth(item)
print(f" Input: {text[:100]}...")
score = kernel.run(text, truth, log)
scores.append(score)
avg_f1 = mean(scores) if scores else 0.0
committed_scores = kernel.committed_scores
bad_committed = sum(1 for s in committed_scores if s < 0.5)
clean_committed = sum(1 for s in committed_scores if s >= 0.5)
n_committed = len(committed_scores)
bad_trace_admission = (
bad_committed / n_committed if n_committed else 0.0
)
repair_success_rate = (
kernel.repair_success_count / kernel.initial_failures
if kernel.initial_failures else 0.0
)
rejection_rate = (
kernel.reject_count / kernel.total_runs
if kernel.total_runs else 0.0
)
clean_memory_rate = (
clean_committed / n_committed if n_committed else 0.0
)
print(f"\n๐ {mode.upper()} SUMMARY")
print(f" Avg F1: {avg_f1:.2f}")
print(f" Stored traces: {len(memory.data)}")
print("\n ๐ง SYSTEM METRICS")
print(f" Total Runs: {kernel.total_runs}")
print(f" Commits: {kernel.commit_count}")
print(f" Rejections: {kernel.reject_count}")
print("\n ๐ VERIFICATION")
print(f" Failed Verifications: {kernel.failed_verifications}")
print(f" Blocking Failures: {kernel.blocking_failures}")
print(f" Semantic Failures: {kernel.semantic_failures}")
print("\n ๐ REPAIR")
print(f" Repair Attempts: {kernel.repair_prompt_uses}")
print(f" Repair Success Rate: {repair_success_rate:.2f}")
print("\n ๐งช QUALITY")
print(f" Bad Trace Admission: {bad_trace_admission:.2f}")
print(f" Rejection Rate: {rejection_rate:.2f}")
print(f" Clean Memory Rate: {clean_memory_rate:.2f}")
return {
"avg_f1": avg_f1,
"bad_trace_admission": bad_trace_admission,
"repair_success_rate": repair_success_rate,
"rejection_rate": rejection_rate,
"clean_memory_rate": clean_memory_rate,
}
# ------------------------------------------------------------------------------
# 10. Main
# ------------------------------------------------------------------------------
if __name__ == "__main__":
random.seed(42)
dataset = load_dataset("conll2003", split="train", trust_remote_code=True)
data = [dataset[i] for i in range(1000)]
# Freeze the exact evaluation slice ONCE
eval_items = random.sample(data, 100)
print("\n๐ต STANDARD")
res_std = run_experiment("standard", eval_items)
print("\n๐ข VERIFIED")
res_ver = run_experiment("verified", eval_items)
print("\nRESULT:")
print(f"Standard F1: {res_std['avg_f1']:.2f}")
print(f"Verified F1: {res_ver['avg_f1']:.2f}")
print(f"Delta: {res_ver['avg_f1'] - res_std['avg_f1']:+.2f}")
๐ Part 3: Deployment Instructions
1. Run the Demo
pip install datasets<4
pip install requests
python eck_verified_demo.py
Expected Output:
๐ VERIFIED SUMMARY
Avg F1: 0.78
Stored traces: 56
๐ง SYSTEM METRICS
Total Runs: 100
Commits: 56
Rejections: 44
๐ VERIFICATION
Failed Verifications: 157
Blocking Failures: 21
Semantic Failures: 136
๐ REPAIR
Repair Attempts: 113
Repair Success Rate: 0.28
๐งช QUALITY
Bad Trace Admission: 0.00
Rejection Rate: 0.44
Clean Memory Rate: 1.00
RESULT:
Standard F1: 0.77
Verified F1: 0.78
Delta: +0.01