The Moment: Intelligence beyond context

The Moment: Intelligence beyond context
Page content

Most AI systems answer and move on. The next step is different: preserve the reasoning state, replay it, measure whether it improves, and keep only what survives verification.

Summary

Most AI workflows still treat intelligence as a single pass.

You ask a question. The model answers. Maybe you ask it to try again. Maybe you add more context. Maybe you save something to memory.

But the basic shape remains the same:

input β†’ answer

That is not enough.

The deeper opportunity is not simply to make the first answer better. It is to preserve the entire reasoning situation β€” the input, the objective, the constraints, the tools, the evidence, the failed attempts, the feedback β€” and then replay that situation under better conditions.

That preserved reasoning situation is what I call the moment.

A moment is not a prompt. A prompt is just the visible instruction. A moment is the full execution state of an intelligent act:

Moment =
state
+ objective
+ evidence
+ tools
+ constraints
+ action space
+ feedback
+ score

The purpose of capturing a moment is not nostalgia. It is amplification.

A human gets one imperfect pass through most moments. We have a conversation once. We make a decision once. We follow a line of reasoning once. Afterwards, memory degrades. We reconstruct. We rationalize. We lose the exact state.

AI systems do not have to work that way.

An AI can preserve a moment. It can reopen it. It can bring in more evidence. It can test alternatives. It can score each attempt. It can detect when an answer is improving. It can detect when the answer is drifting. It can stop. And if the replay works, it can extract a reusable rule for the next similar moment.

That is the difference between memory and experience.

Memory says:

I remember what happened.

Experience says:

I act differently because of what happened.

The moment is the bridge between the two.

This also gives us a way to think about hallucination more precisely.

In previous AI work, I have described hallucination as what happens when a model operates outside the evidence well. It keeps generating after the evidence has run out. It sounds confident, but it is no longer constrained by what can be checked.

Moment replay is the opposite move.

It is not uncontrolled expansion beyond the evidence. It is controlled amplification inside a bounded state.

The model is allowed to search. It is allowed to propose. It is allowed to go beyond the obvious first answer. But every replay must be measured against the moment’s objective, evidence, constraints, and stop conditions.

Without measurement, it is just prompting.

Without a stop condition, it is over-optimization.

Without rule extraction, it is not learning.

So the basic architecture becomes:

    %%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px'}}}%%
flowchart LR
    A["πŸ“Έ Capture<br/>the Moment"]
    B["πŸ”„ Replay<br/>the Moment"]
    C["πŸ“Š Measure<br/>Improvement"]
    D["🧭 Detect<br/>Drift"]
    E["πŸ›‘ Stop at<br/>Useful Peak"]
    F["🧠 Extract<br/>Reusable Rules"]

    A --> B --> C --> D --> E --> F

    classDef momentNode fill:#1e1e2f,stroke:#cba6f7,stroke-width:3px,color:#cdd6f4,rx:16,ry:16,font-weight:bold
    class A,B,C,D,E,F momentNode

    style A fill:#2a1e3f,stroke:#cba6f7
    style B fill:#1e2a3f,stroke:#89b4fa
    style C fill:#1e3f2a,stroke:#a6e3a1
    style D fill:#3f2a1e,stroke:#fab387
    style E fill:#3f1e2a,stroke:#f38ba8
    style F fill:#2a3f1e,stroke:#a6da95
  

This is a generic approach. It should apply to writing, coding, research, design, planning, and decision-making.

But if we introduce it only through writing, it will sound subjective. Better writing is arguable. A better blog post can always be debated.

So we need a harder test.

We need a domain where improvement can be checked.

That is why the first implementation should be mathematical proof.

In a proof assistant like Lean or Coq, the moment is not vague. It is a proof state. The system knows the current goal, hypotheses, imported definitions, available tactics, failed attempts, and remaining obligations. It can tell us whether a proposed next step is valid.

That makes formal proof the ideal laboratory for the moment.

Not because the moment is a proof concept.

It is not.

A proof state is one implementation of the moment.

The concept is broader: preserve the reasoning state, replay it under better conditions, measure whether it tightens, and compile what worked.

Mathematics gives us the cleanest way to show that this is real.

If moment replay can help an AI move a proof state closer to a verified theorem, then we are no longer talking about vibes. We are talking about a measurable system for amplified reasoning.

The question becomes:

Can AI preserve a reasoning moment,
replay it against known evidence,
and improve it without drifting into hallucination?

That is the experiment.

And proof is where we can test it.


1. Humans Lose the Moment. Proof Assistants Preserve It.

Most human reasoning disappears as soon as it happens.

You try an approach. It almost works. You remember that there was a lemma somewhere that felt relevant. You change notation. You follow a side path. You come back later and the exact shape of the problem has shifted in your head.

The moment is gone.

That is how humans reason most of the time. We do not preserve the full state of our own thinking. We preserve fragments: a feeling, a failed attempt, a partial analogy, the memory of being close.

Formal proof assistants do something different.

Lean and Coq preserve the moment.

At every stage of a proof, they know the current goal. They know the local hypotheses. They know the imported definitions. They know the remaining obligations. They know exactly why a tactic failed. They can tell you whether the next step is valid.

That is extraordinary.

But it does not solve the whole problem.

Lean can verify the move you give it. Coq can reject an invalid step. A proof assistant can preserve the state and enforce correctness.

But it still usually needs the human to decide where to go next.

The bottleneck is not verification.

The bottleneck is navigation.

A proof state may be extremely close to a known theorem, lemma, or intermediate result. It may live in a mathematical neighbourhood that has already been explored. But unless the human recognizes that neighbourhood, the proof can stall.

This is where the broader idea of the moment becomes useful.

A moment is not just a prompt. It is a bounded state of reasoning:

state
+ objective
+ evidence
+ tools
+ constraints
+ action space
+ feedback

In writing, a moment might be a weak paragraph and the intent behind it.

In coding, a moment might be a failing test, a patch, and the error trace.

In research, a moment might be a question, a set of papers, and a contradiction.

In formal mathematics, the moment becomes unusually clean.

It becomes a proof state.

That is why proof assistants are the right place to test this idea first. Not because the moment is only about proof. It is not. But because proof gives us something most AI systems lack: a hard verifier.

The system can propose.

Lean can check.

The AI can replay.

The proof state can tell us whether the replay tightened the reasoning or drifted away from it.

That is the difference between uncontrolled generation and amplified reasoning.

Hallucination is what happens when an AI keeps moving after it has left the evidence.

Moment replay is the opposite.

It preserves the reasoning state, searches nearby validated knowledge, proposes the next move, checks whether the move is real, and tries again only while the proof is getting tighter.

That makes proof the ideal laboratory for the moment.

Because here, improvement is not just a matter of taste.

Either the proof state advances, or it does not.


2. The Problem: Formal Proof Is Correct but Not Enough

Formal proof assistants are very good at saying no.

No, that tactic does not solve the goal.

No, that rewrite does not apply.

No, that term does not typecheck.

No, the proof is not complete.

That is their strength. They give mathematics a hard boundary. They prevent the model from drifting into plausible nonsense. They do not care whether a step sounds elegant, confident, or familiar. Either the step is valid, or it is not.

That makes Lean and Coq powerful antidotes to one kind of hallucination: fluent reasoning that cannot survive verification.

But correctness is not the same thing as guidance.

A proof assistant can tell you whether a move works. It cannot, by itself, always tell you which move is promising.

It can preserve the proof state, but the proof state alone does not automatically provide a map of the surrounding mathematical territory.

This is where retrieval systems become important.

Modern theorem-search and proof-search systems already begin to address this. They can retrieve relevant premises, search theorem libraries, and in some cases embed live proof states to find nearby declarations. That matters. It means semantic navigation is no longer imaginary.

But retrieval alone does not close the loop.

A retrieved lemma may be nearby but useless.

A premise may share the right symbols but require a hypothesis the proof state does not have.

A theorem may look semantically close but create more subgoals than it resolves.

A tactic may be valid but fail to tighten the proof.

So the gap is not simply:

Can we find nearby mathematics?

Increasingly, yes, we can.

The harder question is:

Did the retrieved neighbour actually move the proof state closer to completion?

That distinction matters.

Verification asks:

Is this step valid?

Retrieval asks:

What nearby results might be relevant?

Moment replay asks:

Did using those results tighten the proof state,
and should we try this pattern again?

Lean and Coq are excellent at verification.

Retrieval systems provide candidate navigation.

The moment architecture is about closing the loop between the two.

The proof assistant remains the judge.

The retrieval system supplies the map.

The AI proposes candidate moves.

The replay loop measures whether those moves helped.

A proof state may be close to a known lemma, theorem, or intermediate result. It may live in a mathematical neighbourhood that has already been explored. But surfacing a neighbour is only the first step.

The system still has to test whether that neighbour is proof-useful.

That is the important shift.

Not:

This theorem is semantically similar.

But:

This theorem increased expected tightening for this kind of proof moment.

That is where moment replay becomes useful.

It treats semantic neighbours as candidates, not answers.

It injects them into the proof attempt.

It lets Lean or Coq verify the proposed move.

It scores whether the state tightened.

It replays only while progress is real.

And if a neighbour repeatedly helps similar proof states, the system compiles that into a reusable rule.

So the problem is not that formal proof is weak.

The problem is that verification, retrieval, replay, scoring, stopping, and rule extraction are usually separate pieces.

Moment replay is the attempt to wire them together.

That is how formal proof becomes more than a correctness checker or a retrieval problem.

It becomes a laboratory for amplified reasoning.


3. The Core Idea: A Proof State Is a Moment

Now we can make the idea concrete.

A moment is not a mood. It is not a memory. It is not a prompt with a nicer name.

In this implementation, a moment is a proof-state object.

Proof Moment =
goal
+ hypotheses
+ local context
+ imported libraries
+ attempted tactics
+ error messages
+ retrieved neighbours
+ verifier feedback
+ score

The goal tells us what remains to be proved.

The hypotheses tell us what we are allowed to use.

The local context tells us which variables, types, assumptions, and definitions are currently in scope.

The imported libraries define the available mathematical world.

The attempted tactics tell us what has already failed.

The error messages tell us why those attempts failed.

The retrieved neighbours tell us which theorems, lemmas, or proof fragments appear semantically close.

The verifier feedback tells us whether a proposed step was accepted, rejected, or partially advanced the proof.

The score tells us whether the proof state is tightening or drifting.

That is the moment.

Not the entire conversation.

Not the whole proof.

Not the model’s vague memory of what it was trying to do.

A bounded, replayable state.

This matters because once the proof state becomes an object, it can be stored, compared, embedded, replayed, and measured.

A normal LLM interaction looks like this:

prompt β†’ answer

A proof-moment system looks like this:

proof state
β†’ candidate move
β†’ verifier result
β†’ updated proof state
β†’ scored replay

That is a different kind of system.

The LLM is no longer simply generating text. It is operating inside a constrained mathematical environment. The proof assistant supplies the boundary. The theorem corpus supplies the neighbourhood. The replay loop supplies the search. The score tells us whether the system is moving toward proof or away from it.

This is where the moment becomes useful.

Because the state is preserved, the AI can return to it.

Because the state is structured, the AI can search from it.

Because the state is verifiable, the AI can test its own proposals.

Because the state is scored, the system can decide whether replay is helping.

That gives us the first serious version of the moment architecture:

capture the proof state
β†’ search nearby known mathematics
β†’ propose a move
β†’ verify the move
β†’ measure progress
β†’ replay if progress is possible
β†’ stop if progress stalls or drift appears

This is also where the broader concept becomes clearer.

A proof state is not the only kind of moment.

It is just the cleanest one.

In writing, the verifier is weak.

In research, the verifier is partial.

In coding, the verifier may be a test suite.

In mathematics, the verifier is formal.

That is why mathematical proof is the right first test case.

If moment replay works here, it works under the hardest condition: not β€œdoes this sound better,” but β€œdid this actually move the proof forward?”


4. Hallucination vs Amplification

This is where the idea becomes safer and more ambitious at the same time.

Hallucination is not simply β€œthe model got a fact wrong.”

That is the symptom.

The deeper problem is that the model has moved outside the evidence well. It is still generating. It is still producing fluent language. It may even sound more confident than before. But the generation is no longer constrained by what can be checked.

In ordinary writing or research, that boundary is often soft.

In formal proof, the boundary is hard.

Lean does not care whether a proof step sounds plausible. Coq does not reward elegance without validity. A tactic either advances the proof state, or it does not. A term either typechecks, or it does not.

That makes proof assistants powerful anti-hallucination systems.

But if we stop there, we make AI too small.

The goal is not merely to prevent the model from being wrong. The goal is to let the model search harder, propose more, retrieve more, and amplify the reasoning moment β€” while remaining inside a verifiable boundary.

That is the distinction.

Hallucination =
unbounded generation outside evidence
Moment Replay =
bounded amplification inside a verifiable state

The proof assistant becomes the boundary.

The theorem corpus becomes the evidence.

The embedding search becomes the neighbourhood detector.

The replay loop becomes the optimization process.

The AI is allowed to explore, but every exploration has to come back through the verifier.

This gives us a different model of safe AI reasoning.

Not:

make the model less imaginative

But:

make the model's imagination measurable

In a proof setting, the model can suggest a lemma. It can try a tactic. It can search nearby theorem statements. It can propose a proof sketch. It can replay the same proof state with different retrieved neighbours.

But each attempt is checked.

If the attempt moves the proof forward, keep it.

If it drifts, reject it.

If repeated attempts stop improving the state, stop replaying.

That is controlled amplification.

The AI does not become useful by staying close to the obvious answer. It becomes useful by exploring beyond the obvious answer without escaping the structure that can verify it.

This is why mathematical proof is such a strong test case for the moment.

We can allow the model to go further than a cautious assistant would normally go, because the proof assistant gives us a hard boundary.

The model can amplify.

The verifier can constrain.

The replay loop can search for the best next move.

That is the shape we want: not less intelligence, not more hallucination, but more ambitious reasoning inside a system that can tell when the ambition has become false.


5. What Existing Systems Already Give Us

This idea does not appear from nowhere.

Almost every piece already exists.

They just live in separate systems.

Semantic theorem search shows that mathematical knowledge can be embedded and searched by meaning, not only by name or keyword. That gives us neighbourhoods.

LeanSearch, LeanDojo, ReProver, REAL-Prover, and Lean Finder push this further. They show that theorem libraries, premises, and even live proof states can be used during proof generation. In particular, proof-state retrieval is already real. A system can embed the current Lean state, search nearby theorems, and feed those candidates into a tactic generator.

So semantic neighbourhood detection is not the novelty.

It is part of the foundation.

The same is true for learning from experience. Reflexion showed that language agents can use feedback from failed attempts. ExpeL goes further by collecting successful and failed trajectories, extracting reusable natural-language insights, and retrieving them during future tasks.

So rule extraction is not entirely new either.

The map looks like this:

System What it gives What it does not close
Semantic theorem search Statement-level mathematical neighbourhoods No live replay loop
LeanSearch / ReProver / LeanDojo Premise retrieval and proof-state tooling No persistent moment-learning loop
REAL-Prover / Lean Finder Live proof-state retrieval No verifier-scored replay with rule validation
Reflexion Feedback improves later attempts No formal verifier
ExpeL Experience-to-rule extraction Not applied to proof-state replay with verifier-tested rules

Read the last column carefully.

The gap is not a missing capability.

The gap is missing wiring.

The retrieval systems have the map.

The proof assistants have the verifier.

The agent systems have the beginnings of experience extraction.

What I am proposing is the closed loop around the proof moment:

proof state
β†’ semantic neighbour retrieval
β†’ candidate tactic or lemma
β†’ verifier check
β†’ tightening score
β†’ replay or stop
β†’ extract rule
β†’ validate rule on held-out proof moments

That is the contribution.

Not theorem search.

Not Lean.

Not proof-state retrieval.

Not rule extraction in isolation.

The contribution is a measured replay loop where the verifier scores each attempt, the stop condition prevents useless search, and the experience compiler only promotes rules that improve future proof states.

This is a narrower claim.

But it is the claim worth testing.

The interesting question is no longer whether proof-state retrieval exists. It does. The question is whether retrieval, verification, replay, stopping, and rule extraction become more powerful when they are wired into one measured loop.


6. The Proof-Moment Replay Loop

Now we can describe the system.

A proof-moment replay loop is not a chat loop.

It is not:

ask the model
β†’ get an answer
β†’ ask it to try again

That is too loose.

The loop has to be grounded in a preserved proof state and constrained by a verifier.

The architecture looks like this:

Capture proof state
β†’ Embed proof state
β†’ Retrieve nearby theorems, lemmas, and proof fragments
β†’ Inject candidates as constraints or tactics
β†’ Run Lean or Coq
β†’ Score tightening or drift
β†’ Replay if not solved
β†’ Extract reusable rule if pattern repeats

Each step matters.

First, capture the proof state.

The system records the goal, hypotheses, local context, imported libraries, attempted tactics, failed errors, and any previous retrieved neighbours. This gives the replay loop a stable object to work on.

Then embed the proof state.

Not just the surface text of the goal, but the mathematical shape of the moment: the target, the assumptions, the types involved, the symbols in play, and the structure of the obligation.

Then retrieve nearby mathematics.

The system searches for theorem statements, lemmas, proof fragments, and tactic patterns that appear close to the current proof state. These are not accepted as true merely because they were retrieved. They are candidates.

Then inject the candidates.

The language model receives the proof state plus the retrieved neighbours and proposes the next move: a lemma to try, a tactic sequence, a rewrite direction, or a smaller intermediate claim.

Then the proof assistant checks it.

This is the hard boundary.

If the step is invalid, Lean or Coq rejects it.

If the step is valid but does not help, the score should reflect that.

If the step reduces the number of goals, simplifies the hypotheses, closes a subgoal, or moves the proof toward a known theorem shape, the replay loop records that tightening.

The system is now doing measured search.

candidate move
β†’ verifier response
β†’ progress score

That score decides what happens next.

If the proof is solved, stop.

If the proof state tightened, replay from the new state.

If the proof state did not tighten, retrieve different neighbours, change the candidate strategy, or stop if the loop is exhausted.

If the same pattern succeeds repeatedly across similar proof states, extract a reusable rule.

For example:

When the goal involves divisibility of element orders in a finite group,
retrieve order/cardinality lemmas before attempting algebraic rewriting.

That rule is not a proof.

It is a navigation heuristic.

The verifier still decides whether any future use of it is valid.

This is the critical distinction.

Moment replay does not ask the model to be trusted.

It asks the model to search inside a system that can reject it.

The functionality is distributed:

LLM: propose
retriever: remember
proof assistant: verify
replay loop: measure
experience compiler: learn

That is why this is not just prompting.

Prompting produces another answer.

Proof-moment replay produces a measured trajectory through a verifiable state space.

The output is not merely text.

The output is one of three things:

a solved proof
a tighter proof state
a reusable navigation rule

If it produces none of those, the replay failed.

And that is useful too, because failure becomes part of the moment.

The system now knows not just what worked, but what did not work from this neighbourhood.

That is how a proof assistant becomes more than a checker.

It becomes the centre of a replayable reasoning system.


7. The Key Metric: Did the Proof State Tighten?

Now we reach the question that makes this testable.

Did replay actually improve the proof state?

For writing, this is hard. Better prose is subjective. A stronger paragraph can still be debated.

For proof, improvement has sharper edges.

A proof state either moves closer to completion, or it does not.

That does not mean every improvement is binary. A proof can advance partially. A tactic can fail but reveal a better error. A retrieved lemma can fail directly but expose the right theorem family. A rewrite can simplify the goal without closing it.

So we need a metric that sits between:

invalid

and:

proof complete

Call it tightening.

A proof state tightens when the space of unresolved obligations becomes smaller, cleaner, or closer to known validated mathematics.

Some forms of tightening are obvious:

number of goals reduced
subgoal closed
proof term accepted
tactic succeeds

Some are weaker but still useful:

hypotheses simplified
target expression normalized
irrelevant branches removed
error message becomes more specific
retrieved lemma partially applies
goal shape moves closer to a known theorem

And some are negative signals:

more goals created without simplification
hypotheses become noisier
retrieved lemmas stop matching
same tactic failure repeats
proof state drifts away from the original objective

This gives us a way to score replay.

Not perfectly.

But enough to make the system measurable.

A first implementation does not need a perfect score. It needs a usable one.

For example:

TighteningScore =
  + 1.0 * closed_goals
  + 0.5 * reduced_goals
  + 0.3 * accepted_tactic
  + 0.3 * retrieved_lemma_used
  + 0.2 * simpler_target
  - 0.5 * repeated_error
  - 1.0 * goal_explosion
  - 0.7 * semantic_drift

The weights are not sacred. They are tunable. A real implementation would change them by domain, proof style, and benchmark results.

The point is not that this is the final scoring function.

The point is that replay is no longer judged by whether the model sounded plausible.

It is judged by a measurable state transition.

Did a goal close?

Did the number of unresolved obligations fall?

Did Lean accept a tactic?

Did a retrieved lemma actually appear in a successful step?

Did the target become simpler?

Did the system repeat the same error?

Did the proof branch explode?

Did the state drift away from the original theorem?

Those are measurable signals.

A more abstract version looks like this:

Tightening =
goal_reduction
+ accepted_steps
+ simplification_gain
+ neighbour_proximity_gain
- drift_penalty
- repeated_failure_penalty

This matters because it changes the role of the AI.

The model is not being asked to produce a beautiful proof in one pass.

It is being asked to search.

Each replay produces a candidate move.

Lean or Coq checks the move.

The replay system records whether the state tightened, loosened, or stayed the same.

That creates a trajectory:

proof state 0
β†’ candidate move
β†’ proof state 1
β†’ tightening score
β†’ replay or stop

Now the stop condition becomes meaningful.

If the proof is complete, stop.

If the proof state keeps tightening, continue.

If the proof state stops tightening, change strategy or stop.

If the proof state drifts, roll back.

This is also where hallucination becomes visible.

A hallucinated proof step may sound convincing, but it will not tighten the formal state. It will fail to typecheck, fail to apply, increase noise, repeat an error, or move away from useful neighbours.

The verifier turns hallucination into a measurable event.

That is why formal proof is such a powerful laboratory for the moment.

It gives us a way to ask the only question that matters:

Did replay reduce uncertainty?

If the answer is yes, the moment had value.

If the answer is no, the replay was just motion.

The goal is not movement.

The goal is tightening.


8. Moment Value

Tightening tells us whether one replay helped.

Moment value asks a larger question:

Was this proof state worth replaying at all?

Not every moment deserves compute.

Some proof states are trivial. Lean can solve them with simp, omega, ring, or a direct library lemma.

Some proof states are dead ends. Replay only produces different versions of the same failure.

Some proof states are noisy. The goal is badly shaped, the hypotheses are unhelpful, and semantic retrieval returns a cloud of unrelated mathematics.

But some proof states are valuable.

A valuable proof moment is one where replay reveals a reusable bridge.

this goal shape
β†’ this lemma family
β†’ this tactic pattern

That is the scientific core of the idea.

Moment value is not the quality of the current output.

Moment value is the future improvement generated by replaying the state.

Moment Value =
future proof improvement caused by replaying this moment

A moment has high value if replay does one of three things.

First, it solves the current proof.

That is the obvious case.

Second, it tightens the current proof state enough to make the next step easier.

That still matters. In formal proof, partial progress is real progress.

Third, and most important, it produces a reusable navigation rule that improves future proof attempts.

For example:

Observed moment:
goal contains finite group, element order, divisibility target

Successful replay:
retrieved order/cardinality lemmas before algebraic rewriting

Compiled rule:
when this goal shape appears again,
prioritize order/cardinality lemmas before rewriting

The current proof improves.

But the system also learns something about future proofs.

That is moment value.

The value of the moment is not contained only in the solved proof. It is contained in the pattern extracted from the replay.

This is where moment replay becomes different from ordinary proof search.

A proof search system wants to close the current goal.

A moment-learning system wants to ask:

Did this replay teach us something
that improves the next similar state?

That gives us a ranking problem.

Which moments should be replayed?

The answer should not be β€œall of them.”

Replay has a cost. Retrieval has a cost. Verification has a cost. Repeated LLM calls have a cost.

So the system needs a policy.

High-value moments are the ones where expected future improvement exceeds replay cost.

Replay if:

expected tightening
+ expected rule value
>
compute cost
+ drift risk

This is where the architecture starts to look less like prompting and more like learning.

The system does not merely store past states.

It learns which past states are worth reopening.

It learns which states produce reusable rules.

It learns which goal shapes tend to reward replay.

It learns where the mathematical map is dense enough to make semantic neighbourhood search useful.

And it learns where replay is wasted.

That creates a feedback loop:

capture proof moments
β†’ replay selected moments
β†’ measure tightening
β†’ extract rules
β†’ apply rules to future moments
β†’ update which moments are worth replaying

Over time, the system should become more selective.

It should stop replaying moments that only create motion.

It should prioritize moments that create proof progress or reusable mathematical heuristics.

That is the difference between memory and experience.

Memory stores every moment.

Experience learns which moments mattered.


9. From Semantic Neighbourhoods to Proof-Useful Neighbours

Semantic neighbourhood detection already exists.

That matters.

A system can embed a proof state, search a theorem library, and retrieve nearby declarations. That is a real capability, and this post should not pretend otherwise.

But semantic proximity is not the same as proof usefulness.

A theorem can be nearby and still useless.

It can share symbols with the goal but require hypotheses that are not present. It can live in the right mathematical area but create more subgoals than it closes. It can look plausible to the model while failing to tighten the formal state.

So the important question is no longer only:

Have we been near this mathematical neighbourhood before?

The better question is:

Did this neighbour help the proof state tighten?

That is the shift.

Moment replay treats retrieved neighbours as candidates, not answers. The retriever provides a map. The language model proposes a move. Lean or Coq checks the move. The tightening score records whether the state improved.

The current proof state becomes a query against known mathematics:

goal
+ hypotheses
+ local context
+ types
+ symbols
+ failed tactics
+ error classes

The corpus may contain:

theorem statements
Lean declarations
relevant premises
proof sketches
tactic patterns
adjacent lemmas
intermediate proof states

The system is not merely searching for matching words.

It is searching for proof-useful proximity.

A goal involving divisibility may be helped by a theorem about cardinality.

A statement about lists may be helped by a lemma about folds or filters.

A proof state involving normalized equality may be helped by a rewrite pattern used elsewhere.

The useful theorem may not use the same names as the goal. It may not look obvious to the human. It may not be the highest textual match.

That is why the verifier matters.

A retrieved neighbour is not truth.

It is a proposed route through the mathematical landscape.

The route only matters if the proof state tightens.

The workflow becomes:

    %%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px'}}}%%
flowchart LR
    EMB["Embed proof<br/>moment"]
    RET["Retrieve mathematical<br/>neighbours"]
    RANK["Rank by expected<br/>proof usefulness"]
    PROP["Propose tactic or<br/>intermediate claim"]
    VER["Verify with<br/>Lean / Coq"]
    MEAS["Measure<br/>tightening"]

    EMB --> RET --> RANK --> PROP --> VER --> MEAS

    classDef nodeStyle fill:#1e1e2f,stroke-width:3px,color:#cdd6f4,rx:16,ry:16,font-weight:bold
    class EMB,RET,RANK,PROP,VER,MEAS nodeStyle

    style EMB fill:#2a1e3f,stroke:#cba6f7
    style RET fill:#1e2a3f,stroke:#89b4fa
    style RANK fill:#1e3f2a,stroke:#a6e3a1
    style PROP fill:#3f2a1e,stroke:#fab387
    style VER fill:#2a3f1e,stroke:#a6da95
    style MEAS fill:#3f1e2a,stroke:#f38ba8
  

If a neighbour works, the system learns something.

If it fails, the system still learns something.

It can record that this neighbourhood was misleading for this goal shape. It can down-rank similar candidates in the future. It can extract a negative rule:

When the target is about list length after filter,
do not prioritize map lemmas unless map appears in the local context.

That kind of negative knowledge is valuable.

A proof assistant rejects the bad move.

Moment replay remembers why the move was bad.

Over time, the system can learn the difference between three kinds of similarity:

textual similarity
semantic similarity
proof-useful similarity

The third category is the one that matters.

proof-useful similarity =
similarity that increases expected tightening

This is where semantic retrieval becomes part of experience.

The neighbourhood detector is not only a search engine. It becomes a source of feedback. Each retrieved theorem is tested against the proof state. Each success or failure updates the system’s estimate of which neighbours are useful for which kinds of moments.

A neighbour is valuable only if it helps the replay move.

A proof moment is valuable only if replaying it teaches the system which neighbours matter.

The point is not to replace formal proof.

The point is to keep the reasoning chain close to validated mathematical territory while still allowing the AI to explore.

The proof assistant remains the boundary.

Semantic retrieval provides the map.

Moment replay tests the path.

Embedding a proof state

A proof state should not be embedded as a raw string alone.

Local variable names are often arbitrary. Two proof states can be mathematically close while using different names, different orderings, or different surface notation.

A useful embedding should combine several views:

raw goal text
+ normalized type signatures
+ constants and declarations used
+ hypothesis structure
+ target shape
+ failed tactic/error classes

The goal is not textual similarity.

The goal is proof-useful similarity: similarity that increases expected tightening.

A first system can start crudely: embed the rendered goal and hypotheses, retrieve likely neighbours, and let the verifier filter them. A better system would normalize expressions, strip arbitrary variable names, track constants and type families, and learn from which retrieved neighbours actually tightened past proof moments.

That learning loop is the contribution.

Not retrieval alone.

Retrieval finds nearby mathematics.

Moment replay learns which nearby mathematics helps.


10. The Stop Condition

Replay cannot be infinite.

That is one of the most important constraints in this architecture.

A naive system would say:

if one replay helps,
more replay must help more

But that is wrong.

Past a certain point, replay stops tightening the proof. It starts circling. The model tries variants of the same failed tactic. Retrieval returns the same unhelpful lemmas. The proof state becomes noisier. The system burns compute without reducing uncertainty.

In writing, this becomes over-polishing.

In proof, it becomes stalled search.

So the replay loop needs a stop condition.

The stop condition is not an implementation detail. It is part of the intelligence.

The system should stop when:

no new premises improve the score
goal distance stops decreasing
semantic similarity moves away from validated neighbours
tactic attempts repeat
error messages cycle
cost exceeds expected improvement

This gives replay a boundary.

The goal is not to run the model forever.

The goal is to find the useful peak of the moment.

A proof-moment system should keep replaying only while the state is tightening. If a candidate lemma closes a subgoal, continue. If a tactic reduces the goal count, continue. If retrieval surfaces a better theorem family, continue. If the next error message is more specific and closer to a known obligation, continue.

But if the system keeps generating the same failure under different wording, stop.

If the retrieved neighbours become less relevant, stop.

If the proof state moves away from the original objective, stop.

If replay creates more branches than it resolves, stop.

The simplest version is a threshold rule:

continue while:
expected_tightening > replay_cost + drift_risk

That is the economic version of the stop condition.

Replay has value only when the expected improvement is greater than the cost of trying again.

A more practical version might look like this:

Stop if any of the following are true:

1. proof complete
2. tightening score has not improved for N attempts
3. same error class appears K times
4. retrieved neighbour quality drops below threshold
5. tactic diversity collapses
6. proof state drift exceeds threshold
7. compute budget is exhausted

This is how the system avoids becoming another hallucination machine.

Hallucination often comes from unbounded continuation. The model keeps going because language allows it to keep going.

Formal proof does not allow that.

The verifier rejects invalid steps.

But the replay system also needs to reject useless effort.

That is the second boundary.

Lean or Coq says:

this step is not valid

The stop condition says:

this search is no longer productive

Both are necessary.

Without the verifier, replay drifts into plausible nonsense.

Without the stop condition, replay becomes infinite motion.

The intelligence is not merely in the ability to try again.

The intelligence is in knowing when trying again has stopped creating progress.

That is why the stop condition belongs near the centre of the theory.

Moment replay is not infinite amplification.

It is bounded amplification.

The moment is reopened only while replay improves it.

When tightening stops, the system must either change strategy, save the failure as evidence, or move on.

A failed replay is still useful if it teaches the system where not to search next time.

That is the discipline of the moment:

search aggressively
verify every step
measure every replay
stop when tightening stops
save what was learned

In proof, semantic drift means the replay has moved away from the original obligation.

A simple drift signal might compare the current goal’s normalized type signature with the original theorem target. If the proof state keeps generating subgoals whose constants, type families, or theorem neighbourhoods diverge from the original target, the replay is no longer tightening. It is wandering.


11. The Experience Compiler for Proof

When a replay succeeds, do not just save the proof.

Save the pattern.

That is the difference between a system that solves isolated problems and a system that learns from experience.

A normal proof assistant stores the finished proof. A search system may store the tactic sequence that worked. A retrieval system may remember which lemma was used.

But a moment system should ask a different question:

What reusable lesson did this replay reveal?

Suppose the system observes this pattern repeatedly:

Observed:
goals involving divisibility in finite groups repeatedly tighten after retrieving orderOf_dvd_card.

The useful artifact is not merely that one proof succeeded.

The useful artifact is the navigation rule:

Compiled rule:
When a proof state contains finite group + element order + divisibility target,
prioritize order/cardinality lemmas before algebraic rewriting.

That is learning.

Not weight learning.

System learning.

The base model has not changed. The verifier has not changed. Mathlib has not changed.

But the orchestration layer now acts differently when it sees a similar proof moment.

That is the Experience Compiler.

Its job is to turn successful replays into compact, reusable heuristics.

The compiler looks at the trajectory:

initial proof state
β†’ retrieved neighbours
β†’ attempted tactics
β†’ verifier feedback
β†’ tightened proof state
β†’ final proof

Then it asks:

What changed between the failed attempts and the successful one?

Did the successful replay retrieve a different theorem family?

Did it normalize the goal first?

Did it introduce an intermediate claim?

Did it avoid a tactic that repeatedly created noisy subgoals?

Did it use a library lemma the model did not know by name?

The answer becomes a candidate rule.

Not a universal law.

A candidate.

That matters.

A bad experience compiler would create brittle rules:

Always use orderOf_dvd_card.

That is too broad.

A better compiler creates conditional rules with applicability signatures:

IF:
  goal involves finite group
  AND expression contains element order
  AND target is divisibility or cardinality

THEN:
  retrieve order/cardinality lemmas before trying algebraic rewriting
{
  "rule_id": "R_0023",
  "condition": {
    "goal_contains": ["finite_group", "element_order", "divisibility"],
    "hypotheses_contain": ["group"]
  },
  "action": {
    "priority_lemma_classes": ["orderOf_dvd_card", "cardinality"],
    "strategy": "try order/cardinality lemmas before algebraic rewriting"
  },
  "source_moments": ["M_014", "M_027", "M_031"],
  "success_count": 3,
  "fail_count": 0,
  "status": "candidate"
}

This matters because the rule is no longer advice floating in prose. It is an object that can be matched, tested, promoted, weakened, or discarded.

Now the rule is testable.

It can be applied to held-out proof moments.

If it improves tightening, promote it.

If it fails, weaken it.

If it harms other proofs, quarantine it.

This gives us a lifecycle:

candidate rule
β†’ validate on similar proof moments
β†’ measure tightening delta
β†’ promote, revise, or discard

That is where moment replay stops being a loop and becomes a learning system.

The experience compiler is also where negative knowledge becomes useful.

If a retrieved neighbourhood repeatedly fails, that failure can become a rule too:

When the goal is about list length after filter,
do not prioritize map lemmas unless map appears in the local context.

That is not glamorous, but it is valuable.

A mathematician learns this way all the time. Not only by remembering what works, but by remembering what kinds of moves waste time in certain regions of a proof.

The system should do the same.

The output of the experience compiler is not a proof.

It is a navigation layer.

A growing map of proof moments:

goal shape
β†’ useful theorem family
β†’ tactic pattern
β†’ expected tightening
β†’ known failure modes

This map becomes the system’s experience.

Not because it contains everything that happened.

Because it changes what the system tries next.

That is the core distinction.

Memory stores traces.

Experience changes behavior.

The experience compiler is the mechanism that turns one into the other.

A compiled rule only matters if it changes future behavior.

When a new proof moment is captured, the system matches its goal shape, constants, hypotheses, and error history against the rule store. Relevant rules are then injected into the tactic-generation step as constraints or priorities.

The model does not receive the entire messy history of previous failures.

It receives the current proof state, the retrieved neighbours, and a small set of compiled rules that have previously improved similar states.


12. Demonstration We Can Build

The point of this post is not to claim that moment replay sounds useful.

The point is to make it testable.

So the demonstration should be deliberately small at first.

We do not need to build a world-class theorem prover. We do not need to beat Lean’s best automation. We do not need to solve frontier mathematics.

We need to show something narrower and more measurable:

Given the same proof state,
does moment replay improve the system’s ability to move toward a verified proof?

That is the experiment.

The cleanest way to test it is to start with known Lean proofs, hide the proof bodies, and ask several systems to recover or tighten them.

The first system is the baseline:

proof state β†’ LLM tactic attempt β†’ Lean check

The second system adds retrieval:

proof state β†’ retrieve nearby lemmas β†’ LLM tactic attempt β†’ Lean check

The third system adds moment replay:

proof state
β†’ retrieve nearby lemmas
β†’ attempt tactic
β†’ Lean check
β†’ score tightening
β†’ replay if useful
β†’ extract rule if pattern repeats

Now we can compare them.

Not by feeling.

By result.

solved rate
number of goals reduced
number of valid tactic steps
number of failed tactic attempts
time to proof
tokens used
retrieved lemma usefulness
rules extracted
rules reused successfully

This gives us a real benchmark.

I would build it in three layers.

Layer 1: Tiny Proof Sandbox

Start with five to ten very small Lean proofs.

These are not meant to impress anyone. They are meant to prove the loop works.

The goals should be simple enough that failure is easy to inspect:

theorem demo_add_comm (a b : Nat) : a + b = b + a := by
  exact Nat.add_comm a b

Then remove the proof body:

theorem demo_add_comm (a b : Nat) : a + b = b + a := by
  -- system must fill this

The sandbox lets us test the mechanics:

Can we capture the proof state?
Can we ask an LLM for a tactic?
Can we run Lean?
Can we record the error?
Can we replay with new candidates?
Can we detect success?

This belongs in an appendix because it is implementation detail, not the central argument.

Layer 2: A 50-Proof Benchmark

Once the sandbox works, we build a small benchmark.

Select around fifty theorem statements from Mathlib or a controlled Lean file. Hide the original proofs. Keep the difficulty easy to moderate. The goal is not heroic theorem proving. The goal is to compare systems under controlled conditions.

For each theorem, store a proof moment:

theorem statement
imports
initial goal
local context
hidden original proof
available library declarations
baseline attempts
retrieved neighbours
Lean responses
tightening scores

Then run the three systems:

1. Baseline LLM
2. Retrieval-augmented LLM
3. Moment Replay

The expected result does not need to be magical.

Even a modest improvement would matter.

If baseline solves 12/50, retrieval solves 18/50, and moment replay solves 24/50 while extracting useful rules, that is enough to show the architecture has signal.

The key is not only solved proofs.

The key is whether replay creates measurable tightening:

fewer unsolved goals
more valid intermediate steps
better lemma selection
fewer repeated failures
more reusable rules

This is where the post becomes a real experiment.

Layer 3: Reader-Runnable Appendix

The main article should stay readable.

The code should not interrupt the argument.

So the appendices should carry the machinery:

Appendix A: Minimal Lean proof-moment runner
Appendix B: Example proof sandbox
Appendix C: Scoring and tightening metrics
Appendix D: The reader prompt that generates the runner
Appendix E: 50-proof benchmark design

A reader who wants the concept can read the post.

A reader who wants the implementation can go to the appendices.

That is the right split.

The blog post should explain the idea.

The appendix should prove that it can be built.

The most important appendix is the reader-runnable version. It should let someone paste a prompt into an AI and ask it to generate a small Lean replay harness:

Create a minimal Lean proof-moment runner.
It should:
1. take theorem statements with missing proof bodies
2. call an LLM for candidate tactics
3. run Lean to verify each tactic
4. capture errors and proof-state feedback
5. retry with retrieved or suggested lemmas
6. log each attempt as a proof moment
7. score tightening
8. stop when proof is solved or progress stalls

That makes the post interactive without turning the main article into a code tutorial.

This matters because the post is not about Lean specifically.

Lean is the test bench.

The general claim is larger:

AI systems improve when they preserve moments,
replay them under measurement,
reject drift,
and compile what worked.

Mathematical proof is simply the cleanest place to demonstrate it.

In writing, the verifier is subjective.

In research, the verifier is partial.

In coding, the verifier is often a test suite.

In proof, the verifier is formal.

That is why this demonstration matters.

If moment replay cannot improve proof states, the idea may be weaker than it sounds.

If it can, then we have something much more interesting than fancy prompting.

We have a measurable architecture for amplified reasoning.


13. Why This Is More Than Fancy Prompting

The obvious criticism is that this sounds like fancy prompting.

Ask the model once.

Ask it again with more context.

Ask it to critique itself.

Ask it to try harder.

That is not what I mean by moment replay.

Fancy prompting changes the wording around the request.

Moment replay changes the state loop around the system.

A prompt is usually disposable:

input β†’ output

A moment replay system has machinery around the input:

state object
β†’ retrieval
β†’ candidate action
β†’ verifier
β†’ score
β†’ replay policy
β†’ rule extraction

That difference matters.

In ordinary prompting, the model is the centre of the system.

In moment replay, the model is one component inside a measured architecture.

The loop has a formal state object.

It has a retrieval corpus.

It has a verifier.

It has measurable progress.

It has stop conditions.

It has reusable rule extraction.

That is not just a better prompt.

That is a systems architecture.

The proof setting makes the distinction especially clear.

If I ask an LLM to β€œtry proving this again,” it may produce another plausible proof attempt. It may even sound more convincing than the first. But unless the proof assistant checks it, nothing has been established.

Moment replay does not trust the second answer more because it is second.

It trusts only what survives verification.

A replay attempt either tightens the proof state, or it does not.

A retrieved lemma either applies, or it does not.

A tactic either advances the proof, or it fails.

A rule either improves similar future moments, or it is discarded.

That is the hard boundary between this and prompt craft.

Prompting asks:

Can I phrase the request better?

Moment replay asks:

Can I preserve the state,
search nearby validated knowledge,
test candidate moves,
measure progress,
and keep only the patterns that work?

The first is interaction design.

The second is learning infrastructure.

This also changes where improvement lives.

The base model does not need to change weights after every proof attempt. The learning happens in the system around it: the moment store, the retrieval layer, the tightening score, the stop policy, and the experience compiler.

That is why this matters.

The future of applied AI may not be only larger models or longer context windows. It may be better loops around frozen models: loops that know what state they are in, what evidence they can use, how to test progress, when to stop, and what to remember.

In that sense, moment replay is not a prompt.

It is a way of turning individual AI interactions into measured experience.

Aspect Fancy prompting Moment replay
State Chat history Structured ProofMoment
Verification None Lean/Coq check
Progress Subjective Tightening score
Memory Prompt text Moment store + rule store
Stop condition User stops Score plateau / drift / repeated failure
Learning None Reusable rules validated on future moments

There is also a practical distinction between prompt chaining and moment replay.

Prompt chaining often mutates the context window. The failed attempt, the critique, the next attempt, and the next critique all accumulate in the same chat history. Eventually the model is reasoning inside a polluted transcript of its own mistakes.

Moment replay should not work that way.

The messy history lives outside the context window in a Moment Store. When the system replays, it constructs a clean prompt from structured state:

current proof state
+ retrieved neighbours
+ verifier feedback
+ compiled rules

The model does not need to reread every failed path.

It receives the distilled state of the search.

That is the difference between appending history and learning from it.

This is not online weight learning. The base model may remain frozen. The learning happens in the surrounding system: the moment store, retrieval policy, tightening score, stop condition, and rule store.


14. The Interactive Part

The post should not end as an essay.

It should become executable.

That is important because the claim of the post is practical:

preserve a moment
β†’ replay it
β†’ measure progress
β†’ stop drift
β†’ extract what worked

So the reader should be able to run a small version of the same idea.

Not by downloading a massive theorem-proving system.

Not by building a production agent.

Just by pasting a prompt into their AI and asking it to generate the first working scaffold.

The prompt should ask for six things:

1. a proof-moment schema
2. a theorem-search retrieval plan
3. a tightening score
4. a replay protocol
5. a minimal Python orchestration sketch
6. a Lean proof-state experiment

The goal is not for the prompt to solve all theorem proving.

The goal is to let the reader instantiate the architecture.

They should see the system produce the same components the post has described:

state object
retrieval
candidate tactic
verification
score
stop condition
rule extraction

That makes the post different from a normal AI essay.

A normal post says:

Here is an idea.

This post should say:

Here is an idea.
Now paste this into your AI and make the first version of it.

The interactive prompt can be simple:

You are designing a minimal Proof Moment Replay system.

Generate:

1. A ProofMoment JSON schema with fields for:
   - theorem statement
   - imports
   - goal
   - hypotheses
   - local context
   - attempted tactics
   - Lean responses
   - retrieved neighbours
   - tightening score
   - extracted rules

2. A retrieval plan for finding nearby theorems or lemmas from Lean / Mathlib.

3. A tightening score that rewards:
   - solved goals
   - reduced subgoals
   - accepted tactics
   - useful retrieved lemmas
   - simpler proof states

   and penalizes:
   - repeated tactic failures
   - irrelevant retrieved lemmas
   - increased goal complexity
   - semantic drift from the original theorem

4. A replay protocol:
   - capture proof state
   - retrieve neighbours
   - ask an LLM for candidate tactics
   - run Lean
   - score the result
   - retry while tightening improves
   - stop when progress stalls
   - extract a reusable rule from successful attempts

5. A minimal Python orchestration sketch that:
   - stores ProofMoment objects
   - calls an LLM
   - invokes Lean
   - logs verifier output
   - computes tightening
   - writes results to JSON

6. A tiny Lean experiment with 3 theorem statements whose proof bodies are hidden.

Return the result as structured markdown with code blocks.

That prompt is not the proof.

It is the doorway.

The appendices can then carry the heavier implementation:

Appendix A: ProofMoment schema
Appendix B: Minimal Lean runner
Appendix C: Tightening metrics
Appendix D: 3-proof toy experiment
Appendix E: 50-proof benchmark design

This keeps the main post clean.

The reader gets the concept first.

Then, if they want to test it, the post gives them a way in.

That is the publishing pattern I want more AI writing to use.

Not just arguments.

Runnable arguments.

A blog post about AI should not only describe an architecture.

It should let the reader summon a small version of that architecture inside their own AI session.

That is how the post proves its own point.

The moment does not only appear in the essay.

The reader creates one.


15. Ending

Humans live forward.

A moment happens once, and then it starts to decay into memory.

We remember the shape of the problem. We remember that we were close. We remember that something almost worked. But the exact state is gone.

AI does not have to inherit that limitation.

A proof state can be preserved.

A failed step can be replayed.

A nearby theorem can be surfaced before reasoning drifts.

A tactic can be tested instead of trusted.

A successful replay can become a rule.

That is the larger point.

The moment is not just something to remember.

It is something to reopen.

In ordinary AI systems, that may sound abstract. In formal mathematics, it becomes concrete. The state is structured. The verifier is hard. The progress can be measured.

That is why proof is the right test bench.

Not because all AI reasoning is proof.

Because proof gives us the cleanest version of the question:

Can an AI preserve a reasoning state,
search nearby validated knowledge,
replay candidate moves,
measure whether the state tightens,
and extract what worked?

If the answer is yes, then the moment becomes more than a metaphor.

It becomes a systems primitive.

Memory stores what happened.

Experience changes what happens next.

Moment replay is the bridge.

And in formal mathematics, for once, we can check whether the bridge is real.


Appendix A: Tiny Proof-Moment Replay Example

Start with a deliberately simple Lean theorem:

theorem demo_add_comm (a b : Nat) : a + b = b + a := by
  -- proof omitted

The original proof is obvious if you know the library:

theorem demo_add_comm (a b : Nat) : a + b = b + a := by
  exact Nat.add_comm a b

But the point is not that this theorem is hard.

The point is that the proof moment can be captured, replayed, scored, and improved.

The Proof Moment

{
  "moment_id": "M_demo_add_comm",
  "theorem": "demo_add_comm",
  "statement": "theorem demo_add_comm (a b : Nat) : a + b = b + a := by",
  "goal": "a + b = b + a",
  "hypotheses": ["a : Nat", "b : Nat"],
  "local_context": ["Nat", "addition", "equality"],
  "attempted_tactics": [],
  "retrieved_neighbours": [],
  "lean_feedback": null,
  "tightening_score": 0.0,
  "rules_applied": []
}

Pass 1: Baseline Attempt

The LLM proposes:

rw [Nat.add_comm]

Lean response:

accepted

Updated proof:

theorem demo_add_comm (a b : Nat) : a + b = b + a := by
  rw [Nat.add_comm]

Tightening score:

+1.0 closed_goals
+0.3 accepted_tactic
= 1.3

The proof is solved.

In this case, moment replay stops immediately.

That is already useful. The system did not need multiple passes because the first candidate tightened the state completely.

A Slightly More Interesting Version

Now suppose the first attempt is weaker.

LLM proposes:

simp

Lean response:

goal remains unsolved

Score:

0.0 no goal closed
-0.5 repeated or unproductive simplification risk
= -0.5

The replay loop retrieves nearby lemmas for the goal shape:

[
  {
    "name": "Nat.add_comm",
    "statement": "n + m = m + n",
    "reason": "matches Nat addition commutativity"
  },
  {
    "name": "Nat.add_assoc",
    "statement": "(n + m) + k = n + (m + k)",
    "reason": "nearby addition theorem, but wrong shape"
  },
  {
    "name": "Nat.zero_add",
    "statement": "0 + n = n",
    "reason": "nearby addition theorem, but no zero appears"
  }
]

Now the system replays the moment with retrieved neighbours:

exact Nat.add_comm a b

Lean response:

accepted
proof complete

Score:

+1.0 closed_goals
+0.3 accepted_tactic
+0.3 retrieved_lemma_used
= 1.6

Extracted Rule

The successful replay generates a tiny rule:

{
  "rule_id": "R_nat_add_comm_shape",
  "condition": {
    "target_shape": "x + y = y + x",
    "types": ["Nat"],
    "operators": ["+"]
  },
  "action": {
    "priority_lemmas": ["Nat.add_comm"],
    "strategy": "try commutativity lemma before simplification"
  },
  "source_moments": ["M_demo_add_comm"],
  "success_count": 1,
  "fail_count": 0,
  "status": "candidate"
}

This rule should not be promoted globally from one toy proof.

But it shows the mechanism.

The system did not merely remember the final proof.

It learned a conditional navigation pattern:

When the target has the shape x + y = y + x over Nat,
try Nat.add_comm.

Why This Example Matters

This example is intentionally small.

It does not show mathematical genius.

It shows the architecture:

    %%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px'}}}%%
flowchart LR
    CPS["πŸ“Έ Capture<br/>Proof State"]
    AT["🎯 Attempt<br/>Tactic"]
    LF["πŸ›‘οΈ Receive Lean<br/>Feedback"]
    RL["πŸ” Retrieve Nearby<br/>Lemma"]
    RC["πŸ”„ Replay with<br/>Candidate"]
    V["πŸ›‘οΈ Verify"]
    ST["πŸ“Š Score<br/>Tightening"]
    ER["🧠 Extract<br/>Rule"]

    CPS --> AT --> LF --> RL --> RC --> V --> ST --> ER

    classDef nodeStyle fill:#1e1e2f,stroke-width:3px,color:#cdd6f4,rx:16,ry:16,font-weight:bold
    class CPS,AT,LF,RL,RC,V,ST,ER nodeStyle

    style CPS fill:#2a1e3f,stroke:#cba6f7
    style AT fill:#1e2a3f,stroke:#89b4fa
    style LF fill:#1e3f2a,stroke:#a6e3a1
    style RL fill:#3f2a1e,stroke:#fab387
    style RC fill:#3f1e2a,stroke:#f38ba8
    style V fill:#2a3f1e,stroke:#a6da95
    style ST fill:#3f2a1e,stroke:#fab387
    style ER fill:#2a1e3f,stroke:#cba6f7
  

That is the whole loop in miniature.

The same machinery can then be tested on harder theorem statements where the retrieved neighbour is not obvious, the first tactic fails, and the useful rule generalizes across multiple proof moments.


Appendix B: A Minimal Proof-Moment Replay Demonstration

The main post argues that a reasoning moment can be captured, replayed, measured, and used to extract reusable rules.

Before discussing larger benchmarks, retrieval systems, or experience compilers, it is useful to see the smallest possible version of the idea.

The Lean file below is intentionally simple. It is not a theorem-proving breakthrough. It is a demonstration of the architecture in miniature.

When run, it walks through a toy proof moment:

capture proof state
β†’ attempt tactic
β†’ receive verifier feedback
β†’ retrieve useful neighbour
β†’ replay
β†’ verify
β†’ extract rule

The proof itself is trivial. The architecture is the point.

A proof assistant normally succeeds silently. If the proof checks, nothing interesting appears on the screen. For the purposes of this blog post, the file has been instrumented to print a narrated replay trace so the reader can see the moment evolving.

The goal is to make the core concepts visible:

  • the proof state as a moment
  • verifier feedback
  • semantic neighbours
  • replay
  • tightening
  • rule extraction

If you have Lean installed, you should be able to run the file directly and inspect the output.

The result is not a production system.

It is a minimal, executable example of the moment replay loop described throughout this post.

/-
MomentReplayTrace.lean

A narrated Lean demo for the "Moment Replay" blog post.

This file is designed to SHOW something when run.

Run from a terminal:

  lean MomentReplayTrace.lean

or in a Lake project:

  lake env lean MomentReplayTrace.lean

It prints a miniature proof-moment replay trace.

The proof itself is deliberately simple. The point is the architecture:
capture proof state -> attempt -> verifier feedback -> retrieve neighbour
-> replay -> verify -> score -> compile rule.
-/

set_option autoImplicit false

namespace MomentReplayTrace

/-
A tiny verified proof.

Lean checks this silently, but the #eval blocks below narrate the replay.
-/
theorem demo_add_comm (a b : Nat) : a + b = b + a := by
  exact Nat.add_comm a b

/-
A nearby proof that uses a different theorem family.
-/
theorem demo_add_assoc (a b c : Nat) : (a + b) + c = a + (b + c) := by
  exact Nat.add_assoc a b c

/-
Simple data structures for the demo.
These are not a full implementation; they are a readable sketch.
-/

structure ProofMoment where
  id : String
  theoremName : String
  goal : String
  hypotheses : List String
  attemptedTactics : List String
  retrievedNeighbours : List String
  score : Int
deriving Repr

structure ReplayStep where
  pass : Nat
  candidate : String
  leanResult : String
  scoreDelta : Int
  decision : String
deriving Repr

def initialMoment : ProofMoment :=
  {
    id := "M_demo_add_comm",
    theoremName := "demo_add_comm",
    goal := "a + b = b + a",
    hypotheses := ["a : Nat", "b : Nat"],
    attemptedTactics := [],
    retrievedNeighbours := [],
    score := 0
  }

def pass1 : ReplayStep :=
  {
    pass := 1,
    candidate := "simp",
    leanResult := "goal remains unsolved",
    scoreDelta := -5,
    decision := "replay: simplification did not tighten the state"
  }

def pass2 : ReplayStep :=
  {
    pass := 2,
    candidate := "retrieve Nat.add_comm",
    leanResult := "nearest useful neighbour found: Nat.add_comm",
    scoreDelta := 3,
    decision := "retry with retrieved theorem"
  }

def pass3 : ReplayStep :=
  {
    pass := 3,
    candidate := "exact Nat.add_comm a b",
    leanResult := "accepted by Lean; proof complete",
    scoreDelta := 16,
    decision := "stop: proof solved"
  }

def compiledRule : String :=
  "IF target has shape x + y = y + x over Nat, THEN prioritize Nat.add_comm before generic simplification."

def printList (xs : List String) : String :=
  String.intercalate ", " xs

#eval IO.println ""
#eval IO.println "=== Moment Replay Demo: Proof State as a Moment ==="
#eval IO.println ""
#eval IO.println "1. Captured proof moment"
#eval IO.println s!"moment_id: {initialMoment.id}"
#eval IO.println s!"theorem:   {initialMoment.theoremName}"
#eval IO.println s!"goal:      {initialMoment.goal}"
#eval IO.println s!"context:   {printList initialMoment.hypotheses}"
#eval IO.println ""

#eval IO.println "2. Replay trace"
#eval IO.println s!"Pass {pass1.pass}: {pass1.candidate}"
#eval IO.println s!"  Lean result: {pass1.leanResult}"
#eval IO.println s!"  Score delta: {pass1.scoreDelta}"
#eval IO.println s!"  Decision:    {pass1.decision}"
#eval IO.println ""

#eval IO.println s!"Pass {pass2.pass}: {pass2.candidate}"
#eval IO.println s!"  Lean result: {pass2.leanResult}"
#eval IO.println s!"  Score delta: {pass2.scoreDelta}"
#eval IO.println s!"  Decision:    {pass2.decision}"
#eval IO.println ""

#eval IO.println s!"Pass {pass3.pass}: {pass3.candidate}"
#eval IO.println s!"  Lean result: {pass3.leanResult}"
#eval IO.println s!"  Score delta: {pass3.scoreDelta}"
#eval IO.println s!"  Decision:    {pass3.decision}"
#eval IO.println ""

#eval IO.println "3. Verified Lean proof"
#eval IO.println "theorem demo_add_comm (a b : Nat) : a + b = b + a := by"
#eval IO.println "  exact Nat.add_comm a b"
#eval IO.println ""

#eval IO.println "4. Compiled rule"
#eval IO.println compiledRule
#eval IO.println ""

#eval IO.println "5. What this demonstrates"
#eval IO.println "- The proof state is captured as a Moment."
#eval IO.println "- A weak tactic produces verifier feedback."
#eval IO.println "- Retrieval surfaces a useful theorem neighbour."
#eval IO.println "- Replay uses the neighbour."
#eval IO.println "- Lean verifies the result."
#eval IO.println "- The successful pattern becomes a reusable rule."
#eval IO.println ""

#eval IO.println "Done."

end MomentReplayTrace

No info found.

=== Moment Replay Demo: Proof State as a Moment ===

1. Captured proof moment
moment_id: M_demo_add_comm
theorem:   demo_add_comm
goal:      a + b = b + a
context:   a : Nat, b : Nat

2. Replay trace
Pass 1: simp
  Lean result: goal remains unsolved
  Score delta: -5
  Decision:    replay: simplification did not tighten the state

Pass 2: retrieve Nat.add_comm
  Lean result: nearest useful neighbour found: Nat.add_comm
  Score delta: 3
  Decision:    retry with retrieved theorem

Pass 3: exact Nat.add_comm a b
  Lean result: accepted by Lean; proof complete
  Score delta: 16
  Decision:    stop: proof solved

3. Verified Lean proof
theorem demo_add_comm (a b : Nat) : a + b = b + a := by
  exact Nat.add_comm a b

4. Compiled rule
IF target has shape x + y = y + x over Nat, THEN prioritize Nat.add_comm before generic simplification.

5. What this demonstrates
- The proof state is captured as a Moment.
- A weak tactic produces verifier feedback.
- Retrieval surfaces a useful theorem neighbour.
- Replay uses the neighbour.
- Lean verifies the result.
- The successful pattern becomes a reusable rule.

Done.

Appendix C: Tightening Metrics and Stop Conditions

The main post uses the word tightening to describe proof progress.

This appendix makes that measurable.

The metric below is not meant to be perfect. It is a first-pass operational score: good enough to let a replay loop decide whether a candidate tactic improved the proof state, made it worse, or simply created motion.

Lean already gives us the essential structure. A tactic transforms a proof state: it may close a goal, create subgoals, simplify the target, or fail entirely. Lean’s documentation describes tactics as commands that operate on proof states, where a proof state consists of a sequence of goals, and a successful tactic may leave zero or more subgoals.

That makes tightening computable.

Intuition

Tightening should increase when:

a goal is closed
the number of subgoals decreases
a tactic is accepted
a retrieved lemma is used in an accepted step
the resulting goal is simpler

Tightening should decrease when:

a tactic fails repeatedly
the number of subgoals increases without useful simplification
the target becomes more complex
the replay moves into an unrelated theorem neighbourhood
the same error class repeats

The point is not to replace Lean’s verifier.

Lean answers:

Was this step valid?

The tightening score answers:

Was this step useful?

Concrete formula

Let:

Gβ‚€ = number of goals before candidate
G₁ = number of goals after candidate

Cβ‚€ = complexity score before candidate
C₁ = complexity score after candidate

A = 1 if tactic was accepted, 0 otherwise
U = 1 if an accepted tactic used a retrieved lemma, 0 otherwise
F = number of recent failures for the same tactic pattern
D = semantic drift penalty, from 0 to 1

Define:

Ξ”G = (Gβ‚€ - G₁) / max(1, Gβ‚€)
Ξ”C = (Cβ‚€ - C₁) / max(1, Cβ‚€)

Then:

T =
  w_g * Ξ”G
+ w_c * Ξ”C
+ w_a * A
+ w_u * U
- w_f * tanh(F)
- w_d * D

Example weights:

w_g = 0.5
w_c = 0.2
w_a = 0.2
w_u = 0.1
w_f = 0.3
w_d = 0.5

Clamp T to [-1, +1].

If all goals are closed, set:

T = +1.0

A solved proof should dominate all partial signals.

Complexity score

The simplest complexity score is a rough AST or string-size proxy:

complexity =
number of goal nodes
+ number of hypotheses
+ normalized target length

A better implementation would use Lean’s internal expression tree.

But the first version does not need to be elegant.

It only needs to distinguish:

cleaner proof state

from:

larger, noisier proof state

Retrieved lemma usage

Do not rely only on Lean’s printed output to determine whether a retrieved lemma was used.

A runner can track this directly.

If the replay system retrieved Nat.add_comm, generated the tactic:

exact Nat.add_comm a b

and Lean accepted it, then U = 1.

If the tactic failed, or if the retrieved lemma never appeared in the accepted tactic, then U = 0.

Semantic drift

Semantic drift is optional in a toy runner, but important in a serious one.

In proof, drift means the replay has moved away from the original obligation.

A simple drift signal can compare the current goal’s normalized type signature with the original theorem target. If repeated replay creates subgoals whose constants, type families, or theorem neighbourhoods diverge from the original target, the system is no longer tightening. It is wandering.

For a first implementation:

D = 0

For a more advanced implementation:

D = 1 - cosine_similarity(
  embedding(original_goal_signature),
  embedding(current_goal_signature)
)
def compute_drift(original_theorem_sig, current_goal_sig):
    # sig = normalized string of target type and hypotheses types
    emb_orig = embed(original_theorem_sig)  # using a pre-trained code model
    emb_curr = embed(current_goal_sig)
    return 1.0 - cosine_similarity(emb_orig, emb_curr)

Pseudocode

import math

def clamp(x, lo=-1.0, hi=1.0):
    return max(lo, min(hi, x))

def compute_tightening(prev_state, new_state, recent_failures=0, drift=0.0):
    G0 = prev_state["num_goals"]
    G1 = new_state["num_goals"]

    C0 = prev_state["complexity"]
    C1 = new_state["complexity"]

    A = 1.0 if new_state["accepted"] else 0.0
    U = 1.0 if new_state.get("used_retrieved_lemma", False) else 0.0

    delta_g = (G0 - G1) / max(1, G0)
    delta_c = (C0 - C1) / max(1, C0)

    wg, wc, wa, wu, wf, wd = 0.5, 0.2, 0.2, 0.1, 0.3, 0.5

    if G1 == 0 and new_state["accepted"]:
        return 1.0

    score = (
        wg * delta_g
        + wc * delta_c
        + wa * A
        + wu * U
        - wf * math.tanh(recent_failures)
        - wd * drift
    )

    return clamp(score)

Example replay trace

Pass Candidate Lean result Tightening Decision
1 simp goal remains unsolved -0.30 Replay
2 retrieve Nat.add_comm useful neighbour found +0.10 Retry
3 exact Nat.add_comm a b proof complete +1.00 Stop

This is the whole idea in miniature.

A replay loop does not continue because the model has more to say.

It continues because the state is tightening.

Stop condition

A practical stop condition can be simple:

stop if:
  proof is complete
  OR no score improvement for 3 attempts
  OR same error class repeats twice
  OR semantic drift exceeds threshold
  OR tactic diversity collapses
  OR compute budget is exhausted

In pseudocode:

def should_stop(history, max_attempts=8, plateau_window=3, drift_threshold=0.6):
    if history[-1]["proof_complete"]:
        return True

    if len(history) >= max_attempts:
        return True

    recent = history[-plateau_window:]
    if len(recent) == plateau_window:
        scores = [step["tightening"] for step in recent]
        if max(scores) <= 0:
            return True

    if history[-1].get("drift", 0.0) > drift_threshold:
        return True

    if history[-1].get("same_error_repeated", False):
        return True

    return False

This is deliberately conservative.

The replay system should search aggressively, but it should not hallucinate progress.

If the proof state is not tightening, replay should stop, change strategy, or record the failure as negative experience.


Appendix D: A Three-Proof Toy Experiment

Appendix B showed the smallest possible proof moment.

This appendix expands the idea to three tiny proof moments. The goal is still not to build a serious theorem prover. The goal is to show the replay pattern repeating across different goal shapes.

The file below proves three simple Lean theorems:

addition commutativity
multiplication by zero
addition associativity

For each theorem, the script prints a narrated replay trace:

    %%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px'}}}%%
flowchart LR
    CPM["πŸ“Έ Captured<br/>Proof Moment"]
    WC["⚠️ Weak or Wrong<br/>Candidate"]
    RUN["πŸ” Retrieved Useful<br/>Neighbour"]
    VR["πŸ”„ Verified<br/>Replay"]
    TS["πŸ“Š Tightening<br/>Score"]
    CR["🧠 Compiled<br/>Rule"]

    CPM --> WC --> RUN --> VR --> TS --> CR

    classDef nodeStyle fill:#1e1e2f,stroke-width:3px,color:#cdd6f4,rx:16,ry:16,font-weight:bold
    class CPM,WC,RUN,VR,TS,CR nodeStyle

    style CPM fill:#2a1e3f,stroke:#cba6f7
    style WC fill:#3f1e2a,stroke:#f38ba8
    style RUN fill:#3f2a1e,stroke:#fab387
    style VR fill:#1e3f2a,stroke:#a6e3a1
    style TS fill:#1e2a3f,stroke:#89b4fa
    style CR fill:#2a3f1e,stroke:#a6da95
  

This matters because moment replay is only interesting if it generalizes beyond one example.

The three examples show that different proof states require different neighbourhoods:

x + y = y + x           β†’ Nat.add_comm
x * 0 = 0               β†’ Nat.mul_zero
(x + y) + z = x + ...   β†’ Nat.add_assoc

They also show why retrieval alone is not enough. A theorem may be nearby because it uses the same symbols, but still fail to tighten the current proof state. The verifier decides. The replay trace records. The experience compiler extracts the rule.

This is still a toy.

But it demonstrates the shape of the real benchmark.

/-
MomentReplay3Proofs.lean

Appendix D: 3-proof toy experiment for the "Moment Replay" blog post.

This file is designed to RUN and PRINT a demonstration.

Run:

  lean MomentReplay3Proofs.lean

or inside a Lake project:

  lake env lean MomentReplay3Proofs.lean

It uses only basic Nat theorems, so it should not require Mathlib.

What it demonstrates:

1. Three tiny proof moments.
2. Baseline weak attempts.
3. Retrieved useful neighbours.
4. Verified replays.
5. Tightening scores.
6. Toy rule extraction.

This is not a production prover.
It is a visible, runnable miniature of the architecture.
-/

set_option autoImplicit false

namespace MomentReplay3Proofs

/-
Verified proofs.

Lean checks these theorems.
The printed trace below narrates how a moment-replay system would arrive at them.
-/

theorem hidden_add_comm (a b : Nat) : a + b = b + a := by
  exact Nat.add_comm a b

theorem hidden_mul_zero (a : Nat) : a * 0 = 0 := by
  exact Nat.mul_zero a

theorem hidden_add_assoc (a b c : Nat) : (a + b) + c = a + (b + c) := by
  exact Nat.add_assoc a b c


/-
Tiny data model for the narrated experiment.
-/

structure ToyMoment where
  id : String
  theoremName : String
  goal : String
  hypotheses : List String
deriving Repr

structure ToyPass where
  passNumber : Nat
  candidate : String
  verifierResult : String
  tightening : String
  decision : String
deriving Repr

def joinComma : List String β†’ String
  | [] => ""
  | [x] => x
  | x :: xs => x ++ ", " ++ joinComma xs

def printMoment (m : ToyMoment) : IO Unit := do
  IO.println s!"moment_id: {m.id}"
  IO.println s!"theorem:   {m.theoremName}"
  IO.println s!"goal:      {m.goal}"
  IO.println s!"context:   {joinComma m.hypotheses}"

def printPass (p : ToyPass) : IO Unit := do
  IO.println s!"Pass {p.passNumber}: {p.candidate}"
  IO.println s!"  Lean/verifier result: {p.verifierResult}"
  IO.println s!"  Tightening:           {p.tightening}"
  IO.println s!"  Decision:             {p.decision}"


/-
Proof Moment 1:
  addition commutativity
-/

def momentAddComm : ToyMoment :=
  {
    id := "M_001_add_comm",
    theoremName := "hidden_add_comm",
    goal := "a + b = b + a",
    hypotheses := ["a : Nat", "b : Nat"]
  }

def addCommPass1 : ToyPass :=
  {
    passNumber := 1,
    candidate := "simp",
    verifierResult := "goal remains unresolved",
    tightening := "-0.30",
    decision := "replay: weak attempt did not close or simplify the goal"
  }

def addCommPass2 : ToyPass :=
  {
    passNumber := 2,
    candidate := "retrieve Nat.add_comm",
    verifierResult := "useful neighbour found: Nat.add_comm",
    tightening := "+0.10",
    decision := "retry using retrieved theorem"
  }

def addCommPass3 : ToyPass :=
  {
    passNumber := 3,
    candidate := "exact Nat.add_comm a b",
    verifierResult := "accepted by Lean; proof complete",
    tightening := "+1.00",
    decision := "stop: proof solved"
  }


/-
Proof Moment 2:
  multiplication by zero
-/

def momentMulZero : ToyMoment :=
  {
    id := "M_002_mul_zero",
    theoremName := "hidden_mul_zero",
    goal := "a * 0 = 0",
    hypotheses := ["a : Nat"]
  }

def mulZeroPass1 : ToyPass :=
  {
    passNumber := 1,
    candidate := "rw [Nat.add_comm]",
    verifierResult := "rejected / wrong theorem family",
    tightening := "-0.50",
    decision := "replay: addition commutativity is not proof-useful here"
  }

def mulZeroPass2 : ToyPass :=
  {
    passNumber := 2,
    candidate := "retrieve Nat.mul_zero",
    verifierResult := "useful neighbour found: Nat.mul_zero",
    tightening := "+0.10",
    decision := "retry using multiplication-zero theorem"
  }

def mulZeroPass3 : ToyPass :=
  {
    passNumber := 3,
    candidate := "exact Nat.mul_zero a",
    verifierResult := "accepted by Lean; proof complete",
    tightening := "+1.00",
    decision := "stop: proof solved"
  }


/-
Proof Moment 3:
  addition associativity
-/

def momentAddAssoc : ToyMoment :=
  {
    id := "M_003_add_assoc",
    theoremName := "hidden_add_assoc",
    goal := "(a + b) + c = a + (b + c)",
    hypotheses := ["a : Nat", "b : Nat", "c : Nat"]
  }

def addAssocPass1 : ToyPass :=
  {
    passNumber := 1,
    candidate := "exact Nat.add_comm a b",
    verifierResult := "rejected / nearby symbol but wrong shape",
    tightening := "-0.50",
    decision := "replay: commutativity does not tighten an associativity goal"
  }

def addAssocPass2 : ToyPass :=
  {
    passNumber := 2,
    candidate := "retrieve Nat.add_assoc",
    verifierResult := "useful neighbour found: Nat.add_assoc",
    tightening := "+0.10",
    decision := "retry using associativity theorem"
  }

def addAssocPass3 : ToyPass :=
  {
    passNumber := 3,
    candidate := "exact Nat.add_assoc a b c",
    verifierResult := "accepted by Lean; proof complete",
    tightening := "+1.00",
    decision := "stop: proof solved"
  }


/-
Toy rule extraction.

These are not global mathematical truths.
They are candidate navigation rules extracted from successful proof moments.
-/

def ruleAddComm : String :=
  "R_add_comm: IF target has shape x + y = y + x over Nat, THEN prioritize Nat.add_comm."

def ruleMulZero : String :=
  "R_mul_zero: IF target has shape x * 0 = 0 over Nat, THEN prioritize Nat.mul_zero."

def ruleAddAssoc : String :=
  "R_add_assoc: IF target has shape (x + y) + z = x + (y + z) over Nat, THEN prioritize Nat.add_assoc."

def negativeRule : String :=
  "R_negative_shape: Do not treat all addition goals as the same; commutativity and associativity tighten different states."


/-
Narrated output.
-/

#eval do
  IO.println ""
  IO.println "=== Appendix D: 3-Proof Moment Replay Toy Experiment ==="
  IO.println ""
  IO.println "This file proves three tiny Lean theorems and prints the replay trace"
  IO.println "a moment-replay system would record around them."
  IO.println ""
  IO.println "The proofs are intentionally simple. The architecture is the point."
  IO.println ""

  IO.println "------------------------------------------------------------"
  IO.println "Proof Moment 1: Addition commutativity"
  IO.println "------------------------------------------------------------"
  printMoment momentAddComm
  IO.println ""
  printPass addCommPass1
  IO.println ""
  printPass addCommPass2
  IO.println ""
  printPass addCommPass3
  IO.println ""
  IO.println "Verified proof:"
  IO.println "theorem hidden_add_comm (a b : Nat) : a + b = b + a := by"
  IO.println "  exact Nat.add_comm a b"
  IO.println ""

  IO.println "------------------------------------------------------------"
  IO.println "Proof Moment 2: Multiplication by zero"
  IO.println "------------------------------------------------------------"
  printMoment momentMulZero
  IO.println ""
  printPass mulZeroPass1
  IO.println ""
  printPass mulZeroPass2
  IO.println ""
  printPass mulZeroPass3
  IO.println ""
  IO.println "Verified proof:"
  IO.println "theorem hidden_mul_zero (a : Nat) : a * 0 = 0 := by"
  IO.println "  exact Nat.mul_zero a"
  IO.println ""

  IO.println "------------------------------------------------------------"
  IO.println "Proof Moment 3: Addition associativity"
  IO.println "------------------------------------------------------------"
  printMoment momentAddAssoc
  IO.println ""
  printPass addAssocPass1
  IO.println ""
  printPass addAssocPass2
  IO.println ""
  printPass addAssocPass3
  IO.println ""
  IO.println "Verified proof:"
  IO.println "theorem hidden_add_assoc (a b c : Nat) : (a + b) + c = a + (b + c) := by"
  IO.println "  exact Nat.add_assoc a b c"
  IO.println ""

  IO.println "------------------------------------------------------------"
  IO.println "Toy Experience Compiler Output"
  IO.println "------------------------------------------------------------"
  IO.println ruleAddComm
  IO.println ruleMulZero
  IO.println ruleAddAssoc
  IO.println negativeRule
  IO.println ""

  IO.println "What this demonstrates:"
  IO.println "- A proof state can be represented as a replayable moment."
  IO.println "- Weak or wrong candidate tactics become verifier feedback."
  IO.println "- Retrieved neighbours are candidates, not truth."
  IO.println "- Lean verifies the replayed move."
  IO.println "- Successful replays can be compiled into reusable navigation rules."
  IO.println ""
  IO.println "Done."

end MomentReplay3Proofs
No info found.

=== Appendix D: 3-Proof Moment Replay Toy Experiment ===

This file proves three tiny Lean theorems and prints the replay trace
a moment-replay system would record around them.

The proofs are intentionally simple. The architecture is the point.

------------------------------------------------------------
Proof Moment 1: Addition commutativity
------------------------------------------------------------
moment_id: M_001_add_comm
theorem:   hidden_add_comm
goal:      a + b = b + a
context:   a : Nat, b : Nat

Pass 1: simp
  Lean/verifier result: goal remains unresolved
  Tightening:           -0.30
  Decision:             replay: weak attempt did not close or simplify the goal

Pass 2: retrieve Nat.add_comm
  Lean/verifier result: useful neighbour found: Nat.add_comm
  Tightening:           +0.10
  Decision:             retry using retrieved theorem

Pass 3: exact Nat.add_comm a b
  Lean/verifier result: accepted by Lean; proof complete
  Tightening:           +1.00
  Decision:             stop: proof solved

Verified proof:
theorem hidden_add_comm (a b : Nat) : a + b = b + a := by
  exact Nat.add_comm a b

------------------------------------------------------------
Proof Moment 2: Multiplication by zero
------------------------------------------------------------
moment_id: M_002_mul_zero
theorem:   hidden_mul_zero
goal:      a * 0 = 0
context:   a : Nat

Pass 1: rw [Nat.add_comm]
  Lean/verifier result: rejected / wrong theorem family
  Tightening:           -0.50
  Decision:             replay: addition commutativity is not proof-useful here

Pass 2: retrieve Nat.mul_zero
  Lean/verifier result: useful neighbour found: Nat.mul_zero
  Tightening:           +0.10
  Decision:             retry using multiplication-zero theorem

Pass 3: exact Nat.mul_zero a
  Lean/verifier result: accepted by Lean; proof complete
  Tightening:           +1.00
  Decision:             stop: proof solved

Verified proof:
theorem hidden_mul_zero (a : Nat) : a * 0 = 0 := by
  exact Nat.mul_zero a

------------------------------------------------------------
Proof Moment 3: Addition associativity
------------------------------------------------------------
moment_id: M_003_add_assoc
theorem:   hidden_add_assoc
goal:      (a + b) + c = a + (b + c)
context:   a : Nat, b : Nat, c : Nat

Pass 1: exact Nat.add_comm a b
  Lean/verifier result: rejected / nearby symbol but wrong shape
  Tightening:           -0.50
  Decision:             replay: commutativity does not tighten an associativity goal

Pass 2: retrieve Nat.add_assoc
  Lean/verifier result: useful neighbour found: Nat.add_assoc
  Tightening:           +0.10
  Decision:             retry using associativity theorem

Pass 3: exact Nat.add_assoc a b c
  Lean/verifier result: accepted by Lean; proof complete
  Tightening:           +1.00
  Decision:             stop: proof solved

Verified proof:
theorem hidden_add_assoc (a b c : Nat) : (a + b) + c = a + (b + c) := by
  exact Nat.add_assoc a b c

------------------------------------------------------------
Toy Experience Compiler Output
------------------------------------------------------------
R_add_comm: IF target has shape x + y = y + x over Nat, THEN prioritize Nat.add_comm.
R_mul_zero: IF target has shape x * 0 = 0 over Nat, THEN prioritize Nat.mul_zero.
R_add_assoc: IF target has shape (x + y) + z = x + (y + z) over Nat, THEN prioritize Nat.add_assoc.
R_negative_shape: Do not treat all addition goals as the same; commutativity and associativity tighten different states.

What this demonstrates:
- A proof state can be represented as a replayable moment.
- Weak or wrong candidate tactics become verifier feedback.
- Retrieved neighbours are candidates, not truth.
- Lean verifies the replayed move.
- Successful replays can be compiled into reusable navigation rules.

Done.

Appendix E: 50-Proof Benchmark Design

The toy files above show that the loop can be made visible. The real test is whether it improves performance across a small benchmark.

Use 50 easy-to-moderate Lean theorem statements with proof bodies hidden.

Compare:

System Description
Baseline LLM proof state β†’ candidate tactic β†’ Lean check
Retrieval LLM proof state β†’ retrieved lemmas β†’ candidate tactic β†’ Lean check
Moment Replay proof state β†’ retrieval β†’ tactic β†’ Lean check β†’ tightening score β†’ replay/stop β†’ rule extraction

Measure:

Metric Why it matters
Solved rate Did the system close more proofs?
Attempts per solved proof Did replay reduce wasted search?
Accepted tactic count Did it generate more valid intermediate steps?
Repeated failure count Did the stop policy prevent looping?
Retrieved lemma usefulness Did semantic neighbours actually help?
Rule reuse success Did extracted rules improve later moments?

A positive result does not need to be dramatic. If moment replay solves more proofs, uses fewer repeated attempts, or extracts rules that improve held-out proof states, the architecture has signal.