Case Based Reasoning: Teaching AI to Learn From itself

Case-Based Reasoning, Self-Improving AI, Research Implementation

September 03, 2025

Case Based Reasoning: Teaching AI to Learn From itself

Page content

✨ Summary

Imagine an AI that gets smarter every time it works not by retraining on massive datasets, but by learning from its own reasoning and reflection, just like humans.

Most AI systems are frozen in time. Trained once, deployed forever, they never learn from mistakes or build on successes. Real intelligence human or artificial doesn’t work that way. It learns from experience.

This is the vision behind Stephanie: a self-improving AI that gets better every time it acts, not by fine-tuning, but by remembering, reusing, and revising its reasoning.

In this post, we implement the core ideas of the Memento: Fine-tuning LLM Agents without Fine-tuning LLMs paper inside Stephanie.

By the end, you’ll see how case-based reasoning, multi-dimensional scoring, and retention policies combine to give Stephanie something most AI systems lack: the ability to truly learn from experience.

🧐 Hierarchical Reasoning Model (HRM)

👉 See Post: Layers of thought: smarter reasoning with the Hierarchical Reasoning Model

Before we can teach Stephanie to reuse her reasoning, we first need a way to judge its quality.

The paper HRM: Hierarchical Reasoning Model introduced a new way to think. The Hierarchical Reasoning Model (HRM) gives Stephanie the ability to reason in layers. Instead of jumping straight from input to output, HRM breaks problems down into:

High-level strategy (HModule) 🧭 – setting the overall plan.
Low-level analysis (LModule) 🔍 – working through details step by step.
Nested loops 🔄 – iterating between the two, refining judgment until confident.

This matters because self-improvement requires reflection. Other scorers (MRQ, SICQL, EBT, SVM) give useful numbers, but HRM provides a reasoned judgment with a trace of how it got there essential for Case-Based Reasoning and middleware reuse.

🎨 Simplified Flow of HRM

    flowchart LR
    A["📄 Input (Goal + Doc)"]:::data --> B["🎯 Input Projector"]:::module
    B --> C["🧭 HModule<br/>High-Level Strategy"]:::module
    C --> D["🔍 LModule<br/>Detail Analysis"]:::module
    D --> E["🔄 Nested Loop<br/>(N cycles × T steps)"]:::loop
    E --> C
    C --> F["📊 Final Output<br/>Reasoned Score"]:::data

    classDef data fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
    classDef module fill:#e3f2fd,stroke:#1565c0,stroke-width:2px;
    classDef loop fill:#fff3e0,stroke:#ef6c00,stroke-width:2px;

This diagram is deliberately simplified: one input ➝ strategy ➝ details ➝ loop ➝ final judgment. It emphasizes why HRM is different it thinks before it answers.

🏗️ Building HRM in Code

Now that we’ve seen why HRM matters and how it works conceptually, let’s translate that into code. The implementation is deliberately modular:

RMSNorm ⚖️ keeps reasoning stable across long loops.
RecurrentBlock 🔄 powers both the detail-focused LModule and the strategy-focused HModule.
InputProjector 🎯 prepares raw embeddings for reasoning.
OutputProjector 📊 transforms the final high-level state into a usable score.
HRMModel 🧠 ties it all together into a nested loop that alternates between fine-grained updates and high-level adjustments.

What follows is the full PyTorch definition of the HRM. It’s designed to mirror the reasoning loop we discussed earlier low-level analysis inside high level cycles, converging toward a reasoned judgment rather than a one-shot score.


import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    """
    Root Mean Square Normalization (RMSNorm).
    Unlike LayerNorm, it normalizes across features without subtracting the mean.
    This keeps scale consistent while letting the network learn its own balance.
    """
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        # Each feature gets a learned scaling weight
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        # Normalize by root mean square (RMS) of activations
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight


class RecurrentBlock(nn.Module):
    """
    Recurrent update block used in both HRM modules:
      - LModule (low-level thinker)
      - HModule (high-level planner)

    Internally: GRUCell + RMSNorm
    - GRUCell gives temporal memory (keeps track of past reasoning)
    - RMSNorm stabilizes hidden state scale
    """
    def __init__(self, input_dim, hidden_dim, name="RecurrentBlock"):
        super().__init__()
        self.name = name
        self.rnn_cell = nn.GRUCell(input_dim, hidden_dim)
        self.norm = RMSNorm(hidden_dim)

    def forward(self, z_prev, input_combined):
        """
        One reasoning update step.
        - z_prev: previous hidden state (B, hidden_dim)
        - input_combined: current inputs (B, input_dim)
        Returns: next hidden state (B, hidden_dim)
        """
        z_next = self.rnn_cell(input_combined, z_prev)
        z_next = self.norm(z_next)  # keep scale in check
        return z_next

    def init_state(self, batch_size, hidden_dim, device):
        """Start with zero-initialized memory."""
        return torch.zeros(batch_size, hidden_dim, device=device)


class InputProjector(nn.Module):
    """
    Projects the raw input embedding into HRM's reasoning space.
    This is the 'x_tilde' vector   the problem seen through Stephanie's
    cognitive lens before reasoning begins.
    """
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.project = nn.Linear(input_dim, hidden_dim)
        self.norm = RMSNorm(hidden_dim)

    def forward(self, x):
        x_proj = self.project(x)
        x_tilde = self.norm(x_proj)
        return x_tilde


class OutputProjector(nn.Module):
    """
    Projects the final high-level hidden state (zH) to the output space.
    In HRM this is usually a scalar quality score (e.g., epistemic value).
    """
    def __init__(self, h_dim, output_dim):
        super().__init__()
        self.project = nn.Linear(h_dim, output_dim)

    def forward(self, zH_final):
        return self.project(zH_final)


class HRMModel(nn.Module):
    """
    Hierarchical Reasoning Model (HRM).

    Core idea: nested reasoning loops.
    - LModule: low-level thinker (simulates fine-grained steps, like CoT tokens)
    - HModule: high-level planner (strategic adjustments after each deep dive)

    Processing = N outer cycles × T inner steps.
    At each outer cycle:
      1. Run T low-level updates conditioned on input + current high-level plan.
      2. Update the high-level plan using the result of those low-level steps.
    """
    def __init__(self, cfg, logger=None):
        super().__init__()
        self.logger = logger

        # --- Configurable hyperparameters ---
        self.input_dim = cfg.get("hrm.input_dim", 2048)
        self.h_dim = cfg.get("hrm.h_dim", 256)    # high-level hidden size
        self.l_dim = cfg.get("hrm.l_dim", 128)    # low-level hidden size
        self.output_dim = cfg.get("hrm.output_dim", 1)
        self.n_cycles = cfg.get("hrm.n_cycles", 4)  # number of high-level updates
        self.t_steps = cfg.get("hrm.t_steps", 4)    # number of low-level steps per cycle

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        # --- HRM components ---
        self.input_projector = InputProjector(self.input_dim, self.h_dim)

        # Low-level thinker: [x_tilde, zH] → zL
        self.l_module = RecurrentBlock(2 * self.h_dim, self.l_dim, name="LModule")

        # High-level planner: [zL, zH] → zH
        self.h_module = RecurrentBlock(self.l_dim + self.h_dim, self.h_dim, name="HModule")

        # Final projection: zH → score
        self.output_projector = OutputProjector(self.h_dim, self.output_dim)

    def forward(self, x):
        """
        Executes full HRM reasoning.

        Args:
            x: Input tensor of shape (B, input_dim), typically a plan embedding.
        Returns:
            y_hat: Predicted score (B, output_dim)
            intermediate_states: dict with final zL and zH for introspection
        """
        batch_size = x.size(0)

        # Step 1: Project input into hidden reasoning space
        x_tilde = self.input_projector(x)  # (B, h_dim)

        # Step 2: Initialize low-level and high-level memories
        zL = self.l_module.init_state(batch_size, self.l_dim, self.device)
        zH = self.h_module.init_state(batch_size, self.h_dim, self.device)

        # Step 3: Nested reasoning loop
        for n in range(self.n_cycles):  # High-level cycles
            for t in range(self.t_steps):  # Low-level steps
                # Low-level thinker refines details given input + current strategy
                l_input = torch.cat([x_tilde, zH], dim=-1)  # (B, 2*h_dim)
                zL = self.l_module(zL, l_input)

            # After T steps, update high-level strategy
            h_input = torch.cat([zL, zH], dim=-1)          # (B, l_dim + h_dim)
            zH = self.h_module(zH, h_input)

        # Step 4: Final prediction from abstract high-level state
        y_hat = self.output_projector(zH)  # (B, output_dim)

        # Step 5: Return prediction and introspection signals
        intermediate_states = {'zL_final': zL, 'zH_final': zH}
        return y_hat, intermediate_states

    def to(self, device):
        """
        Custom `.to()` override to track device consistently.
        """
        super().to(device)
        self.device = device
        return self

💢 What is SICQL?

👉 See Post: Stephanie’s Secret: The Dawn of Reflective AI

The SICQL: Scalable In-Context Q-Learning paper introduced a way to do Q-learning directly over in-context embeddings. Instead of training separate models for Q, V, and π, you get a single transformer with three coordinated heads:

Q Head 💯: evaluates the quality of a specific (state, action) pair
V Head 🛡️: provides a robust baseline value for the state (via expectile regression)
π Head 🧭: proposes next actions, weighted by their advantage (advantage-weighted regression)

This design makes SICQL a compact, interpretable scorer for Stephanie: it uses one shared context embedding z (from H-Net or another encoder), and then branches into three outputs that reinforce each other.

🔍 Why does Stephanie need SICQL?

HRM is our reasoning engine it loops through strategic and detailed thought before making a judgment. But HRM needs a scoring signal to tell it when a trace, plan, or document is promising.

That’s where SICQL comes in:

HRM provides reasoning traces → SICQL evaluates them.
Q and V heads give sharp, uncertainty-aware scoring signals.
π head lets us guide downstream agents when there are multiple candidate paths.

Put simply:

HRM thinks, SICQL scores. Together they give Stephanie both reasoning depth and evaluative sharpness.

🎨 Simplified SICQL Flow

    graph LR
    A["🎯 Goal Embedding"]:::goal --> C["🧠 TextEncoder<br/>Fusion Layer"]:::encoder
    B["📄 Document / Triplet Embedding"]:::doc --> C
    C --> Z["🔀 Shared Context (z)"]:::latent

    Z --> Q["💯 Q-Head<br/>Action Quality"]:::qhead
    Z --> V["🛡️ V-Head<br/>State Value"]:::vhead
    Z --> PI["🧭 π-Head<br/>Policy Guidance"]:::pihead

    Q --> S["🏆 Score"]:::out
    V --> Conf["✅ Confidence"]:::out
    PI --> Dec["⚡ Next Action"]:::out

    classDef goal fill:#FFD700,stroke:#DAA520,stroke-width:2px;
    classDef doc fill:#87CEFA,stroke:#4682B4,stroke-width:2px;
    classDef encoder fill:#9370DB,stroke:#663399,stroke-width:2px;
    classDef latent fill:#FFECB3,stroke:#FF9800,stroke-width:2px;
    classDef qhead fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px;
    classDef vhead fill:#FFCDD2,stroke:#C62828,stroke-width:2px;
    classDef pihead fill:#BBDEFB,stroke:#1565C0,stroke-width:2px;
    classDef out fill:#E0E0E0,stroke:#424242,stroke-width:1.5px;

This is the InContextQModel code notice how TextEncoder fuses goal + output embeddings into z, then feeds Q, V, and π heads.

import torch
import torch.nn as nn

# --- Building Blocks ---

class MLP(nn.Module):
    """
    Simple 2-layer MLP with ReLU.
    Used for the Q-head: predicts scalar Q-values.
    """
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, input_dim),
            nn.ReLU(),
            nn.Linear(input_dim, output_dim)
        )

    def forward(self, x):
        return self.model(x)


class ExpectileHead(nn.Module):
    """
    V-head: predicts the baseline state value V(s).
    Trained with expectile regression (robust against noisy labels).
    """
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, input_dim),
            nn.ReLU(),
            nn.Linear(input_dim, 1)
        )

    def forward(self, x):
        return self.net(x)


class PolicyHead(nn.Module):
    """
    π-head: predicts logits over actions (policy distribution).
    Trained with Advantage-Weighted Regression (AWR).
    """
    def __init__(self, input_dim, action_dim=1):
        super().__init__()
        self.linear = nn.Linear(input_dim, action_dim)

    def forward(self, x):
        # Typically passed through softmax during training
        return self.linear(x)


class TextEncoder(nn.Module):
    """
    Fusion encoder: combines goal embedding + output embedding
    into a shared context vector z.
    This is the "in-context" piece of SICQL.
    """
    def __init__(self, dim=4096, hdim=4096):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(dim * 2, hdim),
            nn.ReLU(),
            nn.Linear(hdim, dim),
        )

    def forward(self, goal_emb, output_emb):
        # Concatenate goal + candidate output embeddings
        concat = torch.cat([goal_emb, output_emb], dim=1)
        # Transform into shared latent space z
        z = self.encoder(concat)
        return z


# --- Full In-Context Q Model (SICQL) ---

class InContextQModel(nn.Module):
    """
    Scalable In-Context Q-Learning (SICQL) model.

    Takes (goal embedding, output embedding) → shared context z.
    Then predicts:
      - Q(s,a): action value
      - V(s): baseline state value
      - π(a|s): policy distribution over actions
    """
    def __init__(self, dim, hdim, action_dim=1, device="cpu"):
        super().__init__()
        print(f"[SICQL] Initializing with dim={dim}, hdim={hdim}, action_dim={action_dim}, device={device}")
        self.device = device

        # Shared encoder: fuses goal + output into z
        self.encoder = TextEncoder(dim, hdim).to(device)

        # Three heads operating on z
        self.q_head = MLP(dim, 1).to(device)           # Q-value predictor
        self.v_head = ExpectileHead(dim).to(device)    # V-value predictor
        self.pi_head = PolicyHead(dim, action_dim).to(device)  # Policy logits

    def forward(self, goal_emb, output_emb):
        """
        Forward pass.

        Args:
            goal_emb: embedding of the current goal/query
            output_emb: embedding of the candidate document/triplet
        Returns:
            dict with q_value, state_value, action_probabilities
        """
        # Ensure inputs are on the correct device
        goal_emb = goal_emb.to(self.device)
        output_emb = output_emb.to(self.device)

        # Encode (goal, output) into shared context vector z
        z = self.encoder(goal_emb, output_emb)

        # Predict Q(s,a), V(s), and π(a|s) from z
        q_value = self.q_head(z)               # scalar Q-value
        state_value = self.v_head(z)           # baseline V(s)
        action_probabilities = self.pi_head(z) # logits for actions

        return {
            "q_value": q_value,
            "state_value": state_value,
            "action_probabilities": action_probabilities,
        }

Perfect here’s a regenerated H-Net section for your blog. It keeps the pace, avoids overwhelming detail, and uses progressive build-up + light commentary. You can still link back to your original H-Net deep-dive post for readers who want all the code and details.

🔗 Introducing H-Net: Dynamic Chunking for Smarter Embeddings

👉 See Post: The Shape of Thought: Exploring Embedding Strategies with Ollama, HF, and H-Net

The H-Net: Dynamic Chunking for End-to-End Hierarchical Sequence Modeling introduced a new approach to tokenization.

Most embedding systems split text into fixed-size chunks (e.g. every 512 tokens). It’s simple, but also naive: it often slices right through sentences or merges unrelated ideas, polluting the embeddings with noise.

H-Net takes a different approach: it learns where to cut. Instead of fixed intervals, it predicts semantic boundaries and only chunks text where it makes sense.

👉 Result: better chunks → better embeddings → better decisions.

This is crucial for Stephanie, because everything downstream (SICQL, HRM, MRQ, EBT, SVM) relies on embeddings. If the chunks are messy, the reasoning layer inherits that mess.

🧭 Why We Use H-Net

✅ Preserves meaning at the chunk level
✅ Aligns embeddings with natural semantic units
✅ Improves stability for scorers and evaluators

In short: it makes the AI’s “subconscious” cleaner and sharper.

🧬 How We Built H-Net into Stephanie

One of the biggest advantages: we didn’t have to retrain an embedding model. H-Net is a middleware layer that plugs into any backend (Ollama, Hugging Face, MXBAI). It decides where to embed, not how.

Here’s the flow:

    flowchart TD
  A[📄 Input Text] --> B[🧵 Byte-Level Tokenizer]
  B --> C[🍕 Boundary Predictor<br/>BiLSTM over bytes]
  C --> D[✂️ StephanieHNetChunker]
  D --> C1[Chunk 1]
  D --> C2[Chunk 2]
  D --> C3[Chunk N]

  subgraph Chunk Embedding Loop
    C1 --> E1["Embedder (e.g. MXBAI)"]
    C2 --> E2[Embedder]
    C3 --> E3[Embedder]
  end

  E1 --> F[📊 Pooling Strategy]
  E2 --> F
  E3 --> F

  F --> G[🧩 Final Embedding Vector]
  G --> H[🔗 Protocol Layer]
  H --> I["⚖️ Scorers (MRQ / SICQL / HRM)"]
  I --> J[🤖 Self-Improvement Loop]

1. 🧵 Tokenization

We start simple: every character → UTF-8 byte. No dependence on LLM vocabularies, no token drift if backends change.

class ByteLevelTokenizer:
    def tokenize(self, text: str) -> list[int]:
        return list(text.encode("utf-8"))

    def decode(self, tokens: list[int]) -> str:
        return bytes(tokens).decode("utf-8", errors="replace")

Lightweight, language-agnostic, and future-proof.

2. 🍕 Boundary Prediction

A small BiLSTM model learns where semantic breaks are likely. It looks at bytes left and right, then assigns a boundary score at each position.

class ChunkBoundaryPredictor(nn.Module):
    def __init__(self, vocab_size=256, hidden_dim=128):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_dim)
        self.lstm = nn.LSTM(hidden_dim, hidden_dim, bidirectional=True, batch_first=False)
        self.boundary_scorer = nn.Linear(hidden_dim * 2, 1)

    def forward(self, tokens: list[int]):
        # tokens → embeddings → BiLSTM → boundary scores (0..1)
        x = self.embedding(torch.tensor(tokens).long())
        x, _ = self.lstm(x.unsqueeze(1))
        scores = self.boundary_scorer(x.squeeze(1))
        return scores.sigmoid().flatten()

3. ✂️ Chunker

The StephanieHNetChunker runs boundary prediction and cuts text wherever the score passes a threshold (e.g. 0.7).

class StephanieHNetChunker:
    def __init__(self, predictor, threshold=0.7):
        self.tokenizer = ByteLevelTokenizer()
        self.predictor = predictor
        self.threshold = threshold

    def chunk(self, text: str) -> list[str]:
        tokens = self.tokenizer.tokenize(text)
        scores = self.predictor(tokens)
        boundaries = (scores > self.threshold).nonzero(as_tuple=True)[0].tolist()

        chunks, prev = [], 0
        for b in boundaries:
            chunks.append(self.tokenizer.decode(tokens[prev:b+1]))
            prev = b+1
        if prev < len(tokens):
            chunks.append(self.tokenizer.decode(tokens[prev:]))
        return chunks

Now our text is split into meaningful, self-contained chunks.

4. 🧪 Pooling + Embedder

We embed each chunk with an existing backend (MXBAI, Hugging Face, Ollama) and then pool them into one final vector.

class StephanieHNetEmbedder:
    def __init__(self, embedder):
        self.chunker = StephanieHNetChunker(predictor=ChunkBoundaryPredictor())
        self.embedder = embedder

    def embed(self, text: str):
        chunks = self.chunker.chunk(text)
        chunk_embeddings = self.embedder.batch_embed(chunks)
        return np.mean(chunk_embeddings, axis=0).tolist()

🎯 Better understanding

H-Net doesn’t compete with your embedding backend it augments it. It makes embeddings cleaner, more aligned, and more useful for everything downstream:

HRM gets sharper reasoning signals
SICQL gets more stable Q/V/π values
Memory stores stay semantically consistent

Think of it as giving Stephanie semantic scissors ✂️ cutting text where meaning breaks, not where token limits end.

For the deep dive into boundary training and advanced pooling, see the original H-Net post.

    flowchart LR
    subgraph Fixed["⚡ Fixed Chunking"]
        A1["📄 Original Text"]:::txt --> B1["🚧 Chunk 1: Cuts mid-sentence"]:::bad
        A1 --> B2["🚧 Chunk 2: Merges unrelated ideas"]:::bad
        A1 --> B3["🚧 Chunk 3: Fragmented leftovers"]:::bad
    end

    subgraph HNet["🧬 H-Net Chunking"]
        A2["📄 Original Text"]:::txt --> C1["✅ Chunk 1: Full idea"]:::good
        A2 --> C2["✅ Chunk 2: Self-contained thought"]:::good
        A2 --> C3["✅ Chunk 3: Natural boundary"]:::good
    end

    classDef txt fill:#fdf6e3,stroke:#657b83,stroke-width:2px;
    classDef bad fill:#FFCDD2,stroke:#C62828,stroke-width:2px;
    classDef good fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px;

🧩 The Scorable Abstraction: A Measured View of Everything

One of the quiet but powerful design choices in Stephanie is the idea of the Scorable a universal wrapper that turns almost anything into something we can measure.

❓ Why We Needed It

In a self-improving system, you constantly ask:

“How relevant is this to my goal?” “How clear is this explanation?” “How ethical is this response?” “Which option is better?”

These questions apply to everything from a document to a paragraph, a theorem, a plan trace, or even a symbolic rule. To make progress, we needed a single interface that treats them all the same way.

📦 What Is a `Scorable`?

A Scorable is just a lightweight abstraction:

Scorable(
    id=1234,
    text="This is the content I want scored.",
    target_type="document"  # or "cartridge", "trace", "response", etc.
)

Every scorable has three essentials:

text what we want evaluated
id its unique identity
target_type what kind of thing it is

That’s it. With this, anything becomes measurable.

🧠 Why This Matters

The Scorable abstraction is the bridge between raw data and AI evaluation:

✨ Embedding: Every text → embedding
📊 Scoring: Embeddings compared against goal embeddings
🤖 Training: LLM feedback → new scorers trained on Scorable pairs
🔄 Tuning: Scorables can be re-scored endlessly as the system evolves

Because all evaluators and trainers speak the language of Scorable, we get plug-and-play extensibility. Add a new data type? Wrap it as a Scorable.

🧬 Multiple Views

The real power: in Stephanie, everything is scorable at multiple levels.

A whole document
A section or paragraph
A triplet (subject, predicate, object)
A reasoning trace from HRM
Even a symbolic rule

Each can be wrapped as a Scorable and evaluated in parallel. This gives us a multi-view measurement system: fine-grained enough to judge details, broad enough to see the whole picture.

📘 Summary: A Measured View on Everything

The Scorable isn’t just a coding trick it’s a philosophy:

If it can be scored, it can be improved. And if it can be improved, it belongs in the self-tuning loop.

By reducing every evaluable element to this shared abstraction, we create a system where all knowledge documents, rules, traces, or thoughts becomes measurable, comparable, and improvable.

🗄️ Scorables in the Database: Everything Maps to a Scorable

It’s one thing to say “everything is scorable”. It’s another to show it in the schema. In Stephanie, this isn’t hand-waving it’s concretely modeled in the database.

Every Scorable is mapped to its embedding(s), and those embeddings can come from any backend (H-Net, Hugging Face, Ollama…). This gives us a unified, polymorphic interface: one table to rule them all.

🔑 Scorable Embeddings

class ScorableEmbeddingORM(Base): 
    __tablename__ = "scorable_embeddings"

    id = Column(Integer, primary_key=True, autoincrement=True)

    # Polymorphic link back to the object
    scorable_id = Column(String, nullable=False, index=True)   # e.g., document_id, theorem_id
    scorable_type = Column(String, nullable=False, index=True) # e.g., "document", "hypothesis", "cartridge"

    # Embedding backend reference
    embedding_id = Column(Integer, nullable=False)             # id inside embeddings table
    embedding_type = Column(String, nullable=False)            # "hnet", "hf", "ollama"

    created_at = Column(DateTime, default=datetime.now, nullable=False)

This table is the bridge:

scorable_id + scorable_type → points to what is being embedded
embedding_id + embedding_type → points to how it was embedded

Together, they let us trace any object → embedding → scorer output.

🧬 The Embedding Store

All raw vectors live in the embeddings table.

class EmbeddingORM(Base):
    __tablename__ = "embeddings"

    id = Column(Integer, primary_key=True)
    text = Column(Text, nullable=True)
    embedding = Column(Vector(1024), nullable=True)
    created_at = Column(DateTime, default=func.now())
    text_hash = Column(Text, nullable=True)

This is backend-agnostic: it just stores vectors + metadata.

📄 Example: A Document as a Scorable

class DocumentORM(Base):
    __tablename__ = "documents"

    id = Column(Integer, primary_key=True)
    title = Column(String, nullable=False)
    source = Column(String, nullable=False)
    text = Column(Text, nullable=True)
    # (other metadata...)

A document is inserted here, wrapped as a Scorable, and linked to one or more embeddings through ScorableEmbeddingORM. The same path works for hypotheses, cartridges, plan traces, or anything else.

🔗 Putting It Together

Here’s the flow of how anything becomes scorable:

    flowchart LR
    subgraph Raw["🗂️ Raw Objects"]
        D1["📄 DocumentORM"]
        H1["🔮 HypothesisORM"]
        C1["🧩 CartridgeORM"]
    end

    subgraph Scorable["🧩 Scorable Layer"]
        S1["ScorableEmbeddingORM"]
    end

    subgraph Embeddings["🔗 Embedding Store"]
        E1["EmbeddingORM (vectors)"]
    end

    D1 --> S1
    H1 --> S1
    C1 --> S1
    S1 --> E1

📘 Information everywhere

Polymorphic mapping means everything can be embedded and scored the same way.
Multiple embeddings per scorable let us compare H-Net vs Hugging Face vs Ollama on the same item.
Auditability: every score has a trace back to what was scored and which embedding backend was used.

👉 This is how we enforce the principle:

If it can be scored, it can be improved.

🧭 Classifying Knowledge with Domain Intelligence

The Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training paper showed how powerful it can be to represent entire datasets as vectors, enabling smarter selection and mixture of data sources.

We took inspiration from this idea and applied it directly to our own problem: when Stephanie ingests papers, blog posts, or reports, she doesn’t just slurp them into memory. She needs to know what each piece is really about.

Is this section about symbolic reasoning? Is that page describing planning strategies? Or is the whole paper focused on alignment? Without that context, downstream agents can’t specialize their reasoning or retrieve the right knowledge when it matters.

That’s why we built the ScorableClassifier a lightweight adaptation of the Domain2Vec philosophy, tailored for documents and sections. Instead of vectorizing whole datasets, we build centroid vectors for semantic domains and classify each document against them.

🧠 What the ScorableClassifier Does

The ScorableClassifier is a lightweight but powerful module that attaches semantic domain labels to every document (and even to individual sections). It works like this:

Define a set of seed phrases for each domain in a YAML file.
Embed those phrases to create a semantic centroid for each domain.
Embed the document (or section) and compare it to each centroid using cosine similarity.
Assign domains above a configurable threshold (e.g., 0.6) to the document.

This gives us soft, embedding-grounded classification not brittle keyword matching.

🌱 Seeds: The Blueprint of a Domain

Each domain is defined by a small set of phrases that “sound like it”. Here’s an excerpt:

domains:
  symbolic:
    description: "Symbolic reasoning, planning, and logic-guided AI."
    seeds:
      - "Symbolic planning using language agents."
      - "Neuro-symbolic reasoning for AI alignment."
      - "Formal rule extraction from natural language."

  alignment:
    description: "Scalable oversight, alignment, and control of AI behavior."
    seeds:
      - "Preventing goal misgeneralization in agents."
      - "Reward modeling for safety and usefulness."

Even if the paper doesn’t literally say “goal misgeneralization”, embeddings let us recognize it as an alignment paper.

# stephanie/analysis/scorable_classifier.py

import yaml
from sklearn.metrics.pairwise import cosine_similarity

class ScorableClassifier:
    def __init__(self, memory, config_path="config/domain/seeds.yaml"):
        self.memory = memory
        with open(config_path, "r") as f:
            self.domains = yaml.safe_load(f).get("domains", {})
        # Precompute embeddings for all domain seeds
        self.seed_embeddings = {
            d: [self.memory.embedding.get_or_create(s) 
                for s in details.get("seeds", [])]
            for d, details in self.domains.items()
        }

    def classify(self, text: str, top_k: int = 3):
        """Return top-k matching domains for a text snippet."""
        text_emb = self.memory.embedding.get_or_create(text)
        scores = {
            d: max(cosine_similarity([text_emb], [s])[0][0] for s in seeds)
            for d, seeds in self.seed_embeddings.items()
        }
        return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]

📚 Granular information

This classifier gives us:

✅ Modularity – just edit the YAML file to add new domains.
✅ Generalization – works even when exact terms don’t appear.
✅ Granularity – tags both entire papers and individual sections.

So when a reasoning agent asks: “Show me all symbolic planning methods in the literature” Stephanie can instantly route to the right slices of knowledge.

By making our memory domain-aware, it ensures that every document is understood in context. Later, when we run case-based reasoning or Monte Carlo search, these domains become filters, guides, and signals for picking the right knowledge at the right time.

    flowchart TD
    %% Document intake
    A[📄 Document / Section] --> B[🧩 Wrap as Scorable]
    B --> C[🔢 Embedding Vector]
    B --> D[🏷️ Domain Labels]

    %% Embedding track
    C --> E[📊 Similarity Search]
    E --> F[🔍 Find Related Knowledge]

    %% Domain track
    D --> G[🗂️ Categorization by Domain]
    G --> H[🎯 Route to Specialized Agents]

    %% Joint understanding
    E --> I[🧠 Richer Understanding of Content]
    H --> I
    B --> I

    %% Styles
    style A fill:#bbf,stroke:#333,stroke-width:2px
    style B fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#ffd,stroke:#333,stroke-width:1px
    style D fill:#ffd,stroke:#333,stroke-width:1px
    style I fill:#dfd,stroke:#333,stroke-width:2px

Here is an example document with domains.

Domains of documents Figure showing a documentin SIS notice the domains and the related scores for the document

🌳 LATS From Scorables to Trees of Thought

Earlier we introduced the idea of the Scorable: everything in Stephanie from a document to a reasoning step can be wrapped and measured.

That abstraction becomes powerful once we realize:

Each reasoning step can be stored as a Scorable.
Each dimension of quality (clarity, novelty, alignment, etc.) can be applied to it.
And reasoning itself doesn’t have to be a single line it can branch, evolve, and converge.

This is where Language Agent Tree Search (LATS) comes in.

The LATS: Language Agent Tree Search Unifies Reasoning reimagines reasoning as a tree search problem:

Instead of one chain-of-thought, you generate multiple branches.
Each branch is evaluated and scored.
The system doesn’t just produce an answer, it explores a space of possibilities.

This matches perfectly with our scoring framework: every node in the tree is a Scorable.

    flowchart TD
    A[🎯 Goal] --> B[🌳 Reasoning Tree]
    B --> C1[Step 1a<br/>Scorable]
    B --> C2[Step 1b<br/>Scorable]
    C1 --> D1[Step 2a<br/>Scorable]
    C2 --> D2[Step 2b<br/>Scorable]
    D1 --> E[📊 Multi-dimensional Scoring]
    D2 --> E

🎲 Why Monte Carlo Tree Search (MCTS)?

Tree search is powerful, but it needs a way to balance:

Exploration: trying new paths.
Exploitation: focusing on promising paths.

That’s exactly what MCTS does. It simulates possible futures, scores them, and progressively biases toward the most rewarding paths.

For Stephanie, this means:

Every reasoning step becomes a node (Scorable).
Each node is scored across multiple dimensions.
The tree grows in directions that show the most promise.

Over time, Stephanie doesn’t just answer a question she learns which reasoning strategies work best.

🧩 How It All Fits

Scorables give us the unit of evaluation.
LATS gives us the structure of reasoning.
MCTS gives us the search and selection algorithm.

Together, these form the reasoning substrate that CBR will later build on because once we can generate and evaluate reasoning traces, we can start storing them as cases, retrieve them, and improve them over time.

🌳 The MCTS Reasoning Agent

At the heart of our reasoning engine is the MCTSReasoningAgent. It’s the glue that ties together:

Monte Carlo Tree Search (MCTS) for exploring reasoning paths,
LATS-style signatures for step generation and value estimation,
Scorables + multidimensional evaluation for grounding every path in measurable quality.

This agent is a key piece of the system: it doesn’t just generate outputs, it explores alternatives, scores them, and learns which reasoning paths are worth following.

The full class is several hundred lines long too much to embed here. But to give you a feel for its structure, here’s a simplified, annotated version that captures the essence of how it works:

class MCTSReasoningAgent(BaseAgent):
    def __init__(self, cfg, memory, logger):
        super().__init__(cfg, memory, logger)
        self.max_depth = cfg.get("max_depth", 4)
        self.branching_factor = cfg.get("branching_factor", 2)
        self.num_simulations = cfg.get("num_simulations", 20)
        self.ucb_weight = cfg.get("ucb_weight", 1.41)
        self.dimensions = ["alignment", "clarity", "novelty", "relevance"]

    async def run(self, context: dict) -> dict:
        # 1. Start with the goal as the root of the reasoning tree
        root = self._create_node(state=context["goal"]["goal_text"], trace=[])

        # 2. Run MCTS simulations
        for _ in range(self.num_simulations):
            node = self._select(root)             # pick a promising node
            node = await self._expand(node)       # generate next steps
            reward = self._evaluate(node, context) # score the result
            self._backpropagate(node, reward)     # update tree statistics

        # 3. Return top-ranked reasoning traces as Scorables
        best_nodes = self._collect_top_k(root, k=3)
        return {"results": [self._to_scorable(n) for n in best_nodes]}

    # --- Key MCTS steps (simplified) ---
    def _select(self, node): ...
    async def _expand(self, node): ...
    def _evaluate(self, node, context): ...
    def _backpropagate(self, node, reward): ...

📝 What’s Happening Here

Root Node: starts with the goal/problem as the root of the reasoning tree.
Select: uses the UCT (Upper Confidence Bound) formula to balance exploration vs. exploitation.
Expand: generates candidate next reasoning steps via DSPy/LATS.
Evaluate: wraps each step in a Scorable, then scores it across multiple dimensions (clarity, novelty, etc.).
Backpropagate: pushes scores back up the tree, so the search progressively favors better branches.
Emit Results: returns the top-K reasoning traces as Scorable objects, ready for downstream CBR.

👉 If you want to dig deeper, the full implementation (with caching, LM budgets, value estimator hooks, and logging) is here: MCTSReasoningAgent on GitHub

    flowchart TD
    A[🎯 Goal] --> B[🌳 Root Node]
    B --> C[🔍 Select<br/>Choose promising node using UCT]
    C --> D["🌱 Expand<br/>Generate next reasoning steps (DSPy/LATS)"]
    D --> E[📏 Evaluate<br/>Score with multidimensional Scorables]
    E --> F[🔁 Backpropagate<br/>Update parent rewards & visits]
    F --> C

    E --> G[🏆 Best Paths<br/>Top-K reasoning traces as Scorables]
    G --> H[📚 Case-Based Reasoning<br/>Store & reuse best examples]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:1px
    style C fill:#ffd,stroke:#333,stroke-width:1px
    style D fill:#ddf,stroke:#333,stroke-width:1px
    style E fill:#cfc,stroke:#333,stroke-width:1px
    style F fill:#fcf,stroke:#333,stroke-width:1px
    style G fill:#bff,stroke:#333,stroke-width:2px
    style H fill:#fdd,stroke:#333,stroke-width:2px

🧩 Memento: Case Book Reasoning

🚪 Enter the ModularMementoAgent.

Up until now, Stephanie has been reasoning, scoring, and organizing knowledge. But she hasn’t truly learned from experience in a structured way.

The ModularMementoAgent changes that.

This is our first agent built around case-based reasoning (CBR). Instead of treating each run as an isolated event, it remembers past cases, compares them with new ones, and gradually refines its approach.

Think of it as giving Stephanie her first long-term memory system for reasoning where every trace, hypothesis, and decision is turned into a “case” that can be retrieved, reused, revised, and retained.

To see how this fits together, here’s the map we’ll follow for the rest of this section:

    flowchart TD
    A[🎯 Goal + Context] --> B[🤖 ModularMementoAgent]
    B --> C["MCTSReasoningAgent<br/>Base Run (Generates Scorables)"]
    C --> D[CBR Middleware]

    subgraph D[🧩 CBR Middleware]
        D1[📂 ContextNamespacer]
        D2[📖 CasebookScopeManager]
        D3[🔍 CaseSelector]
        D4["📊 Rank & Analyze<br/>(ScorableRanker + MARS)"]
        D5[✅ QualityAssessor]
        D6[🏆 ChampionPromoter]
        D7[📝 GoalStateTracker]
        D8[⚖️ ABValidator]
        D9[🧠 MicroLearner]
        D10[📦 RetentionPolicy]
    end

    D1 --> D2 --> D3 --> D4 --> D5 --> D6 --> D7 --> D8 --> D9 --> D10
    D10 --> E["(💾 Casebook Storage)"]

    E -->|Retrieve Past Cases| D3
    D4 -->|Ranked Outputs| B

At the code level, the agent looks like this:

class ModularMementoAgent(MCTSReasoningAgent):
    def __init__(self, cfg, memory, logger):
        super().__init__(cfg, memory, logger)

        # Wire together all CBR middleware components
        ns = DefaultContextNamespacer()
        scope = DefaultCasebookScopeManager(cfg, memory, logger)
        selector = DefaultCaseSelector(cfg, memory, logger)
        ranker = DefaultRankAndAnalyze(cfg, memory, logger, 
            ranker=ScorableRanker(cfg, memory, logger),
            mars=MARSCalculator(cfg, memory, logger) if cfg.get(INCLUDE_MARS, True) else None)
        retention = DefaultRetentionPolicy(cfg, memory, logger, casebook_scope_mgr=scope)
        assessor = DefaultQualityAssessor(cfg, memory, logger)
        promoter = DefaultChampionPromoter(cfg, memory, logger)
        tracker = DefaultGoalStateTracker(cfg, memory, logger)
        ab = DefaultABValidator(cfg, memory, logger, ns=ns, assessor=assessor)
        micro = DefaultMicroLearner(cfg, memory, logger)

        # Assemble into a single middleware pipeline
        self._cbr = CBRMiddleware(cfg, memory, logger,
                                  ns, scope, selector, ranker,
                                  retention, assessor, promoter,
                                  tracker, ab, micro)

    async def run(self, context: dict) -> dict:
        # Connect middleware with the base reasoning agent (MCTS)
        self._cbr.ranker.scoring = self.scoring
        parent_run = super(ModularMementoAgent, self).run

        async def base_run(ctx):
            return await parent_run(ctx)

        context[AGENT_NAME] = self.name
        return await self._cbr.run(context, base_run, self.output_key)

📌 Key idea: this class is just the wiring harness. It connects our familiar MCTSReasoningAgent to the new CBR middleware, so every run is now guided by past cases and reinforced by quality checks.

👉 Next, we’ll follow the diagram step by step starting with the ContextNamespacer, the small but vital piece that keeps every case organized in its proper scope.

📂 ContextNamespacer: Keeping Cases in Their Lanes

When you start remembering everything, chaos is a real risk. Without structure, cases from one goal could spill over into another, or context from one pipeline run could contaminate another.

The ContextNamespacer solves this by acting as a namespace manager for reasoning traces. It ensures that every case whether it’s a hypothesis, a document, or a scored trace is tagged with the right scope and separated cleanly.

Think of it as the filing cabinet labels for Stephanie’s case memory:

🗂️ Goal Namespace ties each case to the specific goal it was created under.
🧩 Run Namespace distinguishes cases produced in different reasoning runs.
🔑 Scoped IDs creates consistent keys so scorables, embeddings, and results can be linked back unambiguously.

Without namespacing:

A “clarity” score from one goal might get mixed with an “alignment” score from another.
A reasoning trace about vision models might pollute the retrieval set when working on symbolic planning.

With namespacing:

Every case is contained in the right scope, making retrieval and reuse safe.
The CBR middleware can operate across thousands of runs without collisions.

🧭 Flow in the Middleware

    flowchart LR
    A[🎯 Goal + Context] --> B[📂 ContextNamespacer]
    B --> C[📖 CasebookScopeManager]

    style B fill:#e6f7ff,stroke:#0077b6,stroke-width:2px
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#bbf,stroke:#333,stroke-width:2px

The ContextNamespacer is the very first step in the pipeline. Before we even think about retrieving or ranking, it stamps each case with the right identifiers so all downstream modules know where it belongs.

⌨️ In Code Terms

The implementation is simple but foundational:

class DefaultContextNamespacer:
    def make_scoped_id(self, goal_id: str, case_id: str) -> str:
        """Attach goal context to a case id."""
        return f"{goal_id}::{case_id}"

    def extract_scope(self, scoped_id: str) -> Tuple[str, str]:
        """Split back into (goal_id, case_id)."""
        return tuple(scoped_id.split("::", 1))

Every Scorable that passes through the system can now be unambiguously tied back to the goal and run that created it.

👉 Next up: the CasebookScopeManager, which uses these namespaces to decide which casebooks to search and update when a new reasoning run begins.

📖 CasebookScopeManager: Defining the Boundaries of Memory

If the ContextNamespacer gives each case a label, the CasebookScopeManager decides which shelf in the library it belongs to.

The system doesn’t just have one big bucket of cases. Instead, cases are grouped into casebooks structured collections of past reasoning tied to goals, domains, or experiments.

The CasebookScopeManager answers questions like:

📚 Which casebook should this new case be added to?
🔍 When I want to retrieve past knowledge, which casebooks should I search?
🧭 How do we keep the scope small enough to be relevant, but broad enough to be useful?

🚷 Boundaries to our knowledge

Without scope management:

Retrieval might trawl across unrelated casebooks, pulling in irrelevant or noisy cases.
Updates could end up in the wrong memory, confusing future reasoning.

With a CasebookScopeManager:

Every run has a clear boundary of which casebooks are in play.
Memory remains organized, relevant, and efficient.

🧭 Flow in the Middleware

    flowchart LR
    A[📂 ContextNamespacer] --> B[📖 CasebookScopeManager]
    B --> C[🔍 CaseSelector]

    style B fill:#fef9e7,stroke:#b8860b,stroke-width:2px
    style A fill:#e6f7ff,stroke:#0077b6,stroke-width:2px
    style C fill:#bbf,stroke:#333,stroke-width:2px

The CasebookScopeManager acts like the traffic controller between raw namespaces and the retrieval engine. It says:

“Given this goal and context, here are the casebooks you’re allowed to look at and update.”

💻 In Code Terms

The real implementation is more involved, but the essence looks like this:

class DefaultCasebookScopeManager:
    def __init__(self, cfg, memory, logger):
        self.cfg, self.memory, self.logger = cfg, memory, logger

    def active_casebooks(self, goal_id: str) -> list:
        """Return the casebooks relevant for this goal."""
        return self.memory.casebooks.find_for_goal(goal_id)

    def ensure_casebook(self, goal_id: str, description="") -> str:
        """Guarantee a casebook exists for this goal, return its id."""
        return self.memory.casebooks.ensure(goal_id, description)

🔑 Key Role

The CasebookScopeManager ensures:

🏷️ New cases are stored in the right casebook.
📖 Retrieval queries are focused on relevant casebooks.
🔄 Updates don’t bleed across unrelated goals.

👉 Next, we’ll step into the CaseSelector the module that decides which cases to pull back out once the scope has been defined.

🔍 CaseSelector: Choosing Which Memories to Reuse

Once the CasebookScopeManager has told us where to look, the CaseSelector decides what to pull back out.

Think of it as the retrieval engine of the CBR pipeline. Its job is to balance:

🏆 High-quality cases (champions from past reasoning).
🕒 Recent successes (things that worked well last time).
🎲 Novel or diverse candidates (to avoid overfitting to the same cases).
🎯 Exploration (injecting fresh possibilities).

🏅 Best of the best

Without selection pressure, Stephanie could drown in irrelevant or redundant past cases. The CaseSelector makes sure the system always has a curated set of candidates for reuse a balance between stability and exploration.

⚖️ Selection Strategy

Here’s the rough strategy used by the default implementation:

Champion-first If a case has proven itself as the champion for a goal, reuse it first.
Recent-success Bring in the most recent accepted cases they reflect what’s currently working.
Diverse-novel Add candidates that are different from what we’ve already picked.
Exploration Randomly toss in a few extra cases sometimes the best ideas come from unexpected sources.

🧭 Flow in the Middleware

    flowchart LR
    A[📖 CasebookScopeManager] --> B[🔍 CaseSelector] --> C[📊 Rank & Analyze]
    style B fill:#f9f9f9,stroke:#333,stroke-width:2px

The CaseSelector doesn’t make the final decision on what’s best it just assembles a shortlist of reuse candidates to be scored, ranked, and analyzed in the next step.

🧑 In Code Terms

The actual implementation is bigger, but the heart of it looks like this:

class DefaultCaseSelector:
    def __init__(self, cfg, memory, logger):
        self.cfg, self.memory, self.logger = cfg, memory, logger

    def build_reuse_candidates(self, casebook_id, goal_id, cases, budget=10):
        candidates = []

        # 1. Champion-first
        champion = self.memory.casebooks.get_champion(casebook_id, goal_id)
        if champion:
            candidates.append(champion)

        # 2. Recent-success
        recent = self.memory.casebooks.get_recent_successes(casebook_id, goal_id, limit=3)
        candidates.extend(recent)

        # 3. Diverse-novel
        pool = self.memory.casebooks.get_novel_pool(casebook_id, goal_id, exclude=candidates)
        candidates.extend(pool[:2])

        # 4. Exploration (random injection)
        if random.random() < 0.2:
            candidates.extend(random.sample(cases, 2))

        return candidates[:budget]

🔑 Key Role

The CaseSelector ensures that every reasoning run has a diverse but relevant set of prior cases to draw inspiration from.

It doesn’t decide the winner it feeds candidates into the ranking system that comes next.

👉 Next, we’ll cover the Rank & Analyze stage, where those candidates are actually evaluated and compared.

📊 Rank & Analyze: Enhanced Scoring for Smarter Case Reuse

Once cases are retrieved, Stephanie can’t just accept them at face value. She needs to evaluate them across multiple dimensions, using multiple scorers, and with checks for consistency.

That’s the role of Rank & Analyze, powered by:

ScorableRanker – computes a weighted, multi-signal score for each candidate.
MARS (Model Agreement & Reasoning Signal) – validates whether those scores are trustworthy.

🧮 ScorableRanker: Weighted, Multi-Signal Case Scoring

The ScorableRanker extends traditional similarity ranking with richer signals inspired by CBR research. Instead of just “how close is this case to the goal?”, it combines:

Similarity (goal ↔ case embedding match).
Value (past reward signals from evaluations).
Recency (cases fade over time via exponential decay).
Diversity (Maximal Marginal Relevance avoid clones of already picked cases).
Adaptability (does this case generalize? Does it use tools available in the current context?).

components = {
    "similarity": self._similarity(query_emb, cand_emb),
    "value": self._value(cand),
    "recency": self._recency(cand),
    "adaptability": self._adaptability(cand, context),
    "diversity": self._diversity(cand, selected),
}
rank_score = sum(
    components[k] * self.weights.get(k, 0) for k in components
)

By default, weights are inspired by CBR literature (similarity 0.45, value 0.30, recency 0.10, diversity 0.10, adaptability 0.05).

🔑 Takeaway: Instead of a flat “nearest neighbor” score, cases now have a composite rank score that balances short-term similarity, long-term value, and contextual adaptability.

🌌 MARS: Measuring Agreement and Reliability

The MARSCalculator goes one step deeper. It looks across all scorers (MRQ, SICQL, EBT, LLM) and asks:

Do they agree on which cases are good?
If not, where’s the conflict?
Which scorer is most reliable given our trust reference (e.g., LLM)?

result = {
    "dimension": str(dimension),
    "agreement_score": 0.87,   # 1 - variance
    "std_dev": 0.12,           # disagreement spread
    "preferred_model": "ebt",  # closest to trust reference
    "primary_conflict": ["mrq", "llm"],
    "delta": 0.22,             # difference between top & bottom
    "high_disagreement": False,
    "explanation": "MARS agreement: 0.87 | Most aligned with llm: ebt | Primary conflict: mrq vs llm (Δ=0.22)",
}

MARS reports include:

Agreement Score: normalized consensus (1 = perfect agreement, 0 = chaos).
Primary Conflict: biggest scorer disagreement (e.g., MRQ vs LLM).
Preferred Model: scorer most aligned with the trust reference.
Diagnostics: explanations, correlations between metrics, and reliability estimates.

⚡ Worked Example

Let’s say the goal is:

“Evaluate new planning strategies in symbolic reasoning.”

We retrieve 3 candidate cases:

Case	Similarity	Value	Recency	Diversity	Adaptability	Rank Score	MARS Agreement	Conflict
A	0.92	0.81	0.95	0.80	0.70	0.86	High (0.91)	None
B	0.75	0.60	0.70	0.65	0.85	0.73	Medium (0.72)	MRQ vs LLM
C	0.58	0.92	0.40	0.95	0.55	0.69	Low (0.51)	EBT vs LLM

📌 Interpretation:

Case A → Strong all-rounder, high trust.
Case B → Adaptable but contested (scorers disagree).
Case C → Novel but risky (MARS shows low agreement).

📐 Flow in Middleware

    flowchart TD
    A[🔍 Retrieved Cases] --> B[📊 ScorableRanker<br/>Composite Scoring]
    B --> C[🌌 MARS<br/>Agreement + Conflict Analysis]
    C --> D[✅ QualityAssessor<br/>Keep or Discard?]
    D --> E[🏆 ChampionPromoter<br/>Best Case to Champion]

🚀 Deep understanding

The combination of ScorableRanker + MARS is one of the core contributions of this system:

Stephanie no longer just finds “nearest cases.”
She ranks them based on multiple signals relevant to reasoning.
She validates them with agreement checks before trusting them.

This is what makes the ModularMemento pipeline robust: it’s not just memory, it’s measured, trustworthy memory.

Perfect 👍 if you’re planning to show real MARS output screenshots in SIS, then this section is where we really roll up our sleeves and explain exactly what MARS is doing under the hood. That way the visuals will have maximum impact.

Here’s how I’d structure the deep-dive:

🌌 Inside MARS: Measuring Agreement and Reasoning Signal

The MARSCalculator (Model Agreement & Reasoning Signal) is the engine that tells us whether to trust our scoring results.

Where the ScorableRanker evaluates candidates, MARS evaluates the evaluators themselves. It asks:

Do the scorers agree on what’s good?
If not, who do we trust?
Where are the conflicts and weak signals that deserve human review?

🧭 Step 1: Per-Dimension Analysis

Every document, hypothesis, or case is scored along multiple dimensions (e.g., clarity, novelty, alignment). MARS operates within each dimension separately, ensuring we understand not just how good a case is overall, but how reliable scoring is in that specific dimension.

    flowchart TD
    A[📂 ScoreCorpus] --> B["📏 Dimension Matrix<br/>(docs × scorers)"]
    B --> C[📊 Agreement & Variance Analysis]
    B --> D[⚔️ Conflict Detection]
    B --> E[🎯 Preferred Model Selection]
    C & D & E --> F[🌌 MARS Result per Dimension]

📊 Agreement & Variance

Std. Deviation: How much scorers diverge on this dimension.
Agreement Score: 1 - variance (normalized between 0 and 1).

👉 High agreement = safe to trust. 👉 High variance = risky dimension.

# Agreement = 1 - std deviation
agreement_score = max(0.0, min(1.0, 1.0 - float(std_dev)))

⚔️ Conflict Detection

MARS finds the biggest disagreement between scorers:

scorer_means = col_means.fillna(0.0)
max_name = scorer_means.idxmax()
min_name = scorer_means.idxmin()
delta = scorer_means[max_name] - scorer_means[min_name]
primary_conflict = [max_name, min_name]

Example:

MRQ avg = 0.82
LLM avg = 0.61
Conflict = MRQ vs LLM (Δ = 0.21)

This tells us where scorers see the world differently.

🎯 Preferred Model Selection

Who should we trust when scorers disagree?

MARS compares every scorer’s outputs against a trust reference (default: LLM). The model whose scores are closest to the trust reference is marked as preferred.

for scorer in matrix.columns:
    diff = (matrix[scorer] - trust_scores).abs().mean()
    if diff < min_diff:
        preferred_model = scorer

If no trust reference exists, MARS defaults to the median scorer.

📈 Scorer Reliability

MARS also tracks reliability per scorer either by correlation with the trust reference or by how consistent a scorer is across docs.

Scorer	Reliability
LLM	1.00 (reference)
MRQ	0.78
EBT	0.82
SVM	0.66

This lets us spot when one scorer drifts or collapses.

🧠 Human-Readable Explanation

Every MARS result includes a narrative explanation for logs & dashboards:

{
  "dimension": "clarity",
  "agreement_score": 0.87,
  "preferred_model": "ebt",
  "primary_conflict": ["mrq", "llm"],
  "delta": 0.22,
  "explanation": 
    "MARS agreement: 0.87 | Most aligned with llm: ebt | Primary conflict: mrq vs llm (Δ=0.22)"
}

This is what you’ll see visualized in SIS reports.

📚 Example MARS Report

Let’s say we ran MARS across 3 dimensions:

Dimension	Agreement	Conflict	Preferred	Explanation
Clarity	0.91	None	MRQ	High agreement, MRQ best aligned
Novelty	0.68 ⚠️	LLM vs SVM	EBT	Disagreement flagged (Δ=0.25)
Alignment	0.79	MRQ vs LLM	MRQ	Moderate agreement

📌 Interpretation:

Clarity → safe to trust.
Novelty → high risk, needs review.
Alignment → some disagreement, but MRQ aligns with LLM.

🔮 Making sense of scores everywhere

Without MARS, Stephanie would blindly trust whichever model gave a score. With MARS, she can:

Detect hidden disagreements.
Choose the most reliable scorer automatically.
Flag contentious dimensions for review.
Build a transparent audit trail of every decision.

🖼️ MARS in Action

Here’s what a MARS report looks like inside the SIS dashboard. Each bar and row reflects how scorers performed on a single pipeline run:

Agreement score shows how much the models aligned.
Preferred model highlights which scorer is most trustworthy.
Conflicts are flagged clearly for human review.

    flowchart TD
    A[Pipeline Run] --> B[📊 MARS Dashboard]
    B --> C[✅ Agreement Score]
    B --> D[🎯 Preferred Model]
    B --> E[⚔️ Conflict Detection]
    B --> F[🧾 Explanation]

📸 Screenshot below: A real SIS pipeline, with MARS surfacing scorer agreement, highlighting conflicts, and pointing to the most reliable model.

MARS Screenshot

✓ In summary: MARS tells us if our scorers agree. High agreement means we can trust the result. Low agreement flags a decision for review. This makes our CBR loop robust and auditable.

✅ Assessing Case Quality

Not every reasoning path or output is worth keeping. Some are noisy, misleading, or simply irrelevant. That’s why Stephanie has a Quality Assessor module built directly into the CBR loop.

The Quality Assessor acts like a filter between ranked scorables and the casebook:

🧹 Filters out weak cases – only high-quality candidates survive.
🎯 Checks for alignment with goals – making sure what we retain is actually useful.
🔄 Standardizes evaluations – turning diverse scores and signals into a consistent “quality verdict.”

    flowchart LR
    A[Ranked Scorables] --> B[✅ Quality Assessor]
    B -->|Pass| C[🏆 Champion Promoter]
    B -->|Fail| D[🗑️ Discard / Ignore]

This step is crucial. Without a quality filter, the casebook would quickly fill up with clutter making retrieval noisy and learning less effective. With it, Stephanie ensures that only the best examples shape future reasoning.

🧹 The Quality Assessor in Action

At its core, the Quality Assessor takes a candidate Scorable (document, reasoning path, hypothesis, etc.) and decides if it’s good enough to keep.

Here’s the code:

class DefaultQualityAssessor:
    def __init__(self, cfg, memory, logger):
        self.cfg = cfg
        self.memory = memory
        self.logger = logger
        self.threshold = float(cfg.get("quality_threshold", 0.6))

    def assess(self, scorable: Dict[str, Any]) -> bool:
        """
        Decide whether a scorable is "good enough" to be retained.
        Returns True if passed, False if rejected.
        """
        try:
            score = scorable.get("rank_score") or 0.0
            meta = scorable.get("components", {})

            # Basic threshold check
            if score < self.threshold:
                self.logger.log("QualityReject", {
                    "id": scorable.get("id"),
                    "score": score,
                    "reason": f"Below threshold {self.threshold}"
                })
                return False

            # Optional: penalize very low clarity or alignment
            clarity = meta.get("clarity", 1.0)
            alignment = meta.get("alignment", 1.0)

            if clarity < 0.3 or alignment < 0.3:
                self.logger.log("QualityReject", {
                    "id": scorable.get("id"),
                    "clarity": clarity,
                    "alignment": alignment,
                    "reason": "Critical dimension below threshold"
                })
                return False

            # Passed quality checks
            self.logger.log("QualityAccept", {
                "id": scorable.get("id"),
                "score": score,
                "clarity": clarity,
                "alignment": alignment
            })
            return True

        except Exception as e:
            self.logger.log("QualityAssessorError", {"error": str(e)})
            return False

🔍 What’s Happening Here

Threshold Gate
- Every scorable gets a rank_score. If it’s below the configured quality_threshold (default = 0.6), it’s rejected immediately.
Dimension Safety Checks
- Even if the rank score is good, we don’t want cases with glaring weaknesses.
- For example, if clarity or alignment scores are critically low (<0.3), the case is discarded.
Structured Logging
- Every accept/reject is logged with reasons, so we can later audit why a case was filtered.

🚮 Dump the trash

Without this filter, Stephanie would store everything, including:

Noisy reasoning traces that confuse later retrieval.
Low-value outputs that waste training cycles.
Misaligned or ambiguous results that derail future reasoning.

By enforcing a quality gate, we ensure that the casebook evolves toward excellence:

Only good enough cases get retained.
Retrieval is cleaner and more precise.
Learning loops are powered by high-signal, low-noise examples.

🏆 Champion Promoter: Remembering the Best

Once we’ve filtered out the noise with the Quality Assessor, we don’t just want to keep all acceptable cases. Some are better than others and the system needs to know its current best example for each goal.

That’s the role of the Champion Promoter.

It tracks which case currently holds the title of “champion” for a given goal, and promotes new challengers only when they outperform the existing one.

☸️ We know what we know

Keeps a single source of truth for the best-known solution per goal.
Prevents “casebook bloat” with too many near-duplicates.
Enables downstream modules (like the Case Selector) to prioritize high-quality seeds.

🧩 Example Code (simplified)

class DefaultChampionPromoter:
    def __init__(self, cfg, memory, logger):
        self.cfg, self.memory, self.logger = cfg, memory, logger

    def promote(self, goal_id: str, candidate: dict) -> bool:
        """
        Try to promote a candidate as the new champion for this goal.
        Returns True if promotion happened, False otherwise.
        """
        current = self.memory.casebooks.get_champion(goal_id)

        # No champion yet → auto-promote
        if not current:
            self.memory.casebooks.set_champion(goal_id, candidate)
            self.logger.log("ChampionPromoted", {
                "goal": goal_id, "id": candidate.get("id"), "reason": "First champion"
            })
            return True

        # Compare rank_score (or other metrics)
        if candidate.get("rank_score", 0) > current.get("rank_score", 0):
            self.memory.casebooks.set_champion(goal_id, candidate)
            self.logger.log("ChampionPromoted", {
                "goal": goal_id, "id": candidate.get("id"), "reason": "Outperformed old champion"
            })
            return True

        # Candidate not better
        self.logger.log("ChampionNotPromoted", {
            "goal": goal_id, "id": candidate.get("id"), "reason": "Did not beat champion"
        })
        return False

⚡ The Effect

The Champion Promoter acts like a tournament bracket:

Every new candidate competes with the reigning champion.
Only if it beats the champion does it replace it.
This ensures a constant upward trajectory in case quality.

    flowchart LR
    A[📊 Candidate Case] --> B[✅ Quality Assessor]
    B -->|Passes| C[🏆 Champion Promoter]
    B -->|Fails| X[❌ Discarded]

    C -->|Beats Champion| D[⭐ New Champion Stored]
    C -->|Not Better| E[↩️ Champion Retained]

    D --> F[📝 Goal State Tracker]
    E --> F

    F --> G[(💾 Casebook DB)]
    G --> H[📂 Used by Case Selector for Future Runs]

🔑 How to read this:

Only quality-approved cases reach the Champion Promoter.
The promoter ensures that only the best candidate per goal survives as champion.
The Goal State Tracker keeps everything consistent in the database so that the Case Selector always knows where to start.

📝 Goal State Tracker: Remembering the Champion

So far, we’ve seen how new cases are scored, assessed for quality, and possibly promoted as champions. But for a self-improving system, it’s not enough to pick a winner in the moment we need to remember that winner across future runs.

That’s the job of the Goal State Tracker.

🏦 What It Does

Keeps a record of the current champion for each (goal, casebook) pair.
Updates the champion when the Champion Promoter signals a new best case.
Stores metadata like when it was promoted, why it was promoted, and its key scorable signals.
Provides fast retrieval for the Case Selector, so the system can start from its best-known solution instead of reinventing the wheel.

This is essentially the “memory cell” for CBR the point where experience becomes institutional knowledge.

🔎 Example: Champion Persistence

class GoalStateTracker:
    def __init__(self, memory, logger):
        self.memory = memory
        self.logger = logger

    def update_champion(self, casebook_id: int, goal_id: str, case_id: str):
        """
        Set the given case as the champion for this goal.
        """
        self.memory.casebooks.set_champion(casebook_id, goal_id, case_id)
        self.logger.log(
            "ChampionUpdated",
            {"casebook_id": casebook_id, "goal_id": goal_id, "new_champion": case_id},
        )

    def get_champion(self, casebook_id: int, goal_id: str):
        """
        Retrieve the current champion case for this goal.
        """
        return self.memory.casebooks.get_champion(casebook_id, goal_id)

🪢 Learn from history

Without the tracker, the agent would forget its hard-earned lessons. By explicitly recording champions:

Future runs can bootstrap from proven successes.
We can analyze how champions evolve over time.
The system becomes self-stabilizing: bad cases don’t overwrite champions unless they’re clearly better.

Great let’s tackle the A/B Validator section. This is where the system stops being theoretical and actually tests alternatives head-to-head before updating memory.

⚖️ A/B Validator: Putting Cases to the Test

Promoting a new champion is a big deal. If we do it too eagerly, we risk forgetting good solutions. If we do it too conservatively, we miss out on better ones.

The A/B Validator exists to strike that balance.

🧠 What It Does

Takes a candidate case (the challenger) and compares it to the current champion.
Uses the same scoring engines (ScorableRanker, MARS, etc.) across multiple dimensions.
Decides whether the challenger is truly better before updating the Goal State Tracker.
Records the evaluation so that results are transparent and reproducible.

🧪 Example: Challenger vs Champion

class ABValidator:
    def __init__(self, assessor, logger):
        self.assessor = assessor
        self.logger = logger

    def validate(self, champion, challenger, context):
        """
        Compare champion vs challenger.
        Return True if challenger should replace champion.
        """
        champ_score = self.assessor.assess(champion, context)
        chall_score = self.assessor.assess(challenger, context)

        better = chall_score > champ_score
        self.logger.log(
            "ABValidation",
            {
                "champion_score": champ_score,
                "challenger_score": chall_score,
                "winner": "challenger" if better else "champion",
            },
        )
        return better

🎯 Prove it

This step ensures that:

Champions aren’t overwritten unless the challenger proves itself.
We can experiment safely without destabilizing the system.
Validation logs give us a paper trail of improvements we can see when, how, and why a champion was replaced.

In short: the A/B Validator makes sure our agent’s evolution is based on evidence, not hype.

👉 Next up in the Mermaid diagram is the MicroLearner the component that takes these validation results and squeezes extra training signal out of them.

🧠 MicroLearner: Learning in the Small

Big retraining loops are expensive. They require lots of data, time, and compute. But what if we could learn incrementally, case by case, as new evidence comes in?

That’s the role of the MicroLearner.

⚙️ What It Does

Watches the results of each CBR cycle (champions, challengers, validation outcomes).
Extracts tiny training signals from every decision:
- “This case scored higher than that one.”
- “This output was accepted, that one was discarded.”
Converts those signals into on-the-fly updates for scorers (MRQ, EBT, SVM, etc.).
Keeps models fresh and adaptive without waiting for full retrains.

Think of it as fine-grained gradient nudges that happen inside the reasoning loop.

🔍 Example: Online Update

Here’s a simplified sketch of how it works:

class MicroLearner:
    def __init__(self, memory, logger):
        self.memory = memory
        self.logger = logger

    def update_from_validation(self, champion, challenger, result):
        """
        Tiny supervised update: reward the winner, penalize the loser.
        """
        if result == "challenger":
            self._reward(challenger, 1.0)
            self._reward(champion, 0.0)
        else:
            self._reward(champion, 1.0)
            self._reward(challenger, 0.0)

    def _reward(self, case, reward_value):
        self.memory.rewards.add(
            case_id=case.id,
            reward=reward_value,
        )
        self.logger.log(
            "MicroUpdate",
            {"case_id": case.id, "reward": reward_value},
        )

🚀 Why It Matters

Always learning: Every decision provides feedback.
Low overhead: Updates happen instantly, without pausing for retraining jobs.
Better over time: Even small nudges accumulate into significant improvements.

In other words, the MicroLearner makes sure Stephanie never wastes an experience every case, win or lose, sharpens the system.

    flowchart TD
    A[🏆 Champion Case] --> C[⚖️ AB Validation]
    B[🥊 Challenger Case] --> C

    C -->|Winner/Loser| D[🧠 MicroLearner]

    subgraph D[MicroLearner]
        D1[📊 Compare outcomes]
        D2[✏️ Assign rewards<br/>Champion vs Challenger]
        D3[🔄 Online update<br/>Scoring Models]
    end

    D --> E["📈 Scorers (MRQ, EBT, SVM)"]
    E --> F[💡 Improved Future Ranking]

    style A fill:#bbf,stroke:#333,stroke-width:1px
    style B fill:#fbb,stroke:#333,stroke-width:1px
    style C fill:#ffd,stroke:#333,stroke-width:1px
    style D fill:#dfd,stroke:#333,stroke-width:2px
    style E fill:#cfc,stroke:#333,stroke-width:1px
    style F fill:#fcf,stroke:#333,stroke-width:2px

📦 After the MicroLearner comes the Retention Policy, which decides how long cases stick around and which ones fade out.

📦 Retention Policy: Deciding What to Keep

Not every case is worth saving. Some are noisy, redundant, or outright harmful. The Retention Policy makes sure Stephanie’s casebooks don’t just grow endlessly they evolve with quality in mind.

🔍 What It Does

Filters cases before they’re stored in the casebook.
Applies rules & thresholds (like minimum score, novelty, or domain balance).
Decides retention mode:
- ✅ Accept: store permanently.
- ⏳ Stash: keep temporarily for review.
- ❌ Reject: discard outright.
Logs the decision so we know why something was kept or dropped.

This keeps the system lean, adaptive, and bias-aware. Without it, CBR would just hoard everything.

🧠 Why It Matters

CBR depends on the quality of its memory. If bad or redundant cases stick around, retrieval gets noisy and reasoning degrades.

The Retention Policy ensures:

📉 Noise control we don’t clutter the casebook.
🧬 Novelty preservation genuinely new cases make it in.
🎯 Goal alignment cases must actually help the agent.

Think of it as Marie Kondo for AI memory:

If a case doesn’t spark learning, it doesn’t stay.

⚙️ A Simplified Example

Here’s a slimmed-down code snippet from retention_policy.py that shows the essence:

class DefaultRetentionPolicy:
    def __init__(self, cfg, memory, logger, casebook_scope_mgr):
        self.cfg = cfg
        self.memory = memory
        self.logger = logger
        self.scope_mgr = casebook_scope_mgr

    def should_retain(self, case) -> str:
        """
        Decide what to do with a case:
        return "accept", "stash", or "reject".
        """
        score = case.get("score", 0.0)
        novelty = case.get("novelty", 0.0)

        if score < 0.3:
            return "reject"
        elif novelty < 0.1:
            return "stash"
        return "accept"

🔁 Where It Fits

Let’s place it in context with the full CBR loop:

    flowchart LR
    A[📄 New Case] --> B[📊 Rank & Analyze]
    B --> C[✅ Quality Assessor]
    C --> D[🏆 Champion/Challenger]
    D --> E[⚖️ AB Validator]
    E --> F[🧠 MicroLearner]
    F --> G[📦 Retention Policy]
    G -->|Accept| H[(💾 Casebook Storage)]
    G -->|Reject/Stash| I[🗑️ Drop or Hold]

    style G fill:#ffd,stroke:#333,stroke-width:2px
    style H fill:#cfc,stroke:#333,stroke-width:2px

This diagram shows:

Cases are only written to the Casebook after passing through Retention Policy.
Everything else (scoring, validation, micro-learning) feeds into this gate.

✅ With Retention Policy, Stephanie’s casebooks don’t just grow they curate themselves.

Great question this is where we can make the post more concrete and also show the power of Retention Policy. Right now we only hinted at the simplest form (keep if score > 0.3, novelty > 0.1). But in practice, we layer rules that reflect how a case contributes to long-term learning.

Here’s how you could explain it in the blog:

📐 Retention Rules: Curating Memory for Growth

The Retention Policy isn’t a one-size-fits-all filter it applies multiple rules to decide whether a case strengthens the casebook or just adds noise.

⚖️ Examples of Rules

Score Thresholds
- Reject cases that don’t meet a minimum quality score (e.g., clarity < 0.3).
- Ensures we don’t store junk.
Novelty & Diversity
- Penalize cases that are too similar to existing ones.
- Encourage retention of genuinely new reasoning paths.
Domain Balance
- If a casebook is over-saturated in one domain (e.g., 80% “planning”), reject or stash new cases from that domain.
- Keeps knowledge coverage broad.
Temporal Freshness
- Prioritize recent cases when concepts are evolving fast.
- Optionally prune stale cases after N days.
Goal Alignment
- Check if the case actually advances the current goal.
- Even a high-scoring case might be irrelevant to the task at hand.

🧩 Putting It Together

Here’s a more detailed (but still readable) policy function:

class DefaultRetentionPolicy:
    def __init__(self, cfg, memory, logger, casebook_scope_mgr):
        self.cfg = cfg
        self.memory = memory
        self.logger = logger
        self.scope_mgr = casebook_scope_mgr

    def should_retain(self, case) -> str:
        score = case.get("score", 0.0)
        novelty = case.get("novelty", 0.0)
        domain = case.get("domain", "unknown")
        age_days = case.get("age_days", 0)

        # Rule 1: basic quality gate
        if score < 0.3:
            return "reject"

        # Rule 2: novelty encourages learning
        if novelty < 0.1:
            return "stash"  # don’t delete, but don’t commit

        # Rule 3: domain balancing
        if self.scope_mgr.domain_overloaded(domain):
            return "stash"

        # Rule 4: freshness matters
        if age_days > 180 and score < 0.6:
            return "reject"

        # Rule 5: final fallback
        return "accept"

🧠 Small focused Casebooks

Without these rules, the casebook would balloon with:

🔄 Duplicates (same idea phrased differently)
📉 Low-quality noise
⚖️ Over-representation of some domains

With retention rules, the memory stays small but sharp: every stored case makes future reasoning better.

Here’s a clean way to set it up in your blog post an intro + caption that makes the Casebook diagram feel like the natural conclusion of the middleware walkthrough:

📚 The Casebook: Where Experience Lives

All of these middleware components from Case Selection and Ranking to Quality Assessment, Champion Promotion, and Retention Policies ultimately converge on one central artifact: the Casebook.

The Casebook is Stephanie’s long-term memory of reasoning.

Each case captures a goal, context, and reasoning trace.
Over time, cases are retrieved, reused, revised, and retained (the CBR cycle).
Thanks to the Retention Policy, only the most useful and high-quality cases survive, keeping the Casebook sharp and focused.

Here’s how it looks in the SIS interface:

Casebook Diagram

Caption:

🗂️ The Casebook in SIS each card represents a case, with its reasoning outputs, scores, and metadata. The Casebook is where Stephanie’s past experiences are stored, organized, and continually refined ready to be retrieved for the next reasoning challenge.

Case Diagram

🗂️ An example Case in SIS Here you can see some of the detaisl of an actual case.

    flowchart TD
    subgraph Casebook["📚 Casebook (SIS)"]
        C1[📝 Case 1<br/>Reasoning Trace + Scores]
        C2[📝 Case 2<br/>Reasoning Trace + Scores]
        C3[📝 Case 3<br/>Reasoning Trace + Scores]
    end

    A[🎯 New Goal] --> B[🔍 Retrieve Relevant Cases]
    B --> C1 & C2
    C1 & C2 --> D[♻️ Reuse + Adapt Reasoning]
    D --> E[🛠️ Revise with Feedback]
    E --> F[💾 Retain Updated Case]
    F --> Casebook

    style Casebook fill:#f9f,stroke:#333,stroke-width:2px
    style A fill:#bbf,stroke:#333,stroke-width:2px
    style F fill:#cfc,stroke:#333,stroke-width:2px

After all these moving parts, you might be wondering: why does this matter? Isn’t this just another layer of complexity on top of Stephanie’s pipelines?

💡 This is where the shift happens.

The CBR middleware isn’t just an implementation detail it’s the point where intelligence decouples from model size. While most AI progress chases larger models, Stephanie gets smarter through better algorithms specifically, through the CBR loop.

💡 Why This Changes Everything

Stephanie’s implementation achieves something profound: it decouples intelligence from model size. While most AI progress comes from bigger models, Stephanie gets smarter through better algorithms specifically, through the CBR loop.

This means:

✅ Stephanie can run on smaller, cheaper models while still improving
✅ Her knowledge base grows independently of model capacity
✅ She can transfer learning across completely different domains

As the Memento paper hints: “The policy is continually updated based on environmental feedback through a memory rewriting mechanism.” Stephanie makes this concrete with measurable outcomes.

This isn’t just “better AI” it’s a fundamental shift toward algorithms over parameters, where intelligence lives in the reasoning process itself.

📋 Memento: Checklist of Implementation

The entire purpose of this post was to bring the Memento paper off the page and into a working system. Here’s the checklist what the paper proposed, and how we built it inside Stephanie:

Memento (Paper)	Stephanie (Implementation)	✅
Reasoning traces as cases – record `(state, plan, reward)`	PlanTraces – structured records with goals, inputs, steps, outputs, errors, and scores	✅
Case bank – store past cases for later retrieval	PlanTrace Store (Postgres/SQLite) – database-backed memory with embeddings, metadata, and APIs	✅
Case retrieval by similarity (KNN)	Embedding Store (H-Net, HF, Ollama) – vector search over traces, documents, and hypotheses	✅
Adaptation over generation – reuse past cases instead of starting from scratch	PlannerReuseAgent – ranks and adapts top traces into new plans via LLM prompting	✅
Feedback and revision – correct and refine cases with signals	Multi-dimensional scoring (HRM, SICQL, MRQ, EBT, SVM) – graded evaluation of reasoning and documents	✅
Retention policies – decide what to keep, prune, or discard	RetentionPolicy module – configurable rules (`keep_all`, `keep_top_k`, `discard_failed`)	✅
Champion selection – prioritize strong past cases	ChampionPromoter + GoalStateTracker – mark and promote best cases per goal	✅
AB validation – compare case variants	ABValidator – runs controlled comparisons across candidate solutions	✅
Learning loop – improve without fine-tuning models	MicroLearner – online updates that adapt ranking and scoring based on past performance	✅
(Optional in paper) Q-retriever for quality-weighted retrieval	ScorableRanker + MARS – composite scoring and meta-analysis of scorer agreement	✅

🛑 Conclusion

We started with a question: can an AI learn from its own experience without constant fine-tuning?

The Memento paper gave us the blueprint: intelligence can live in a memory of cases. In this post, we turned that blueprint into a working, extensible system inside Stephanie.

Here’s what we built:

🧩 CBR Middleware Layer – a reusable class that can wrap any pipeline, not just a single agent.
📊 Scorable Abstraction – a universal way to make every concept (documents, traces, rules) measurable and comparable.
⚖️ Multi-Dimensional Judgment – a jury of scorers (HRM, SICQL, MARS) that evaluate reasoning beyond simple rewards.
🌐 Memory Graph – not just logs, but a connected lineage of reasoning, with explicit reuse links across PlanTraces.

This isn’t just an implementation. It’s a foundation. With this CBR loop in place, Stephanie can now improve every time she acts by remembering, reusing, and revising her own reasoning.

👉 What’s next? In the next post, we’ll connect this living memory to ZeroModel, turning decisions into real-time visual maps. For the first time, you’ll be able to literally watch an AI think.

📖 Glossary

Term	Definition
Stephanie	A self-improving AI system that learns by remembering, reusing, revising, and retaining its reasoning, instead of relying only on fine-tuning.
Memento (Paper)	Research paper “Fine-tuning LLM Agents without Fine-tuning LLMs” that proposes case-based reasoning (CBR) as a way for agents to learn from experience.
Case-Based Reasoning (CBR)	A cycle of Retrieve → Reuse → Revise → Retain where past reasoning cases are stored, adapted, and improved for new problems.
PlanTrace	Stephanie’s structured memory of a reasoning process, containing steps, inputs/outputs, and scores; the atomic “case” in her memory.
CBR Middleware	A modular layer in Stephanie that wraps any pipeline with case-based reasoning capabilities (case storage, retrieval, ranking, validation, and retention).
Scorable	A universal wrapper that makes any object (document, plan, hypothesis, rule) measurable and comparable by embeddings and scores.
ScorableRanker	Component that ranks cases by combining similarity, value, recency, adaptability, and diversity into a weighted score.
MARS (Model Agreement & Reasoning Signal)	A meta-analysis tool that checks if scorers agree, detects conflicts, and identifies the most reliable evaluator.
HRM (Hierarchical Reasoning Model)	A model that simulates layered reasoning by alternating between high-level strategies and low-level steps, producing a judgment with reasoning trace.
SICQL (Scalable In-Context Q-Learning)	A model with Q, V, and π heads that scores quality, value, and action policies directly over embeddings for sharp evaluation.
Retention Policy	Rules that decide which cases to keep or discard (e.g., keep all, keep top-K, discard failed) to ensure memory grows in quality.
Casebook	A structured collection of related PlanTraces and reasoning cases grouped by goal, domain, or experiment.
ContextNamespacer	Middleware module that tags each case with goal/run IDs to prevent scope collisions in memory.
CasebookScopeManager	Decides which casebooks to search or update when storing and retrieving cases.
CaseSelector	Retrieves candidate past cases for reuse, balancing champions, recent successes, novelty, and exploration.
Quality Assessor	Filters out low-quality cases before they are added to casebooks, ensuring only useful cases are retained.
Champion Promoter	Elevates the best-performing case to “champion” status for a given goal, making it the default for reuse.
AB Validator	Runs A/B testing between candidate cases to validate which one performs better in practice.
MicroLearner	Lightweight reinforcement loop that fine-tunes scorers and retrieval weights using feedback from case performance.
ZeroModel	A complementary system that visualizes Stephanie’s decision-making process through Visual Policy Maps, enabling real-time introspection of reasoning.

📚 References

📎 Appendix: 📥 retrieve

🔎 Step 1: Retrieve The ArxivSearchAgent

The first step in Case-Based Reasoning (CBR) is always retrieval. In the Memento paper, this means pulling similar cases from memory. In Stephanie, we extend this to include not just internal memory but also external knowledge sources starting with scientific papers from arXiv.

Why arXiv? Because Stephanie is designed for self-learning research workflows. To improve over time, she needs a steady flow of new problems, methods, and evidence. ArXiv is the natural place to start: it’s where cutting-edge research lives.

🛠️ The Agent: `ArxivSearchAgent`

👉 I Full Code Here

We’ve covered this agent before (see DeepResearch Part 1 📄), but here’s a quick recap of how it works:

Keyword extraction – It takes the current goal text and extracts candidate keywords using simple heuristics (later stages refine them).
Query building – Constructs an arXiv-compatible search query, filtering by year, category (cs.AI by default), and optionally by recency (today, week, month, year).
Fetching results – Uses the arxiv Python API, with retry logic for empty page errors. – Each result is normalized into a structured dict: title, summary, url, authors, category, etc.
Context integration – Results are stored back into the pipeline context so downstream agents can use them for profiling, scoring, and planning.

📖 Remembering Everything

As we’ve said before, Stephanie is a pipeline application. Everything she does big or small, nested or top-level is a pipeline. And more than that: every pipeline is a PlanTrace. That means every run, every stage, every output becomes a learning moment, a piece of memory that can be retrieved, reused, revised, or retained.

Here’s how it looks in practice:

    flowchart TD
    %% Style
    classDef pipeline fill:#f6f8fa,stroke:#0366d6,stroke-width:2px,color:#000
    classDef step fill:#fff5e6,stroke:#f59e0b,stroke-width:2px,color:#000
    classDef memory fill:#e6fffa,stroke:#10b981,stroke-width:2px,color:#000
    classDef index fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#000

    %% Pipeline (top row)
    subgraph P["⚙️ Pipeline Execution"]
      A[🔍 Arxiv Search]
      B[📄 Document Loader]
      C[🧩 Profiler]
      D[📊 Paper Scorer]
      E[♻️ Planner Reuse]
      F[📝 Knowledge Loader]
      G[📦 Cartridge Retain]
    end
    class P pipeline

    %% Execution steps (middle row)
    subgraph S["🪢 Execution Steps (PlanTrace)"]
      H1[Arxiv Search Step]
      H2[Doc Loader Step]
      H3[Profiler Step]
      H4[Paper Score Step]
      H5[Planner Reuse Step]
      H6[Knowledge Loader Step]
      H7[Cartridge Retain Step]
    end
    class S step

    %% Memory + Indexing (bottom row)
    subgraph M["🧠 Memory + Indexing"]
      I[🔎 Embedded & Indexed]
      J[♻️ Reuse Links + Retrieval]
    end
    class M memory

    %% Connections
    A --> H1 --> I
    B --> H2 --> I
    C --> H3 --> I
    D --> H4 --> I
    E --> H5 --> I
    F --> H6 --> I
    G --> H7 --> I
    I --> J --> E

This layering Pipeline → PlanTrace → Memory ensures that every action Stephanie takes is both execution and learning. By embedding, indexing, and linking traces back into memory, Stephanie turns pipelines into a feedback system where experience itself becomes training data.

🚀 Let’s do some science

In this post by starting with arXiv retrieval, we:

Anchor Stephanie’s reasoning in real, high-quality scientific work.
Generate cases worth remembering.
Ensure that the Retrieve step isn’t abstract it’s directly tied to solving real research goals.

  stages:
     - name: arxiv_search
       agent_role: retrieve
       description: "📥 Retrieve: search Arxiv for candidate documents (knowledge cases)"
       cls: stephanie.agents.knowledge.arxiv_search.ArxivSearchAgent
       enabled: true
       iterations: 1

✅ Next, we’ll walk through what happens after retrieval how papers become structured intelligence, scored, and eventually reused.

📥 Step 2: Materialize The `DocumentLoaderAgent`

Please note we covered this in more detail in this post.

Now that we’ve retrieved candidate papers from arXiv, the next step is to materialize them into structured, scorable knowledge objects.

This is where the DocumentLoaderAgent comes in.

It doesn’t just “download a PDF” it transforms a raw document into a first-class citizen of Stephanie’s memory:

✅ Downloads and extracts text
✅ Stores in Postgres
✅ Generates H-Net embeddings
✅ Assigns domains (NLP, ML, Robotics, etc.)
✅ Links to the current pipeline run
✅ Logs every action

This is not ingestion it’s assimilation.

🛠️ The Agent: `DocumentLoaderAgent`

👉 Full Code Here

Let’s walk through the key parts of the code that make this possible.

1. Skip Duplicates, Store New

# Skip if already exists
existing = self.memory.documents.get_by_url(url)
if existing:
    stored_documents.append(existing.to_dict())
    continue

Prevents redundant processing
Ensures idempotency the same paper won’t be loaded twice

2. Download & Extract PDF

response = requests.get(url, stream=True)
# ... write to temp file ...
text = PDFConverter.pdf_to_text(pdf_path)
os.remove(pdf_path)

Uses requests + PDFConverter to extract clean text
Handles streaming for large files
Cleans up temp files

3. Create a Scorable

scorable = Scorable(
    id=doc_id,
    text=embed_text,
    target_type=TargetType.DOCUMENT,
)
self.memory.scorable_embeddings.get_or_create(scorable)

🔍 New Concept: What’s a Scorable?

A Scorable is the atomic unit of memory in Stephanie.

Every object that can be retrieved, scored, or reused documents, plan traces, hypotheses must be a Scorable.

This guarantees that it has:

✅ An embedding (via H-Net)
✅ A type (document, plan_trace, hypothesis)
✅ A text field for retrieval
✅ A score bundle (HRM, SICQL, MRQ, etc.)

By making everything a Scorable, we ensure that no knowledge is unindexed.

4. Generate Embeddings with H-Net

embed_text = f"{doc['title']}\n\n{doc.get('text', doc.get('summary', ''))}"

Combines title + full text (or summary) for rich context
Passed to H-Net via get_or_create() → stored in embeddings table

👁️ New Concept: Embeddings as “Vision”

Think of H-Net as Stephanie’s eyes.

Just as your brain converts light into neural signals, H-Net converts text into dense vectors that represent meaning.

Once embedded, documents can be:

🔍 Searched by similarity
🗺️ Mapped in a vector space
🔄 Retrieved for reuse

This is how Stephanie “sees” knowledge.

5. Assign Domains with `ScorableClassifier`

results = self.domain_classifier.classify(
    text, self.top_k_domains, self.min_classification_score
)
for domain, score in results:
    self.memory.scorable_domains.insert({
        "scorable_id": str(document.id),
        "scorable_type": "document",
        "domain": domain,
        "score": score,
    })

📂 New Concept: Domains as Categories

A document isn’t just “about AI” it might be:

80% NLP
15% Optimization
5% Ethics

The DomainClassifier assigns multiple overlapping domains with confidence scores.

This lets Stephanie:

🔎 Filter by domain ("show me all NLP papers")
🎯 Weight retrieval by relevance
🧠 Build domain-specific skill clusters

6. Link to Pipeline Run

self.memory.pipeline_references.insert({
    "pipeline_run_id": pipeline_run_id,
    "target_type": TargetType.DOCUMENT,
    "target_id": doc_id,
    "relation_type": "inserted",
    "source": self.name,
})

Creates a provenance link between the document and the pipeline
Enables traceability: “Which run loaded this paper?”
Supports retention policies: “Keep all docs from high-quality runs”

🧩 The Full Ingestion Flow

    flowchart TD
    A[📥 Search Result] --> B{Already in DB?}
    B -->|Yes| C[🔁 Link to Run]
    B -->|No| D[⬇️ Download PDF]
    D --> E[📄 Extract Text]
    E --> F["🧠 Generate Embedding<br/>(H-Net)"]
    F --> G["🏷️ Assign Domains<br/>(NLP, ML, etc.)"]
    G --> H[💾 Store Document]
    H --> I[🔗 Link to Pipeline]
    I --> J[🪢 PlanTrace Updated]
    
    subgraph "Memory Layer"
        F --> K[Embeddings Table]
        G --> L[Domains Table]
        H --> M[Documents Table]
        I --> N[Pipeline References]
    end
    
    J --> O[✅ Ready for Profiling]

🚀 An information machine

The DocumentLoaderAgent is where external knowledge becomes internal memory.

It enforces a critical invariant:

Every piece of knowledge must be scorable, embeddable, classifiable, and linkable.

This is what makes CBR possible:

✅ You can retrieve documents by similarity
✅ You can reuse them in planning
✅ You can revise them with feedback
✅ You can retain them in a structured way

Without this agent, you’d just have PDFs.
With it, you have a living growing knowledge base.

👀 Viewing Imported Documents in SIS

One of the challenges in building these pipelines is that so much of the work happens “under the hood.” With SIS (Stephanie Insight System), we make this process visible.

Every document that DocumentLoaderAgent processes is immediately available to explore inside SIS:

📋 List View – See all documents loaded by a pipeline run, with ID, title, summary, source, and scorable metadata.
🔍 Detail View – Click into a document to see its full text, assigned domains, embedding info, and provenance links back to the pipeline run.

📋 Document List

Document List in SIS

Figure showing SIS document list view. Each row represents a scorable document: title, summary, and source are shown. From here you can click through to inspect details.

🔍 Document Details

Document Detail in SIS

Figure showing the SIS document detail page. On the detail page you see everything the system knows about this document full text, embeddings, assigned domains, and how it links back into the pipeline run. This makes the “invisible ingestion” pipeline completely transparent.

🌐 Easy visual inspection of the process

This is more than a UI convenience. By making imported documents explorable in SIS, we:

✅ Verify that ingestion worked (no silent failures).
✅ Inspect assigned domains and embeddings.
✅ Trace provenance from document → scorable → pipeline run.
✅ Build trust: you can see exactly what knowledge Stephanie has taken in.

👉 This closes the loop between code (the agent) and cognition (Stephanie’s growing memory).

📌 Next Step: Profile

Now that we’ve loaded the documents, the next stage is profiling breaking them into structured sections (abstract, method, results) so they can be reused effectively.

🧱 Step 3: Structure The `DocumentProfilerAgent`

Now that we’ve retrieved papers and loaded them into memory, the next challenge is: how do we make them usable?

A PDF is not a plan. A title and abstract are not a case.

To use a document in Case-Based Reasoning (CBR), we need to break it down into structured sections abstract, methods, results, contributions so we can:

🔍 Retrieve by section (“find papers with strong ablation studies”)
🧠 Reuse specific parts of reasoning
📊 Score each component independently
🔄 Revise and refine over time

This is the job of the DocumentProfilerAgent. It doesn’t just “read” a paper it dissects it, classifies its parts, and prepares it for reuse.

🛠️ The Agent: `DocumentProfilerAgent`

👉 Full Code Here

1. Unstructured Parsing: The First Attempt

unstruct_data = self.section_parser.parse(text)

Uses unstructured.io (or similar) to extract sections via layout analysis
Works well on modern PDFs with clean formatting
Fast, deterministic, no LLM cost

But as you know real-world documents are messy:

Headers misaligned
Sections merged or missing
Equations, figures, and footnotes corrupt the flow

So while unstructured parsing is fast, it’s not reliable enough on its own.

2. Fallback to LLM: The Temptation (and Trap)

if self.fallback_to_llm and self.needs_fallback(unstruct_data):
    llm_data = await self.extract_with_prompt(text, context)

We could use an LLM to:

Read the full text
Output a clean JSON with abstract, methods, etc.

But in practice, this didn’t work well. ❌ Quality Drift – Summaries were often shallow, missing key technical details.

So while LLMs seem like a solution, they introduced more noise than signal.

3. The Right Answer: Trust the Source

Instead of generating summaries, we decided to trust the authors.

✅ arXiv provides a summary using their API.

Why?

It’s written by experts
It’s concise and accurate
It’s already available from the arxiv_search stage

This is a key insight:

Don’t re-summarize what’s already well-summarized.

So the DocumentProfilerAgent:

✅ Keeps the original abstract
✅ Uses it as the summary field
✅ Only falls back to LLM if truly missing

This ensures quality, consistency, and efficiency.

4. Merge & Validate: Best of Both Worlds

chosen = self.merge_outputs(unstruct_data, llm_data)

When fallback is needed, we don’t just pick the LLM version we compare quality.

The merge_outputs method:

Checks length
Evaluates content quality (sentence coherence, word complexity)
Picks the best version, not just the longest

def evaluate_content_quality(self, text: str) -> float:
    sentences = text.split(".")
    avg_word_len = sum(len(word) for word in text.split()) / len(text.split())
    sentence_score = len([s for s in sentences if len(s.strip()) > 20]) / max(1, len(sentences))
    return 0.4 * min(1.0, len(text)/500) + 0.4 * sentence_score + 0.2 * min(1.0, avg_word_len/8)

This heuristic scorer acts as a gatekeeper, ensuring only high-quality content enters memory.

🏷️ The Domain Classifier: Categorizing Knowledge

Once sections are extracted, the next step is classification.

section_domains = self.domain_classifier.classify(text, top_k=3, min_score=0.6)

🔍 New Concept: What’s a Domain? A Domain is a category of knowledge like NLP, Computer Vision, Ethics, or Robotics.

But unlike a single label, the DomainClassifier assigns multiple domains with confidence scores:

[
  {"domain": "NLP", "score": 0.85},
  {"domain": "Reasoning", "score": 0.72},
  {"domain": "Efficiency", "score": 0.61}
]

This lets Stephanie:

🔎 Filter by domain: “Show me all reasoning papers”
🎯 Weight retrieval: “Prioritize high-NLP papers”
🧠 Build domain-specific skills: a reasoning_agent that only uses high-scoring reasoning traces

The DomainClassifier is trained on seed examples (in config/domain/seeds.yaml) and uses few-shot classification to generalize.

📌 Note: At this stage, embeddings are still whole-document (title + text). Section-level embeddings will come later, enabling even finer retrieval like “find all method sections in robotics papers.”

🧩 The Full Profiling Flow

    flowchart TD
    A[📄 Raw PDF Text] --> B{Use Unstructured?}
    B -->|Yes| C["🔍 Parse Sections<br/>(abstract, methods, etc.)"]
    B -->|No| D["🧠 LLM Extraction (fallback)"]
    C --> E{Meets Quality?}
    D --> E
    E -->|No| F[🔁 Merge & Score]
    E -->|Yes| G[✅ Accept]
    F --> G
    G --> H[🏷️ Assign Domains]
    H --> I[💾 Store in DB]
    I --> J[🪢 Promote Sections to Scorables]
    
    subgraph "Memory Layer"
        H --> K[DocumentSectionDomains]
        I --> L[DocumentSection Table]
        J --> M[Embeddings + Scores]
    end

🚀 Tuning information

The DocumentProfilerAgent is where raw information becomes structured intelligence.

It enforces a critical principle:

Don’t generate when you can extract. Don’t guess when you can classify.

By combining:

✅ Unstructured parsing (fast, deterministic)
✅ arXiv abstracts (trusted, high-quality)
✅ Domain classification (multi-label, scored)
✅ Quality heuristics (no blind LLM trust)

You’ve built a robust, scalable, and reliable way to turn papers into reusable cases.

In SIS, you’ll soon be able to browse these profiled documents directly titles, abstracts, and assigned domains side by side.

This is what makes CBR possible:

✅ You can retrieve by section + domain
✅ You can reuse the methods of a past paper
✅ You can revise a plan using high-scoring results
✅ You can retain the best knowledge

And most importantly:

This discipline is what makes self-improvement possible: Stephanie builds on what’s real, not what’s imagined.

👀 Seeing the Structure in SIS

It’s one thing to parse and classify a paper into sections it’s another to see it come alive.

That’s where SIS (Stephanie Insight System) steps in. Every profiled document can now be explored through an interactive profile view:

📑 Each section (Abstract, Method, Results, etc.) is shown in its own tab
🏷️ Domains are highlighted as badges on top of each section
📝 Summaries (when available) sit alongside the raw section text
🔗 The full provenance (which run loaded this paper, where it came from) is preserved

Here’s what it looks like in SIS:

Document Profile View in SIS

Figure Document Profile in SIS. Notice the tabs per section and the domains attached to each section

With this, a paper stops being a flat PDF and becomes a scored, embeddable, retrievable knowledge object one Stephanie can actually reason over.

🧪 Step 4: Evaluate The `PaperScoreAgent` and the Multi-Dimensional Judgment of Knowledge

👉 Full Code Here

Now that we’ve retrieved, loaded, and structured research papers, the next step is to judge their quality.

Not just “is this paper on-topic?” but:

Is it novel?
Is it clear?
Can we implement its ideas?
Does it align with our goal?

This is the job of the PaperScoreAgent.

It doesn’t apply a single score it orchestrates a jury of specialized scorers:

✅ MRQ → measures relevance
✅ SICQL → evaluates clarity & reasoning
✅ EBT → scores evidence quality
✅ SVM → detects duplication
✅ HRM → judges epistemic quality
✅ Contrastive Ranker → compares papers relative to each other

Together, they form a multi-dimensional scoring ensemble a cognitive jury that evaluates each paper from multiple angles.

⚖️ Multi-Dimensional Judgment

self.dimensions = [
    "novelty",
    "clarity",
    "relevance",
    "implementability",
    "alignment"
]

These dimensions become decision axes:

Relevance → does this help the current goal?
Clarity → can we follow it without guessing?
Implementability → can we turn it into code or a plan?
Novelty → is it new, or already solved elsewhere?
Alignment → does it move us toward our long-term vision?

This is not binary success/failure it’s graded, structured judgment.

2. Ensemble of Scorers

self.scorers = {
    "mrq": MRQScorer(...),
    "sicql": SICQLScorer(...),
    "ebt": EBTScorer(...),
    "hrm": HRMScorer(...),
    "svm": SVMScorer(...),
    "contrastive_ranker": ContrastiveRankerScorer(...)
}

Each scorer contributes to one or more dimensions:

Scorer	Dimensions It Scores
MRQ	Relevance
SICQL	Clarity, Reasoning
EBT	Evidence, Support
HRM	Epistemic Quality, Truth-tracking
SVM	Novelty, Duplication
Contrastive Ranker	Relative Quality, Preference

This ensemble approach ensures that no single scorer dominates and that weaknesses I put you down literally but I got to put you down aroundin one are compensated by others.

🧩 From Scores to Meta-Scores (MARS)

After scoring, Stephanie applies MARS the Multi-dimensional Analysis of Reasoning and Scoring.

MARS isn’t a scorer. It’s a meta-analyzer that:

Aggregates across all papers
Detects scoring bias (e.g. one scorer is too harsh)
Identifies high-leverage papers (novel and implementable)
Recommends which scorers to trust more next time

This is how Stephanie learns to score better over time not just the papers, but the judges themselves.

🧩 The Full Scoring Flow

    flowchart TD
    A[📄 Structured Paper] --> B{Apply Scorers}
    B --> C[MRQ: Relevance]
    B --> D[SICQL: Clarity]
    B --> E[EBT: Evidence]
    B --> F[HRM: Epistemic Quality]
    B --> G[SVM: Novelty]
    B --> H[Contrastive: Preference]
    
    C --> I[ScoreBundle]
    D --> I
    E --> I
    F --> I
    G --> I
    H --> I
    
    I --> J[MARS Analysis]
    J --> K[Ranked, Weighted, Trusted Scores]
    K --> L[🪢 PlanTrace Updated]
    
    subgraph "Memory Layer"
        I --> M[ScoreBundle Table]
        K --> N[MARS Recommendations]
    end

🚀 Why This Matters for CBR

In Case-Based Reasoning, cases are only useful if they’re good cases.

The PaperScoreAgent ensures that:

✅ High-quality cases rise to the top
✅ Retrieval ranks by multi-dimensional quality, not just similarity
✅ Stephanie continuously refines its own judges

Without it, CBR would be garbage in, garbage out. With it, Stephanie builds a curated library of strong, reusable cases.

🧠 Step 5: Judge The HRM Scorer and the Quality of Reasoning

But there’s a deeper question than “is the paper good?”:

Was the reasoning itself good?

Not “did it get the right answer?” but “was the thinking behind it sound?”

This is where the Hierarchical Reasoning Model (HRM) comes in. HRM doesn’t just score knowledge it thinks about thinking.

🔍 Two Flavors of HRM

HRMScorer → scores (goal, document) pairs: is this paper well-reasoned?
EpistemicPlanHRMScorer → scores PlanTraces: was this reasoning process itself good?

This dual design means Stephanie can judge both:

The knowledge she consumes (documents)
The reasoning she performs (traces)

⚙️ How HRM Thinks

HRM is built as a hierarchical loop of cognition:

🔹 Low-Level RNN (L) → models fine-grained reasoning steps
🔹 High-Level RNN (H) → models strategic updates over time

Together, they simulate multi-step thinking and produce an epistemic quality score.

    flowchart TD
    A[Input: Goal + Trace] --> B[Embed & Encode]
    B --> C[Low-Level RNN: step-by-step]
    B --> D[High-Level RNN: strategy updates]
    C --> E[zL_final]
    D --> F[zH_final]
    E & F --> G["Epistemic<br/>Quality Score (0–1)"]

Why this matters:

✅ Rewards structured, coherent reasoning
✅ Penalizes shallow pattern-matching
✅ Exposes introspectable states (zL, zH) → a window into the “shape” of thought

📊 Why HRM Fits CBR

Traditional CBR says: reuse any past case that worked. But brittle success can hide bad reasoning.

HRM flips this:

Only reuse cases with high-quality reasoning.

That makes CBR self-correcting: Stephanie learns not just from results, but from good thinking itself.

🚀 Why HRM is Critical for Self-Improvement

Policy Learning (GILD): HRM scores train planning policies on good reasoning.
MCTS Guidance: HRM scores act as a compass when exploring.
Retention: Bad reasoning traces are pruned, even if they had “right” answers.
Skill Acquisition: New skills are learned only from high-HRM traces.

♻️ Step 6: Reuse The `PlannerReuseAgent` and Learning from the Past

Now that Stephanie can judge knowledge (PaperScoreAgent) and judge reasoning (HRM), the next step is to reuse the best cases.

This is the job of the PlannerReuseAgent.

It works like this:

Retrieve → find past PlanTraces most similar to the current goal
Filter → select only those with high HRM scores
Adapt → use LLMs to reshape past plans into a new solution
Link → record parent-child reuse links, so lineage is never lost

    flowchart LR
    A[🎯 Current Goal] --> B[Retrieve Past PlanTraces]
    B --> C[Filter: High HRM Quality]
    C --> D[Adapt with LLM]
    D --> E[New Plan Trace]
    E --> F[🔗 Record Reuse Link]

This closes the CBR loop:

Retrieve → good past cases
Reuse → adapt them to new goals
Revise → update with feedback
Retain → store as new cases for the future

♻️ Step 6: Reuse The `PlannerReuseAgent` and the Memory Graph

Now that we’ve retrieved, structured, and scored past reasoning traces, the next step is reuse.

This is where Case-Based Reasoning (CBR) becomes real: Stephanie doesn’t generate a plan from scratch she adapts one from her own history.

This is the job of the PlannerReuseAgent.

It’s not just a planner it’s a reasoning archaeologist, digging through past PlanTraces to find the best strategies for the current task.

🛠️ The Agent: `PlannerReuseAgent`

👉 Full Code Here

Let’s walk through how it works and why it’s more than just “RAG for plans”.

1. Retrieve Candidate Traces

all_traces = self.memory.plan_traces.get_all(limit=500)

Fetches up to 500 past PlanTraces
Converts each into a Scorable (for embedding)

🔍 Why This Matters
Every PlanTrace is treated as a first-class knowledge object embeddable, scoreable, reusable.

2. Embed & Rank with `ScorableRanker`

ranked = self.ranker.rank(query=query_scorable, candidates=candidates, context=context)

The ScorableRanker computes a composite score using four signals:

Signal	Purpose	Formula
Similarity	Semantic match to current goal	`cosine_sim(goal_emb, trace_emb)`
Reward	Did the trace succeed?	`avg_score` from `EvaluationORM`
Recency	Penalize outdated traces	`exp(-age_sec / (302460*60))`
Adaptability	Has this trace been reused successfully?	`reuse_count / attempt_count`

rank_score = sum(components[k] * self.weights.get(k, 0) for k in components)

✅ Weighted Combo
Default: similarity=0.4, reward=0.3, recency=0.2, adaptability=0.1
Configurable per task or domain

This is not just retrieval it’s value estimation.

3. Adapt Top Traces into a New Plan

prompt = self.prompt_loader.load_prompt(self.cfg, context)
response = self.call_llm(prompt, context=merged)
parsed = self._extract_plan_from_response(response)

The LLM is given:

The current goal
The top-K ranked traces (goal, plan, rank_score)

And asked to adapt them into a new plan.

The prompt includes:

## Rationale: Why these cases?
## Score: 8/10
## Plan:
1. Search for relevant papers
2. Extract key claims
3. Compare with current goal

This ensures the new plan is not a copy, but a synthesis.

4. Record Reuse Links

self.memory.plan_traces.add_reuse_link(
    parent_trace_id=parent_trace_id,
    child_trace_id=new_trace_id
)

This is critical.

It creates a directed link from the new trace (child) to the past traces (parents) that inspired it.

These links form a memory graph a lineage of reasoning that shows how knowledge evolves.

🧩 The Reuse Flow

    flowchart TD
    A[🎯 Current Goal] --> B[📚 All PlanTraces]
    B --> C[🧠 Embed with H-Net]
    C --> D[📊 Rank: Similarity + Reward + Recency + Adaptability]
    D --> E[🔝 Top-K Traces]
    E --> F[🔄 LLM Adapts into New Plan]
    F --> G[🔗 Record Reuse Links]
    G --> H[🪢 PlanTrace Updated]
    
    subgraph "Memory Layer"
        D --> I[ScorableRanker Weights]
        G --> J[Reuse Links Table]
        H --> K[PlanTrace Store]
    end

🌐 The Memory Graph

Every reuse link turns your PlanTrace store into a graph of reasoning evolution.

    graph TD
    A[Trace 1: Summarize Memento] --> B[Trace 2: Explain Memento]
    B --> C[Trace 3: Compare Memento vs RAG]
    C --> D[Trace 4: Implement CBR in Stephanie]
    D --> E[Trace 5: Optimize VPM Retrieval]

This is not just logging it’s knowledge lineage.

You can:

🔍 Find all traces derived from a breakthrough idea
📈 Measure how reasoning quality improves over generations
🛠️ Debug failures by tracing back to root causes

🚀 Why This Closes the CBR Loop

The PlannerReuseAgent is where memory becomes intelligence.

It ensures that:

✅ Only high-quality, recent, adaptable traces are reused
✅ New plans are synthesized, not copied
✅ The provenance of every decision is recorded
✅ The memory graph grows with every task

This is not just reuse it’s self-improvement.

🧠 Step 6: The `PlanTraceMonitor` Memento’s Case Bank in Action

Now that we’ve seen how Stephanie retrieves, structures, scores, and reuses knowledge, we need to look at the core infrastructure that makes it all possible: the PlanTraceMonitor.

This isn’t just a logger.

It’s Memento’s Case Bank brought to life.

📄 What Memento Says

In the paper, the Case Bank is where all reasoning episodes are stored:

“Each case is a (state, plan, reward) tuple, stored for later retrieval and reuse.”

It’s the memory of the agent the foundation of self-improvement.

But the paper leaves implementation open.

We’ve filled that gap and gone further.

🛠️ Key Features of the `PlanTraceMonitor`

1. Automatic Trace Creation

def start_pipeline(self, context: Dict, pipeline_run_id: str):
    self.current_plan_trace = PlanTrace(
        trace_id=str(pipeline_run_id),
        goal_text=context.get("goal", {}).get("goal_text", ""),
        plan_signature=self._generate_plan_signature(context),
        input_data=self._extract_input_data(context),
        execution_steps=[],
        meta={ "started_at": time.time() }
    )

Every pipeline run becomes a PlanTrace
No manual logging it’s automatic, universal, and consistent

2. Step-by-Step Monitoring

def start_stage(self, stage_name: str, context: Dict, stage_idx: int):
    step = ExecutionStep(
        step_id=f"{trace_id}_step_{stage_idx + 1}",
        step_type=stage_name,
        input_text=input_preview,
        output_text="",
        start_time=time.time()
    )
    self.current_plan_trace.execution_steps.append(step)

Every stage is logged as an ExecutionStep
Duration, errors, and outputs are captured

3. Scorable-Aware Output Capture

def complete_stage(self, stage_name: str, context: Dict, stage_idx: int):
    agent_obj = context.get("agent_obj")
    if agent_obj and getattr(agent_obj, "is_scorable", False):
        details = getattr(agent_obj, "scorable_details", {})
        step.input_text = details.get("input_text", "")
        step.output_text = details.get("output_text", "")

Only agents that are truly scorable (LLM, reasoning, summarization) populate input_text/output_text
Others get a breadcrumb (context keys), not a fake output
This ensures only meaningful steps are scored

4. Scoring & Feedback Loop

async def score_pipeline(self, context: Dict):
    scoring_context = {
        "plan_traces": [self.current_plan_trace],
        "goal": context.get("goal", {})
    }
    scored_context = await self.plan_trace_scorer.run(scoring_context)
    
    self.current_plan_trace.pipeline_score = scored_context.get("pipeline_score", {})
    self.current_plan_trace.mars_analysis = scored_context.get("mars_analysis", {})

Uses HRM, SICQL, MRQ, EBT to assign epistemic quality
Enables value estimation for future retrieval

5. Retention Policies

def apply_retention_policy(self):
    if self.retention_policy == "discard_failed":
        failed = self.memory.plan_traces.get_failed()
        for t in failed:
            self.memory.plan_traces.delete(t.trace_id)

keep_all: grow forever
keep_top_k: retain only the best
discard_failed: prune the useless

This ensures the memory grows in quality, not just quantity.

6. Reuse Links (Memory Graph)

def link_reuse(self, parent_trace_id: str, child_trace_id: str):
    self.memory.plan_traces.add_reuse_link(parent_trace_id, child_trace_id)

Every time a new plan adapts from a past trace, a directed link is recorded
This creates a memory graph a lineage of reasoning

    graph TD
    A[Trace 1: Summarize Paper] --> B[Trace 2: Explain Method]
    B --> C[Trace 3: Compare to Memento]
    C --> D[Trace 4: Implement CBR]

This is not just logging it’s knowledge evolution.

✨ Summary

🧐 Hierarchical Reasoning Model (HRM)

🎨 Simplified Flow of HRM

🏗️ Building HRM in Code

💢 What is SICQL?

🔍 Why does Stephanie need SICQL?

🎨 Simplified SICQL Flow

🔗 Introducing H-Net: Dynamic Chunking for Smarter Embeddings

🧭 Why We Use H-Net

🧬 How We Built H-Net into Stephanie

1. 🧵 Tokenization

2. 🍕 Boundary Prediction

3. ✂️ Chunker

4. 🧪 Pooling + Embedder

🎯 Better understanding

🧩 The Scorable Abstraction: A Measured View of Everything

❓ Why We Needed It

📦 What Is a Scorable?

🧠 Why This Matters

🧬 Multiple Views

📘 Summary: A Measured View on Everything

🗄️ Scorables in the Database: Everything Maps to a Scorable

🔑 Scorable Embeddings

🧬 The Embedding Store

📄 Example: A Document as a Scorable

🔗 Putting It Together

📘 Information everywhere

🧭 Classifying Knowledge with Domain Intelligence

🧠 What the ScorableClassifier Does

🌱 Seeds: The Blueprint of a Domain

📚 Granular information

🌳 LATS From Scorables to Trees of Thought

🎲 Why Monte Carlo Tree Search (MCTS)?

🧩 How It All Fits

🌳 The MCTS Reasoning Agent

📝 What’s Happening Here

🧩 Memento: Case Book Reasoning

📂 ContextNamespacer: Keeping Cases in Their Lanes

🧭 Flow in the Middleware

⌨️ In Code Terms

📖 CasebookScopeManager: Defining the Boundaries of Memory

🚷 Boundaries to our knowledge

🧭 Flow in the Middleware

💻 In Code Terms

🔑 Key Role

🔍 CaseSelector: Choosing Which Memories to Reuse

🏅 Best of the best

⚖️ Selection Strategy

🧭 Flow in the Middleware

🧑 In Code Terms

🔑 Key Role

📊 Rank & Analyze: Enhanced Scoring for Smarter Case Reuse

🧮 ScorableRanker: Weighted, Multi-Signal Case Scoring

🌌 MARS: Measuring Agreement and Reliability

⚡ Worked Example

📐 Flow in Middleware

🚀 Deep understanding

🌌 Inside MARS: Measuring Agreement and Reasoning Signal

🧭 Step 1: Per-Dimension Analysis

📊 Agreement & Variance

⚔️ Conflict Detection

🎯 Preferred Model Selection

📈 Scorer Reliability

🧠 Human-Readable Explanation

📚 Example MARS Report

🔮 Making sense of scores everywhere

🖼️ MARS in Action

✅ Assessing Case Quality

🧹 The Quality Assessor in Action

🔍 What’s Happening Here

🚮 Dump the trash

🏆 Champion Promoter: Remembering the Best

☸️ We know what we know

🧩 Example Code (simplified)

⚡ The Effect

📝 Goal State Tracker: Remembering the Champion

🏦 What It Does

🔎 Example: Champion Persistence

🪢 Learn from history

⚖️ A/B Validator: Putting Cases to the Test

📦 What Is a `Scorable`?

🛠️ The Agent: `ArxivSearchAgent`

📥 Step 2: Materialize The `DocumentLoaderAgent`

🛠️ The Agent: `DocumentLoaderAgent`

5. Assign Domains with `ScorableClassifier`

🧱 Step 3: Structure The `DocumentProfilerAgent`

🛠️ The Agent: `DocumentProfilerAgent`

🧪 Step 4: Evaluate The `PaperScoreAgent` and the Multi-Dimensional Judgment of Knowledge

♻️ Step 6: Reuse The `PlannerReuseAgent` and Learning from the Past

♻️ Step 6: Reuse The `PlannerReuseAgent` and the Memory Graph

🛠️ The Agent: `PlannerReuseAgent`

2. Embed & Rank with `ScorableRanker`