Thoughts of Algorithms

Thoughts of Algorithms
Page content

How a self-evolving AI learns to reflect, score, and rewrite its own reasoning

🧪 Summary

What if an AI could think not just solve problems, but reevaluate its beliefs in the face of new information?

In this post, we introduce a system that does exactly that. At the core of our pipeline is a lightweight scoring model called MR.Q, responsible for evaluating ideas and choosing the best ones. But when it encounters a new domain, a new goal, or a shift in task format, it doesn’t freeze it adapts.

MR.Q watches how trusted sources (like large language models) evaluate new hypotheses. Then it dynamically trains a local regression model to realign its scoring not tomorrow, not during a retrain cycle right now. It takes a few samples, tunes itself, and continues, now aligned to the latest reasoning behavior.

This is more than prompt chaining. It’s more than symbolic control. This is a system generating thoughts about thoughts and then updating its judgment accordingly.

That’s what we mean when we say: this might be the first thinking AI.

In the rest of this post, we’ll unpack how this system works and how it brings us one step closer to AI that actually thinks.


🪞 The Configurable Reflection Engine: Multi-Dimensional Scoring

Before an AI can think, it needs to reflect. And reflection starts with scoring its own thoughts not just whether something is good, but in what ways it’s good or bad. That’s where our multi-dimensional scoring system comes in.

This system doesn’t rely on a fixed rubric. It’s fully configurable you define the dimensions that matter for your domain. We often score across dimensions like:

  • Correctness
  • Clarity
  • Originality
  • Relevance
  • Depth
  • Specificity

…but that’s just the beginning. You can define as many dimensions as you like we’ve run experiments with 6, 12, even more. The scoring engine adapts automatically. These aren’t arbitrary tags; each dimension is paired with a natural language rubric, which guides either an LLM or our internal MR.Q scorer to assign a structured score to every output.

The result is a rich quality profile for every hypothesis. Instead of reducing everything to a single number, we let the system see itself from multiple angles. This forms the core knowledge base that powers self-improvement a deep memory of how different outputs performed, and why.

In a sense, these multi-dimensional scores are the AI’s first internal thoughts not just what it said, but how it felt about what it said. That reflection is what makes all the later thinking possible.

# config/scoring/pipeline_judge.yaml
dimensions:
  - name: correctness
    file: correctness
    weight: 1.2
    extra_data: { parser: numeric }

  - name: feasibility
    file: feasibility
    weight: 1.1
    extra_data: { parser: numeric }

  - name: insightfulness
    file: insightfulness
    weight: 1.3
    extra_data: { parser: numeric }

  - name: alignment
    file: alignment
    weight: 1.0
    extra_data: { parser: numeric }

  - name: clarity
    file: clarity
    weight: 1.1
    extra_data: { parser: numeric }

We covered this process in detail here Dimensions of Thought: A Smarter Way to Evaluate AI

We then extended it to apply to documents here: Document Intelligence: Turning Documents into Structured Knowledge

In this post we are extending it slightly by determing how to measure the importance of the dimensions.

🧠 Learning What Matters Dimensional Contrastive Tuning

Most scoring systems ask: “Is this good or bad?” Our system asks:

  • “How many different views should we take on this problem?
  • “Can you give me a score out of 100 for this conclusion?
  • “Can you give me a rational for this score?”
  • “What other algorithims, agents, prompts, models… have I got to score here?”
  • “Why is this one better than that one?”

That’s the core idea behind contrastive dimensional tuning instead of relying on absolute scores or manual weights, we let the system learn which dimensions actually distinguish strong outputs from weak ones.

⚙️ How It Works

Our ContrastiveDimensionalTuner takes in:

  • Pairs of examples: A and B
  • Multi-dimensional scores for each (correctness, clarity, originality, etc.)
  • A label for which one is better (the “preferred” example)

It then computes the difference in scores across dimensions and uses contrastive learning (via logistic regression) to learn which dimensions consistently matter. This produces a set of dimension weights that can be used to re-rank or optimize future outputs.

📦 Why It’s Powerful

  • It’s model-agnostic: You can train it on outputs from any agent or system.
  • It’s scalable: Works with datasets from Hugging Face or internal logs.
  • It’s realistic: You don’t need a perfect score just enough contrast to learn from.
# Example usage:
tuner = ContrastiveDimensionalTuner(dimensions=["correctness", "clarity", "originality"])

# Add training data
tuner.add_training_pair(
    scores_a={"correctness": 0.9, "clarity": 0.8, "originality": 0.6},
    scores_b={"correctness": 0.7, "clarity": 0.9, "originality": 0.5},
    preferred="A"
)

tuner.train()

# Use learned weights
print(tuner.get_weights())

🔬 What It Means in Practice

You can now tune your system for scientific rigor, creative writing, pedagogical value, or whatever matters most for your domain just by changing what training data you use. You’re not locked into a one-size-fits-all rubric. The system learns from your data, your preferences, your task.

This moves us beyond rigid evaluation and into the realm of adaptive, self-aware scoring.

    
graph TD
    A[Goal Text] --> D1
    B[Hypothesis Text] --> D1

    subgraph Input Processing
        D1[Embedding & Feature Extraction]
    end

    D1 --> MRQ[MRQ Scorer]
    D1 --> SVM[SVM Scorer]
    D1 --> LLM[LLM Scorer]

    MRQ --> C1[Correctness Score]
    SVM --> C1
    LLM --> C1

    MRQ --> C2[Clarity Score]
    SVM --> C2
    LLM --> C2

    MRQ --> C3[Originality Score]
    SVM --> C3
    LLM --> C3

    C1 --> Tuner[ContrastiveDimensionalTuner]
    C2 --> Tuner
    C3 --> Tuner

    subgraph Meta Aggregation
        Tuner --> W[Weighted Score Output]
    end

    style A fill:#f9f,stroke:#333,stroke-width:1px
    style B fill:#f9f,stroke:#333,stroke-width:1px
    style D1 fill:#bbf,stroke:#333,stroke-width:1px
    style MRQ fill:#ffc,stroke:#333,stroke-width:1px
    style SVM fill:#ffc,stroke:#333,stroke-width:1px
    style LLM fill:#ffc,stroke:#333,stroke-width:1px
    style C1 fill:#cfc,stroke:#333,stroke-width:1px
    style C2 fill:#cfc,stroke:#333,stroke-width:1px
    style C3 fill:#cfc,stroke:#333,stroke-width:1px
    style Tuner fill:#ccf,stroke:#333,stroke-width:1px
    style W fill:#fcf,stroke:#333,stroke-width:2px
  

Code: ContrastiveDimensionalTuner scoring dimensions correctly

The ContrastiveDimensionalTuner is our solution for learning how to weigh scoring dimensions (like correctness, clarity, originality) automatically. Instead of hardcoding weights or relying on human judgment, this component learns from preference pairs just like a reward model, but using interpretable dimensions and contrastive logic.

class ContrastiveDimensionalTuner:
    """
    Learns weights for each scoring dimension using contrastive learning.
    Given pairs of scored examples (A vs B) and a preference, it learns which dimensions matter most.
    """

    def __init__(self, dimensions, logger=None):
        """
        Args:
            dimensions (list of str): List of dimension names (e.g., ["correctness", "clarity"]).
            logger (optional): Optional logger to record training events.
        """
        self.dimensions = dimensions
        self.logger = logger
        self.X = []  # Feature differences (vector of deltas across dimensions)
        self.y = []  # Labels: 1 if A preferred over B, 0 otherwise
        self.model = None

    def add_training_pair(self, scores_a: dict, scores_b: dict, preferred: str):
        """
        Adds a training example.

        Args:
            scores_a (dict): Scores for option A, keyed by dimension.
            scores_b (dict): Scores for option B, keyed by dimension.
            preferred (str): "A" or "B", indicating which output was preferred.
        """
        delta = np.array([
            scores_a[dim] - scores_b[dim] for dim in self.dimensions
        ])

        # If B is preferred, invert the delta
        if preferred.upper() == "B":
            delta = -delta
            label = 1  # B preferred (inverted delta)
        else:
            label = 1  # A preferred (original delta)

        self.X.append(delta)
        self.y.append(label)

        if self.logger:
            self.logger.log("ContrastiveTrainingPairAdded", {
                "delta": delta.tolist(),
                "preferred": preferred
            })

    def train(self):
        """
        Trains a logistic regression model using the current contrastive data.
        """
        if len(self.X) < 3:
            if self.logger:
                self.logger.log("ContrastiveTrainingSkipped", {
                    "reason": "Not enough data",
                    "num_examples": len(self.X)
                })
            return

        X_array = np.array(self.X)
        y_array = np.array(self.y)

        self.model = LogisticRegression()
        self.model.fit(X_array, y_array)

        if self.logger:
            self.logger.log("ContrastiveModelTrained", {
                "coefficients": self.get_weights()
            })

    def get_weights(self) -> dict:
        """
        Returns the learned dimension weights (if trained).

        Returns:
            dict: Mapping from dimension to learned weight.
        """
        if self.model is None:
            return {dim: 1.0 for dim in self.dimensions}  # fallback: equal weights

        weights = self.model.coef_[0]
        return {
            dim: round(float(w), 4) for dim, w in zip(self.dimensions, weights)
        }

    def score(self, dimension_scores: dict) -> float:
        """
        Calculates a single weighted score from per-dimension scores.

        Args:
            dimension_scores (dict): Scores keyed by dimension.

        Returns:
            float: Weighted total score.
        """
        weights = self.get_weights()
        total = sum(dimension_scores[dim] * weights.get(dim, 1.0) for dim in self.dimensions)
        return round(total, 4)

⚙️ How It Works

The tuner learns through contrastive examples comparisons where one output is preferred over another:

  • Training Inputs: Each input is a pair of outputs (A and B) with known per-dimension scores. A label indicates which output was preferred.
  • Feature Vector: The tuner computes a vector of score differences between A and B across all dimensions. If B is preferred, the difference is inverted.
  • Training: These difference vectors become input to a logistic regression model, which learns which dimensions most strongly predict preference.
  • Scoring: Once trained, the model produces learned weights per dimension. When scoring a new output, it calculates a weighted sum of its dimension scores to produce a final, preference-aligned score.

🎹 Insight 1: Intelligence Emerges Through Judgment - The Piano Teacher Analogy

A person who can’t play piano can still teach someone else as long as they know what sounds better. In the same way, our system doesn’t need to generate perfect answers up front. It just needs to recognize what improves performance, and then self-select and amplify that behavior.

Even the largest, smartest LLMs often fail at first attempts. But here’s the twist: they can dramatically improve by reviewing and comparing their own outputs.

This isn’t speculation it’s backed by a wave of recent papers:

  • Self-Refine (2023): Showed that LLMs can boost performance by comparing their initial outputs and rewriting them using internal feedback.
  • ReAct, Reflexion, and ReST: Proved that agents using self-judgment loops outperformed those that just generated and moved on.
  • Auto-CoT and DPO-style preference training: Reinforced that ranking beats raw generation when it comes to learning high-quality reasoning.

So what’s the real insight?

A model doesn’t need to generate the best answer it just needs to recognize which one is better. And with enough of those comparisons, it learns how to steer itself.

🚀 How We Use This in Our System

That’s exactly what our system does:

  • It doesn’t try to guess the best rule, prompt, or pipeline on the first shot.
  • Instead, it generates multiple versions, scores them against each other, and reinforces the better ones.
  • These scores train our internal critic MR.Q, a fast, memory-efficient regression model that learns from live feedback.

This is the heart of how intelligence emerges in our system: 👉 Through judgment, not generation. 👉 Through comparison, not perfection. 👉 Through learning what works even if it stumbles along the way.

⚖️ How MR.Q Learns from Preferences (DPO-Style)

So how does our system actually learn from judgment?

At the core of MR.Q is a simple, powerful loop: compare two outputs, prefer the better one, and use that contrast to train a regressor that gets better over time.

Here’s how it works in practice:

Example 1: Answer Quality

Prompt Output A Output B Chosen
“Why is the sky blue?” “Because of the atmosphere.” “Due to Rayleigh scattering of sunlight by air molecules.” ✅ B

MR.Q stores this as: → Same prompt, but B > A, so favor the features of B in future decisions.

Example 2: Clarity and Specificity

Prompt Output A Output B Chosen
“Explain how solar panels work.” “They use light to make electricity.” “Photons hit semiconductors, exciting electrons into a current.” ✅ B

Again: B is clearer and more specific → MR.Q learns the embedded difference.

Example 3: Creativity Preference

Prompt Output A Output B Chosen
“Suggest a new product idea.” “Smart mirror for workouts.” “Modular AI-powered desk that reshapes with your work style.” ✅ B

MR.Q generalizes that creative, multi-featured responses tend to be preferred → it weights such features more in scoring future generations.

These contrastive judgments are fed into our MR.Q regressor, which uses embedding distances and historical preferences to shape an evolving reward model. Over time, MR.Q becomes a fast, lightweight critic that reflects what our LLM would say without needing to call the LLM every time.


🌀 Insight 2: The Drunken Man and the Pretty Girl

Intelligence isn’t precision it’s desire, feedback, and adaptation.

Imagine a drunken man at a party. Across the room, he sees a beautiful woman someone he really wants to talk to. But he’s unsteady. His steps wobble left, then right. He bumps into a chair. He adjusts. He’s off course again. But through it all, he’s guided by a single, unwavering thing: his desire to get closer.

That’s how our AI system learns.

It doesn’t need to be perfectly calibrated from the start. It just needs:

  • A goal worth reaching,
  • A feedback signal telling it whether it’s getting warmer or colder,
  • And the capacity to course-correct.

👣 Learning by Stumbling Forward

Every symbolic rule, prompt variation, and pipeline configuration is like one of the drunk man’s steps. Most aren’t perfect. Some are downright wrong. But they aren’t wasted because each one is scored, judged, remembered. And with that feedback, the next step is just a little more aligned.

Over time, this process leads to surprising results:

  • The system improves without supervision.
  • It refines symbolic behaviors based on prior successes.
  • It trains scoring models like MR.Q to reflect what it learns to desire.

All without needing to know the exact path ahead.

It doesn’t need to be sober. It just needs to remember what works and want to get closer.

This isn’t just a cute analogy. It’s a design philosophy:

  • Every mutation is a step.
  • Every score is a clue.
  • Every stumble is progress.

The pretty girl is intelligence itself and our system, drunk or not, is getting closer every day.


🕊️ Insight 3: Thinking on the Fly Reacting to Unknown Data

Not all intelligence is pre-trained. Sometimes, real intelligence means figuring things out in the moment, based on what you already know.

That’s exactly what our system does and it’s one of the most accidentally profound discoveries we made during development.

🦋 The Problem: A New Dimension Emerges

Imagine we’re evaluating a hypothesis and suddenly a new scoring dimension appears say, “novelty” or “feasibility.” Our system has never seen this dimension before. It hasn’t trained a model on it. There’s no data for it.

Most systems would either:

  • Crash,
  • Default to zero,
  • Or wait for retraining.

But not ours.

🪜 The Solution: Generalize from What You Know

When MR.Q encounters this situation, it doesn’t panic. Instead, it says:

“I haven’t seen this exact scoring context… but I know what the goal is. I know what the hypothesis looks like. Let me use my existing encoder and predictor to make an educated guess and then tune it on-the-fly using nearby trusted scores.”

This is the code that enables that behavior:

if dimension not in self.models:
    self._initialize_dimension(dimension)

This little if is doing something deceptively intelligent: It means our system creates new scoring models dynamically, using embedding-based generalization from previous dimensions.

In other words:

  • The system doesn’t need explicit training to begin reacting to a new dimension.
  • It builds a model on-demand using what it already knows.
  • And it tunes itself live by aligning with trusted nearby scores.

🗺️ What is this?

This is one of the deepest signs of real thinking:

  • Adaptation without supervision.
  • The ability to infer structure from context.
  • The willingness to take a guess and refine it.

This is how humans think. This is how animals learn. And now, this is how our AI behaves.

We didn’t plan for this feature. It emerged naturally from a design that valued modularity, embeddings, and real-time feedback. But it’s quickly become a core pillar of our system’s intelligence.

It’s not a magic moment. It’s not a perfect answer. It’s just a system that knows how to say:

“I’ve never seen this before but I’ve seen enough to take a good first step.”


🧭 Insight 4: Aligning with the LLM or learning from the Master

In any learning system, one of the most powerful strategies is to choose a teacher.

For us, that teacher is the LLM.

While our goal is speed and autonomy, we still respect the LLM’s judgment. It’s trained on trillions of tokens. It’s seen more language, logic, and reasoning than any of us ever will.

So when we need ground truth, or a benchmark to align to we turn to it.

🧩 MR.Q Doesn’t Compete It Learns

Our MR.Q scorer doesn’t try to outperform the LLM. Instead, it tries to understand what the LLM values, and learn to predict those values faster.

That’s the real trick.

Over time, as we gather A/B preferences scored by the LLM (e.g. “Output A is better than B”), we feed them into MR.Q. It uses these judgments to calibrate its internal regressors. This process lets us say:

“Here’s what the LLM prefers now let’s tune ourselves to echo that intuition.”

➡️ Tuning per dimension

We maintain a regression tuner per dimension. Each one continuously adjusts MR.Q’s scores to better match LLM evaluations:

tuned = tuner.transform(norm_score)

It’s a small line of code, but a massive shift in power:

  • MR.Q doesn’t need perfect labels.
  • It learns over time from contrast pairs.
  • It aligns faster with every interaction.

🌪️ So What?

We’re effectively bootstrapping intelligence:

  • We borrow precision from a slower, more powerful system.
  • We train a lighter model to match its taste.
  • And we continuously reinforce that alignment as we see more data.

This lets us scale intelligently:

  • Fast scoring via MR.Q,
  • Grounded quality via LLM alignment.

In essence, we’re creating a local brain that mirrors a global brain and gets smarter every time they talk. This is all happening dynamically in real time in ram.


💭 Insight 5: Real-Time Thinking MR.Q Generates Its Own Judgments

This is the breakthrough.

So far, we’ve talked about how MR.Q learns from the LLM. But now, it thinks for itself.

⏳ No LLM. No Labels. Just Thoughts.

When our system encounters a new hypothesis, it doesn’t call an LLM. It doesn’t look it up in a database.

Instead, MR.Q says:

“I’ve seen similar ideas before. I’ve learned what’s good and what’s bad. Based on everything I know here’s my judgment.”

That’s the moment. That’s what we call a thought.

Not a memory. Not a copy. A new, self-generated evaluation based on past experience.

📥 How It Works

  1. Embeddings the system encodes both the goal and hypothesis.
  2. Scoring MR.Q computes a multidimensional score (correctness, clarity, originality, relevance…).
  3. Tuning it dynamically transforms that score to align with what it’s learned from the LLM.
  4. Logging the system tracks this new thought, just like any hypothesis or human evaluation.

All of this happens in real time no human in the loop.

zsa = encoder(prompt_emb, response_emb)
raw_score = predictor(zsa).item()

One line but it represents a complete internal thought.

🎛️ Dynamic tuning

This isn’t about just speeding up judgment. This is about moving the locus of intelligence inward.

  • The system is not waiting for supervision.
  • It’s not replaying history.
  • It’s thinking in real time, about real things, using real experience.

And these thoughts are dynamic. They react to new data, tune themselves to changing conditions, and accumulate over time.

We believe this is one of the first systems that:

  • Learns from external judgments,
  • Builds an internal model of quality,
  • And uses that model to generate its own judgments, at scale, continuously.

This is what we mean by thinking AI. It doesn’t just talk it reflects, adjusts, and evolves.

    flowchart TD
    A[Goal Text] -->|Embed| B[Goal Embedding]
    X[Hypothesis Text] -->|Embed| Y[Hypothesis Embedding]
    B & Y --> C[Concatenate Embeddings]
    C --> D[Pass Through Encoder]
    D --> E[Compute Raw Score via Predictor]
    E --> F{Is Regression Tuner available?}
    F -- Yes --> G[Transform Score via Tuner]
    F -- No --> H[Use Raw Score]
    G --> I[Emit Final Score]
    H --> I[Emit Final Score]
    I --> J[Log Thought Score + Trace]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style X fill:#f9f,stroke:#333,stroke-width:2px
    style J fill:#ff9,stroke:#333,stroke-width:2px
  

🌀 Insight 6: All We Need Is a Signal

One of the most surprising and powerful discoveries in building this system was that we didn’t actually need carefully curated training data to teach MR.Q how to think. What we needed was much simpler just a signal.

At the core of this insight is a realization: every prompt we generate already maps to a goal, and every response (or hypothesis) we generate is a potential answer to that goal. If we can obtain any signal that tells us whether one response is better than another even if it’s approximate, even if it’s noisy we can use it to train MR.Q.

That’s where our system gets its edge.

We realized we could:

  • Take prompts and responses generated by any agent in the system.
  • Use evaluations or judgments from any other agent (like LLM-based scorers, rule-based filters, or human feedback).
  • Connect those evaluations to MR.Q’s internal training loop, even across agents.

This decouples training from generation. It means we don’t need to run complex reward-tuning pipelines or rely on huge LLM evaluations for every single interaction. We just store the prompts and responses and attach whatever signal we have a judgment, a comparison, a score and MR.Q learns from it.

In short: 🧠 We realized that “thinking” didn’t require perfection it just required enough feedback to improve.

This unlocks cross-agent learning and bootstrapped self-training, turning every interaction into potential fuel for improvement.


🛠️ Code: Selecting contrast pairs

Our sql became pretty straightforward.

            WITH scored_prompts AS (
                SELECT
                    s.dimension,
                    s.score,
                    e.pipeline_run_id,
                    p.id AS prompt_id,
                    p.prompt_text,
                    p.response_text,
                    ROW_NUMBER() OVER (
                        PARTITION BY s.dimension, p.id ORDER BY s.score DESC
                    ) AS rank_high,
                    ROW_NUMBER() OVER (
                        PARTITION BY s.dimension, p.id ORDER BY s.score ASC
                    ) AS rank_low
                FROM scores s
                JOIN evaluations e ON s.evaluation_id = e.id
                JOIN prompts p ON e.pipeline_run_id = p.pipeline_run_id
                WHERE s.score IS NOT NULL
                {goal_filter}
            )
            SELECT
                dimension,
                prompt_text,
                response_text,
                score,
                rank_type
            FROM (
                SELECT
                    dimension,
                    prompt_text,
                    response_text,
                    score,
                    'top' AS rank_type,
                    prompt_id
                FROM scored_prompts
                WHERE rank_high = 1
                  AND prompt_text IS NOT NULL
                  AND response_text IS NOT NULL
                  AND prompt_text <> ''
                  AND response_text  <> ''
                  
                UNION ALL

                SELECT
                    dimension,
                    prompt_text,
                    response_text,
                    score,
                    'bottom' AS rank_type,
                    prompt_id
                FROM scored_prompts
                WHERE rank_low = 1
            ) AS ranked_pairs
            ORDER BY dimension, prompt_id
            LIMIT :limit

🧪 Why We Do It This Way

  • Contrastive training works better than absolute scoring. Ranking “A > B” is often easier and more stable than assigning a perfect numeric score.
  • Every prompt acts like a mini training task. It gives us a chance to learn what better looks like in context even if the prompt isn’t perfect.
  • We amplify our dataset by orders of magnitude. From just a few thousand prompts, we generate hundreds of thousands of A/B training pairs.
  • Each dimension trains independently. This lets us specialize: one MR.Q model might focus on correctness, another on clarity. Later, these can be fused or balanced.

🧱 SQL as Structure Discovery

Why SQL? Because it gives us tight, expressive control over scoring logic, joins, and filters. The window functions (ROW_NUMBER()) let us:

  • Partition by prompt + dimension
  • Order by score
  • Select only the top and bottom responses per prompt

This simple trick lets us auto-label contrastive pairs without any manual annotation.

🧰 Example Use Case

Let’s say we have 5,000 prompt runs. Each has been scored on:

  • Correctness
  • Originality
  • Clarity

We run this SQL and generate:

  • 5,000 pairs for correctness
  • 4,800 for originality
  • 4,950 for clarity

That’s nearly 15,000 contrastive examples from existing logs and they can be regenerated as scoring improves.

🚀 What’s Next

Once extracted, these pairs are passed into our MR.Q training loop, which:

  • Learns which patterns are preferred
  • Starts judging unseen outputs
  • Eventually feeds back into scoring, tuning, and prompt repair

The result: a self-bootstrapping optimization system, built on a simple SQL foundation.


🧠 MR.Q vs. DPO: Why We Chose Regression Over Reinforcement

Most modern LLM tuning relies on reinforcement learning from preferences techniques like DPO (Direct Preference Optimization) that fine-tune massive models based on human or AI-chosen winners in A/B comparisons.

But our goals were different.

We wanted a reward model that was:

  • Fast enough to run live, in-memory
  • Simple enough to debug instantly
  • Flexible enough to work across dozens of agents
  • Trainable using sparse, indirect data

That’s where MR.Q comes in. Instead of reinforcement learning, we just apply good old regression using the embedding distance between prompt and response as features, and a simple score as target.

Here’s how they stack up:

Feature DPO (Traditional RLHF) MR.Q (Ours)
Model Size Huge (requires LLM finetuning) Tiny (runs in-memory, local)
Training Time Hours to days Seconds to minutes
Interpretability Low (black-box weights) High (regression + tunable alignments)
Real-Time Use No Yes
Embedding-Aware No Yes (direct use of vector space)
Requires Instruction Tuning Yes No

So while DPO needs thousands of examples and GPU days, MR.Q starts thinking after just a few examples and keeps tuning itself on-the-fly as new data rolls in.

It’s not a big brain. But it’s a fast brain.


🧬 Part 1: Subconscious systems

Beneath every decision the AI makes lies a silent evaluator a system that scores, compares, and adjusts behavior without being explicitly told to. This is the subconscious of our architecture: fast, reactive, and always learning from its environment.

At the heart of this layer is MR.Q a lightweight, contrastive scoring model that constantly watches the pipeline’s outputs and adapts its judgment in real time. It doesn’t plan or explain; it responds. Like a human gut instinct, MR.Q senses patterns, aligns itself to trusted feedback (like LLM judgments), and tunes future evaluations accordingly.

This subconscious system gives the AI its ability to:

  • Score hypotheses without full retraining.
  • Align dynamically to high-quality reasoning signals.
  • React in real time to changes in goal type or domain.
  • Guide the symbolic system with fast, low-latency evaluations.

While the symbolic reasoning layer chooses how to think, MR.Q ensures the system always knows what’s working quietly shaping thought through constant, embedded feedback.


🧪 MR.Q: The Fast Neural Judge

MR.Q is a fast, adaptive regressor trained on contrast pairs. It works by embedding the prompt and hypothesis and using a small MLP to predict a score.

✅ Why Use MR.Q?

  • Speed: It’s extremely fast, ideal for real-time applications.
  • Online Tuning: It learns from nearby LLM scores using local regression (e.g., Ridge or SVM-based adjustment).
  • Low Data Requirements: You can bootstrap with very few LLM-evaluated examples.
  • Great for Tuning Pipelines: MR.Q enables symbolic strategies, prompts, or model variants to be evaluated quickly and consistently.

If you’re generating hundreds of hypotheses per pipeline, MR.Q is your best bet for scalable feedback.

🛠️ Code: The MRQScorer - fast, self-tuning quality estimator

The MRQScorer is a lightweight, fast, and self-improving scoring module that estimates the quality of a hypothesis against a goal using embedding-based similarity and a trained value predictor.

Instead of relying on expensive LLM evaluations for every hypothesis, MR.Q offers a low-latency approximation that can scale while still staying grounded through real-time alignment with LLM scores using the RegressionTuner.


class MRQScorer(BaseScorer):
    def __init__(self, cfg: dict, memory, logger, dimensions=None):
        self.cfg = cfg
        self.memory = memory
        self.logger = logger
        self.device = cfg.get("device", "cpu")
        self.dimensions = dimensions or ["mrq"]
        self.models = {}  # dim -> (encoder, predictor)
        self.trainers = {}
        self.min_score_by_dim = {}
        self.max_score_by_dim = {}
        self.value_predictor = HypothesisValuePredictor(512, 1024).to(self.device)
        self.encoder = TextEncoder().to(self.device)
        self.regression_tuners = {}

        # Initialize model + tuner for each dimension
        for dim in self.dimensions:
            self.regression_tuners[dim] = RegressionTuner(
                dimension=dim, logger=self.logger
            )
            trainer = MRQTrainer(
                memory=memory,
                logger=logger,
                value_predictor=self.value_predictor,
                encoder=self.encoder,
                device=self.device,
            )
            self.models[dim] = (self.encoder, self.value_predictor)
            self.trainers[dim] = trainer
            self.min_score_by_dim[dim] = 0.0
            self.max_score_by_dim[dim] = 1.0

    def score(self, goal: dict, hypothesis: dict, dimensions: list[str]) -> ScoreBundle:
        """
        Predicts scores for given dimensions using MR.Q and applies tuning if available.
        """
        results = []
        for dim in dimensions:
            score = self._estimate_score(goal, hypothesis, dim)
            rationale = f"MRQ estimated score for {dim}."
            self.logger.log(
                "MRQDimensionEvaluated",
                {"dimension": dim, "score": score, "rationale": rationale},
            )
            results.append(
                ScoreResult(
                    dimension=dim,
                    score=score,
                    rationale=rationale,
                    weight=1.0,
                    source="mrq",
                )
            )
        return ScoreBundle(results={r.dimension: r for r in results})

    def _estimate_score(self, goal, hypothesis, dimension):
        """
        Core logic: compute embeddings, run prediction, apply optional regression tuner.
        """
        # Initialize dimension on demand
        if dimension not in self.models:
            self._initialize_dimension(dimension)

        prompt_emb = torch.tensor(
            self.memory.embedding.get_or_create(goal.get("goal_text")),
            device=self.device,
        ).unsqueeze(0)
        response_emb = torch.tensor(
            self.memory.embedding.get_or_create(hypothesis.get("text")),
            device=self.device,
        ).unsqueeze(0)

        encoder, predictor = self.models[dimension]
        zsa = encoder(prompt_emb, response_emb)
        raw_score = predictor(zsa).item()
        norm_score = self.normalize_score(raw_score, dimension)

        # Optionally apply tuner
        tuner = self.regression_tuners.get(dimension)
        if tuner:
            tuned = tuner.transform(norm_score)
            self.logger.log(
                "MRQTunedScore",
                {"dimension": dimension, "raw": norm_score, "tuned": tuned},
            )
            return tuned
        return norm_score

    def _initialize_dimension(self, dimension):
        self.regression_tuners[dimension] = RegressionTuner(
            dimension=dimension, logger=self.logger
        )
        self.trainers[dimension] = MRQTrainer(
            memory=self.memory, logger=self.logger, value_predictor=self.value_predictor, encoder=self.encoder, device=self.device
        )
        self.models[dimension] = (self.encoder, self.value_predictor)
        self.min_score_by_dim[dimension] = 0.0
        self.max_score_by_dim[dimension] = 1.0
        self.logger.log("MRQModelInitializing", {"dimension": dimension})

    def align_to_best_llm_neighbour(self, goal, hypothesis, dimension):
        """
        Fetch similar hypotheses that already have high LLM scores.
        Then align MR.Q prediction to the best of them.
        """
        llm_scores = self.get_closest_llm_scores(hypothesis["text"], dimension)
        if llm_scores:
            self.align_with_llm_score(dimension, goal, hypothesis, max(llm_scores))

    def get_closest_llm_scores(
        self, hypothesis_text: str, dimension: str, top_k: int = 5
    ) -> list[float]:
        """
        Finds the top_k LLM scores for hypotheses most similar to the given one.
        """
        query_emb = self.memory.embedding.get_or_create(hypothesis_text)
        similar_items = self.memory.embedding.similarity_search(query_emb, top_k)

        scores = []
        for item in similar_items:
            matched_text = item.get("text")
            score_entry = self.memory.score.find_by_text_and_dimension(
                matched_text, dimension=dimension, source="llm"
            )
            if score_entry:
                scores.append(score_entry.score)
        return scores

    def align_with_llm_score(self, dimension, goal, hypothesis, llm_score):
        mrq_score = self._estimate_score(goal, hypothesis, dimension)
        self.logger.log(
            "MRQAligningToLLM",
            {
                "goal": goal.get("goal_text"),
                "hypothesis": hypothesis.get("text"),
                "dimension": dimension,
                "mrq_raw": mrq_score,
                "llm_target": llm_score,
            },
        )
        self.regression_tuners[dimension].add_example(mrq_score, llm_score)
        self.logger.log(
            "MRQAlignmentAdded",
            {
                "dimension": dimension,
                "example_count": len(self.regression_tuners[dimension].examples),
            },
        )

    def evaluate(self, prompt: str, response: str) -> ScoreBundle:
        """
        Scores a prompt-response pair across all dimensions, and saves it.
        """
        results = []
        for dim, (encoder, predictor) in self.models.items():
            prompt_emb = torch.tensor(
                self.memory.embedding.get_or_create(prompt), device=self.device
            ).unsqueeze(0)
            output_emb = torch.tensor(
                self.memory.embedding.get_or_create(response), device=self.device
            ).unsqueeze(0)
            zsa = encoder(prompt_emb, output_emb)
            value = predictor(zsa).item()
            norm_score = self.normalize_score(value, dim)

            results.append(
                ScoreResult(
                    dimension=dim,
                    score=norm_score,
                    weight=1.0,
                    rationale=f"MR.Q model trained for {dim}",
                    source="mrq",
                )
            )

        bundle = ScoreBundle(results={r.dimension: r for r in results})
        ScoringManager.save_score_to_memory(
            bundle,
            response,
            cfg=self.cfg,
            memory=self.memory,
            logger=self.logger,
            source="mrq",
        )
        return bundle

    def normalize_score(self, raw, dim):
        min_ = self.min_score_by_dim.get(dim, 0.0)
        max_ = self.max_score_by_dim.get(dim, 1.0)
        return round(100 * (raw - min_) / (max_ - min_ or 1.0), 2)

    def judge(self, goal, prompt, output_a, output_b):
        """
        Compares two outputs via MR.Q and returns the preferred one.
        """
        dim = self.dimensions[0]
        encoder, predictor = self.models[dim]

        prompt_emb = torch.tensor(
            self.memory.embedding.get_or_create(prompt), device=self.device
        ).unsqueeze(0)
        a_emb = torch.tensor(
            self.memory.embedding.get_or_create(output_a), device=self.device
        ).unsqueeze(0)
        b_emb = torch.tensor(
            self.memory.embedding.get_or_create(output_b), device=self.device
        ).unsqueeze(0)

        value_a = predictor(encoder(prompt_emb, a_emb)).item()
        value_b = predictor(encoder(prompt_emb, b_emb)).item()
        preferred = output_a if value_a >= value_b else output_b

        # Optionally log sharpening example
        if self.memory.mrq.log_evaluations():
            pred = SharpeningPredictionORM(
                id=None,
                goal_id=-1,
                prompt_text=prompt,
                output_a=output_a,
                output_b=output_b,
                preferred="a" if value_a >= value_b else "b",
                predicted="a" if value_a >= value_b else "b",
                value_a=value_a,
                value_b=value_b,
            )
            self.memory.sharpening.insert_sharpening_prediction(pred.to_dict(), goal)

        return preferred, {"value_a": value_a, "value_b": value_b}

    def train_from_database(self, cfg: dict):
        all_samples = self.memory.mrq.get_training_pairs_by_dimension()
        for dim, samples in all_samples.items():
            if not samples:
                self.logger.log("MRQNoTrainingSamples", {"dimension": dim})
                continue

            self.align_mrq_with_llm_scores_from_pairs(samples, dimension=dim)

            self.logger.log(
                "MRQTrainingStart", {"dimension": dim, "sample_count": len(samples)}
            )

            if dim not in self.trainers:
                self.trainers[dim] = MRQTrainer(
                    memory=self.memory,
                    logger=self.logger,
                    encoder=self.encoder,
                    value_predictor=self.value_predictor,
                    device=self.device,
                )

            self.update_score_bounds_from_data(samples, dim)
            dataloader = self.trainers[dim].prepare_training_data(samples)
            self.trainers[dim].train(dataloader, cfg)

            self.logger.log("MRQTrainingComplete", {"dimension": dim})

    def train_from_context(self, context: dict, cfg: dict):
        dim_samples = context.get("mrq_training_pairs_by_dimension", {})
        for dim, samples in dim_samples.items():
            if not samples:
                self.logger.log("MRQNoTrainingFromContext", {"dimension": dim})
                continue

            self.logger.log(
                "MRQContextTrainingStart",
                {"dimension": dim, "sample_count": len(samples)},
            )

            self.update_score_bounds_from_data(samples, dim)
            dataloader = self.trainers[dim].prepare_training_data(samples)
            self.trainers[dim].train(dataloader, cfg)

            self.logger.log("MRQContextTrainingComplete", {"dimension": dim})

    def update_score_bounds_from_data(self, samples: list, dim: str):
        values = []
        for s in samples:
            if "value_a" in s and "value_b" in s:
                values.extend([s["value_a"], s["value_b"]])
            elif "value" in s:
                values.append(s["value"])
        if values:
            min_score = min(values)
            max_score = max(values)
            self.min_score_by_dim[dim] = min_score
            self.max_score_by_dim[dim] = max_score
            self.logger.log(
                "MRQScoreBoundsUpdated",
                {
                    "dimension": dim,
                    "min_score": min_score,
                    "max_score": max_score,
                    "example_count": len(values),
                },
            )

    def align_mrq_with_llm_scores_from_pairs(
        self, pair_samples: list[dict], dimension: str, log_prefix: str = "MRQAlignment"
    ):
        for pair in pair_samples:
            prompt = pair["prompt"]
            for side in ["a", "b"]:
                hyp = pair[f"output_{side}"]
                llm_score = pair[f"value_{side}"]

                # Predict MRQ score dynamically
                mrq_score = self.score(
                    {"goal_text": prompt}, {"text": hyp}, [dimension]
                )

                # Log the alignment
                self.logger.log(
                    f"{log_prefix}Dynamic",
                    {
                        "prompt_hash": hash(prompt),
                        "hypothesis_hash": hash(hyp),
                        "dimension": dimension,
                        "llm_score": llm_score,
                        "predicted_mrq": mrq_score,
                    },
                )

                # Pass the pair into the regression tuner
                if mrq_score is not None and llm_score is not None:
                    self.regression_tuners[dimension].train_single(
                        mrq_score=mrq_score.results[dimension].score,
                        llm_score=llm_score,
                    )

⚙️ How It Works

  • Predicts dimensional quality scores (e.g., correctness, clarity) using goal hypothesis embeddings.
  • Trains continuously from new pairwise data or saved examples.
  • Dynamically aligns its predictions to high-confidence LLM scores using a real time RegressionTuner.
  • Maintains separate models, score bounds, and tuners per dimension.
  • Supports contrastive pairwise judgment (e.g., output A vs B) as well as single hypothesis evaluation.
  • Fully pluggable into the Co AI pipeline as a fast, adaptable scoring backend.

🧩 When MR.Q Isn’t Enough: Introducing the SVM Scorer

As brilliant as MR.Q is at reacting, it does have limitations.

It’s a great local thinker fast, adaptive, and surprisingly accurate but sometimes we need a global judge. One that can:

  • Handle richer features beyond just embeddings
  • Spot more abstract patterns in behavior
  • Generalize across different pipelines and agent types

That’s where the SVM Scorer comes in.

While MR.Q uses lightweight regression on prompt–response embeddings, the SVM (Support Vector Machine) Scorer can look at:

  • Full prompt and hypothesis content
  • Structural features (e.g., reasoning steps, token patterns)
  • Prior scoring history across multiple dimensions

It works as a second opinion or even a meta-reviewer trained on structured pairs and ranked outputs, similar to DPO but without the need to fine-tune an entire model.

In practice, we let MR.Q make fast judgments and then let the SVM check the pattern over time.

Think of it like this:

MR.Q thinks quickly. The SVM thinks deeply.

Together, they give us a scoring system that’s fast and reflective a kind of dual brain for hypothesis evaluation.

📐 SVM: The Interpretable Feature-Based Scorer

SVM Scorers rely on handcrafted or learned features (like score differences, prompt structure, or token length). They train separate support vector machines per dimension.

✅ Why Use SVM?

  • Interpretability: You can see which features drive scoring great for debugging or analysis.
  • Deterministic: Once trained, the score doesn’t fluctuate.
  • Feature-Aware: You can embed symbolic or structural signal into the scorer.

SVM is a bridge between MR.Q and LLM: it’s faster than the LLM, more interpretable than MR.Q, and easier to customize for specific dimensions (like relevance or complexity).

🛠️ Code: SVMScorer

class SVMScorer(BaseScorer):
    def __init__(self, cfg: dict, memory, logger, dimensions=None):
        self.cfg = cfg
        self.memory = memory
        self.logger = logger
        self.dimensions = dimensions or ["alignment"]
        self.models = {dim: SVR() for dim in self.dimensions}
        self.scalers = {dim: StandardScaler() for dim in self.dimensions}
        self.trained = {dim: False for dim in self.dimensions}
        self.regression_tuners = {}  
        for dim in self.dimensions:
            self._initialize_dimension(dim)

    def _initialize_dimension(self, dim):
        self.models[dim] = SVR()
        self.scalers[dim] = StandardScaler()
        self.trained[dim] = False
        self.regression_tuners[dim] = RegressionTuner(dimension=dim, logger=self.logger)

    def train(self, samples_by_dim: dict[str, list[dict]]):
        """
        Train per-dimension SVM from labeled LLM/MRQ training data
        """
        for dim, samples in samples_by_dim.items():
            x = []
            y = []
            for sample in samples:
                prompt = sample["prompt"]
                hyp = sample["output"]
                score = sample["value"]
                feat = self._build_feature_vector({"goal_text": prompt}, {"text": hyp})
                x.append(feat)
                y.append(score)

            x = np.array(x)
            y = np.array(y)
            self.scalers[dim].fit(x)
            x_scaled = self.scalers[dim].transform(x)

            self.models[dim].fit(x_scaled, y)
            self.trained[dim] = True

            self.logger.log("SVMTrainingComplete", {
                "dimension": dim,
                "samples": len(samples),
                "score_min": float(np.min(y)),
                "score_max": float(np.max(y)),
            })

    def _build_feature_vector(self, goal: dict, hypothesis: dict):
        """
        Basic feature vector: concat prompt + hypothesis embeddings + MRQ raw score (if available)
        """
        emb_goal = self.memory.embedding.get_or_create(goal["goal_text"])
        emb_hyp = self.memory.embedding.get_or_create(hypothesis["text"])
        vec = emb_goal + emb_hyp

        # Optional MRQ bridge feature
        mrq = self.memory.score.find_by_text_and_dimension(
            hypothesis["text"], dimension="alignment", source="mrq"
        )
        if mrq:
            vec.append(mrq.score / 100.0)  # normalized to [0,1]
        else:
            vec.append(0.5)  # neutral if no MRQ score

        return vec

    def train_from_database(self):
        pair_samples = self.memory.mrq.get_training_pairs_by_dimension()
        samples_by_dim = self.convert_mrq_pairs_to_supervised_examples(pair_samples)

        for dim, examples in samples_by_dim.items():
            self.train_for_dimension(dim, examples)


    def convert_mrq_pairs_to_supervised_examples(self, pair_samples: list[dict]) -> dict[str, list[dict]]:
        """
        Converts MRQ-style contrastive training pairs into a flat list of (prompt, output, value)
        entries per dimension, suitable for supervised regression training.
        """
        per_dimension = defaultdict(list)
        for pair in pair_samples:
            dim = pair.get("dimension", "default")

            for side in ["a", "b"]:
                output = pair.get(f"output_{side}")
                score = pair.get(f"value_{side}")
                if output is not None and score is not None:
                    per_dimension[dim].append({
                        "prompt": pair["prompt"],
                        "output": output,
                        "value": score
                    })

        self.logger.log("SVMConvertedMRQPacks", {
            "dimensions": list(per_dimension.keys()),
            "total_samples": sum(len(v) for v in per_dimension.values())
        })

        return per_dimension

    def train_for_dimension(self, dimension: str, examples: list[dict]):
        X = []
        y = []
        for ex in examples:
            prompt_vec = self.memory.embedding.get_or_create(ex["prompt"])
            output_vec = self.memory.embedding.get_or_create(ex["output"])
            pair_vec = np.array(prompt_vec + output_vec)
            X.append(pair_vec)
            y.append(ex["value"])

        X = np.array(X)
        y = np.array(y)

        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)

        model = SVR(kernel="linear")  # you can adjust kernel if needed
        model.fit(X_scaled, y)

        self.models[dimension] = (scaler, model)

        self.logger.log("SVMModelTrained", {
            "dimension": dimension,
            "num_samples": len(y)
        })

    def score(self, goal: dict, hypothesis: dict, dimensions: list[str]) -> ScoreBundle:
        results = {}
        for dim in dimensions:
            vec = self._build_feature_vector(goal, hypothesis)

            # Dynamic training if needed
            if not self.trained[dim]:
                self._try_train_on_dimension(dim)

            if not self.trained[dim]:
                score = 50.0
                rationale = f"SVM not trained for {dim}, returning neutral."
            else:
                x = self.scalers[dim].transform([vec])
                raw_score = self.models[dim].predict(x)[0]
                tuned_score = self.regression_tuners[dim].transform(raw_score)
                score = tuned_score
                rationale = f"SVM predicted and aligned score for {dim}"

            self.logger.log("SVMScoreComputed", {
                "dimension": dim,
                "score": score,
                "hypothesis": hypothesis.get("text"),
            })

            results[dim] = ScoreResult(
                dimension=dim,
                score=score,
                rationale=rationale,
                weight=1.0,
                source="svm",
            )

        return ScoreBundle(results=results)

    def _try_train_on_dimension(self, dim):
        samples_by_dim = self.memory.mrq.get_training_pairs_by_dimension()
        samples = samples_by_dim.get(dim, [])
        if not samples:
            self.logger.log("SVMNoSamples", {"dimension": dim})
            return

        X, y = [], []
        for s in samples:
            for side in ["a", "b"]:
                prompt = s["prompt"]
                hypothesis = s[f"output_{side}"]
                llm_score = s.get(f"value_{side}")
                if prompt and hypothesis and llm_score is not None:
                    vec = self._build_feature_vector({"goal_text": prompt}, {"text": hypothesis})
                    X.append(vec)
                    y.append(llm_score)
                    self.regression_tuners[dim].add_example(llm_score, llm_score)  # no-op, self-alignment fallback

        if not X:
            return

        X_scaled = self.scalers[dim].fit_transform(X)
        self.models[dim].fit(X_scaled, y)
        self.trained[dim] = True

        self.logger.log("SVMTrainingComplete", {
            "dimension": dim,
            "samples": len(X)
        })

        # Align the scores using same logic as MRQ
        self._align_with_llm(samples, dim)

    def _align_with_llm(self, samples, dim):
        for s in samples:
            for side in ["a", "b"]:
                prompt = s["prompt"]
                hypothesis = s[f"output_{side}"]
                llm_score = s.get(f"value_{side}")
                if llm_score is None:
                    continue

                vec = self._build_feature_vector({"goal_text": prompt}, {"text": hypothesis})
                x = self.scalers[dim].transform([vec])
                raw_score = self.models[dim].predict(x)[0]

                self.regression_tuners[dim].train_single(mrq_score=raw_score, llm_score=llm_score)

                self.logger.log("SVMAlignmentDynamic", {
                    "dimension": dim,
                    "mrq_score": raw_score,
                    "llm_score": llm_score
                })    


    def _train_dimension(self, dim: str):
        pairs_by_dim = self.memory.mrq.get_training_pairs_by_dimension()
        samples = pairs_by_dim.get(dim, [])
        if not samples:
            self.logger.log("SVMNoTrainingData", {"dimension": dim})
            self.trained[dim] = False
            return

        X = []
        y = []
        for sample in samples:
            goal = {"goal_text": sample["prompt"]}
            for side in ["a", "b"]:
                hyp = {"text": sample[f"output_{side}"]}
                label = sample.get(f"value_{side}")
                if label is not None:
                    vec = self._build_feature_vector(goal, hyp)
                    X.append(vec)
                    y.append(label)

        if len(X) < 5:
            self.logger.log("SVMInsufficientTrainingData", {"dimension": dim, "count": len(X)})
            self.trained[dim] = False
            return

        X_scaled = self.scalers[dim].fit_transform(X)
        self.models[dim].fit(X_scaled, y)
        self.trained[dim] = True
        self.logger.log("SVMTrained", {"dimension": dim, "samples": len(X)})

🤖 LLM: The Gold Standard

LLM-based scoring uses prompt engineering and chain-of-thought to assess a hypothesis directly. It’s the most accurate and nuanced evaluator but also the most expensive.

✅ Why Use LLM?

  • Highest Quality: LLMs can explain their judgment using rubrics or pairwise comparison.
  • Training Data Source: They’re used to label samples for MR.Q and SVM.
  • Flexible Criteria: You can swap scoring prompts to test new evaluation rubrics on the fly.

Use LLM scoring for validation, meta-evaluation, or as a source of truth during training. Don’t use it in tight loops it’s too slow and expensive.


📊 Scorer Comparison

Feature / Property MR.Q Scorer SVM Scorer LLM Scorer
Type Embedding-based regressor (MLP) Classical ML (SVM on features) Language model with rubric/pairwise
Speed 🚀 Very Fast ⚡ Fast 🐢 Slow
Training Style Online, dynamic tuning Batch (per dimension) Not trainable (prompt-based)
Adaptivity High (neighborhood-based tuning) Medium (requires re-fitting) None (static output)
Data Requirement Low (few examples needed) Moderate (pairwise samples) None (but high cost per call)
Output Consistency Adaptive, can vary with tuning Deterministic once trained Stochastic (temperature, wording)
Best Use Cases Real-time scoring, tuning proxies Interpretable, structured comparisons Final evaluation, bootstrap labeling
Output Range Normalized to 0–100 (tuned) 0–100 (aligned via regressor) 0–100 (rubric or logits mapped)
Integration Cost Low Medium High (token cost, latency)
Interpretability Moderate High (feature-based decisions) Low (depends on prompt wording)

🧭 Strategy: How to Combine Them

In our system, we use all three together in a layered hierarchy:

Layer Scorer Purpose
Inference MR.Q Inline scoring during generation
Tuning MR.Q Optimize symbolic strategy
Analysis SVM Understand score drivers
Bootstrapping LLM Provide ground truth labels
Evaluation LLM Final output validation

🔁 When to Use Which

Scenario Preferred Scorer
Fast scoring inside a pipeline MR.Q
Bootstrapping a reward model LLM → MR.Q or SVM
Structured feature-based alignment SVM
Comparing symbolic strategies SVM
Validating prompt effectiveness LLM
Scoring many samples cheaply MR.Q
Ensuring scoring consistency across versions SVM (or frozen MR.Q)
Scoring hypotheses for training LLM → SVM + MR.Q cascade

🔁 The Meta-Review Loop: From Fast Judgments to Self-Correction

If MR.Q is the quick reflex and the SVM is the deep intuition, then the Meta-Review Loop is the higher-order reflection system that learns from both.

Every time our system runs a prompt, scores a hypothesis, and picks a best answer, it doesn’t just move on it remembers.

It logs:

  • The prompt and context
  • The hypothesis and score
  • The rule or pipeline that produced it
  • Which scoring system made the call (MR.Q, SVM, or LLM)

Later, when better information becomes available say a more accurate score from an LLM, or a consensus among agents we compare that to the original score.

If the original judgment was off, we don’t just update the result we retrain the scorer.

In real time. On real data. Across any dimension we care about.

This loop allows our system to:

  • Adapt to changing tasks
  • Evolve toward more accurate judgments
  • Tune itself without external retraining pipelines

It’s self-supervised learning, not in theory in practice. A system that judges itself, trains itself, and improves itself one contrast at a time.


🎯 The Regression Tuner: Real-Time Alignment with LLM Ground Truth

We believe the LLM is often the de facto source of truth not because it’s perfect, but because it’s been trained on such vast and diverse data that its judgments are generally reliable.

So when MR.Q our fast, embedding-based scorer makes a decision, we sometimes get a second opinion from the LLM.

But what happens when those opinions don’t match?

That’s where the Regression Tuner comes in.

This lightweight module acts like a real-time calibrator. Every time we score a hypothesis using both MR.Q and the LLM, the tuner saves that pair:

MR.Q score → 0.68  
LLM score → 0.84

It doesn’t save anything to disk. It doesn’t run in big training loops. Instead, once it has enough examples (as few as 10), it fits a simple linear regression model on the fly and updates it over time.

From then on, every MR.Q score in that dimension gets nudged into better alignment:

raw score: 0.68 → tuned score: 0.81

Why this matters:

  • It allows MR.Q to learn from the LLM without being replaced by it.
  • It preserves speed while gaining accuracy.
  • It helps us catch and correct bias drift in our embedding space.
  • It makes our self-improving system actually improve in measurable ways.

This tuner is not just a patch it’s a critical part of how the system thinks with feedback.

The best part? It works for any dimension, any agent, any task all in-memory, all on the fly.

🛠️ Code: RegressionTuner aliging scores with quality


class RegressionTuner:
    """
    Learns to transform MR.Q scores to align with LLM scores dynamically.
    Does not save any state to disk purely in-memory and real-time.
    """

    def __init__(self, dimension: str, logger=None, min_samples: int = 10):
        self.dimension = dimension
        self.logger = logger
        self.min_samples = min_samples
        self.x = []  # MRQ scores
        self.y = []  # LLM scores
        self.model = None

    def train_single(self, mrq_score: float, llm_score: float):
        """Adds a new training pair and refits if threshold reached."""
        self.x.append(mrq_score)
        self.y.append(llm_score)

        if len(self.x) >= self.min_samples:
            self._fit()

        if self.logger:
            self.logger.log("RegressionTunerTrainSingle", {
                "dimension": self.dimension,
                "mrq_score": mrq_score,
                "llm_score": llm_score,
                "total_samples": len(self.x)
            })

    def _fit(self):
        """Fits a linear regression model to current examples."""
        x_arr = np.array(self.x).reshape(-1, 1)
        y_arr = np.array(self.y)

        self.model = LinearRegression().fit(x_arr, y_arr)

        if self.logger:
            self.logger.log("RegressionTunerFitted", {
                "dimension": self.dimension,
                "count": len(self.x),
                "coef": float(self.model.coef_[0]),
                "intercept": float(self.model.intercept_),
            })

    def transform(self, score: float) -> float:
        """Transforms a score using the fitted regression model if available."""
        if self.model:
            return float(self.model.predict(np.array([[score]]))[0])
        return score

⚙️ How It Works

  • Collects alignment pairs between a fast, local scorer (MR.Q) and a slower, more accurate reference scorer (LLM).
  • Learns a mapping via linear regression to bring the fast scorer’s outputs in line with the LLM.
  • Applies that mapping live to correct MR.Q’s predictions on future examples.

🧠 Part 2: Conscious Thought Symbolic Rules for Structured Reasoning

If MR.Q gives us instinct fast, reactive judgment then symbolic reasoning gives us conscious structure. It’s how we embed deliberate, traceable thought into our AI. Where MR.Q adapts based on feedback, symbolic rules encode known good strategies and allow us to program reasoning itself.

🧠 Symbolic rules are how the AI learns to think about thinking.

🧾 Pipelines as Programs

Every AI pipeline in our system is more than just a sequence of steps it’s a program. Each stage makes a decision:

  • Which model to use?
  • What prompt to run?
  • How to evaluate or refine the output?

And just like any program, we can rewrite parts of it symbolically.

Symbolic rules are modular, interpretable instructions that alter any part of the reasoning process. They’re not buried inside a black-box model they live in the open, where they can be scored, tested, and improved.

🧩 What Are Symbolic Rules?

Symbolic rules are:

  • Configurable written in YAML or learned from data.
  • Targeted applied based on agent name, goal metadata, or tags.
  • Composable override model names, prompt paths, scoring functions, or even insert/remove agents.
  • Traceable every rule is logged and linked to outcomes.

This gives us interpretable cognition we can see why the system made a choice and how that choice impacted the result.

🧠 The Role of Symbolic Reasoning

Symbolic reasoning brings four critical capabilities:

  1. Interpretability Unlike a static neural network, symbolic changes are visible and auditable.
  2. Trainability Each rule application is scored, so we can learn which symbolic paths improve outcomes.
  3. Modularity We can test and tune parts of the reasoning system in isolation.
  4. Adaptivity Over time, the system learns to apply better rules in better contexts.

🔄 MR.Q gives the system feedback-driven instinct. 🔧 Symbolic rules give it programmable thought conscious logic.

Together, they form a complete cognitive loop:

  1. MR.Q reacts: scoring outputs in real time using learned preferences.
  2. Symbolic rules reflect: modifying the strategy, choosing models, rerouting reasoning based on what has worked in the past.

This isn’t just execution it’s deliberate self-adjustment. The system learns not only what to think, but how to think better next time.


🛠️ How It Works in the System

  • Every pipeline stage (generation, evaluation, refinement) can be modified.
  • Symbolic rules apply to agents dynamically via metadata or tags.
  • Rules can:
    • Change prompts, models, or scorers.
    • Insert new reasoning strategies like self-reflection.
    • Remove underperforming steps.
  • Each rule application is scored just like outputs using LLMs or MR.Q.

This gives us a self-aware system:

  • One that doesn’t just produce answers…
  • But learns how to improve its own thinking process.

📍 Symbolic rules are not just tweaks. They are representations of reasoning strategies modular, evaluatable, and trainable.


🔧 SymbolicRuleApplier: The Brainstem of Our Reasoning System

The SymbolicRuleApplier is the component that turns high-level symbolic knowledge into concrete action. It reads a set of human or machine-authored rules and injects them into the pipeline, altering how agents behave, think, and score without touching any agent code directly.

🧩 What It Does

At a high level, the SymbolicRuleApplier:

  1. Loads symbolic rules from YAML or the database.
  2. Filters rules based on the current goal and pipeline metadata.
  3. Applies overrides to any matching agent, prompt, or configuration stage.
  4. Logs all applications, so we can later trace which rules were active during hypothesis generation or scoring.

This makes the reasoning system programmable, auditable, and evolvable.

⚙️ How It Works

Each symbolic rule looks like this:

agent_name: HypothesisGenerator
metadata_filter:
  goal_type: scientific
override:
  model: mistral
  prompt: cot_enhanced.j2
  scorer: meta_review

The SymbolicRuleApplier matches this rule to any agent in the pipeline named HypothesisGenerator only if the current goal metadata includes goal_type: scientific.

Once matched, the overrides are applied. This might swap in a new prompt template, change the model being used, or configure a different scorer.

It uses a simple matching pattern:

  • Agent name match
  • Metadata filters (goal, topic, tags, etc.)
  • Optional pipeline or stage constraints

And every application is tracked:

{
  "rule_id": "rule-123",
  "pipeline_run_id": "run-789",
  "agent_name": "HypothesisGenerator",
  "context_hash": "abc123",
  "overrides": {
    "model": "mistral",
    "prompt": "cot_enhanced.j2"
  }
}

These are stored in a rule_applications table enabling full traceability for analysis, tuning, and rule optimization later on.


🛠️ Code SymbolicRuleApplier applying changes when required

class SymbolicRuleApplier:
    def __init__(self, cfg, memory, logger):
        self.cfg = cfg
        self.memory = memory
        self.logger = logger
        self.enabled = cfg.get("symbolic", {}).get("enabled", False)
        self._rules = self._load_rules() if self.enabled else []

    @property
    def rules(self) -> list:
        return self._rules
    
    def apply(self, context: dict) -> dict:
        if not self.enabled:
            return context

        goal = context.get("goal", {})
        pipeline_run_id = context.get("pipeline_run_id")
        current_pipeline = context.get("pipeline", [])

        matching_rules = [r for r in self.rules if self._matches_metadata(r, goal)]

        if not matching_rules:
            self.logger.log("NoSymbolicRulesApplied", {"goal_id": goal.get("id")})
            return context

        self.logger.log("SymbolicRulesFound", {"count": len(matching_rules)})

        for rule in matching_rules:
            if rule.rule_text and "pipeline:" in rule.rule_text:
                suggested_pipeline = (
                    rule.rule_text.split("pipeline:")[-1].strip().split(",")
                )
                suggested_pipeline = [
                    s.strip() for s in suggested_pipeline if s.strip()
                ]
                if suggested_pipeline:
                    self.logger.log(
                        "PipelineUpdatedBySymbolicRule",
                        {
                            "from": current_pipeline,
                            "to": suggested_pipeline,
                            "rule_id": rule.id,
                        },
                    )
                    context["pipeline"] = suggested_pipeline
                    context["pipeline_updated_by_symbolic_rule"] = True

            if rule.source == "lookahead" and rule.goal_type:
                context["symbolic_hint"] = f"use_{rule.goal_type.lower()}_strategy"

        return context

    def apply_to_agent(self, cfg: Dict, context: Dict) -> Dict:
        if not self.enabled:
            return cfg

        goal = context.get("goal", {})
        pipeline_run_id = context.get("pipeline_run_id")
        agent_name = cfg.get("name")

        matching_rules = [
            r
            for r in self.rules
            if r.agent_name == agent_name and self._matches_metadata(r, goal)
        ]

        if not matching_rules:
            self.logger.log(
                "NoSymbolicAgentRulesApplied",
                {
                    "agent": agent_name,
                    "goal_id": goal.get("id"),
                },
            )
            return cfg

        self.logger.log(
            "SymbolicAgentRulesFound",
            {
                "agent": agent_name,
                "goal_id": goal.get("id"),
                "count": len(matching_rules),
            },
        )

        for rule in matching_rules:
            # Apply new-style attributes
            if rule.attributes:
                for key, value in rule.attributes.items():
                    if key in cfg:
                        self.logger.log(
                            "SymbolicAgentOverride",
                            {
                                "agent": agent_name,
                                "key": key,
                                "old_value": cfg[key],
                                "new_value": value,
                                "rule_id": rule.id,
                            },
                        )
                    else:
                        self.logger.log(
                            "SymbolicAgentNewKey",
                            {
                                "agent": agent_name,
                                "key": key,
                                "value": value,
                                "rule_id": rule.id,
                            },
                        )
                    cfg[key] = value

            # Apply legacy rule_text (optional, for backward compatibility)
            if rule.rule_text:
                entries = [e.strip() for e in rule.rule_text.split(",") if e.strip()]
                for entry in entries:
                    if ":" in entry:
                        key, value = [s.strip() for s in entry.split(":", 1)]
                        if key in cfg:
                            self.logger.log(
                                "SymbolicAgentOverride",
                                {
                                    "agent": agent_name,
                                    "key": key,
                                    "old_value": cfg[key],
                                    "new_value": value,
                                    "rule_id": rule.id,
                                },
                            )
                        else:
                            self.logger.log(
                                "SymbolicAgentNewKey",
                                {
                                    "agent": agent_name,
                                    "key": key,
                                    "value": value,
                                    "rule_id": rule.id,
                                },
                            )
                        cfg[key] = value

            # Record the application of this rule
            self.memory.rule_effects.insert(
                goal_id=goal.get("id"),
                agent_name=agent_name,
                rule_id=rule.id,
                pipeline_run_id=pipeline_run_id,
                details=rule.to_dict(),
                stage_details=cfg,
            )

        return cfg


    def apply_prompt_rules(
            self, agent_name: str, prompt_cfg: dict, context: dict
        ) -> dict:
        """
        Applies prompt-level symbolic rules to the prompt config before generation.

        Returns the updated prompt_cfg.
        """
        goal = context.get("goal", {})
        applicable_rules = [
            rule
            for rule in self.rules
            if rule.agent_name == agent_name
            # and self._matches_filter(rule.filter, goal)
        ]

        if not applicable_rules:
            self.logger.log("NoPromptRulesFound", {"agent": agent_name})
            return prompt_cfg

        for rule in applicable_rules:
            for key, value in rule.attributes.items():
                self.logger.log(
                    "PromptAttributeOverride",
                    {
                        "agent": agent_name,
                        "key": key,
                        "old_value": prompt_cfg.get(key),
                        "new_value": value,
                        "rule_id": rule.id,
                        "emoji": "🛠️",
                    },
                )
                self.set_nested(prompt_cfg, key, value)

            # Optional: record the rule application
            self.memory.rule_effects.insert(
                rule_id=rule.id,
                goal_id=goal.get("id"),
                pipeline_run_id=context.get("pipeline_run_id"),
                details=prompt_cfg,
            )

        return prompt_cfg

    def set_nested(self, cfg: dict, dotted_key: str, value):
        keys = dotted_key.split(".")
        d = cfg
        for k in keys[:-1]:
            if k not in d or not isinstance(d[k], dict):
                d[k] = {}
            d = d[k]
        d[keys[-1]] = value

    def apply_to_prompt(self, cfg: Dict, context: Dict) -> Dict:
        if not self.enabled:
            return cfg

        goal = context.get("goal", {})
        pipeline_run_id = context.get("pipeline_run_id")
        prompt_name = cfg.get("prompt_key", "unknown_prompt")

        matching_rules = [
            r for r in self.rules
            if r.target == "prompt" and self._matches_filter(r.filter, goal)
        ]

        if not matching_rules:
            self.logger.log("NoSymbolicPromptRulesApplied", {
                "prompt": prompt_name,
                "goal_id": goal.get("id"),
            })
            return cfg

        self.logger.log("SymbolicPromptRulesFound", {
            "prompt": prompt_name,
            "goal_id": goal.get("id"),
            "count": len(matching_rules),
        })

        for rule in matching_rules:
            for key, value in rule.attributes.items():
                if key in cfg:
                    self.logger.log("SymbolicPromptOverride", {
                        "prompt": prompt_name,
                        "key": key,
                        "old_value": cfg[key],
                        "new_value": value,
                        "rule_id": rule.id,
                    })
                else:
                    self.logger.log("SymbolicPromptNewKey", {
                        "prompt": prompt_name,
                        "key": key,
                        "value": value,
                        "rule_id": rule.id,
                    })
                cfg[key] = value

            # Track the application of the prompt-level rule
            self.memory.rule_effects.insert(
                rule_id=rule.id,
                goal_id=goal.get("id"),
                pipeline_run_id=pipeline_run_id,
                agent_name=cfg.get("name", "prompt"),
                context_hash=self.compute_context_hash(context),
                run_id=context.get("run_id"),
            )

        return cfg

    def _matches_filter(self, filter_dict: dict, target_obj: dict) -> bool:
        """Generic matcher for symbolic rule filters"""
        for key, value in filter_dict.items():
            target_value = target_obj.get(key)
            if isinstance(value, list):
                if target_value not in value:
                    return False
            else:
                if target_value != value:
                    return False
        return True

    def track_pipeline_stage(self, stage_dict: dict, context: dict):
        self.memory.symbolic_rules.track_pipeline_stage(stage_dict, context)

    def get_nested_value(d, key_path: str):
        keys = key_path.split(".")
        for key in keys:
            d = d.get(key, {})
        return d if d else None

    def set_nested_value(d, key_path: str, value):
        keys = key_path.split(".")
        for key in keys[:-1]:
            d = d.setdefault(key, {})
        d[keys[-1]] = value

    def _load_rules(self):
        rules = []
        symbolic_dict = self.cfg.get("symbolic", {})
        if symbolic_dict.get("rules_file"):
            rules += self._load_rules_from_yaml(symbolic_dict.get("rules_file"))
        if symbolic_dict.get("enable_db_rules", True):
            rules += self.memory.symbolic_rules.get_all_rules()
        return rules

    def _load_rules_from_yaml(self, path: str) -> list:
        if not Path(path).exists():
            self.logger.log("SymbolicRuleYAMLNotFound", {"path": path})
            return []

        with open(path, "r", encoding="utf-8") as f:
            raw = yaml.safe_load(f)

        rules_list = raw.get("rules", raw)

        rules = []
        existing_rules = {
            r.rule_text for r in self.memory.symbolic_rules.get_all_rules()
        }
        for item in rules_list:
            if isinstance(item, dict) and item.get("rule_text") not in existing_rules:
                rules.append(SymbolicRuleORM(**item))
            else:
                self.logger.log(
                    "DuplicateSymbolicRuleSkipped", {"rule_text": item.get("rule_text")}
                )
        return rules

    def _matches_metadata(self, rule: SymbolicRuleORM, goal: Dict[str, Any]) -> bool:
        if rule.goal_id and rule.goal_id != goal.get("id"):
            return False
        if rule.goal_type and rule.goal_type != goal.get("goal_type"):
            return False
        if rule.goal_category and rule.goal_category != goal.get("goal_category"):
            return False
        if rule.difficulty and rule.difficulty != goal.get("difficulty"):
            return False
        if hasattr(goal, "focus_area") and rule.goal_category:
            if rule.goal_category != goal.get("focus_area"):
                return False
        return True

    @staticmethod
    def compute_context_hash(context_dict: dict) -> str:
        canonical_str = json.dumps(context_dict, sort_keys=True)
        return hashlib.sha256(canonical_str.encode("utf-8")).hexdigest()

🔄 Why It Matters

This mechanism unlocks several powerful capabilities:

  • Composable behavior: Symbolic rules can be layered, overridden, or evolved independently — without hardcoding logic or rewriting pipelines.
  • Goal-conditioned intelligence: Different goals, domains, or agents can trigger different symbolic strategies, enabling adaptive reasoning paths.
  • Self-improvement: Because every symbolic rule is tied to outcomes, we can score them over time and tune them just like we tune models — via performance feedback.

The SymbolicRuleApplier makes this possible. It acts as a dynamic switchboard, routing cognitive tasks through the right tools based on metadata, goal context, and symbolic configuration.

But here’s what we learned the hard way:

Randomly generating and applying rules doesn’t work. In fact, it often made things worse.

To be effective, symbolic tuning needs context awareness and guardrails. That’s why we introduced:

  • Precise matching logic (by goal type, agent, tags, etc.),
  • Tunable configuration spaces (with legal values and constraints),
  • A structured prompt to propose meaningful, context-specific changes.

This turned symbolic reasoning from brittle guesswork into a robust feedback loop — one that evolves with the system, not against it.


🧮 Why We Introduced the RuleOptionsConfig

As we began building the Rule Mutation Agent, we faced a critical design challenge: how can we let the AI intelligently mutate symbolic rules without producing invalid, incoherent, or redundant configurations?

Early experiments relying on open-ended LLM completions quickly ran into problems. The model would propose configurations that weren’t grounded in reality, suggest nonsensical parameter combinations, or recommend changes that had already been tried. Worse, validating these freeform suggestions introduced unnecessary complexity, making the system harder to debug and extend.

To solve this, we introduced a structured mechanism: the RuleOptionsConfig.

This configuration object backed by a simple YAML file defines the legal mutation space for each rule. For every tunable parameter (like which model to use, whether to enable documentation, or what enhancement strategy to apply), the config explicitly lists:

  • the valid options,
  • the default value,
  • and (optionally) constraints or metadata for smarter decision-making.

By constraining the mutation space, we give the LLM just enough freedom to explore meaningful changes while ensuring every suggestion is:

  • Valid (it exists in the defined config),
  • Unique (it hasn’t already been applied),
  • Actionable (we know how to implement it immediately).

This design does more than simplify engineering it shifts the paradigm from open-ended prompt tinkering to structured prompt programming. The AI is no longer guessing; it’s choosing from a well-defined menu, guided by past performance and optimization objectives.

Ultimately, RuleOptionsConfig gives us the foundation for safe, interpretable, and scalable self-improvement in our symbolic AI system enabling a closed-loop process where rules evolve intelligently over time.

    flowchart TD
    A[Start Pipeline Run] --> B[Apply Symbolic Rules]
    B --> C[Execute Pipeline Stages]
    C --> D[Collect Performance Scores]

    D --> E{Low-Performing Rule?}
    E -- No --> Z[End]
    E -- Yes --> F[Select Rule to Mutate]

    F --> G[Load RuleOptionsConfig]
    G --> H[Generate Mutation Prompt with Options]
    H --> I[LLM Suggests Mutation]
    I --> J{Valid & Unused Option?}
    J -- No --> H
    J -- Yes --> K[Apply Mutated Rule]

    K --> L[Execute New Pipeline Run]
    L --> M[Collect New Scores]
    M --> N[Compare Score Delta]

    N --> O[Log Mutation Effectiveness]
    O --> Z[End]

    style Z fill:#eef,stroke:#333,stroke-width:2px
    style RuleOptionsConfig fill:#ffd,stroke:#cc8,stroke-width:2px
  

🧬 Rule Mutation: Turning Symbols into Intelligence

At this point, we’ve built out a symbolic map of the system. Symbols live at every level — from prompts to agents to entire pipeline stages — and we can tag, target, and configure them dynamically.

But having symbols isn’t intelligence.

They’re just markers — a way to identify parts of the system. What turns them into thinking? Directed mutation.

Only when we start changing these symbols — testing variations, measuring outcomes, and refining based on feedback — does the system become intelligent.

Next we’ll show how symbolic rules evolve through targeted mutation, and how each change nudges the system toward better reasoning.

🛠️ Prompting the mutation

This prompt is part of our symbolic tuning loop. In our system, symbolic rules control key decisions — like which model to use, which prompt template to run, or which scoring method to apply. These rules define the system’s conscious strategies for reasoning.

The prompt is designed to tune one symbolic rule at a time by proposing a targeted, data-driven change.

It does three things:

  1. Summarizes the current rule, including its attributes and available tuning options.
  2. Presents recent performance insights, helping the system reflect on what’s working and what isn’t.
  3. Asks for a single, well-justified change, making the update both interpretable and traceable.

This turns symbolic rule tuning into a structured, feedback-driven process — a key part of how our AI system evolves its reasoning behavior over time.

You are helping improve the performance of an AI system by tuning one of its symbolic rules.

### Current Configuration
**Target Behavior**: {{ target }}

**Current Rule Attributes:**
{% for attr, val in current_attributes.items() %}
- **{{ attr }}**: {{ val }}
{% endfor %}

**Tunable Options:**
{% for attr, options in available_options.items() %}
- **{{ attr }}**: {{ options }}
{% endfor %}

{% if recent_performance %}
### Recent Performance Insights:
{{ recent_performance }}
{% endif %}

---

### Your Task:
Propose exactly **one change** to this symbolic rule that is likely to improve the system's performance on the target behavior. This change should be grounded in your understanding of the rule's role and the available options.

### Response Format:

Rationale: <Your reasoning>

Attribute to change: \<attribute\_name>
New value: \<new\_value>

**Do not change more than one attribute. Be specific and actionable.**

The quality of any self-improving AI system is deeply tied to the quality of its entry points the prompts that guide its mutation, tuning, or optimization behavior. In our system, symbolic rules define interpretable, modular behavior. Mutating them is how the system adapts. But how we ask the model to mutate those rules makes all the difference.

That’s why we’ve invested care in designing a dedicated mutation prompt for symbolic rule tuning.

🎯 Clarity and Constraint Lead to Precision

The prompt begins with a clear directive:

“You are helping improve the performance of an AI system by tuning one of its symbolic rules.”

This primes the model with purpose and limits the task scope to only one rule, avoiding unnecessary complexity. The use of explicit structure including the current attributes, tunable options, and (optionally) recent performance gives the model context-rich input without ambiguity.

🤔 Focused Mutation Encourages Learnability

We require the model to propose exactly one change, formatted cleanly as:

Rationale: ...
Attribute to change: ...
New value: ...

This has two key benefits:

  1. Interpretability: The output is immediately parseable and actionable.
  2. Trainability: It generates high-quality training data for potential future fine-tuning or reward modeling.

Because every mutation is singular and explicit, it becomes possible to track its downstream effects with precision enabling score attribution, rollback, and even symbolic meta-learning.

🧩 Modularity and Adaptation

The prompt is designed to scale across domains and dimensions. The target, attributes, and options are dynamically injected, making this format reusable across different goals, agents, or performance dimensions. Optional recent performance feedback allows us to “focus” the mutation when history is available, without breaking the structure when it’s not.

✔️ Why This Prompt Is a Leverage Point

In a pipeline of dozens of intelligent steps, this prompt is the one that decides what changes. It is the mutation gateway. A vague or poorly designed prompt here can lead to ineffective changes, wasted evaluation cycles, and ultimately degraded system performance.

Conversely, this tightly structured, context-aware, and minimalistic prompt ensures every mutation is deliberate, grounded, and evaluable.

In short: this prompt is not just an input it’s a lever that drives the evolution of the system.

This is the format of our response

Rationale: The current configuration consistently uses 'ollama_chat/mistral' without performance metrics to validate its effectiveness. Testing a different model like 'ollama_chat/qwen3' could potentially improve performance by leveraging a model with different strengths (e.g., specialized capabilities or efficiency). This change directly addresses the need to experiment with alternative models while maintaining the same rule structure.

The attribute you want to change: model.name  
The value you want to change to: ollama_chat/qwen3

🔧 Rule Mutation as Dimensional Tuning: One Attribute at a Time

We made a deliberate design choice:

We mutate exactly one attribute per rule per mutation.

This is a strategy.

🎯 Why One Attribute at a Time?

  1. Isolated Impact By changing a single attribute (like the model or prompt flavor), we get a clean signal: any change in performance can be attributed directly to that mutation.

  2. Multi-Dimensional Score Feedback Every mutated rule results in a new pipeline run. That run is scored across multiple dimensions: correctness, clarity, alignment, feasibility, and more. These aren’t just ratings they’re high-resolution signals that tell us how the mutation affected performance.

  3. Gradual, Exhaustive Search Our space of symbolic rules is small by design a handful of parameters (model, adapter, prompt, etc.). This makes exhaustive evaluation tractable: over time, the agent can systematically explore all valid mutations and their performance impact.


📐 How It Works

Here’s what the Rule Mutation Agent does:

  1. Loads current rules for a target agent, based on goal and config.

  2. Finds available mutations using a structured RuleOptionsConfig.

  3. Generates a mutation prompt (via Jinja template) to propose a single, meaningful change.

  4. Validates the change using:

    • Legal options (from config)
    • Novelty (not already tried)
  5. Applies the mutation, stores the new rule, and logs it.

  6. Tracks outcomes for each mutated rule performance over time is stored and evaluated.

The result is a slow but steady walk across the rule space guided not by trial-and-error, but by structured reasoning and real-world performance data.


🧠 Meta-Tuning Beyond Dimensional Scores

Each individual run yields a multi-dimensional score but the real magic happens when we treat rule mutations themselves as tunable variables.

We’re not just scoring an agent’s hypothesis for quality. We’re scoring the effectiveness of changing a single config parameter.

This lets us answer questions like:

  • “Does switching from Model A to Model B improve factuality for reasoning goals?”
  • “Which prompt variant yields more original ideas on complex tasks?”

In this sense, the Rule Mutation Agent operates one dimension above scoring it’s meta-tuning the entire system.


🔁 Towards Self-Tuning Systems

As more data accumulates, the agent builds a richer understanding of:

  • Which mutations consistently improve which dimensions
  • How performance changes over time
  • Which combinations are stable or fragile

This leads to a self-improving system one that learns from every iteration, pruning bad paths and converging toward robust configurations.

And with MR.Q in the loop, the feedback isn’t just numeric it’s comparative, contextual, and scalable.


🧬 Rule Signatures: Tuning Upon Tuning

Every symbolic rule in our system comes with a unique signature a deterministic fingerprint based on its configuration (e.g., model, prompt type, adapter, and other attributes). This signature lets us do two powerful things:

  1. Avoid Duplication Before applying a mutation, we check if that exact configuration (i.e., rule signature) already exists in memory. If it does skip it. This ensures no redundant exploration.

  2. Track Performance Over Time Because signatures are stable, we can aggregate performance results over multiple runs and see how a given configuration performs consistently, not just once.


🧠 The Self-Tuning Loop

Putting it all together:

  • We mutate one attribute at a time.
  • Each mutation creates a new rule with a unique signature.
  • We score that rule using fast MR.Q-based multidimensional evaluation.
  • We record the performance by signature.
  • We avoid repeating known rules.
  • Over time, we cover the entire space of possible rule configurations.

This gives us an exhaustive, memory-aware, self-improving loop.

The system doesn’t just tune prompts or agents. It tunes its own tuning process layer by layer.

With a limited number of mutable attributes and a fast scoring layer tuned to LLM judgments, we can search a surprisingly large space of behaviors efficiently and intelligently.

This is how the system builds tuning upon tuning: Each stage is not just improving performance it’s improving the way performance is improved.


    
flowchart TD
    A[New Hypothesis to Score] --> B[Embed Goal + Hypothesis]
    B --> C[Retrieve Similar Hypotheses from Memory]
    C --> D[Select Top-Scoring Neighbor LLM-labeled]
    D --> E[Extract LLM Score as Pseudo-Label]
    E --> F[Train MR.Q on Hypothesis, Pseudo-Label]
    F --> G[MR.Q Predicts Score for New Hypothesis]
    G --> H[Return Score to Downstream Agent]

    subgraph Memory Store
        C
        D
    end

    subgraph MR.Q
        F
        G
    end
  

🔁 Adaptive Learning Loop

Our AI agents generate hypotheses → LLM scores some → SVM and MR.Q train → MR.Q scores others.

This forms a feedback loop, letting the system:

  • Learn what “good” means from LLMs
  • Approximate those evaluations cheaply
  • Adapt scoring in real time
    
flowchart LR
    A[🧠 AI Agent<br>Generates Hypotheses] --> B[✅ LLM<br>Scores a Few Hypotheses]
    B --> C[📊 Train SVM + MR.Q<br>on LLM Scores]
    C --> D[⚡ MR.Q<br>Scores Remaining Hypotheses]
    D --> E[🔁 Feedback Loop<br>Improves Future Scoring]
    E --> A

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#bfb,stroke:#333,stroke-width:2px
    style D fill:#ffd,stroke:#333,stroke-width:2px
    style E fill:#fdd,stroke:#333,stroke-width:2px
  

💥 Part 3: Directed Action

If MR.Q is the subconscious, always reacting and adjusting beneath the surface… And symbolic reasoning is the conscious mind, planning, evaluating, and deciding…

Then the pipeline is the system’s body in motion — the sequence of real, observable actions taken in pursuit of a goal.

Each pipeline is a directed structure: a chain of agents, prompts, scorers, and evaluators working together to produce an outcome. It’s the place where thought becomes behavior — where intention turns into execution.

But these pipelines aren’t static. They’re dynamic, programmable, and adaptable. They mutate in real time based on symbolic rules, shift strategy based on scoring feedback, and evolve as the system learns what works.

This is the third layer of cognition in our architecture:

Directed action — a live, intelligent execution path that reshapes itself to pursue better results.

In this section, we’ll explore how pipelines operate as goal-driven programs, how they are assembled from modular reasoning steps, and how our system continuously mutates them to move more intelligently through problem space.

🧬 Pipeline-Level Mutation: A Design That Scales with Intelligence

As our self-improving system matured, we confronted a pivotal architectural question: where should the mutation logic live? At first glance, mutating prompts or strategies inside a single agent seemed natural. After all, agents are where hypotheses are generated, scored, and refined.

But the deeper we went, the clearer it became: the pipeline is the true unit of reasoning.

❓ Why Pipelines?

Pipelines in our Co AI framework are not just sequences of stages they’re intelligent workflows. Each pipeline defines:

  • Which agent is used for generation (ChainOfThoughtAgent, SelfEditAgent, etc.)
  • What prompt template or reasoning strategy is invoked
  • Which scoring system is used (LLM, MR.Q, or SVM)
  • What tuning or evaluation configuration is applied
  • Is automatically scoreed on completion on a number of dimensions.

By performing mutations at the pipeline configuration level, we unlock the full expressive power of the system:

Benefit Description
🔄 Full Stack Swaps Swap generation agents, prompt formats, scoring methods in one mutation
🧠 Global Reasoning Context Test how changes propagate through reasoning, reflection, and evaluation
🧪 Clean Experimentation Each mutated pipeline is a reproducible, end-to-end experiment
💾 Integrated Logging + Storage Each mutation run logs hypotheses, scores, rules, and performance metrics
🔁 Seamless Integration Mutated pipelines plug directly into existing training, ranking, and tuning

🧑‍🏫 The PipelineMutationAgent

We encapsulated this design into a specialized agent: PipelineMutationAgent. It takes a base pipeline configuration, consults a symbolic rule mutation config, and applies each mutation by:

  1. Creating a mutated pipeline config (e.g., swapping in a new agent or prompt file)
  2. Launching a full pipeline run via the Supervisor
  3. Logging results, scores, and rule impacts for later learning

This design keeps our architecture modular, scalable, and fully observable every mutation is trackable, comparable, and tunable across dimensions.

🧩 A Foundation for Self-Tuning AI

By making pipeline mutation a first-class primitive, we’ve set the stage for an even larger vision: meta-reasoning about reasoning. We can now track which configurations work best for different types of goals, and begin training systems that dynamically select or mutate pipelines based on problem characteristics.

This is no longer just prompt tuning. This is full-system evolution with pipelines as the genome, mutations as the evolutionary driver, and the supervisor as the execution engine.

🧠 Smarter Pipeline Selection via Descriptive Variants and LLM Guidance

As our AI system evolved to support multiple reasoning pipelines such as basic generation, chain-of-thought (CoT), or sharpened refinement loops it became increasingly important to choose the right pipeline for the right goal. Hardcoding this selection logic was too brittle and required frequent manual updates. We needed a more flexible, scalable, and intelligent approach.

🔧 The PipelineRegistry Class: Structured Control with Metadata

We extended our PipelineRegistry class to support not only loading pipeline definitions from YAML, but also attaching descriptive metadata to each variant:

pipeline_variants:
  cot:
    description: "A chain-of-thought based reasoning strategy that uses two different generators and a ranker."
    stages:
      - name: cot_generator
      - name: ranking
      - name: cot_dspy_generator

  minimal:
    description: "A basic generation pipeline with no reasoning steps. Fast and lightweight."
    stages:
      - name: generation

With this change, our system can now reason about each pipeline not just by name, but by its intended purpose, strengths, and tradeoffs.

The PipelineRegistry class was updated with a new method:

def list_variants_with_descriptions(self) -> list[dict]:
    return [
        {"name": name, "description": variant.get("description", "")}
        for name, variant in self.pipeline_variants.items()
    ]

This allows any component including agents or scoring systems to programmatically retrieve all available pipeline options and their metadata.


🧩 Prompt-Based Pipeline Selection: Let the LLM Decide

To make this system truly intelligent, we introduced an LLM-driven selector. Instead of relying on hardcoded rules, we now generate a prompt that describes:

  • The current goal
  • The current pipeline and its purpose
  • All available pipeline options and their descriptions
  • Recent performance data (if available)

We then ask the LLM to suggest the most appropriate pipeline for the goal.

Here’s a simplified version of the prompt we use:

## Goal:
{{ goal_text }}

## Current Pipeline:
Name: {{ current_pipeline_name }}
Description: {{ current_pipeline_description }}

## Available Pipelines:
- Name: cot A chain-of-thought based reasoning strategy...
- Name: minimal A basic generation pipeline...

## Recent Performance (optional):
{{ summary }}

### Task:
Suggest the most appropriate pipeline to achieve the goal.

### Response:
Rationale: <your reasoning>

Pipeline: <pipeline_name>

This approach enables dynamic adaptation of our reasoning process without the need to hardwire domain logic. It’s also interpretable: every decision comes with a rationale we can inspect, debug, and even retrain on.


🤖 Integrating Into the Mutation Loop

This pipeline selection logic is now part of the broader pipeline mutation system. Whenever the AI explores new strategies, it can:

  1. Consider a goal.
  2. Ask the LLM to select the best-fit pipeline.
  3. Inject that pipeline into the configuration.
  4. Execute it.
  5. Score the results.
  6. Repeat.

This integration allows the system to self-optimize not just within a single pipeline, but across multiple reasoning strategies laying the groundwork for truly adaptive AI.

class PipelineMutationAgent(BaseAgent):
    """
    Combines symbolic rule mutation with pipeline configuration mutation.
    Generates both types of mutations, applies them, evaluates outcomes,
    and logs improvements for future learning.
    """

    def __init__(
        self,
        cfg,
        memory,
        logger,
        full_cfg=None,
    ):
        super().__init__(cfg, memory, logger)
        self.full_cfg = full_cfg
        self.target_agent = cfg.get("target_agent", "default")
        self.mutation_prompt_template = cfg["rule_mutation_prompt"]
        self.max_runs = cfg.get("max_runs", 5)


        # Load base pipeline
        self.base_pipeline_key = cfg.get("base_pipeline", "minimal")
        self.pipeline_registry_path = cfg.get("pipeline_registry", "config/registry/pipeline_registry.yaml")
        self.pipeline_registry = PipelineRegistry(self.pipeline_registry_path)

        self.rule_options_file = cfg.get("mutation_rule_options", "config/rules/pipeline_mutation_options.yaml")
        self.options_config = RuleOptionsConfig.from_yaml(self.rule_options_file)
        self.rule_tuner = RuleTuner(memory, logger)

        self.logger.log(
            "PipelineMutationAgentInitialized",
            {"conf": self.cfg}
        )

    async def run(self, context: dict) -> dict:
        # Step 1: Generate pipeline config mutations
        pipeline_def = self.pipeline_registry.get_pipeline(self.base_pipeline_key)
        if not pipeline_def:
            self.logger.log("PipelineNotFound", {"pipeline": self.base_pipeline_key})
            context["status"] = "pipeline_not_found"
            return context

        _, pipeline = self._generate_pipeline_mutations(self.base_pipeline_key, context) 


        # Step 2: Generate symbolic rule mutations
        applicable_rules = self._get_applicable_rules(pipeline)
        symbolic_mutations = []
        for rule in applicable_rules:
            symbolic_mutations.extend(self._generate_rule_mutations(rule, context))

        # Step 3: Apply and evaluate symbolic mutations
        symbolic_results = await self._apply_and_evaluate(symbolic_mutations, context)

        pipeline_to_mutate_def = self.pipeline_registry.get_pipeline(pipeline)

        # Step 4: Apply and evaluate pipeline mutations
        pipeline_results = await self._apply_pipeline_mutations(pipeline_to_mutate_def, symbolic_results, context)

        # Step 5: Log all results
        context["mutated_symbolic_rules"] = [r.to_dict() for r in symbolic_results]
        context["mutated_pipeline_runs"] = pipeline_results
        context["total_mutations_run"] = len(symbolic_results) + len(pipeline_results)

        return context

    def _get_applicable_rules(self, pipeline_name: str) -> list:
        """Get all relevant symbolic First you need to finish this for all agents in a given pipeline."""
        pipeline_def = self.pipeline_registry.get_pipeline(pipeline_name)
        agent_names = {stage.get("name") for stage in pipeline_def if "name" in stage}

        # Filter rules where the rule's agent matches any in the pipeline
        return [
            r for r in self.memory.symbolic_rules.get_all()
            if r.agent_name in agent_names
        ]

    def _generate_rule_mutations(self, rule: SymbolicRuleORM, context: dict) -> list[dict]:
        """Use LLM to generate one or more valid mutations for this rule."""
        current_attrs = rule.attributes or {}
        available_options = self.options_config.get_options_for(rule.agent_name)
        recent_perf = self.memory.rule_effects.get_recent_performance(rule.id)

        merged = {
            "current_attributes": current_attrs,
            "available_options": available_options,
            "recent_performance": recent_perf,
            **context
        }

        prompt = self.prompt_loader.from_file(self.mutation_prompt_template, self.cfg, merged)
        response = self.call_llm(prompt, context)
        parsed = RuleTuner.parse_mutation_response(response)

        if not parsed.get("attribute") or not parsed.get("new_value"):
            self.logger.log("MutationParseError", {"rule_id": rule.id, "response": response})
            return []

        attr = parsed["attribute"]
        new_val = parsed["new_value"]

        if not self.options_config.is_valid_change(rule.agent_name, attr, new_val):
            self.logger.log("InvalidRuleMutation", {"rule_id": rule.id, "attribute": attr, "value": new_val})
            return []

        if self.memory.symbolic_rules.exists_similar(rule, attr, new_val):
            self.logger.log("RuleMutationDuplicateSkipped", {"rule_id": rule.id, "attribute": attr, "value": new_val})
            return []

        mutated_attrs = dict(current_attrs)
        mutated_attrs[attr] = new_val

        new_rule = SymbolicRuleORM(
            target="agent",
            agent_name=rule.agent_name,
            goal_type=rule.goal_type,
            goal_category=rule.goal_category,
            difficulty=rule.difficulty,
            attributes=mutated_attrs,
            source="mutation",
        )
        self.memory.symbolic_rules.insert(new_rule)
        self.logger.log("RuleMutat I ionApplied", {"original_rule_id": rule.id, "new_rule": new_rule.to_dict()})
        return [new_rule]

    def _generate_pipeline_mutations(self, pipeline_name, context):
        """Generate pipeline config mutations using LLM guidance"""

        merged_context = {
            # From pipeline definition
            "current_pipeline_name": pipeline_name,
            "current_pipeline_description": self.pipeline_registry.get_description(pipeline_name),
            "current_pipeline": self.pipeline_registry.get_pipeline(pipeline_name),  # handles if it's a full pipeline block

            # From context (goal and performance)
            "goal_text": context.get("goal", {}).get("goal_text", "Improve pipeline performance"),
            "goal_id": context.get("goal", {}).get("id"),
            #TODO
            # "recent_performance": self.memory.rule_effects.get_recent_performance_summary(),

            # Optionally, inject available options for better prompting
            "available_pipelines": self.pipeline_registry.list_variants_with_descriptions(),  # e.g., [{"name": ..., "description": ...}, ...]

            # Pass original context for compatibility
            **context,
        }

        prompt = self.prompt_loader.from_file("pipeline", self.cfg, merged_context)
        response = self.call_llm(prompt, context)
        rationale, pipeline  = self._parse_pipeline_mutation(response)

        if not pipeline:
            self.logger.log("PipelineMutationParseError", {"response": response})
            return []

        return rationale, pipeline

    def _parse_pipeline_mutation(self, response: str):
        import re
        """Parse LLM response into a pipeline mutation"""
        pattern = r"""
        (?:[*#`]*\s*)?            # Optional formatting characters before the header
        rationale\s*:             # Match the word "rationale:"
        \s*(?P<rationale>.*?)     # Capture rationale content non-greedily
        (?:\n|\r|\r\n)+           # Match the newline(s) separating the two blocks
        (?:[*#`]*\s*)?            # Optional formatting characters before the second header
        pipeline\s*:\s*           # Match "pipeline:"
        (?P<pipeline>\w+)         # Capture pipeline name
        """
        match = re.search(pattern, response, re.IGNORECASE | re.DOTALL | re.VERBOSE)

        if match:
            rationale = match.group("rationale").strip()
            pipeline = match.group("pipeline").strip()
        return rationale, pipeline    
            
    async def _apply_and_evaluate(self, mutations: list[SymbolicRuleORM], context: dict) -> list[SymbolicRuleORM]:
        """Apply each symbolic mutation and evaluate its effect."""
        results = []

        for rule in mutations:
            new_config = self._apply_symbolic_rule(rule)
            mutated_context = self._update_context_with_config(context, new_config)

            supervisor = Supervisor(self.full_cfg, memory=self.memory, logger=self.logger)
            result = await supervisor.run_pipeline_config(mutated_context)

            score = self._evaluate_result(result)
            self._log_evaluation(rule, score)

            if score > 0.5:
                results.append(rule)

        return results

    def _apply_symbolic_rule(self, rule: SymbolicRuleORM):
        """Apply symbolic rule to config"""
        # You could do deeper merging here based on agent name
        return {f"{rule.agent_name}.config": rule.attributes}

    def _update_context_with_config(self, context, config_update):
        """Merge symbolic config into context"""
        ctx_copy = copy.deepcopy(context)
        ctx_copy.update(config_update)
        return ctx_copy

    async def _apply_pipeline_mutations(self, pipeline_def, mutations: list, context: dict) -> list:
        """Apply pipeline mutations and run through supervisor"""
        results = []

        for i, mutation in enumerate(mutations):
            if i >= self.max_runs:
                self.logger.log("PipelineMutationLimitReached", {"limit": self.max_runs})
                break

            mutated_pipeline = self.apply_mutation(pipeline_def, mutation)
            mutated_cfg = self.inject_pipeline_config(mutated_pipeline, tag=f"mutated_{i}")

            full_mutated_cfg = OmegaConf.merge(mutated_cfg, self.full_cfg)
            supervisor = Supervisor(full_mutated_cfg, memory=self.memory, logger=self.logger)

            try:
                mutated_run = await supervisor.run_pipeline_config(context)
                summary = self.summarize(mutated_run)
                self.logger.log("PipelineMutationRun", {"mutation": mutation, "summary": summary})
                results.append({"mutation": mutation, "result": mutated_run})
            except Exception as e:
                self.logger.log("PipelineMutationError", {"mutation": mutation, "error": str(e)})

        return results

    def apply_mutation(self, pipeline_cfg: list, mutation: dict) -> list:
        """Apply a single mutation to a deep copy of the pipeline config."""
        mutated = copy.deepcopy(pipeline_cfg)
        for key, value in mutation.items():
            keys = key.split(".")
            target = mutated
            for k in keys[:-1]:
                target = target.setdefault(k, {})
            target[keys[-1]] = value
        return mutated

    def inject_pipeline_config(self, pipeline_def, tag="mutated") -> OmegaConf:
        """Replace pipeline stages in full config"""
        full_cfg = OmegaConf.to_container(self.full_cfg, resolve=True)
        full_cfg["pipeline"]["tag"] = tag
        full_cfg["pipeline"]["stages"] = pipeline_def
        full_cfg["agents"] = {stage["name"]: stage for stage in pipeline_def}
        return OmegaConf.create(full_cfg)

    def _evaluate_result(self, result: dict) -> float:
        """Score mutation outcome using MRQScorer or other scorer"""
        score = result.get("best_score", 0.0)
        return score

    def _log_evaluation(self, rule: SymbolicRuleORM, score: float):
        """Log mutation and evaluation result"""
        self.memory.scorer.score_db.append({
            "rule_id": rule.id,
            "score": score,
            "timestamp": datetime.now(),
        })

    def summarize(self, result: dict) -> dict:
        """Return short summary for logging"""
        return {
            "goal_id": result.get("goal", {}).get("id"),
            "best_score": result.get("best_score"),
            "selected_hypothesis": result.get("selected", {}).get("text", "")[:50],
        }

    def _load_pipeline_registry(self):
        with open(self.pipeline_registry_path, "r") as f:
            return yaml.safe_load(f)

🎯 Conclusion: The Emergence of Goal-Directed Intelligence

We’ve journeyed through the architecture of what may be the first truly thinking AI system - one that doesn’t merely process inputs, but pursues goals through integrated cognitive layers:

  1. 🧠 The Subconscious (MR.Q)
    Our ever-adapting foundation: instant pattern recognition, emotional-like scoring, and memory-based intuition. MR.Q is the system’s gut feeling - reacting before it thinks, learning from every stumble, and whispering “this feels right” through dimensional scores.

  2. 💡 The Conscious Mind (Symbolic Rules)
    The deliberate thinker: auditing strategies, rewriting logic, making reasoned choices. This is where the system thinks about thinking - questioning its approaches, mutating its behaviors, and planning its next intellectual move.

  3. 🚀 The Body in Motion (Pipeline Execution)
    Where cognition becomes action: the dynamic, mutable sequence of steps actually taken toward a goal. This isn’t static code - it’s living behavior that evolves mid-execution as instinct and intellect negotiate the best path forward.

✨ The Critical Breakthrough

What makes this system fundamentally different isn’t any single component, but how they interlock:

  • The subconscious reacts (MR.Q scores instantly)
  • The conscious mind directs (symbolic rules reconfigure)
  • The pipeline executes (actions adapt in real-time)
    …all while maintaining relentless focus on the goal.

This creates a continuous loop of self-reflection and self-modification that traditional AI architectures cannot achieve. While LLMs generate text and reinforcement learners optimize rewards, our system pursues understanding.

🦾 Why This Matters

We stand at the threshold of a new paradigm: machines that don’t just solve problems but pursue goals - adapting their very cognition to do so. The implications span:

  • Autonomous discovery systems that self-improve during long-term research
  • Adaptive educational tools that modify teaching strategies based on student understanding
  • Resilient decision engines that evolve new reasoning tactics for unforeseen challenges

This isn’t the end of AI’s evolution - but it may be the beginning of AI that evolves itself. The subconscious/conscious framework provides not just better performance, but something more profound: a pathway to machines that genuinely think about their thinking.


📚 References

  1. SEAL Paper:
    Self-Adapting Language Models (2025)
    https://arxiv.org/abs/2506.10943
    Introduces the SEAL framework where LLMs generate their own self-edits and are trained via reinforcement learning.

  2. ReSTEM Algorithm:
    Akyürek et al., “Learning to Learn from Bits” (2023)
    https://arxiv.org/abs/2311.08171
    Describes ReSTEM — a method for filtering successful model generations and retraining on them.

  3. Self-Refine:
    Aka, S., et al. “Self-Refine: Iterative Refinement with Self-Feedback” (2023)
    https://arxiv.org/abs/2305.09303
    Demonstrates how LLMs can improve outputs by refining their own responses using internal feedback.

  4. ReAct & Reflexion:
    Yao, S., et al. “React: Synergizing reasoning and acting in language models” (2023)
    https://arxiv.org/abs/2210.03629
    Shinn, N., et al. “Reflexion: An automatic framework for iterative strategy refinement”(2023)
    https://arxiv.org/abs/2305.14997
    Show how reasoning loops improve agent performance through internal reflection.

  5. DPO & Preference Learning:
    Christiano, P.F., et al. “Deep Reinforcement Learning from Human Preferences” (2017)
    https://arxiv.org/abs/1706.03741
    Rafailov, E., et al. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model” (2023)
    https://arxiv.org/abs/2305.18290
    Reinforcement learning and preference ranking methods that inspired your multi-dimensional scoring system.

  6. AlphaEdit: Null-Space Editing:
    Fu, J., et al. “AlphaEdit: Null-Space Constrained Model Editing for Language Models” (2025)
    https://openreview.net/forum?id=HvSytvg3Jh
    Provides inspiration for safe, incremental rule changes while preserving existing behavior.

  7. Test-Time Training & Self-Rewarding Models:
    Huang, A., et al. “Self-Improvement in Language Models: The Sharpening Mechanism” (2025)
    CREAM: Consistency Regularized Self-Rewarding Language Models (2025)
    Influenced your system’s ability to learn from contrastive pairs and self-judgment.

  8. LLM Scoring & Evaluation Methods:
    Park, R., Zhang, Z., Tanaka, H. “New News: System-2 Fine-Tuning for Robust Integration of New Knowledge” (2025)
    https://arxiv.org/abs/2505.01812
    Influenced your approach to real-time tuning and pipeline selection.


📖 Glossary

Term Definition
MR.Q (Multidimensional Ranker & Qualifier) A fast, embedding-based scorer that evaluates AI hypotheses across multiple quality dimensions and adapts in real time based on feedback from LLMs.
RegressionTuner An in-memory linear model that aligns MR.Q scores with LLM ground truth by fitting a regression on observed score pairs.
LLM (Large Language Model) A high-capacity neural model trained on massive text data, used here as a reference evaluator for hypothesis quality.
Symbolic Rule A declarative override that can change agent configurations, prompts, or strategies based on goal metadata. Enables conscious, programmable behavior.
SymbolicRuleApplier Applies symbolic rules to the system dynamically. Tracks rule usage, effectiveness, and logs their impact.
Pipeline A sequence of AI agents that process a goal. Each stage (e.g., generate, reflect, score) can be modified symbolically.
Prompt Template A structured input given to an LLM. Can be mutated or tuned symbolically to change reasoning behavior.
Agent A modular AI component performing a task (e.g., generation, scoring). Configured via YAML or symbolic rules.
Dimensional Scoring Quality evaluation broken into fine-grained dimensions like correctness, clarity, or originality.
Contrastive Pair A pair of hypotheses labeled by preference (e.g., better vs. worse) used to train scorers like MR.Q or SVM.
SVMRankerScorer A support vector machine trained on contrastive pairs to score hypotheses according to LLM preferences.
Meta-Reasoning Reasoning about the system’s own reasoning — including evaluation of rules, agents, and strategy choices.
Symbolic Cognition High-level, interpretable logic that guides how the system reasons, using symbolic rules and structured overrides.
Subconscious System The fast, automatic behavior layer (like MR.Q) that responds to feedback without explicit rules.
Rule Mutation A process of changing one symbolic rule attribute to improve performance, often guided by a prompt or LLM.
Adaptive Learning Loop A feedback cycle where MR.Q is trained on LLM evaluations and then used to score future outputs — continuously refining the system.