Thoughts of Algorithms

How a self-evolving AI learns to reflect, score, and rewrite its own reasoning
🧪 Summary
What if an AI could think not just solve problems, but reevaluate its beliefs in the face of new information?
In this post, we introduce a system that does exactly that. At the core of our pipeline is a lightweight scoring model called MR.Q, responsible for evaluating ideas and choosing the best ones. But when it encounters a new domain, a new goal, or a shift in task format, it doesn’t freeze it adapts.
MR.Q watches how trusted sources (like large language models) evaluate new hypotheses. Then it dynamically trains a local regression model to realign its scoring not tomorrow, not during a retrain cycle right now. It takes a few samples, tunes itself, and continues, now aligned to the latest reasoning behavior.
This is more than prompt chaining. It’s more than symbolic control. This is a system generating thoughts about thoughts and then updating its judgment accordingly.
That’s what we mean when we say: this might be the first thinking AI
.
In the rest of this post, we’ll unpack how this system works and how it brings us one step closer to AI that actually thinks.
🪞 The Configurable Reflection Engine: Multi-Dimensional Scoring
Before an AI can think, it needs to reflect. And reflection starts with scoring its own thoughts not just whether something is good, but in what ways it’s good or bad. That’s where our multi-dimensional scoring system comes in.
This system doesn’t rely on a fixed rubric. It’s fully configurable you define the dimensions that matter for your domain. We often score across dimensions like:
- Correctness
- Clarity
- Originality
- Relevance
- Depth
- Specificity
…but that’s just the beginning. You can define as many dimensions as you like we’ve run experiments with 6, 12, even more. The scoring engine adapts automatically. These aren’t arbitrary tags; each dimension is paired with a natural language rubric, which guides either an LLM or our internal MR.Q scorer to assign a structured score to every output.
The result is a rich quality profile for every hypothesis. Instead of reducing everything to a single number, we let the system see itself from multiple angles. This forms the core knowledge base that powers self-improvement a deep memory of how different outputs performed, and why.
In a sense, these multi-dimensional scores are the AI’s first internal thoughts not just what it said, but how it felt about what it said. That reflection is what makes all the later thinking possible.
# config/scoring/pipeline_judge.yaml
dimensions:
- name: correctness
file: correctness
weight: 1.2
extra_data: { parser: numeric }
- name: feasibility
file: feasibility
weight: 1.1
extra_data: { parser: numeric }
- name: insightfulness
file: insightfulness
weight: 1.3
extra_data: { parser: numeric }
- name: alignment
file: alignment
weight: 1.0
extra_data: { parser: numeric }
- name: clarity
file: clarity
weight: 1.1
extra_data: { parser: numeric }
We covered this process in detail here Dimensions of Thought: A Smarter Way to Evaluate AI
We then extended it to apply to documents here: Document Intelligence: Turning Documents into Structured Knowledge
In this post we are extending it slightly by determing how to measure the importance of the dimensions.
🧠 Learning What Matters Dimensional Contrastive Tuning
Most scoring systems ask: “Is this good or bad?” Our system asks:
- “How many different views should we take on this problem?
- “Can you give me a score out of 100 for this conclusion?
- “Can you give me a rational for this score?”
- “What other algorithims, agents, prompts, models… have I got to score here?”
- “Why is this one better than that one?”
That’s the core idea behind contrastive dimensional tuning instead of relying on absolute scores or manual weights, we let the system learn which dimensions actually distinguish strong outputs from weak ones.
⚙️ How It Works
Our ContrastiveDimensionalTuner
takes in:
- Pairs of examples: A and B
- Multi-dimensional scores for each (correctness, clarity, originality, etc.)
- A label for which one is better (the “preferred” example)
It then computes the difference in scores across dimensions and uses contrastive learning (via logistic regression) to learn which dimensions consistently matter. This produces a set of dimension weights that can be used to re-rank or optimize future outputs.
📦 Why It’s Powerful
- It’s model-agnostic: You can train it on outputs from any agent or system.
- It’s scalable: Works with datasets from Hugging Face or internal logs.
- It’s realistic: You don’t need a perfect score just enough contrast to learn from.
# Example usage:
tuner = ContrastiveDimensionalTuner(dimensions=["correctness", "clarity", "originality"])
# Add training data
tuner.add_training_pair(
scores_a={"correctness": 0.9, "clarity": 0.8, "originality": 0.6},
scores_b={"correctness": 0.7, "clarity": 0.9, "originality": 0.5},
preferred="A"
)
tuner.train()
# Use learned weights
print(tuner.get_weights())
🔬 What It Means in Practice
You can now tune your system for scientific rigor, creative writing, pedagogical value, or whatever matters most for your domain just by changing what training data you use. You’re not locked into a one-size-fits-all rubric. The system learns from your data, your preferences, your task.
This moves us beyond rigid evaluation and into the realm of adaptive, self-aware scoring.
graph TD A[Goal Text] --> D1 B[Hypothesis Text] --> D1 subgraph Input Processing D1[Embedding & Feature Extraction] end D1 --> MRQ[MRQ Scorer] D1 --> SVM[SVM Scorer] D1 --> LLM[LLM Scorer] MRQ --> C1[Correctness Score] SVM --> C1 LLM --> C1 MRQ --> C2[Clarity Score] SVM --> C2 LLM --> C2 MRQ --> C3[Originality Score] SVM --> C3 LLM --> C3 C1 --> Tuner[ContrastiveDimensionalTuner] C2 --> Tuner C3 --> Tuner subgraph Meta Aggregation Tuner --> W[Weighted Score Output] end style A fill:#f9f,stroke:#333,stroke-width:1px style B fill:#f9f,stroke:#333,stroke-width:1px style D1 fill:#bbf,stroke:#333,stroke-width:1px style MRQ fill:#ffc,stroke:#333,stroke-width:1px style SVM fill:#ffc,stroke:#333,stroke-width:1px style LLM fill:#ffc,stroke:#333,stroke-width:1px style C1 fill:#cfc,stroke:#333,stroke-width:1px style C2 fill:#cfc,stroke:#333,stroke-width:1px style C3 fill:#cfc,stroke:#333,stroke-width:1px style Tuner fill:#ccf,stroke:#333,stroke-width:1px style W fill:#fcf,stroke:#333,stroke-width:2px
Code: ContrastiveDimensionalTuner
scoring dimensions correctly
The ContrastiveDimensionalTuner is our solution for learning how to weigh scoring dimensions (like correctness, clarity, originality) automatically. Instead of hardcoding weights or relying on human judgment, this component learns from preference pairs just like a reward model, but using interpretable dimensions and contrastive logic.
class ContrastiveDimensionalTuner:
"""
Learns weights for each scoring dimension using contrastive learning.
Given pairs of scored examples (A vs B) and a preference, it learns which dimensions matter most.
"""
def __init__(self, dimensions, logger=None):
"""
Args:
dimensions (list of str): List of dimension names (e.g., ["correctness", "clarity"]).
logger (optional): Optional logger to record training events.
"""
self.dimensions = dimensions
self.logger = logger
self.X = [] # Feature differences (vector of deltas across dimensions)
self.y = [] # Labels: 1 if A preferred over B, 0 otherwise
self.model = None
def add_training_pair(self, scores_a: dict, scores_b: dict, preferred: str):
"""
Adds a training example.
Args:
scores_a (dict): Scores for option A, keyed by dimension.
scores_b (dict): Scores for option B, keyed by dimension.
preferred (str): "A" or "B", indicating which output was preferred.
"""
delta = np.array([
scores_a[dim] - scores_b[dim] for dim in self.dimensions
])
# If B is preferred, invert the delta
if preferred.upper() == "B":
delta = -delta
label = 1 # B preferred (inverted delta)
else:
label = 1 # A preferred (original delta)
self.X.append(delta)
self.y.append(label)
if self.logger:
self.logger.log("ContrastiveTrainingPairAdded", {
"delta": delta.tolist(),
"preferred": preferred
})
def train(self):
"""
Trains a logistic regression model using the current contrastive data.
"""
if len(self.X) < 3:
if self.logger:
self.logger.log("ContrastiveTrainingSkipped", {
"reason": "Not enough data",
"num_examples": len(self.X)
})
return
X_array = np.array(self.X)
y_array = np.array(self.y)
self.model = LogisticRegression()
self.model.fit(X_array, y_array)
if self.logger:
self.logger.log("ContrastiveModelTrained", {
"coefficients": self.get_weights()
})
def get_weights(self) -> dict:
"""
Returns the learned dimension weights (if trained).
Returns:
dict: Mapping from dimension to learned weight.
"""
if self.model is None:
return {dim: 1.0 for dim in self.dimensions} # fallback: equal weights
weights = self.model.coef_[0]
return {
dim: round(float(w), 4) for dim, w in zip(self.dimensions, weights)
}
def score(self, dimension_scores: dict) -> float:
"""
Calculates a single weighted score from per-dimension scores.
Args:
dimension_scores (dict): Scores keyed by dimension.
Returns:
float: Weighted total score.
"""
weights = self.get_weights()
total = sum(dimension_scores[dim] * weights.get(dim, 1.0) for dim in self.dimensions)
return round(total, 4)
⚙️ How It Works
The tuner learns through contrastive examples comparisons where one output is preferred over another:
- Training Inputs: Each input is a pair of outputs (A and B) with known per-dimension scores. A label indicates which output was preferred.
- Feature Vector: The tuner computes a vector of score differences between A and B across all dimensions. If B is preferred, the difference is inverted.
- Training: These difference vectors become input to a logistic regression model, which learns which dimensions most strongly predict preference.
- Scoring: Once trained, the model produces learned weights per dimension. When scoring a new output, it calculates a weighted sum of its dimension scores to produce a final, preference-aligned score.
🎹 Insight 1: Intelligence Emerges Through Judgment - The Piano Teacher Analogy
A person who can’t play piano can still teach someone else as long as they know what sounds better. In the same way, our system doesn’t need to generate perfect answers up front. It just needs to recognize what improves performance, and then self-select and amplify that behavior.
Even the largest, smartest LLMs often fail at first attempts. But here’s the twist: they can dramatically improve by reviewing and comparing their own outputs.
This isn’t speculation it’s backed by a wave of recent papers:
- Self-Refine (2023): Showed that LLMs can boost performance by comparing their initial outputs and rewriting them using internal feedback.
- ReAct, Reflexion, and ReST: Proved that agents using self-judgment loops outperformed those that just generated and moved on.
- Auto-CoT and DPO-style preference training: Reinforced that ranking beats raw generation when it comes to learning high-quality reasoning.
So what’s the real insight?
A model doesn’t need to generate the best answer it just needs to recognize which one is better. And with enough of those comparisons, it learns how to steer itself.
🚀 How We Use This in Our System
That’s exactly what our system does:
- It doesn’t try to guess the best rule, prompt, or pipeline on the first shot.
- Instead, it generates multiple versions, scores them against each other, and reinforces the better ones.
- These scores train our internal critic MR.Q, a fast, memory-efficient regression model that learns from live feedback.
This is the heart of how intelligence emerges in our system: 👉 Through judgment, not generation. 👉 Through comparison, not perfection. 👉 Through learning what works even if it stumbles along the way.
⚖️ How MR.Q Learns from Preferences (DPO-Style)
So how does our system actually learn from judgment?
At the core of MR.Q is a simple, powerful loop: compare two outputs, prefer the better one, and use that contrast to train a regressor that gets better over time.
Here’s how it works in practice:
Example 1: Answer Quality
Prompt | Output A | Output B | Chosen |
---|---|---|---|
“Why is the sky blue?” | “Because of the atmosphere.” | “Due to Rayleigh scattering of sunlight by air molecules.” | ✅ B |
MR.Q stores this as: → Same prompt, but B > A, so favor the features of B in future decisions.
Example 2: Clarity and Specificity
Prompt | Output A | Output B | Chosen |
---|---|---|---|
“Explain how solar panels work.” | “They use light to make electricity.” | “Photons hit semiconductors, exciting electrons into a current.” | ✅ B |
Again: B is clearer and more specific → MR.Q learns the embedded difference.
Example 3: Creativity Preference
Prompt | Output A | Output B | Chosen |
---|---|---|---|
“Suggest a new product idea.” | “Smart mirror for workouts.” | “Modular AI-powered desk that reshapes with your work style.” | ✅ B |
MR.Q generalizes that creative, multi-featured responses tend to be preferred → it weights such features more in scoring future generations.
These contrastive judgments are fed into our MR.Q regressor, which uses embedding distances and historical preferences to shape an evolving reward model. Over time, MR.Q becomes a fast, lightweight critic that reflects what our LLM would say without needing to call the LLM every time.
🌀 Insight 2: The Drunken Man and the Pretty Girl
Intelligence isn’t precision it’s desire, feedback, and adaptation.
Imagine a drunken man at a party. Across the room, he sees a beautiful woman someone he really wants to talk to. But he’s unsteady. His steps wobble left, then right. He bumps into a chair. He adjusts. He’s off course again. But through it all, he’s guided by a single, unwavering thing: his desire to get closer.
That’s how our AI system learns.
It doesn’t need to be perfectly calibrated from the start. It just needs:
- A goal worth reaching,
- A feedback signal telling it whether it’s getting warmer or colder,
- And the capacity to course-correct.
👣 Learning by Stumbling Forward
Every symbolic rule, prompt variation, and pipeline configuration is like one of the drunk man’s steps. Most aren’t perfect. Some are downright wrong. But they aren’t wasted because each one is scored, judged, remembered. And with that feedback, the next step is just a little more aligned.
Over time, this process leads to surprising results:
- The system improves without supervision.
- It refines symbolic behaviors based on prior successes.
- It trains scoring models like MR.Q to reflect what it learns to desire.
All without needing to know the exact path ahead.
It doesn’t need to be sober. It just needs to remember what works and want to get closer.
This isn’t just a cute analogy. It’s a design philosophy:
- Every mutation is a step.
- Every score is a clue.
- Every stumble is progress.
The pretty girl is intelligence itself and our system, drunk or not, is getting closer every day.
🕊️ Insight 3: Thinking on the Fly Reacting to Unknown Data
Not all intelligence is pre-trained. Sometimes, real intelligence means figuring things out in the moment, based on what you already know.
That’s exactly what our system does and it’s one of the most accidentally profound discoveries we made during development.
🦋 The Problem: A New Dimension Emerges
Imagine we’re evaluating a hypothesis and suddenly a new scoring dimension appears say, “novelty” or “feasibility.” Our system has never seen this dimension before. It hasn’t trained a model on it. There’s no data for it.
Most systems would either:
- Crash,
- Default to zero,
- Or wait for retraining.
But not ours.
🪜 The Solution: Generalize from What You Know
When MR.Q encounters this situation, it doesn’t panic. Instead, it says:
“I haven’t seen this exact scoring context… but I know what the goal is. I know what the hypothesis looks like. Let me use my existing encoder and predictor to make an educated guess and then tune it on-the-fly using nearby trusted scores.”
This is the code that enables that behavior:
if dimension not in self.models:
self._initialize_dimension(dimension)
This little if
is doing something deceptively intelligent:
It means our system creates new scoring models dynamically, using embedding-based generalization from previous dimensions.
In other words:
- The system doesn’t need explicit training to begin reacting to a new dimension.
- It builds a model on-demand using what it already knows.
- And it tunes itself live by aligning with trusted nearby scores.
🗺️ What is this?
This is one of the deepest signs of real thinking:
- Adaptation without supervision.
- The ability to infer structure from context.
- The willingness to take a guess and refine it.
This is how humans think. This is how animals learn. And now, this is how our AI behaves.
We didn’t plan for this feature. It emerged naturally from a design that valued modularity, embeddings, and real-time feedback. But it’s quickly become a core pillar of our system’s intelligence.
It’s not a magic moment. It’s not a perfect answer. It’s just a system that knows how to say:
“I’ve never seen this before but I’ve seen enough to take a good first step.”
🧭 Insight 4: Aligning with the LLM or learning from the Master
In any learning system, one of the most powerful strategies is to choose a teacher.
For us, that teacher is the LLM.
While our goal is speed and autonomy, we still respect the LLM’s judgment. It’s trained on trillions of tokens. It’s seen more language, logic, and reasoning than any of us ever will.
So when we need ground truth, or a benchmark to align to we turn to it.
🧩 MR.Q Doesn’t Compete It Learns
Our MR.Q scorer doesn’t try to outperform the LLM. Instead, it tries to understand what the LLM values, and learn to predict those values faster.
That’s the real trick.
Over time, as we gather A/B preferences scored by the LLM (e.g. “Output A is better than B”), we feed them into MR.Q. It uses these judgments to calibrate its internal regressors. This process lets us say:
“Here’s what the LLM prefers now let’s tune ourselves to echo that intuition.”
➡️ Tuning per dimension
We maintain a regression tuner per dimension. Each one continuously adjusts MR.Q’s scores to better match LLM evaluations:
tuned = tuner.transform(norm_score)
It’s a small line of code, but a massive shift in power:
- MR.Q doesn’t need perfect labels.
- It learns over time from contrast pairs.
- It aligns faster with every interaction.
🌪️ So What?
We’re effectively bootstrapping intelligence:
- We borrow precision from a slower, more powerful system.
- We train a lighter model to match its taste.
- And we continuously reinforce that alignment as we see more data.
This lets us scale intelligently:
- Fast scoring via MR.Q,
- Grounded quality via LLM alignment.
In essence, we’re creating a local brain that mirrors a global brain and gets smarter every time they talk. This is all happening dynamically in real time in ram.
💭 Insight 5: Real-Time Thinking MR.Q Generates Its Own Judgments
This is the breakthrough.
So far, we’ve talked about how MR.Q learns from the LLM. But now, it thinks for itself.
⏳ No LLM. No Labels. Just Thoughts.
When our system encounters a new hypothesis, it doesn’t call an LLM. It doesn’t look it up in a database.
Instead, MR.Q says:
“I’ve seen similar ideas before. I’ve learned what’s good and what’s bad. Based on everything I know here’s my judgment.”
That’s the moment. That’s what we call a thought.
Not a memory. Not a copy. A new, self-generated evaluation based on past experience.
📥 How It Works
- Embeddings the system encodes both the goal and hypothesis.
- Scoring MR.Q computes a multidimensional score (correctness, clarity, originality, relevance…).
- Tuning it dynamically transforms that score to align with what it’s learned from the LLM.
- Logging the system tracks this new thought, just like any hypothesis or human evaluation.
All of this happens in real time no human in the loop.
zsa = encoder(prompt_emb, response_emb)
raw_score = predictor(zsa).item()
One line but it represents a complete internal thought.
🎛️ Dynamic tuning
This isn’t about just speeding up judgment. This is about moving the locus of intelligence inward.
- The system is not waiting for supervision.
- It’s not replaying history.
- It’s thinking in real time, about real things, using real experience.
And these thoughts are dynamic. They react to new data, tune themselves to changing conditions, and accumulate over time.
We believe this is one of the first systems that:
- Learns from external judgments,
- Builds an internal model of quality,
- And uses that model to generate its own judgments, at scale, continuously.
This is what we mean by thinking AI. It doesn’t just talk it reflects, adjusts, and evolves.
flowchart TD A[Goal Text] -->|Embed| B[Goal Embedding] X[Hypothesis Text] -->|Embed| Y[Hypothesis Embedding] B & Y --> C[Concatenate Embeddings] C --> D[Pass Through Encoder] D --> E[Compute Raw Score via Predictor] E --> F{Is Regression Tuner available?} F -- Yes --> G[Transform Score via Tuner] F -- No --> H[Use Raw Score] G --> I[Emit Final Score] H --> I[Emit Final Score] I --> J[Log Thought Score + Trace] style A fill:#f9f,stroke:#333,stroke-width:2px style X fill:#f9f,stroke:#333,stroke-width:2px style J fill:#ff9,stroke:#333,stroke-width:2px
🌀 Insight 6: All We Need Is a Signal
One of the most surprising and powerful discoveries in building this system was that we didn’t actually need carefully curated training data to teach MR.Q how to think. What we needed was much simpler just a signal.
At the core of this insight is a realization: every prompt we generate already maps to a goal, and every response (or hypothesis) we generate is a potential answer to that goal. If we can obtain any signal that tells us whether one response is better than another even if it’s approximate, even if it’s noisy we can use it to train MR.Q.
That’s where our system gets its edge.
We realized we could:
- Take prompts and responses generated by any agent in the system.
- Use evaluations or judgments from any other agent (like LLM-based scorers, rule-based filters, or human feedback).
- Connect those evaluations to MR.Q’s internal training loop, even across agents.
This decouples training from generation. It means we don’t need to run complex reward-tuning pipelines or rely on huge LLM evaluations for every single interaction. We just store the prompts and responses and attach whatever signal we have a judgment, a comparison, a score and MR.Q learns from it.
In short: 🧠 We realized that “thinking” didn’t require perfection it just required enough feedback to improve.
This unlocks cross-agent learning and bootstrapped self-training, turning every interaction into potential fuel for improvement.
🛠️ Code: Selecting contrast pairs
Our sql became pretty straightforward.
WITH scored_prompts AS (
SELECT
s.dimension,
s.score,
e.pipeline_run_id,
p.id AS prompt_id,
p.prompt_text,
p.response_text,
ROW_NUMBER() OVER (
PARTITION BY s.dimension, p.id ORDER BY s.score DESC
) AS rank_high,
ROW_NUMBER() OVER (
PARTITION BY s.dimension, p.id ORDER BY s.score ASC
) AS rank_low
FROM scores s
JOIN evaluations e ON s.evaluation_id = e.id
JOIN prompts p ON e.pipeline_run_id = p.pipeline_run_id
WHERE s.score IS NOT NULL
{goal_filter}
)
SELECT
dimension,
prompt_text,
response_text,
score,
rank_type
FROM (
SELECT
dimension,
prompt_text,
response_text,
score,
'top' AS rank_type,
prompt_id
FROM scored_prompts
WHERE rank_high = 1
AND prompt_text IS NOT NULL
AND response_text IS NOT NULL
AND prompt_text <> ''
AND response_text <> ''
UNION ALL
SELECT
dimension,
prompt_text,
response_text,
score,
'bottom' AS rank_type,
prompt_id
FROM scored_prompts
WHERE rank_low = 1
) AS ranked_pairs
ORDER BY dimension, prompt_id
LIMIT :limit
🧪 Why We Do It This Way
- Contrastive training works better than absolute scoring. Ranking “A > B” is often easier and more stable than assigning a perfect numeric score.
- Every prompt acts like a mini training task. It gives us a chance to learn what better looks like in context even if the prompt isn’t perfect.
- We amplify our dataset by orders of magnitude. From just a few thousand prompts, we generate hundreds of thousands of A/B training pairs.
- Each dimension trains independently. This lets us specialize: one MR.Q model might focus on correctness, another on clarity. Later, these can be fused or balanced.
🧱 SQL as Structure Discovery
Why SQL? Because it gives us tight, expressive control over scoring logic, joins, and filters. The window functions (ROW_NUMBER()
) let us:
- Partition by prompt + dimension
- Order by score
- Select only the top and bottom responses per prompt
This simple trick lets us auto-label contrastive pairs without any manual annotation.
🧰 Example Use Case
Let’s say we have 5,000 prompt runs. Each has been scored on:
- Correctness
- Originality
- Clarity
We run this SQL and generate:
- 5,000 pairs for correctness
- 4,800 for originality
- 4,950 for clarity
That’s nearly 15,000 contrastive examples from existing logs and they can be regenerated as scoring improves.
🚀 What’s Next
Once extracted, these pairs are passed into our MR.Q training loop, which:
- Learns which patterns are preferred
- Starts judging unseen outputs
- Eventually feeds back into scoring, tuning, and prompt repair
The result: a self-bootstrapping optimization system, built on a simple SQL foundation.
🧠 MR.Q vs. DPO: Why We Chose Regression Over Reinforcement
Most modern LLM tuning relies on reinforcement learning from preferences techniques like DPO (Direct Preference Optimization) that fine-tune massive models based on human or AI-chosen winners in A/B comparisons.
But our goals were different.
We wanted a reward model that was:
- Fast enough to run live, in-memory
- Simple enough to debug instantly
- Flexible enough to work across dozens of agents
- Trainable using sparse, indirect data
That’s where MR.Q comes in. Instead of reinforcement learning, we just apply good old regression using the embedding distance between prompt and response as features, and a simple score as target.
Here’s how they stack up:
Feature | DPO (Traditional RLHF) | MR.Q (Ours) |
---|---|---|
Model Size | Huge (requires LLM finetuning) | Tiny (runs in-memory, local) |
Training Time | Hours to days | Seconds to minutes |
Interpretability | Low (black-box weights) | High (regression + tunable alignments) |
Real-Time Use | No | Yes |
Embedding-Aware | No | Yes (direct use of vector space) |
Requires Instruction Tuning | Yes | No |
So while DPO needs thousands of examples and GPU days, MR.Q starts thinking after just a few examples and keeps tuning itself on-the-fly as new data rolls in.
It’s not a big brain. But it’s a fast brain.
🧬 Part 1: Subconscious systems
Beneath every decision the AI makes lies a silent evaluator a system that scores, compares, and adjusts behavior without being explicitly told to. This is the subconscious of our architecture: fast, reactive, and always learning from its environment.
At the heart of this layer is MR.Q a lightweight, contrastive scoring model that constantly watches the pipeline’s outputs and adapts its judgment in real time. It doesn’t plan or explain; it responds. Like a human gut instinct, MR.Q senses patterns, aligns itself to trusted feedback (like LLM judgments), and tunes future evaluations accordingly.
This subconscious system gives the AI its ability to:
- Score hypotheses without full retraining.
- Align dynamically to high-quality reasoning signals.
- React in real time to changes in goal type or domain.
- Guide the symbolic system with fast, low-latency evaluations.
While the symbolic reasoning layer chooses how to think, MR.Q ensures the system always knows what’s working quietly shaping thought through constant, embedded feedback.
🧪 MR.Q: The Fast Neural Judge
MR.Q is a fast, adaptive regressor trained on contrast pairs. It works by embedding the prompt and hypothesis and using a small MLP to predict a score.
✅ Why Use MR.Q?
- Speed: It’s extremely fast, ideal for real-time applications.
- Online Tuning: It learns from nearby LLM scores using local regression (e.g., Ridge or SVM-based adjustment).
- Low Data Requirements: You can bootstrap with very few LLM-evaluated examples.
- Great for Tuning Pipelines: MR.Q enables symbolic strategies, prompts, or model variants to be evaluated quickly and consistently.
If you’re generating hundreds of hypotheses per pipeline, MR.Q is your best bet for scalable feedback.
🛠️ Code: The MRQScorer - fast, self-tuning quality estimator
The MRQScorer is a lightweight, fast, and self-improving scoring module that estimates the quality of a hypothesis against a goal using embedding-based similarity and a trained value predictor.
Instead of relying on expensive LLM evaluations for every hypothesis, MR.Q offers a low-latency approximation that can scale while still staying grounded through real-time alignment with LLM scores using the RegressionTuner.
class MRQScorer(BaseScorer):
def __init__(self, cfg: dict, memory, logger, dimensions=None):
self.cfg = cfg
self.memory = memory
self.logger = logger
self.device = cfg.get("device", "cpu")
self.dimensions = dimensions or ["mrq"]
self.models = {} # dim -> (encoder, predictor)
self.trainers = {}
self.min_score_by_dim = {}
self.max_score_by_dim = {}
self.value_predictor = HypothesisValuePredictor(512, 1024).to(self.device)
self.encoder = TextEncoder().to(self.device)
self.regression_tuners = {}
# Initialize model + tuner for each dimension
for dim in self.dimensions:
self.regression_tuners[dim] = RegressionTuner(
dimension=dim, logger=self.logger
)
trainer = MRQTrainer(
memory=memory,
logger=logger,
value_predictor=self.value_predictor,
encoder=self.encoder,
device=self.device,
)
self.models[dim] = (self.encoder, self.value_predictor)
self.trainers[dim] = trainer
self.min_score_by_dim[dim] = 0.0
self.max_score_by_dim[dim] = 1.0
def score(self, goal: dict, hypothesis: dict, dimensions: list[str]) -> ScoreBundle:
"""
Predicts scores for given dimensions using MR.Q and applies tuning if available.
"""
results = []
for dim in dimensions:
score = self._estimate_score(goal, hypothesis, dim)
rationale = f"MRQ estimated score for {dim}."
self.logger.log(
"MRQDimensionEvaluated",
{"dimension": dim, "score": score, "rationale": rationale},
)
results.append(
ScoreResult(
dimension=dim,
score=score,
rationale=rationale,
weight=1.0,
source="mrq",
)
)
return ScoreBundle(results={r.dimension: r for r in results})
def _estimate_score(self, goal, hypothesis, dimension):
"""
Core logic: compute embeddings, run prediction, apply optional regression tuner.
"""
# Initialize dimension on demand
if dimension not in self.models:
self._initialize_dimension(dimension)
prompt_emb = torch.tensor(
self.memory.embedding.get_or_create(goal.get("goal_text")),
device=self.device,
).unsqueeze(0)
response_emb = torch.tensor(
self.memory.embedding.get_or_create(hypothesis.get("text")),
device=self.device,
).unsqueeze(0)
encoder, predictor = self.models[dimension]
zsa = encoder(prompt_emb, response_emb)
raw_score = predictor(zsa).item()
norm_score = self.normalize_score(raw_score, dimension)
# Optionally apply tuner
tuner = self.regression_tuners.get(dimension)
if tuner:
tuned = tuner.transform(norm_score)
self.logger.log(
"MRQTunedScore",
{"dimension": dimension, "raw": norm_score, "tuned": tuned},
)
return tuned
return norm_score
def _initialize_dimension(self, dimension):
self.regression_tuners[dimension] = RegressionTuner(
dimension=dimension, logger=self.logger
)
self.trainers[dimension] = MRQTrainer(
memory=self.memory, logger=self.logger, value_predictor=self.value_predictor, encoder=self.encoder, device=self.device
)
self.models[dimension] = (self.encoder, self.value_predictor)
self.min_score_by_dim[dimension] = 0.0
self.max_score_by_dim[dimension] = 1.0
self.logger.log("MRQModelInitializing", {"dimension": dimension})
def align_to_best_llm_neighbour(self, goal, hypothesis, dimension):
"""
Fetch similar hypotheses that already have high LLM scores.
Then align MR.Q prediction to the best of them.
"""
llm_scores = self.get_closest_llm_scores(hypothesis["text"], dimension)
if llm_scores:
self.align_with_llm_score(dimension, goal, hypothesis, max(llm_scores))
def get_closest_llm_scores(
self, hypothesis_text: str, dimension: str, top_k: int = 5
) -> list[float]:
"""
Finds the top_k LLM scores for hypotheses most similar to the given one.
"""
query_emb = self.memory.embedding.get_or_create(hypothesis_text)
similar_items = self.memory.embedding.similarity_search(query_emb, top_k)
scores = []
for item in similar_items:
matched_text = item.get("text")
score_entry = self.memory.score.find_by_text_and_dimension(
matched_text, dimension=dimension, source="llm"
)
if score_entry:
scores.append(score_entry.score)
return scores
def align_with_llm_score(self, dimension, goal, hypothesis, llm_score):
mrq_score = self._estimate_score(goal, hypothesis, dimension)
self.logger.log(
"MRQAligningToLLM",
{
"goal": goal.get("goal_text"),
"hypothesis": hypothesis.get("text"),
"dimension": dimension,
"mrq_raw": mrq_score,
"llm_target": llm_score,
},
)
self.regression_tuners[dimension].add_example(mrq_score, llm_score)
self.logger.log(
"MRQAlignmentAdded",
{
"dimension": dimension,
"example_count": len(self.regression_tuners[dimension].examples),
},
)
def evaluate(self, prompt: str, response: str) -> ScoreBundle:
"""
Scores a prompt-response pair across all dimensions, and saves it.
"""
results = []
for dim, (encoder, predictor) in self.models.items():
prompt_emb = torch.tensor(
self.memory.embedding.get_or_create(prompt), device=self.device
).unsqueeze(0)
output_emb = torch.tensor(
self.memory.embedding.get_or_create(response), device=self.device
).unsqueeze(0)
zsa = encoder(prompt_emb, output_emb)
value = predictor(zsa).item()
norm_score = self.normalize_score(value, dim)
results.append(
ScoreResult(
dimension=dim,
score=norm_score,
weight=1.0,
rationale=f"MR.Q model trained for {dim}",
source="mrq",
)
)
bundle = ScoreBundle(results={r.dimension: r for r in results})
ScoringManager.save_score_to_memory(
bundle,
response,
cfg=self.cfg,
memory=self.memory,
logger=self.logger,
source="mrq",
)
return bundle
def normalize_score(self, raw, dim):
min_ = self.min_score_by_dim.get(dim, 0.0)
max_ = self.max_score_by_dim.get(dim, 1.0)
return round(100 * (raw - min_) / (max_ - min_ or 1.0), 2)
def judge(self, goal, prompt, output_a, output_b):
"""
Compares two outputs via MR.Q and returns the preferred one.
"""
dim = self.dimensions[0]
encoder, predictor = self.models[dim]
prompt_emb = torch.tensor(
self.memory.embedding.get_or_create(prompt), device=self.device
).unsqueeze(0)
a_emb = torch.tensor(
self.memory.embedding.get_or_create(output_a), device=self.device
).unsqueeze(0)
b_emb = torch.tensor(
self.memory.embedding.get_or_create(output_b), device=self.device
).unsqueeze(0)
value_a = predictor(encoder(prompt_emb, a_emb)).item()
value_b = predictor(encoder(prompt_emb, b_emb)).item()
preferred = output_a if value_a >= value_b else output_b
# Optionally log sharpening example
if self.memory.mrq.log_evaluations():
pred = SharpeningPredictionORM(
id=None,
goal_id=-1,
prompt_text=prompt,
output_a=output_a,
output_b=output_b,
preferred="a" if value_a >= value_b else "b",
predicted="a" if value_a >= value_b else "b",
value_a=value_a,
value_b=value_b,
)
self.memory.sharpening.insert_sharpening_prediction(pred.to_dict(), goal)
return preferred, {"value_a": value_a, "value_b": value_b}
def train_from_database(self, cfg: dict):
all_samples = self.memory.mrq.get_training_pairs_by_dimension()
for dim, samples in all_samples.items():
if not samples:
self.logger.log("MRQNoTrainingSamples", {"dimension": dim})
continue
self.align_mrq_with_llm_scores_from_pairs(samples, dimension=dim)
self.logger.log(
"MRQTrainingStart", {"dimension": dim, "sample_count": len(samples)}
)
if dim not in self.trainers:
self.trainers[dim] = MRQTrainer(
memory=self.memory,
logger=self.logger,
encoder=self.encoder,
value_predictor=self.value_predictor,
device=self.device,
)
self.update_score_bounds_from_data(samples, dim)
dataloader = self.trainers[dim].prepare_training_data(samples)
self.trainers[dim].train(dataloader, cfg)
self.logger.log("MRQTrainingComplete", {"dimension": dim})
def train_from_context(self, context: dict, cfg: dict):
dim_samples = context.get("mrq_training_pairs_by_dimension", {})
for dim, samples in dim_samples.items():
if not samples:
self.logger.log("MRQNoTrainingFromContext", {"dimension": dim})
continue
self.logger.log(
"MRQContextTrainingStart",
{"dimension": dim, "sample_count": len(samples)},
)
self.update_score_bounds_from_data(samples, dim)
dataloader = self.trainers[dim].prepare_training_data(samples)
self.trainers[dim].train(dataloader, cfg)
self.logger.log("MRQContextTrainingComplete", {"dimension": dim})
def update_score_bounds_from_data(self, samples: list, dim: str):
values = []
for s in samples:
if "value_a" in s and "value_b" in s:
values.extend([s["value_a"], s["value_b"]])
elif "value" in s:
values.append(s["value"])
if values:
min_score = min(values)
max_score = max(values)
self.min_score_by_dim[dim] = min_score
self.max_score_by_dim[dim] = max_score
self.logger.log(
"MRQScoreBoundsUpdated",
{
"dimension": dim,
"min_score": min_score,
"max_score": max_score,
"example_count": len(values),
},
)
def align_mrq_with_llm_scores_from_pairs(
self, pair_samples: list[dict], dimension: str, log_prefix: str = "MRQAlignment"
):
for pair in pair_samples:
prompt = pair["prompt"]
for side in ["a", "b"]:
hyp = pair[f"output_{side}"]
llm_score = pair[f"value_{side}"]
# Predict MRQ score dynamically
mrq_score = self.score(
{"goal_text": prompt}, {"text": hyp}, [dimension]
)
# Log the alignment
self.logger.log(
f"{log_prefix}Dynamic",
{
"prompt_hash": hash(prompt),
"hypothesis_hash": hash(hyp),
"dimension": dimension,
"llm_score": llm_score,
"predicted_mrq": mrq_score,
},
)
# Pass the pair into the regression tuner
if mrq_score is not None and llm_score is not None:
self.regression_tuners[dimension].train_single(
mrq_score=mrq_score.results[dimension].score,
llm_score=llm_score,
)
⚙️ How It Works
- Predicts dimensional quality scores (e.g., correctness, clarity) using goal hypothesis embeddings.
- Trains continuously from new pairwise data or saved examples.
- Dynamically aligns its predictions to high-confidence LLM scores using a real time RegressionTuner.
- Maintains separate models, score bounds, and tuners per dimension.
- Supports contrastive pairwise judgment (e.g., output A vs B) as well as single hypothesis evaluation.
- Fully pluggable into the Co AI pipeline as a fast, adaptable scoring backend.
🧩 When MR.Q Isn’t Enough: Introducing the SVM Scorer
As brilliant as MR.Q is at reacting, it does have limitations.
It’s a great local thinker fast, adaptive, and surprisingly accurate but sometimes we need a global judge. One that can:
- Handle richer features beyond just embeddings
- Spot more abstract patterns in behavior
- Generalize across different pipelines and agent types
That’s where the SVM Scorer comes in.
While MR.Q uses lightweight regression on prompt–response embeddings, the SVM (Support Vector Machine) Scorer can look at:
- Full prompt and hypothesis content
- Structural features (e.g., reasoning steps, token patterns)
- Prior scoring history across multiple dimensions
It works as a second opinion or even a meta-reviewer trained on structured pairs and ranked outputs, similar to DPO but without the need to fine-tune an entire model.
In practice, we let MR.Q make fast judgments and then let the SVM check the pattern over time.
Think of it like this:
MR.Q thinks quickly. The SVM thinks deeply.
Together, they give us a scoring system that’s fast and reflective a kind of dual brain for hypothesis evaluation.
📐 SVM: The Interpretable Feature-Based Scorer
SVM Scorers rely on handcrafted or learned features (like score differences, prompt structure, or token length). They train separate support vector machines per dimension.
✅ Why Use SVM?
- Interpretability: You can see which features drive scoring great for debugging or analysis.
- Deterministic: Once trained, the score doesn’t fluctuate.
- Feature-Aware: You can embed symbolic or structural signal into the scorer.
SVM is a bridge between MR.Q and LLM: it’s faster than the LLM, more interpretable than MR.Q, and easier to customize for specific dimensions (like relevance or complexity).
🛠️ Code: SVMScorer
class SVMScorer(BaseScorer):
def __init__(self, cfg: dict, memory, logger, dimensions=None):
self.cfg = cfg
self.memory = memory
self.logger = logger
self.dimensions = dimensions or ["alignment"]
self.models = {dim: SVR() for dim in self.dimensions}
self.scalers = {dim: StandardScaler() for dim in self.dimensions}
self.trained = {dim: False for dim in self.dimensions}
self.regression_tuners = {}
for dim in self.dimensions:
self._initialize_dimension(dim)
def _initialize_dimension(self, dim):
self.models[dim] = SVR()
self.scalers[dim] = StandardScaler()
self.trained[dim] = False
self.regression_tuners[dim] = RegressionTuner(dimension=dim, logger=self.logger)
def train(self, samples_by_dim: dict[str, list[dict]]):
"""
Train per-dimension SVM from labeled LLM/MRQ training data
"""
for dim, samples in samples_by_dim.items():
x = []
y = []
for sample in samples:
prompt = sample["prompt"]
hyp = sample["output"]
score = sample["value"]
feat = self._build_feature_vector({"goal_text": prompt}, {"text": hyp})
x.append(feat)
y.append(score)
x = np.array(x)
y = np.array(y)
self.scalers[dim].fit(x)
x_scaled = self.scalers[dim].transform(x)
self.models[dim].fit(x_scaled, y)
self.trained[dim] = True
self.logger.log("SVMTrainingComplete", {
"dimension": dim,
"samples": len(samples),
"score_min": float(np.min(y)),
"score_max": float(np.max(y)),
})
def _build_feature_vector(self, goal: dict, hypothesis: dict):
"""
Basic feature vector: concat prompt + hypothesis embeddings + MRQ raw score (if available)
"""
emb_goal = self.memory.embedding.get_or_create(goal["goal_text"])
emb_hyp = self.memory.embedding.get_or_create(hypothesis["text"])
vec = emb_goal + emb_hyp
# Optional MRQ bridge feature
mrq = self.memory.score.find_by_text_and_dimension(
hypothesis["text"], dimension="alignment", source="mrq"
)
if mrq:
vec.append(mrq.score / 100.0) # normalized to [0,1]
else:
vec.append(0.5) # neutral if no MRQ score
return vec
def train_from_database(self):
pair_samples = self.memory.mrq.get_training_pairs_by_dimension()
samples_by_dim = self.convert_mrq_pairs_to_supervised_examples(pair_samples)
for dim, examples in samples_by_dim.items():
self.train_for_dimension(dim, examples)
def convert_mrq_pairs_to_supervised_examples(self, pair_samples: list[dict]) -> dict[str, list[dict]]:
"""
Converts MRQ-style contrastive training pairs into a flat list of (prompt, output, value)
entries per dimension, suitable for supervised regression training.
"""
per_dimension = defaultdict(list)
for pair in pair_samples:
dim = pair.get("dimension", "default")
for side in ["a", "b"]:
output = pair.get(f"output_{side}")
score = pair.get(f"value_{side}")
if output is not None and score is not None:
per_dimension[dim].append({
"prompt": pair["prompt"],
"output": output,
"value": score
})
self.logger.log("SVMConvertedMRQPacks", {
"dimensions": list(per_dimension.keys()),
"total_samples": sum(len(v) for v in per_dimension.values())
})
return per_dimension
def train_for_dimension(self, dimension: str, examples: list[dict]):
X = []
y = []
for ex in examples:
prompt_vec = self.memory.embedding.get_or_create(ex["prompt"])
output_vec = self.memory.embedding.get_or_create(ex["output"])
pair_vec = np.array(prompt_vec + output_vec)
X.append(pair_vec)
y.append(ex["value"])
X = np.array(X)
y = np.array(y)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = SVR(kernel="linear") # you can adjust kernel if needed
model.fit(X_scaled, y)
self.models[dimension] = (scaler, model)
self.logger.log("SVMModelTrained", {
"dimension": dimension,
"num_samples": len(y)
})
def score(self, goal: dict, hypothesis: dict, dimensions: list[str]) -> ScoreBundle:
results = {}
for dim in dimensions:
vec = self._build_feature_vector(goal, hypothesis)
# Dynamic training if needed
if not self.trained[dim]:
self._try_train_on_dimension(dim)
if not self.trained[dim]:
score = 50.0
rationale = f"SVM not trained for {dim}, returning neutral."
else:
x = self.scalers[dim].transform([vec])
raw_score = self.models[dim].predict(x)[0]
tuned_score = self.regression_tuners[dim].transform(raw_score)
score = tuned_score
rationale = f"SVM predicted and aligned score for {dim}"
self.logger.log("SVMScoreComputed", {
"dimension": dim,
"score": score,
"hypothesis": hypothesis.get("text"),
})
results[dim] = ScoreResult(
dimension=dim,
score=score,
rationale=rationale,
weight=1.0,
source="svm",
)
return ScoreBundle(results=results)
def _try_train_on_dimension(self, dim):
samples_by_dim = self.memory.mrq.get_training_pairs_by_dimension()
samples = samples_by_dim.get(dim, [])
if not samples:
self.logger.log("SVMNoSamples", {"dimension": dim})
return
X, y = [], []
for s in samples:
for side in ["a", "b"]:
prompt = s["prompt"]
hypothesis = s[f"output_{side}"]
llm_score = s.get(f"value_{side}")
if prompt and hypothesis and llm_score is not None:
vec = self._build_feature_vector({"goal_text": prompt}, {"text": hypothesis})
X.append(vec)
y.append(llm_score)
self.regression_tuners[dim].add_example(llm_score, llm_score) # no-op, self-alignment fallback
if not X:
return
X_scaled = self.scalers[dim].fit_transform(X)
self.models[dim].fit(X_scaled, y)
self.trained[dim] = True
self.logger.log("SVMTrainingComplete", {
"dimension": dim,
"samples": len(X)
})
# Align the scores using same logic as MRQ
self._align_with_llm(samples, dim)
def _align_with_llm(self, samples, dim):
for s in samples:
for side in ["a", "b"]:
prompt = s["prompt"]
hypothesis = s[f"output_{side}"]
llm_score = s.get(f"value_{side}")
if llm_score is None:
continue
vec = self._build_feature_vector({"goal_text": prompt}, {"text": hypothesis})
x = self.scalers[dim].transform([vec])
raw_score = self.models[dim].predict(x)[0]
self.regression_tuners[dim].train_single(mrq_score=raw_score, llm_score=llm_score)
self.logger.log("SVMAlignmentDynamic", {
"dimension": dim,
"mrq_score": raw_score,
"llm_score": llm_score
})
def _train_dimension(self, dim: str):
pairs_by_dim = self.memory.mrq.get_training_pairs_by_dimension()
samples = pairs_by_dim.get(dim, [])
if not samples:
self.logger.log("SVMNoTrainingData", {"dimension": dim})
self.trained[dim] = False
return
X = []
y = []
for sample in samples:
goal = {"goal_text": sample["prompt"]}
for side in ["a", "b"]:
hyp = {"text": sample[f"output_{side}"]}
label = sample.get(f"value_{side}")
if label is not None:
vec = self._build_feature_vector(goal, hyp)
X.append(vec)
y.append(label)
if len(X) < 5:
self.logger.log("SVMInsufficientTrainingData", {"dimension": dim, "count": len(X)})
self.trained[dim] = False
return
X_scaled = self.scalers[dim].fit_transform(X)
self.models[dim].fit(X_scaled, y)
self.trained[dim] = True
self.logger.log("SVMTrained", {"dimension": dim, "samples": len(X)})
🤖 LLM: The Gold Standard
LLM-based scoring uses prompt engineering and chain-of-thought to assess a hypothesis directly. It’s the most accurate and nuanced evaluator but also the most expensive.
✅ Why Use LLM?
- Highest Quality: LLMs can explain their judgment using rubrics or pairwise comparison.
- Training Data Source: They’re used to label samples for MR.Q and SVM.
- Flexible Criteria: You can swap scoring prompts to test new evaluation rubrics on the fly.
Use LLM scoring for validation, meta-evaluation, or as a source of truth during training. Don’t use it in tight loops it’s too slow and expensive.
📊 Scorer Comparison
Feature / Property | MR.Q Scorer | SVM Scorer | LLM Scorer |
---|---|---|---|
Type | Embedding-based regressor (MLP) | Classical ML (SVM on features) | Language model with rubric/pairwise |
Speed | 🚀 Very Fast | ⚡ Fast | 🐢 Slow |
Training Style | Online, dynamic tuning | Batch (per dimension) | Not trainable (prompt-based) |
Adaptivity | High (neighborhood-based tuning) | Medium (requires re-fitting) | None (static output) |
Data Requirement | Low (few examples needed) | Moderate (pairwise samples) | None (but high cost per call) |
Output Consistency | Adaptive, can vary with tuning | Deterministic once trained | Stochastic (temperature, wording) |
Best Use Cases | Real-time scoring, tuning proxies | Interpretable, structured comparisons | Final evaluation, bootstrap labeling |
Output Range | Normalized to 0–100 (tuned) | 0–100 (aligned via regressor) | 0–100 (rubric or logits mapped) |
Integration Cost | Low | Medium | High (token cost, latency) |
Interpretability | Moderate | High (feature-based decisions) | Low (depends on prompt wording) |
🧭 Strategy: How to Combine Them
In our system, we use all three together in a layered hierarchy:
Layer | Scorer | Purpose |
---|---|---|
Inference | MR.Q | Inline scoring during generation |
Tuning | MR.Q | Optimize symbolic strategy |
Analysis | SVM | Understand score drivers |
Bootstrapping | LLM | Provide ground truth labels |
Evaluation | LLM | Final output validation |
🔁 When to Use Which
Scenario | Preferred Scorer |
---|---|
Fast scoring inside a pipeline | MR.Q |
Bootstrapping a reward model | LLM → MR.Q or SVM |
Structured feature-based alignment | SVM |
Comparing symbolic strategies | SVM |
Validating prompt effectiveness | LLM |
Scoring many samples cheaply | MR.Q |
Ensuring scoring consistency across versions | SVM (or frozen MR.Q) |
Scoring hypotheses for training | LLM → SVM + MR.Q cascade |
🔁 The Meta-Review Loop: From Fast Judgments to Self-Correction
If MR.Q is the quick reflex and the SVM is the deep intuition, then the Meta-Review Loop is the higher-order reflection system that learns from both.
Every time our system runs a prompt, scores a hypothesis, and picks a best answer, it doesn’t just move on it remembers.
It logs:
- The prompt and context
- The hypothesis and score
- The rule or pipeline that produced it
- Which scoring system made the call (MR.Q, SVM, or LLM)
Later, when better information becomes available say a more accurate score from an LLM, or a consensus among agents we compare that to the original score.
If the original judgment was off, we don’t just update the result we retrain the scorer.
In real time. On real data. Across any dimension we care about.
This loop allows our system to:
- Adapt to changing tasks
- Evolve toward more accurate judgments
- Tune itself without external retraining pipelines
It’s self-supervised learning, not in theory in practice. A system that judges itself, trains itself, and improves itself one contrast at a time.
🎯 The Regression Tuner: Real-Time Alignment with LLM Ground Truth
We believe the LLM is often the de facto source of truth not because it’s perfect, but because it’s been trained on such vast and diverse data that its judgments are generally reliable.
So when MR.Q our fast, embedding-based scorer makes a decision, we sometimes get a second opinion from the LLM.
But what happens when those opinions don’t match?
That’s where the Regression Tuner comes in.
This lightweight module acts like a real-time calibrator. Every time we score a hypothesis using both MR.Q and the LLM, the tuner saves that pair:
MR.Q score → 0.68
LLM score → 0.84
It doesn’t save anything to disk. It doesn’t run in big training loops. Instead, once it has enough examples (as few as 10), it fits a simple linear regression model on the fly and updates it over time.
From then on, every MR.Q score in that dimension gets nudged into better alignment:
raw score: 0.68 → tuned score: 0.81
Why this matters:
- It allows MR.Q to learn from the LLM without being replaced by it.
- It preserves speed while gaining accuracy.
- It helps us catch and correct bias drift in our embedding space.
- It makes our self-improving system actually improve in measurable ways.
This tuner is not just a patch it’s a critical part of how the system thinks with feedback.
The best part? It works for any dimension, any agent, any task all in-memory, all on the fly.
🛠️ Code: RegressionTuner
aliging scores with quality
class RegressionTuner:
"""
Learns to transform MR.Q scores to align with LLM scores dynamically.
Does not save any state to disk purely in-memory and real-time.
"""
def __init__(self, dimension: str, logger=None, min_samples: int = 10):
self.dimension = dimension
self.logger = logger
self.min_samples = min_samples
self.x = [] # MRQ scores
self.y = [] # LLM scores
self.model = None
def train_single(self, mrq_score: float, llm_score: float):
"""Adds a new training pair and refits if threshold reached."""
self.x.append(mrq_score)
self.y.append(llm_score)
if len(self.x) >= self.min_samples:
self._fit()
if self.logger:
self.logger.log("RegressionTunerTrainSingle", {
"dimension": self.dimension,
"mrq_score": mrq_score,
"llm_score": llm_score,
"total_samples": len(self.x)
})
def _fit(self):
"""Fits a linear regression model to current examples."""
x_arr = np.array(self.x).reshape(-1, 1)
y_arr = np.array(self.y)
self.model = LinearRegression().fit(x_arr, y_arr)
if self.logger:
self.logger.log("RegressionTunerFitted", {
"dimension": self.dimension,
"count": len(self.x),
"coef": float(self.model.coef_[0]),
"intercept": float(self.model.intercept_),
})
def transform(self, score: float) -> float:
"""Transforms a score using the fitted regression model if available."""
if self.model:
return float(self.model.predict(np.array([[score]]))[0])
return score
⚙️ How It Works
- Collects alignment pairs between a fast, local scorer (MR.Q) and a slower, more accurate reference scorer (LLM).
- Learns a mapping via linear regression to bring the fast scorer’s outputs in line with the LLM.
- Applies that mapping live to correct MR.Q’s predictions on future examples.
🧠 Part 2: Conscious Thought Symbolic Rules for Structured Reasoning
If MR.Q gives us instinct fast, reactive judgment then symbolic reasoning gives us conscious structure. It’s how we embed deliberate, traceable thought into our AI. Where MR.Q adapts based on feedback, symbolic rules encode known good strategies and allow us to program reasoning itself.
🧠 Symbolic rules are how the AI learns to think about thinking.
🧾 Pipelines as Programs
Every AI pipeline in our system is more than just a sequence of steps it’s a program. Each stage makes a decision:
- Which model to use?
- What prompt to run?
- How to evaluate or refine the output?
And just like any program, we can rewrite parts of it symbolically.
Symbolic rules are modular, interpretable instructions that alter any part of the reasoning process. They’re not buried inside a black-box model they live in the open, where they can be scored, tested, and improved.
🧩 What Are Symbolic Rules?
Symbolic rules are:
- Configurable written in YAML or learned from data.
- Targeted applied based on agent name, goal metadata, or tags.
- Composable override model names, prompt paths, scoring functions, or even insert/remove agents.
- Traceable every rule is logged and linked to outcomes.
This gives us interpretable cognition we can see why the system made a choice and how that choice impacted the result.
🧠 The Role of Symbolic Reasoning
Symbolic reasoning brings four critical capabilities:
- Interpretability Unlike a static neural network, symbolic changes are visible and auditable.
- Trainability Each rule application is scored, so we can learn which symbolic paths improve outcomes.
- Modularity We can test and tune parts of the reasoning system in isolation.
- Adaptivity Over time, the system learns to apply better rules in better contexts.
🔄 MR.Q gives the system feedback-driven instinct. 🔧 Symbolic rules give it programmable thought conscious logic.
Together, they form a complete cognitive loop:
- MR.Q reacts: scoring outputs in real time using learned preferences.
- Symbolic rules reflect: modifying the strategy, choosing models, rerouting reasoning based on what has worked in the past.
This isn’t just execution it’s deliberate self-adjustment. The system learns not only what to think, but how to think better next time.
🛠️ How It Works in the System
- Every pipeline stage (generation, evaluation, refinement) can be modified.
- Symbolic rules apply to agents dynamically via metadata or tags.
- Rules can:
- Change prompts, models, or scorers.
- Insert new reasoning strategies like self-reflection.
- Remove underperforming steps.
- Each rule application is scored just like outputs using LLMs or MR.Q.
This gives us a self-aware system:
- One that doesn’t just produce answers…
- But learns how to improve its own thinking process.
📍 Symbolic rules are not just tweaks. They are representations of reasoning strategies modular, evaluatable, and trainable.
🔧 SymbolicRuleApplier: The Brainstem of Our Reasoning System
The SymbolicRuleApplier
is the component that turns high-level symbolic knowledge into concrete action. It reads a set of human or machine-authored rules and injects them into the pipeline, altering how agents behave, think, and score without touching any agent code directly.
🧩 What It Does
At a high level, the SymbolicRuleApplier
:
- Loads symbolic rules from YAML or the database.
- Filters rules based on the current goal and pipeline metadata.
- Applies overrides to any matching agent, prompt, or configuration stage.
- Logs all applications, so we can later trace which rules were active during hypothesis generation or scoring.
This makes the reasoning system programmable, auditable, and evolvable.
⚙️ How It Works
Each symbolic rule looks like this:
agent_name: HypothesisGenerator
metadata_filter:
goal_type: scientific
override:
model: mistral
prompt: cot_enhanced.j2
scorer: meta_review
The SymbolicRuleApplier
matches this rule to any agent in the pipeline named HypothesisGenerator
only if the current goal metadata includes goal_type: scientific
.
Once matched, the overrides are applied. This might swap in a new prompt template, change the model being used, or configure a different scorer.
It uses a simple matching pattern:
- Agent name match
- Metadata filters (goal, topic, tags, etc.)
- Optional pipeline or stage constraints
And every application is tracked:
{
"rule_id": "rule-123",
"pipeline_run_id": "run-789",
"agent_name": "HypothesisGenerator",
"context_hash": "abc123",
"overrides": {
"model": "mistral",
"prompt": "cot_enhanced.j2"
}
}
These are stored in a rule_applications
table enabling full traceability for analysis, tuning, and rule optimization later on.
🛠️ Code SymbolicRuleApplier
applying changes when required
class SymbolicRuleApplier:
def __init__(self, cfg, memory, logger):
self.cfg = cfg
self.memory = memory
self.logger = logger
self.enabled = cfg.get("symbolic", {}).get("enabled", False)
self._rules = self._load_rules() if self.enabled else []
@property
def rules(self) -> list:
return self._rules
def apply(self, context: dict) -> dict:
if not self.enabled:
return context
goal = context.get("goal", {})
pipeline_run_id = context.get("pipeline_run_id")
current_pipeline = context.get("pipeline", [])
matching_rules = [r for r in self.rules if self._matches_metadata(r, goal)]
if not matching_rules:
self.logger.log("NoSymbolicRulesApplied", {"goal_id": goal.get("id")})
return context
self.logger.log("SymbolicRulesFound", {"count": len(matching_rules)})
for rule in matching_rules:
if rule.rule_text and "pipeline:" in rule.rule_text:
suggested_pipeline = (
rule.rule_text.split("pipeline:")[-1].strip().split(",")
)
suggested_pipeline = [
s.strip() for s in suggested_pipeline if s.strip()
]
if suggested_pipeline:
self.logger.log(
"PipelineUpdatedBySymbolicRule",
{
"from": current_pipeline,
"to": suggested_pipeline,
"rule_id": rule.id,
},
)
context["pipeline"] = suggested_pipeline
context["pipeline_updated_by_symbolic_rule"] = True
if rule.source == "lookahead" and rule.goal_type:
context["symbolic_hint"] = f"use_{rule.goal_type.lower()}_strategy"
return context
def apply_to_agent(self, cfg: Dict, context: Dict) -> Dict:
if not self.enabled:
return cfg
goal = context.get("goal", {})
pipeline_run_id = context.get("pipeline_run_id")
agent_name = cfg.get("name")
matching_rules = [
r
for r in self.rules
if r.agent_name == agent_name and self._matches_metadata(r, goal)
]
if not matching_rules:
self.logger.log(
"NoSymbolicAgentRulesApplied",
{
"agent": agent_name,
"goal_id": goal.get("id"),
},
)
return cfg
self.logger.log(
"SymbolicAgentRulesFound",
{
"agent": agent_name,
"goal_id": goal.get("id"),
"count": len(matching_rules),
},
)
for rule in matching_rules:
# Apply new-style attributes
if rule.attributes:
for key, value in rule.attributes.items():
if key in cfg:
self.logger.log(
"SymbolicAgentOverride",
{
"agent": agent_name,
"key": key,
"old_value": cfg[key],
"new_value": value,
"rule_id": rule.id,
},
)
else:
self.logger.log(
"SymbolicAgentNewKey",
{
"agent": agent_name,
"key": key,
"value": value,
"rule_id": rule.id,
},
)
cfg[key] = value
# Apply legacy rule_text (optional, for backward compatibility)
if rule.rule_text:
entries = [e.strip() for e in rule.rule_text.split(",") if e.strip()]
for entry in entries:
if ":" in entry:
key, value = [s.strip() for s in entry.split(":", 1)]
if key in cfg:
self.logger.log(
"SymbolicAgentOverride",
{
"agent": agent_name,
"key": key,
"old_value": cfg[key],
"new_value": value,
"rule_id": rule.id,
},
)
else:
self.logger.log(
"SymbolicAgentNewKey",
{
"agent": agent_name,
"key": key,
"value": value,
"rule_id": rule.id,
},
)
cfg[key] = value
# Record the application of this rule
self.memory.rule_effects.insert(
goal_id=goal.get("id"),
agent_name=agent_name,
rule_id=rule.id,
pipeline_run_id=pipeline_run_id,
details=rule.to_dict(),
stage_details=cfg,
)
return cfg
def apply_prompt_rules(
self, agent_name: str, prompt_cfg: dict, context: dict
) -> dict:
"""
Applies prompt-level symbolic rules to the prompt config before generation.
Returns the updated prompt_cfg.
"""
goal = context.get("goal", {})
applicable_rules = [
rule
for rule in self.rules
if rule.agent_name == agent_name
# and self._matches_filter(rule.filter, goal)
]
if not applicable_rules:
self.logger.log("NoPromptRulesFound", {"agent": agent_name})
return prompt_cfg
for rule in applicable_rules:
for key, value in rule.attributes.items():
self.logger.log(
"PromptAttributeOverride",
{
"agent": agent_name,
"key": key,
"old_value": prompt_cfg.get(key),
"new_value": value,
"rule_id": rule.id,
"emoji": "🛠️",
},
)
self.set_nested(prompt_cfg, key, value)
# Optional: record the rule application
self.memory.rule_effects.insert(
rule_id=rule.id,
goal_id=goal.get("id"),
pipeline_run_id=context.get("pipeline_run_id"),
details=prompt_cfg,
)
return prompt_cfg
def set_nested(self, cfg: dict, dotted_key: str, value):
keys = dotted_key.split(".")
d = cfg
for k in keys[:-1]:
if k not in d or not isinstance(d[k], dict):
d[k] = {}
d = d[k]
d[keys[-1]] = value
def apply_to_prompt(self, cfg: Dict, context: Dict) -> Dict:
if not self.enabled:
return cfg
goal = context.get("goal", {})
pipeline_run_id = context.get("pipeline_run_id")
prompt_name = cfg.get("prompt_key", "unknown_prompt")
matching_rules = [
r for r in self.rules
if r.target == "prompt" and self._matches_filter(r.filter, goal)
]
if not matching_rules:
self.logger.log("NoSymbolicPromptRulesApplied", {
"prompt": prompt_name,
"goal_id": goal.get("id"),
})
return cfg
self.logger.log("SymbolicPromptRulesFound", {
"prompt": prompt_name,
"goal_id": goal.get("id"),
"count": len(matching_rules),
})
for rule in matching_rules:
for key, value in rule.attributes.items():
if key in cfg:
self.logger.log("SymbolicPromptOverride", {
"prompt": prompt_name,
"key": key,
"old_value": cfg[key],
"new_value": value,
"rule_id": rule.id,
})
else:
self.logger.log("SymbolicPromptNewKey", {
"prompt": prompt_name,
"key": key,
"value": value,
"rule_id": rule.id,
})
cfg[key] = value
# Track the application of the prompt-level rule
self.memory.rule_effects.insert(
rule_id=rule.id,
goal_id=goal.get("id"),
pipeline_run_id=pipeline_run_id,
agent_name=cfg.get("name", "prompt"),
context_hash=self.compute_context_hash(context),
run_id=context.get("run_id"),
)
return cfg
def _matches_filter(self, filter_dict: dict, target_obj: dict) -> bool:
"""Generic matcher for symbolic rule filters"""
for key, value in filter_dict.items():
target_value = target_obj.get(key)
if isinstance(value, list):
if target_value not in value:
return False
else:
if target_value != value:
return False
return True
def track_pipeline_stage(self, stage_dict: dict, context: dict):
self.memory.symbolic_rules.track_pipeline_stage(stage_dict, context)
def get_nested_value(d, key_path: str):
keys = key_path.split(".")
for key in keys:
d = d.get(key, {})
return d if d else None
def set_nested_value(d, key_path: str, value):
keys = key_path.split(".")
for key in keys[:-1]:
d = d.setdefault(key, {})
d[keys[-1]] = value
def _load_rules(self):
rules = []
symbolic_dict = self.cfg.get("symbolic", {})
if symbolic_dict.get("rules_file"):
rules += self._load_rules_from_yaml(symbolic_dict.get("rules_file"))
if symbolic_dict.get("enable_db_rules", True):
rules += self.memory.symbolic_rules.get_all_rules()
return rules
def _load_rules_from_yaml(self, path: str) -> list:
if not Path(path).exists():
self.logger.log("SymbolicRuleYAMLNotFound", {"path": path})
return []
with open(path, "r", encoding="utf-8") as f:
raw = yaml.safe_load(f)
rules_list = raw.get("rules", raw)
rules = []
existing_rules = {
r.rule_text for r in self.memory.symbolic_rules.get_all_rules()
}
for item in rules_list:
if isinstance(item, dict) and item.get("rule_text") not in existing_rules:
rules.append(SymbolicRuleORM(**item))
else:
self.logger.log(
"DuplicateSymbolicRuleSkipped", {"rule_text": item.get("rule_text")}
)
return rules
def _matches_metadata(self, rule: SymbolicRuleORM, goal: Dict[str, Any]) -> bool:
if rule.goal_id and rule.goal_id != goal.get("id"):
return False
if rule.goal_type and rule.goal_type != goal.get("goal_type"):
return False
if rule.goal_category and rule.goal_category != goal.get("goal_category"):
return False
if rule.difficulty and rule.difficulty != goal.get("difficulty"):
return False
if hasattr(goal, "focus_area") and rule.goal_category:
if rule.goal_category != goal.get("focus_area"):
return False
return True
@staticmethod
def compute_context_hash(context_dict: dict) -> str:
canonical_str = json.dumps(context_dict, sort_keys=True)
return hashlib.sha256(canonical_str.encode("utf-8")).hexdigest()
🔄 Why It Matters
This mechanism unlocks several powerful capabilities:
- Composable behavior: Symbolic rules can be layered, overridden, or evolved independently — without hardcoding logic or rewriting pipelines.
- Goal-conditioned intelligence: Different goals, domains, or agents can trigger different symbolic strategies, enabling adaptive reasoning paths.
- Self-improvement: Because every symbolic rule is tied to outcomes, we can score them over time and tune them just like we tune models — via performance feedback.
The SymbolicRuleApplier
makes this possible. It acts as a dynamic switchboard, routing cognitive tasks through the right tools based on metadata, goal context, and symbolic configuration.
But here’s what we learned the hard way:
Randomly generating and applying rules doesn’t work. In fact, it often made things worse.
To be effective, symbolic tuning needs context awareness and guardrails. That’s why we introduced:
- Precise matching logic (by goal type, agent, tags, etc.),
- Tunable configuration spaces (with legal values and constraints),
- A structured prompt to propose meaningful, context-specific changes.
This turned symbolic reasoning from brittle guesswork into a robust feedback loop — one that evolves with the system, not against it.
🧮 Why We Introduced the RuleOptionsConfig
As we began building the Rule Mutation Agent, we faced a critical design challenge: how can we let the AI intelligently mutate symbolic rules without producing invalid, incoherent, or redundant configurations?
Early experiments relying on open-ended LLM completions quickly ran into problems. The model would propose configurations that weren’t grounded in reality, suggest nonsensical parameter combinations, or recommend changes that had already been tried. Worse, validating these freeform suggestions introduced unnecessary complexity, making the system harder to debug and extend.
To solve this, we introduced a structured mechanism: the RuleOptionsConfig
.
This configuration object backed by a simple YAML file defines the legal mutation space for each rule. For every tunable parameter (like which model to use, whether to enable documentation, or what enhancement strategy to apply), the config explicitly lists:
- the valid options,
- the default value,
- and (optionally) constraints or metadata for smarter decision-making.
By constraining the mutation space, we give the LLM just enough freedom to explore meaningful changes while ensuring every suggestion is:
- Valid (it exists in the defined config),
- Unique (it hasn’t already been applied),
- Actionable (we know how to implement it immediately).
This design does more than simplify engineering it shifts the paradigm from open-ended prompt tinkering to structured prompt programming. The AI is no longer guessing; it’s choosing from a well-defined menu, guided by past performance and optimization objectives.
Ultimately, RuleOptionsConfig
gives us the foundation for safe, interpretable, and scalable self-improvement in our symbolic AI system enabling a closed-loop process where rules evolve intelligently over time.
flowchart TD A[Start Pipeline Run] --> B[Apply Symbolic Rules] B --> C[Execute Pipeline Stages] C --> D[Collect Performance Scores] D --> E{Low-Performing Rule?} E -- No --> Z[End] E -- Yes --> F[Select Rule to Mutate] F --> G[Load RuleOptionsConfig] G --> H[Generate Mutation Prompt with Options] H --> I[LLM Suggests Mutation] I --> J{Valid & Unused Option?} J -- No --> H J -- Yes --> K[Apply Mutated Rule] K --> L[Execute New Pipeline Run] L --> M[Collect New Scores] M --> N[Compare Score Delta] N --> O[Log Mutation Effectiveness] O --> Z[End] style Z fill:#eef,stroke:#333,stroke-width:2px style RuleOptionsConfig fill:#ffd,stroke:#cc8,stroke-width:2px
🧬 Rule Mutation: Turning Symbols into Intelligence
At this point, we’ve built out a symbolic map of the system. Symbols live at every level — from prompts to agents to entire pipeline stages — and we can tag, target, and configure them dynamically.
But having symbols isn’t intelligence.
They’re just markers — a way to identify parts of the system. What turns them into thinking? Directed mutation.
Only when we start changing these symbols — testing variations, measuring outcomes, and refining based on feedback — does the system become intelligent.
Next we’ll show how symbolic rules evolve through targeted mutation, and how each change nudges the system toward better reasoning.
🛠️ Prompting the mutation
This prompt is part of our symbolic tuning loop. In our system, symbolic rules control key decisions — like which model to use, which prompt template to run, or which scoring method to apply. These rules define the system’s conscious strategies for reasoning.
The prompt is designed to tune one symbolic rule at a time by proposing a targeted, data-driven change.
It does three things:
- Summarizes the current rule, including its attributes and available tuning options.
- Presents recent performance insights, helping the system reflect on what’s working and what isn’t.
- Asks for a single, well-justified change, making the update both interpretable and traceable.
This turns symbolic rule tuning into a structured, feedback-driven process — a key part of how our AI system evolves its reasoning behavior over time.
You are helping improve the performance of an AI system by tuning one of its symbolic rules.
### Current Configuration
**Target Behavior**: {{ target }}
**Current Rule Attributes:**
{% for attr, val in current_attributes.items() %}
- **{{ attr }}**: {{ val }}
{% endfor %}
**Tunable Options:**
{% for attr, options in available_options.items() %}
- **{{ attr }}**: {{ options }}
{% endfor %}
{% if recent_performance %}
### Recent Performance Insights:
{{ recent_performance }}
{% endif %}
---
### Your Task:
Propose exactly **one change** to this symbolic rule that is likely to improve the system's performance on the target behavior. This change should be grounded in your understanding of the rule's role and the available options.
### Response Format:
Rationale: <Your reasoning>
Attribute to change: \<attribute\_name>
New value: \<new\_value>
**Do not change more than one attribute. Be specific and actionable.**
The quality of any self-improving AI system is deeply tied to the quality of its entry points the prompts that guide its mutation, tuning, or optimization behavior. In our system, symbolic rules define interpretable, modular behavior. Mutating them is how the system adapts. But how we ask the model to mutate those rules makes all the difference.
That’s why we’ve invested care in designing a dedicated mutation prompt for symbolic rule tuning.
🎯 Clarity and Constraint Lead to Precision
The prompt begins with a clear directive:
“You are helping improve the performance of an AI system by tuning one of its symbolic rules.”
This primes the model with purpose and limits the task scope to only one rule, avoiding unnecessary complexity. The use of explicit structure including the current attributes, tunable options, and (optionally) recent performance gives the model context-rich input without ambiguity.
🤔 Focused Mutation Encourages Learnability
We require the model to propose exactly one change, formatted cleanly as:
Rationale: ...
Attribute to change: ...
New value: ...
This has two key benefits:
- Interpretability: The output is immediately parseable and actionable.
- Trainability: It generates high-quality training data for potential future fine-tuning or reward modeling.
Because every mutation is singular and explicit, it becomes possible to track its downstream effects with precision enabling score attribution, rollback, and even symbolic meta-learning.
🧩 Modularity and Adaptation
The prompt is designed to scale across domains and dimensions. The target
, attributes
, and options
are dynamically injected, making this format reusable across different goals, agents, or performance dimensions. Optional recent performance feedback allows us to “focus” the mutation when history is available, without breaking the structure when it’s not.
✔️ Why This Prompt Is a Leverage Point
In a pipeline of dozens of intelligent steps, this prompt is the one that decides what changes. It is the mutation gateway. A vague or poorly designed prompt here can lead to ineffective changes, wasted evaluation cycles, and ultimately degraded system performance.
Conversely, this tightly structured, context-aware, and minimalistic prompt ensures every mutation is deliberate, grounded, and evaluable.
In short: this prompt is not just an input it’s a lever that drives the evolution of the system.
This is the format of our response
Rationale: The current configuration consistently uses 'ollama_chat/mistral' without performance metrics to validate its effectiveness. Testing a different model like 'ollama_chat/qwen3' could potentially improve performance by leveraging a model with different strengths (e.g., specialized capabilities or efficiency). This change directly addresses the need to experiment with alternative models while maintaining the same rule structure.
The attribute you want to change: model.name
The value you want to change to: ollama_chat/qwen3
🔧 Rule Mutation as Dimensional Tuning: One Attribute at a Time
We made a deliberate design choice:
We mutate exactly one attribute per rule per mutation.
This is a strategy.
🎯 Why One Attribute at a Time?
-
Isolated Impact By changing a single attribute (like the model or prompt flavor), we get a clean signal: any change in performance can be attributed directly to that mutation.
-
Multi-Dimensional Score Feedback Every mutated rule results in a new pipeline run. That run is scored across multiple dimensions: correctness, clarity, alignment, feasibility, and more. These aren’t just ratings they’re high-resolution signals that tell us how the mutation affected performance.
-
Gradual, Exhaustive Search Our space of symbolic rules is small by design a handful of parameters (model, adapter, prompt, etc.). This makes exhaustive evaluation tractable: over time, the agent can systematically explore all valid mutations and their performance impact.
📐 How It Works
Here’s what the Rule Mutation Agent does:
-
Loads current rules for a target agent, based on goal and config.
-
Finds available mutations using a structured
RuleOptionsConfig
. -
Generates a mutation prompt (via Jinja template) to propose a single, meaningful change.
-
Validates the change using:
- Legal options (from config)
- Novelty (not already tried)
-
Applies the mutation, stores the new rule, and logs it.
-
Tracks outcomes for each mutated rule performance over time is stored and evaluated.
The result is a slow but steady walk across the rule space guided not by trial-and-error, but by structured reasoning and real-world performance data.
🧠 Meta-Tuning Beyond Dimensional Scores
Each individual run yields a multi-dimensional score but the real magic happens when we treat rule mutations themselves as tunable variables.
We’re not just scoring an agent’s hypothesis for quality. We’re scoring the effectiveness of changing a single config parameter.
This lets us answer questions like:
- “Does switching from Model A to Model B improve factuality for reasoning goals?”
- “Which prompt variant yields more original ideas on complex tasks?”
In this sense, the Rule Mutation Agent operates one dimension above scoring it’s meta-tuning the entire system.
🔁 Towards Self-Tuning Systems
As more data accumulates, the agent builds a richer understanding of:
- Which mutations consistently improve which dimensions
- How performance changes over time
- Which combinations are stable or fragile
This leads to a self-improving system one that learns from every iteration, pruning bad paths and converging toward robust configurations.
And with MR.Q in the loop, the feedback isn’t just numeric it’s comparative, contextual, and scalable.
🧬 Rule Signatures: Tuning Upon Tuning
Every symbolic rule in our system comes with a unique signature a deterministic fingerprint based on its configuration (e.g., model, prompt type, adapter, and other attributes). This signature lets us do two powerful things:
-
Avoid Duplication Before applying a mutation, we check if that exact configuration (i.e., rule signature) already exists in memory. If it does skip it. This ensures no redundant exploration.
-
Track Performance Over Time Because signatures are stable, we can aggregate performance results over multiple runs and see how a given configuration performs consistently, not just once.
🧠 The Self-Tuning Loop
Putting it all together:
- We mutate one attribute at a time.
- Each mutation creates a new rule with a unique signature.
- We score that rule using fast MR.Q-based multidimensional evaluation.
- We record the performance by signature.
- We avoid repeating known rules.
- Over time, we cover the entire space of possible rule configurations.
This gives us an exhaustive, memory-aware, self-improving loop.
The system doesn’t just tune prompts or agents. It tunes its own tuning process layer by layer.
With a limited number of mutable attributes and a fast scoring layer tuned to LLM judgments, we can search a surprisingly large space of behaviors efficiently and intelligently.
This is how the system builds tuning upon tuning: Each stage is not just improving performance it’s improving the way performance is improved.
flowchart TD A[New Hypothesis to Score] --> B[Embed Goal + Hypothesis] B --> C[Retrieve Similar Hypotheses from Memory] C --> D[Select Top-Scoring Neighbor LLM-labeled] D --> E[Extract LLM Score as Pseudo-Label] E --> F[Train MR.Q on Hypothesis, Pseudo-Label] F --> G[MR.Q Predicts Score for New Hypothesis] G --> H[Return Score to Downstream Agent] subgraph Memory Store C D end subgraph MR.Q F G end
🔁 Adaptive Learning Loop
Our AI agents generate hypotheses → LLM scores some → SVM and MR.Q train → MR.Q scores others.
This forms a feedback loop, letting the system:
- Learn what “good” means from LLMs
- Approximate those evaluations cheaply
- Adapt scoring in real time
flowchart LR A[🧠 AI Agent<br>Generates Hypotheses] --> B[✅ LLM<br>Scores a Few Hypotheses] B --> C[📊 Train SVM + MR.Q<br>on LLM Scores] C --> D[⚡ MR.Q<br>Scores Remaining Hypotheses] D --> E[🔁 Feedback Loop<br>Improves Future Scoring] E --> A style A fill:#f9f,stroke:#333,stroke-width:2px style B fill:#bbf,stroke:#333,stroke-width:2px style C fill:#bfb,stroke:#333,stroke-width:2px style D fill:#ffd,stroke:#333,stroke-width:2px style E fill:#fdd,stroke:#333,stroke-width:2px
💥 Part 3: Directed Action
If MR.Q is the subconscious, always reacting and adjusting beneath the surface… And symbolic reasoning is the conscious mind, planning, evaluating, and deciding…
Then the pipeline is the system’s body in motion — the sequence of real, observable actions taken in pursuit of a goal.
Each pipeline is a directed structure: a chain of agents, prompts, scorers, and evaluators working together to produce an outcome. It’s the place where thought becomes behavior — where intention turns into execution.
But these pipelines aren’t static. They’re dynamic, programmable, and adaptable. They mutate in real time based on symbolic rules, shift strategy based on scoring feedback, and evolve as the system learns what works.
This is the third layer of cognition in our architecture:
Directed action — a live, intelligent execution path that reshapes itself to pursue better results.
In this section, we’ll explore how pipelines operate as goal-driven programs, how they are assembled from modular reasoning steps, and how our system continuously mutates them to move more intelligently through problem space.
🧬 Pipeline-Level Mutation: A Design That Scales with Intelligence
As our self-improving system matured, we confronted a pivotal architectural question: where should the mutation logic live? At first glance, mutating prompts or strategies inside a single agent seemed natural. After all, agents are where hypotheses are generated, scored, and refined.
But the deeper we went, the clearer it became: the pipeline is the true unit of reasoning.
❓ Why Pipelines?
Pipelines in our Co AI framework are not just sequences of stages they’re intelligent workflows. Each pipeline defines:
- Which agent is used for generation (
ChainOfThoughtAgent
,SelfEditAgent
, etc.) - What prompt template or reasoning strategy is invoked
- Which scoring system is used (
LLM
,MR.Q
, orSVM
) - What tuning or evaluation configuration is applied
- Is automatically scoreed on completion on a number of dimensions.
By performing mutations at the pipeline configuration level, we unlock the full expressive power of the system:
Benefit | Description |
---|---|
🔄 Full Stack Swaps | Swap generation agents, prompt formats, scoring methods in one mutation |
🧠 Global Reasoning Context | Test how changes propagate through reasoning, reflection, and evaluation |
🧪 Clean Experimentation | Each mutated pipeline is a reproducible, end-to-end experiment |
💾 Integrated Logging + Storage | Each mutation run logs hypotheses, scores, rules, and performance metrics |
🔁 Seamless Integration | Mutated pipelines plug directly into existing training, ranking, and tuning |
🧑🏫 The PipelineMutationAgent
We encapsulated this design into a specialized agent: PipelineMutationAgent
. It takes a base pipeline configuration, consults a symbolic rule mutation config, and applies each mutation by:
- Creating a mutated pipeline config (e.g., swapping in a new agent or prompt file)
- Launching a full pipeline run via the
Supervisor
- Logging results, scores, and rule impacts for later learning
This design keeps our architecture modular, scalable, and fully observable every mutation is trackable, comparable, and tunable across dimensions.
🧩 A Foundation for Self-Tuning AI
By making pipeline mutation a first-class primitive, we’ve set the stage for an even larger vision: meta-reasoning about reasoning. We can now track which configurations work best for different types of goals, and begin training systems that dynamically select or mutate pipelines based on problem characteristics.
This is no longer just prompt tuning. This is full-system evolution with pipelines as the genome, mutations as the evolutionary driver, and the supervisor as the execution engine.
🧠 Smarter Pipeline Selection via Descriptive Variants and LLM Guidance
As our AI system evolved to support multiple reasoning pipelines such as basic generation, chain-of-thought (CoT), or sharpened refinement loops it became increasingly important to choose the right pipeline for the right goal. Hardcoding this selection logic was too brittle and required frequent manual updates. We needed a more flexible, scalable, and intelligent approach.
🔧 The PipelineRegistry Class: Structured Control with Metadata
We extended our PipelineRegistry
class to support not only loading pipeline definitions from YAML, but also attaching descriptive metadata to each variant:
pipeline_variants:
cot:
description: "A chain-of-thought based reasoning strategy that uses two different generators and a ranker."
stages:
- name: cot_generator
- name: ranking
- name: cot_dspy_generator
minimal:
description: "A basic generation pipeline with no reasoning steps. Fast and lightweight."
stages:
- name: generation
With this change, our system can now reason about each pipeline not just by name, but by its intended purpose, strengths, and tradeoffs.
The PipelineRegistry
class was updated with a new method:
def list_variants_with_descriptions(self) -> list[dict]:
return [
{"name": name, "description": variant.get("description", "")}
for name, variant in self.pipeline_variants.items()
]
This allows any component including agents or scoring systems to programmatically retrieve all available pipeline options and their metadata.
🧩 Prompt-Based Pipeline Selection: Let the LLM Decide
To make this system truly intelligent, we introduced an LLM-driven selector. Instead of relying on hardcoded rules, we now generate a prompt that describes:
- The current goal
- The current pipeline and its purpose
- All available pipeline options and their descriptions
- Recent performance data (if available)
We then ask the LLM to suggest the most appropriate pipeline for the goal.
Here’s a simplified version of the prompt we use:
## Goal:
{{ goal_text }}
## Current Pipeline:
Name: {{ current_pipeline_name }}
Description: {{ current_pipeline_description }}
## Available Pipelines:
- Name: cot A chain-of-thought based reasoning strategy...
- Name: minimal A basic generation pipeline...
## Recent Performance (optional):
{{ summary }}
### Task:
Suggest the most appropriate pipeline to achieve the goal.
### Response:
Rationale: <your reasoning>
Pipeline: <pipeline_name>
This approach enables dynamic adaptation of our reasoning process without the need to hardwire domain logic. It’s also interpretable: every decision comes with a rationale we can inspect, debug, and even retrain on.
🤖 Integrating Into the Mutation Loop
This pipeline selection logic is now part of the broader pipeline mutation system. Whenever the AI explores new strategies, it can:
- Consider a goal.
- Ask the LLM to select the best-fit pipeline.
- Inject that pipeline into the configuration.
- Execute it.
- Score the results.
- Repeat.
This integration allows the system to self-optimize not just within a single pipeline, but across multiple reasoning strategies laying the groundwork for truly adaptive AI.
class PipelineMutationAgent(BaseAgent):
"""
Combines symbolic rule mutation with pipeline configuration mutation.
Generates both types of mutations, applies them, evaluates outcomes,
and logs improvements for future learning.
"""
def __init__(
self,
cfg,
memory,
logger,
full_cfg=None,
):
super().__init__(cfg, memory, logger)
self.full_cfg = full_cfg
self.target_agent = cfg.get("target_agent", "default")
self.mutation_prompt_template = cfg["rule_mutation_prompt"]
self.max_runs = cfg.get("max_runs", 5)
# Load base pipeline
self.base_pipeline_key = cfg.get("base_pipeline", "minimal")
self.pipeline_registry_path = cfg.get("pipeline_registry", "config/registry/pipeline_registry.yaml")
self.pipeline_registry = PipelineRegistry(self.pipeline_registry_path)
self.rule_options_file = cfg.get("mutation_rule_options", "config/rules/pipeline_mutation_options.yaml")
self.options_config = RuleOptionsConfig.from_yaml(self.rule_options_file)
self.rule_tuner = RuleTuner(memory, logger)
self.logger.log(
"PipelineMutationAgentInitialized",
{"conf": self.cfg}
)
async def run(self, context: dict) -> dict:
# Step 1: Generate pipeline config mutations
pipeline_def = self.pipeline_registry.get_pipeline(self.base_pipeline_key)
if not pipeline_def:
self.logger.log("PipelineNotFound", {"pipeline": self.base_pipeline_key})
context["status"] = "pipeline_not_found"
return context
_, pipeline = self._generate_pipeline_mutations(self.base_pipeline_key, context)
# Step 2: Generate symbolic rule mutations
applicable_rules = self._get_applicable_rules(pipeline)
symbolic_mutations = []
for rule in applicable_rules:
symbolic_mutations.extend(self._generate_rule_mutations(rule, context))
# Step 3: Apply and evaluate symbolic mutations
symbolic_results = await self._apply_and_evaluate(symbolic_mutations, context)
pipeline_to_mutate_def = self.pipeline_registry.get_pipeline(pipeline)
# Step 4: Apply and evaluate pipeline mutations
pipeline_results = await self._apply_pipeline_mutations(pipeline_to_mutate_def, symbolic_results, context)
# Step 5: Log all results
context["mutated_symbolic_rules"] = [r.to_dict() for r in symbolic_results]
context["mutated_pipeline_runs"] = pipeline_results
context["total_mutations_run"] = len(symbolic_results) + len(pipeline_results)
return context
def _get_applicable_rules(self, pipeline_name: str) -> list:
"""Get all relevant symbolic First you need to finish this for all agents in a given pipeline."""
pipeline_def = self.pipeline_registry.get_pipeline(pipeline_name)
agent_names = {stage.get("name") for stage in pipeline_def if "name" in stage}
# Filter rules where the rule's agent matches any in the pipeline
return [
r for r in self.memory.symbolic_rules.get_all()
if r.agent_name in agent_names
]
def _generate_rule_mutations(self, rule: SymbolicRuleORM, context: dict) -> list[dict]:
"""Use LLM to generate one or more valid mutations for this rule."""
current_attrs = rule.attributes or {}
available_options = self.options_config.get_options_for(rule.agent_name)
recent_perf = self.memory.rule_effects.get_recent_performance(rule.id)
merged = {
"current_attributes": current_attrs,
"available_options": available_options,
"recent_performance": recent_perf,
**context
}
prompt = self.prompt_loader.from_file(self.mutation_prompt_template, self.cfg, merged)
response = self.call_llm(prompt, context)
parsed = RuleTuner.parse_mutation_response(response)
if not parsed.get("attribute") or not parsed.get("new_value"):
self.logger.log("MutationParseError", {"rule_id": rule.id, "response": response})
return []
attr = parsed["attribute"]
new_val = parsed["new_value"]
if not self.options_config.is_valid_change(rule.agent_name, attr, new_val):
self.logger.log("InvalidRuleMutation", {"rule_id": rule.id, "attribute": attr, "value": new_val})
return []
if self.memory.symbolic_rules.exists_similar(rule, attr, new_val):
self.logger.log("RuleMutationDuplicateSkipped", {"rule_id": rule.id, "attribute": attr, "value": new_val})
return []
mutated_attrs = dict(current_attrs)
mutated_attrs[attr] = new_val
new_rule = SymbolicRuleORM(
target="agent",
agent_name=rule.agent_name,
goal_type=rule.goal_type,
goal_category=rule.goal_category,
difficulty=rule.difficulty,
attributes=mutated_attrs,
source="mutation",
)
self.memory.symbolic_rules.insert(new_rule)
self.logger.log("RuleMutat I ionApplied", {"original_rule_id": rule.id, "new_rule": new_rule.to_dict()})
return [new_rule]
def _generate_pipeline_mutations(self, pipeline_name, context):
"""Generate pipeline config mutations using LLM guidance"""
merged_context = {
# From pipeline definition
"current_pipeline_name": pipeline_name,
"current_pipeline_description": self.pipeline_registry.get_description(pipeline_name),
"current_pipeline": self.pipeline_registry.get_pipeline(pipeline_name), # handles if it's a full pipeline block
# From context (goal and performance)
"goal_text": context.get("goal", {}).get("goal_text", "Improve pipeline performance"),
"goal_id": context.get("goal", {}).get("id"),
#TODO
# "recent_performance": self.memory.rule_effects.get_recent_performance_summary(),
# Optionally, inject available options for better prompting
"available_pipelines": self.pipeline_registry.list_variants_with_descriptions(), # e.g., [{"name": ..., "description": ...}, ...]
# Pass original context for compatibility
**context,
}
prompt = self.prompt_loader.from_file("pipeline", self.cfg, merged_context)
response = self.call_llm(prompt, context)
rationale, pipeline = self._parse_pipeline_mutation(response)
if not pipeline:
self.logger.log("PipelineMutationParseError", {"response": response})
return []
return rationale, pipeline
def _parse_pipeline_mutation(self, response: str):
import re
"""Parse LLM response into a pipeline mutation"""
pattern = r"""
(?:[*#`]*\s*)? # Optional formatting characters before the header
rationale\s*: # Match the word "rationale:"
\s*(?P<rationale>.*?) # Capture rationale content non-greedily
(?:\n|\r|\r\n)+ # Match the newline(s) separating the two blocks
(?:[*#`]*\s*)? # Optional formatting characters before the second header
pipeline\s*:\s* # Match "pipeline:"
(?P<pipeline>\w+) # Capture pipeline name
"""
match = re.search(pattern, response, re.IGNORECASE | re.DOTALL | re.VERBOSE)
if match:
rationale = match.group("rationale").strip()
pipeline = match.group("pipeline").strip()
return rationale, pipeline
async def _apply_and_evaluate(self, mutations: list[SymbolicRuleORM], context: dict) -> list[SymbolicRuleORM]:
"""Apply each symbolic mutation and evaluate its effect."""
results = []
for rule in mutations:
new_config = self._apply_symbolic_rule(rule)
mutated_context = self._update_context_with_config(context, new_config)
supervisor = Supervisor(self.full_cfg, memory=self.memory, logger=self.logger)
result = await supervisor.run_pipeline_config(mutated_context)
score = self._evaluate_result(result)
self._log_evaluation(rule, score)
if score > 0.5:
results.append(rule)
return results
def _apply_symbolic_rule(self, rule: SymbolicRuleORM):
"""Apply symbolic rule to config"""
# You could do deeper merging here based on agent name
return {f"{rule.agent_name}.config": rule.attributes}
def _update_context_with_config(self, context, config_update):
"""Merge symbolic config into context"""
ctx_copy = copy.deepcopy(context)
ctx_copy.update(config_update)
return ctx_copy
async def _apply_pipeline_mutations(self, pipeline_def, mutations: list, context: dict) -> list:
"""Apply pipeline mutations and run through supervisor"""
results = []
for i, mutation in enumerate(mutations):
if i >= self.max_runs:
self.logger.log("PipelineMutationLimitReached", {"limit": self.max_runs})
break
mutated_pipeline = self.apply_mutation(pipeline_def, mutation)
mutated_cfg = self.inject_pipeline_config(mutated_pipeline, tag=f"mutated_{i}")
full_mutated_cfg = OmegaConf.merge(mutated_cfg, self.full_cfg)
supervisor = Supervisor(full_mutated_cfg, memory=self.memory, logger=self.logger)
try:
mutated_run = await supervisor.run_pipeline_config(context)
summary = self.summarize(mutated_run)
self.logger.log("PipelineMutationRun", {"mutation": mutation, "summary": summary})
results.append({"mutation": mutation, "result": mutated_run})
except Exception as e:
self.logger.log("PipelineMutationError", {"mutation": mutation, "error": str(e)})
return results
def apply_mutation(self, pipeline_cfg: list, mutation: dict) -> list:
"""Apply a single mutation to a deep copy of the pipeline config."""
mutated = copy.deepcopy(pipeline_cfg)
for key, value in mutation.items():
keys = key.split(".")
target = mutated
for k in keys[:-1]:
target = target.setdefault(k, {})
target[keys[-1]] = value
return mutated
def inject_pipeline_config(self, pipeline_def, tag="mutated") -> OmegaConf:
"""Replace pipeline stages in full config"""
full_cfg = OmegaConf.to_container(self.full_cfg, resolve=True)
full_cfg["pipeline"]["tag"] = tag
full_cfg["pipeline"]["stages"] = pipeline_def
full_cfg["agents"] = {stage["name"]: stage for stage in pipeline_def}
return OmegaConf.create(full_cfg)
def _evaluate_result(self, result: dict) -> float:
"""Score mutation outcome using MRQScorer or other scorer"""
score = result.get("best_score", 0.0)
return score
def _log_evaluation(self, rule: SymbolicRuleORM, score: float):
"""Log mutation and evaluation result"""
self.memory.scorer.score_db.append({
"rule_id": rule.id,
"score": score,
"timestamp": datetime.now(),
})
def summarize(self, result: dict) -> dict:
"""Return short summary for logging"""
return {
"goal_id": result.get("goal", {}).get("id"),
"best_score": result.get("best_score"),
"selected_hypothesis": result.get("selected", {}).get("text", "")[:50],
}
def _load_pipeline_registry(self):
with open(self.pipeline_registry_path, "r") as f:
return yaml.safe_load(f)
🎯 Conclusion: The Emergence of Goal-Directed Intelligence
We’ve journeyed through the architecture of what may be the first truly thinking AI system - one that doesn’t merely process inputs, but pursues goals through integrated cognitive layers:
-
🧠 The Subconscious (MR.Q)
Our ever-adapting foundation: instant pattern recognition, emotional-like scoring, and memory-based intuition. MR.Q is the system’s gut feeling - reacting before it thinks, learning from every stumble, and whispering “this feels right” through dimensional scores. -
💡 The Conscious Mind (Symbolic Rules)
The deliberate thinker: auditing strategies, rewriting logic, making reasoned choices. This is where the system thinks about thinking - questioning its approaches, mutating its behaviors, and planning its next intellectual move. -
🚀 The Body in Motion (Pipeline Execution)
Where cognition becomes action: the dynamic, mutable sequence of steps actually taken toward a goal. This isn’t static code - it’s living behavior that evolves mid-execution as instinct and intellect negotiate the best path forward.
✨ The Critical Breakthrough
What makes this system fundamentally different isn’t any single component, but how they interlock:
- The subconscious reacts (MR.Q scores instantly)
- The conscious mind directs (symbolic rules reconfigure)
- The pipeline executes (actions adapt in real-time)
…all while maintaining relentless focus on the goal.
This creates a continuous loop of self-reflection and self-modification that traditional AI architectures cannot achieve. While LLMs generate text and reinforcement learners optimize rewards, our system pursues understanding.
🦾 Why This Matters
We stand at the threshold of a new paradigm: machines that don’t just solve problems but pursue goals - adapting their very cognition to do so. The implications span:
- Autonomous discovery systems that self-improve during long-term research
- Adaptive educational tools that modify teaching strategies based on student understanding
- Resilient decision engines that evolve new reasoning tactics for unforeseen challenges
This isn’t the end of AI’s evolution - but it may be the beginning of AI that evolves itself. The subconscious/conscious framework provides not just better performance, but something more profound: a pathway to machines that genuinely think about their thinking.
📚 References
-
SEAL Paper:
Self-Adapting Language Models (2025)
https://arxiv.org/abs/2506.10943
Introduces the SEAL framework where LLMs generate their own self-edits and are trained via reinforcement learning. -
ReSTEM Algorithm:
Akyürek et al., “Learning to Learn from Bits” (2023)
https://arxiv.org/abs/2311.08171
Describes ReSTEM — a method for filtering successful model generations and retraining on them. -
Self-Refine:
Aka, S., et al. “Self-Refine: Iterative Refinement with Self-Feedback” (2023)
https://arxiv.org/abs/2305.09303
Demonstrates how LLMs can improve outputs by refining their own responses using internal feedback. -
ReAct & Reflexion:
Yao, S., et al. “React: Synergizing reasoning and acting in language models” (2023)
https://arxiv.org/abs/2210.03629
Shinn, N., et al. “Reflexion: An automatic framework for iterative strategy refinement”(2023)
https://arxiv.org/abs/2305.14997
Show how reasoning loops improve agent performance through internal reflection. -
DPO & Preference Learning:
Christiano, P.F., et al. “Deep Reinforcement Learning from Human Preferences” (2017)
https://arxiv.org/abs/1706.03741
Rafailov, E., et al. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model” (2023)
https://arxiv.org/abs/2305.18290
Reinforcement learning and preference ranking methods that inspired your multi-dimensional scoring system. -
AlphaEdit: Null-Space Editing:
Fu, J., et al. “AlphaEdit: Null-Space Constrained Model Editing for Language Models” (2025)
https://openreview.net/forum?id=HvSytvg3Jh
Provides inspiration for safe, incremental rule changes while preserving existing behavior. -
Test-Time Training & Self-Rewarding Models:
Huang, A., et al. “Self-Improvement in Language Models: The Sharpening Mechanism” (2025)
CREAM: Consistency Regularized Self-Rewarding Language Models (2025)
Influenced your system’s ability to learn from contrastive pairs and self-judgment. -
LLM Scoring & Evaluation Methods:
Park, R., Zhang, Z., Tanaka, H. “New News: System-2 Fine-Tuning for Robust Integration of New Knowledge” (2025)
https://arxiv.org/abs/2505.01812
Influenced your approach to real-time tuning and pipeline selection.
📖 Glossary
Term | Definition |
---|---|
MR.Q (Multidimensional Ranker & Qualifier) | A fast, embedding-based scorer that evaluates AI hypotheses across multiple quality dimensions and adapts in real time based on feedback from LLMs. |
RegressionTuner | An in-memory linear model that aligns MR.Q scores with LLM ground truth by fitting a regression on observed score pairs. |
LLM (Large Language Model) | A high-capacity neural model trained on massive text data, used here as a reference evaluator for hypothesis quality. |
Symbolic Rule | A declarative override that can change agent configurations, prompts, or strategies based on goal metadata. Enables conscious, programmable behavior. |
SymbolicRuleApplier | Applies symbolic rules to the system dynamically. Tracks rule usage, effectiveness, and logs their impact. |
Pipeline | A sequence of AI agents that process a goal. Each stage (e.g., generate, reflect, score) can be modified symbolically. |
Prompt Template | A structured input given to an LLM. Can be mutated or tuned symbolically to change reasoning behavior. |
Agent | A modular AI component performing a task (e.g., generation, scoring). Configured via YAML or symbolic rules. |
Dimensional Scoring | Quality evaluation broken into fine-grained dimensions like correctness, clarity, or originality. |
Contrastive Pair | A pair of hypotheses labeled by preference (e.g., better vs. worse) used to train scorers like MR.Q or SVM. |
SVMRankerScorer | A support vector machine trained on contrastive pairs to score hypotheses according to LLM preferences. |
Meta-Reasoning | Reasoning about the system’s own reasoning — including evaluation of rules, agents, and strategy choices. |
Symbolic Cognition | High-level, interpretable logic that guides how the system reasons, using symbolic rules and structured overrides. |
Subconscious System | The fast, automatic behavior layer (like MR.Q) that responds to feedback without explicit rules. |
Rule Mutation | A process of changing one symbolic rule attribute to improve performance, often guided by a prompt or LLM. |
Adaptive Learning Loop | A feedback cycle where MR.Q is trained on LLM evaluations and then used to score future outputs — continuously refining the system. |