Prompt Engineering

23 June 2025

How a self-evolving AI learns to reflect, score, and rewrite its own reasoning

🧪 Summary

What if an AI could think not just solve problems, but reevaluate its beliefs in the face of new information?

In this post, we introduce a system that does exactly that. At the core of our pipeline is a lightweight scoring model called MR.Q, responsible for evaluating ideas and choosing the best ones. But when it encounters a new domain, a new goal, or a shift in task format, it doesn’t freeze it adapts.

General Reasoner: The smarter Local Agent

22 May 2025

🔧 Summary

The General Reasoner paper shows how we can train LLMs to reason across domains using diverse data and a generative verifier. In this post, I walk through our open-source implementation showing how we built a modular reasoning agent capable of generating multiple hypotheses, evaluating them with an LLM-based judge, and selecting the best answer.

🧠 What We Built

We built a GeneralReasonerAgent that:

Dynamically generates multiple hypotheses using different reasoning strategies (e.g., cot, debate, verify_then_answer, etc.)
Evaluates each pair of hypotheses using either a local LLM judge or our custom MR.Q evaluator
Classifies the winning hypothesis using rubric dimensions
Logs structured results to a PostgreSQL-backed system

All of this was integrated with our existing stephanie framework, which includes:

Thoughts of Algorithms

🧪 Summary

General Reasoner: The smarter Local Agent

🔧 Summary

🧠 What We Built