General Reasoner: The smarter Local Agent

π§ Summary
The General Reasoner paper shows how we can train LLMs to reason across domains using diverse data and a generative verifier. In this post, I walk through our open-source implementation showing how we built a modular reasoning agent capable of generating multiple hypotheses, evaluating them with an LLM-based judge, and selecting the best answer.
π§ What We Built
We built a GeneralReasonerAgent
that:
- Dynamically generates multiple hypotheses using different reasoning strategies (e.g.,
cot
,debate
,verify_then_answer
, etc.) - Evaluates each pair of hypotheses using either a local LLM judge or our custom MR.Q evaluator
- Classifies the winning hypothesis using rubric dimensions
- Logs structured results to a PostgreSQL-backed system
All of this was integrated with our existing co_ai framework, which includes:
- Modular agent definitions
- Hydra-based configuration
- Prompt templating
- JSONL and SQL-based logging and storage
graph TD A[External Goals Dataset StrategyQA] --> B[Goal Store] B --> C[GeneralReasonerAgent] C --> D[Hypothesis Generator Multi-Strategy Prompts] D --> E1[Strategy: CoT] D --> E2[Strategy: Plan-First] D --> E3[Strategy: Debate] D --> E4[Strategy: Verify-Then-Answer] D --> E5[Strategy: Counterfactual] E1 & E2 & E3 & E4 & E5 --> F[Hypotheses List] F --> G[Pairwise Judging] G -->|via LLM| H1[LLM Judge] G -->|via Scores| H2[MR.Q Evaluator] H1 & H2 --> I[Scored Hypothesis Pairs] I --> J[Select Best Hypothesis] J --> K[Rubric Classification] K --> L[Pattern Labels Deductive, Top-Down] L --> M[Pattern Stats Table] J --> N[Score Table] M & N --> O[Strategy Analysis Dashboard] O --> P[Research Insight: Which strategies work best for which goals and why?] style A fill:#fdf6e3,stroke:#586e75 style P fill:#b58900,color:#fff,font-weight:bold
π The Reasoning Loop
The GeneralReasonerAgent
works as follows:
π§ Evaluating Hypotheses with Pairwise Judging
After generating a list of hypotheses based on the configured reasoning strategies, the agent enters a loop where it compares each pair of hypotheses using a custom prompt template and a configured judge (e.g., MR.Q or a local LLM).
In the configuration we can choose different modes
generate_and_judge
when generatin hypotheses using the configured agents.judge_only
we would be processing as part of a larger pipline of agents.
general_reasoner:
name: general_reasoner
...
thinking_mode: generate_and_judge # (judge_only (for processing hypotheses from other agents or generate_and_judge )
judge: llm # (mrq or llm)
judge_prompt_file: judge_pairwise_comparison.txt # the prompt used for judging
judge_model: # the model used for judging
name: ollama/mistral:7b-instruct
api_base: http://localhost:11434
api_key: null
for hyp_a, hyp_b in combinations(hypotheses, 2):
prompt_text = prompt_loader.from_file(judging_prompt_template, self.cfg, {
"goal": goal,
"hypothesis_a": hyp_a.text,
"hypothesis_b": hyp_b.text
})
preferred, score = self.judge.judge(prompt_text, goal, hyp_a.text, hyp_b.text)
Hereβs what this is doing:
combinations(hypotheses, 2)
: Iterates over all possible pairs of hypotheses so each one can be compared head-to-head.prompt_loader.from_file(...)
: Loads a Jinja-style template and fills in thegoal
,hypothesis_a
, andhypothesis_b
, creating a natural language evaluation task.self.judge.judge(...)
: Sends the constructed prompt to the evaluator (LLM or MR.Q) to determine which hypothesis better addresses the goal and assigns a score.
π This structure allows the system to perform pairwise comparisons, a more robust evaluation method than scoring each hypothesis independently. It mimics how humans might compare multiple ideas to pick the best one.
βοΈ Judging Template and MR.Q Integration
The judging step is powered by a prompt template file that defines how the goal and hypotheses are presented to the evaluator. This is usually a simple Jinja-style text file such as judge_pairwise_comparison.txt
. Here’s an example of what that file might look like:
You are an impartial expert evaluator. Given a goal and two hypotheses, your task is to judge which hypothesis better addresses the goal.
### Goal:
{{ goal }}
### Hypothesis A:
{{ hypothesis_a }}
### Hypothesis B:
{{ hypothesis_b }}
### Evaluation Criteria:
- Relevance to the goal
- Reasoning quality and depth
- Clarity and coherence
- Use of evidence or logical structure
### Instructions:
Do **not** include any extra explanation or internal thoughts.
**Respond only using the following exact format**:
better hypothesis:<A or B>
reason:<Brief explanation>
score_a:<0-100>
score_b:<0-100>
This prompt gets rendered by the agent with real content from each hypothesis and is then passed to the judge.
π§ͺ Swapping in MR.Q as the Judge
Instead of using a local LLM to evaluate responses, we can switch to using our custom MR.Q judge a lightweight, rule-based evaluator trained on prior scoring data.
This swap is handled via a config flag:
agents:
general_reasoner:
judge: "mrq" # β Use "llm" for LLM-based, or "mrq" for our custom model
Under the hood, self.judge.judge(...)
automatically delegates to the correct backend. This gives us:
- Faster and more consistent evaluations
- No dependency on model inference
- Easy debugging and reproducibility
π This modularity allows you to A/B test evaluation strategies by toggling a single flag in the config ideal for benchmarking or training custom reward models like MR.Q.
π§ͺ Using MR.Q as the Judge
You can easily swap out the LLM judge with MR.Q by setting judge: mrq in your config. MR.Q evaluates hypotheses by comparing features learned from a large number of training examples.
Training MR.Q:
First, collect ~1000 scored LLM judgments.
Use SQL queries to extract pairs of hypotheses along with their scores, strategies, and reasoning metadata.
Train MR.Q to rank hypotheses based on this historical performance data.
MR.Q is a great fit when:
You want fast, cheap evaluations
You’re analyzing new strategies
LLM evaluations are too expensive or slow
β οΈ Avoid using MR.Q to evaluate the same data it was trained on this leads to circular logic and overfitting.
π§ͺ Hypothesis Generation with Reasoning Strategies
The core of the GeneralReasonerAgent lies in its ability to generate diverse hypotheses for the same goal, each using a different reasoning strategy. This is what allows us to explore a wide space of approaches from standard chain-of-thought to structured debate.
βοΈ Strategy-Driven Prompting
We configure the agent with a generation_strategy_list
, for example:
generation_strategy_list:
- cot
- plan_first
- debate
- verify_then_answer
- counterfactual
For each strategy, the agent:
- Loads a matching prompt template file, e.g.
strategy_plan_first.txt
. - Renders it using the goal and other inputs via our Jinja-based
PromptLoader
. - Calls a local model (e.g. Qwen or Mistral) with the rendered prompt.
- Wraps the model output as a
Hypothesis
and tags it with the strategy used.
This loop ensures that each reasoning strategy gets a fair shot at addressing the goal with no special preference or ordering.
π Example Template: strategy_verify_then_answer.txt
You are a careful reasoner. First, verify whether you have enough information to answer this question:
{{ goal }}
Then, and only then, provide a justified, evidence-based answer.
This structure nudges the model to be more cautious and methodical, which as our results show often leads to higher-quality hypotheses.
π Why It Matters
Having a modular and pluggable strategy setup means:
- You can easily add new strategies (just drop in a new
.txt
file) - You can benchmark strategy performance across many goals
- You can train MR.Q on reasoning strategy metadata for more accurate evaluations later
And since the strategy is stored in the database along with each hypothesis, everything is traceable and reproducible.
π§© Rubric Classification
As an extension from our prior work on Chain-of-Thought pattern analysis, we added rubric-based classification for every judged hypothesis. After a pair is evaluated, the winning hypothesis is classified across dimensions like:
- Strategy Orientation (Top-Down vs Bottom-Up)
- Inference Style (Deductive vs Analogical)
- Exploration Breadth (Greedy vs Exhaustive)
- Reasoning Depth (Shallow vs Deep)
- Evidence Use (Belief-driven vs Data-driven)
- Certainty Expression (Confident vs Tentative)
This classification is handled via a small adapter that calls an LLM with a rubric template and parses the labeled result. Each classification is logged to the database and linked to the associated hypothesis and evaluation. This lets us analyze which strategies produce different reasoning styles over time.
class RubricClassifierMixin:
def _load_enabled_rubrics(self, cfg):
enabled_rubrics = []
rubrics_cfg = cfg.get("rubrics", [])
for entry in rubrics_cfg:
if entry.get("enabled", False):
enabled_rubrics.append({
"dimension": entry["dimension"],
"rubric": entry["rubric"],
"options": entry["options"]
})
return enabled_rubrics
def classify_with_rubrics(self, hypothesis, context, prompt_loader, cfg, logger):
results = {}
pattern_file = cfg.get("pattern_prompt_file", "cot_pattern.txt")
rubrics = self._load_enabled_rubrics(cfg)
for rubric in rubrics:
rubric["goal"] = context["goal"]["goal_text"]
rubric["hypotheses"] = hypothesis.text
merged = {**context, **rubric}
prompt_text = prompt_loader.from_file(pattern_file, cfg, merged)
custom_llm = cfg.get("analysis_model", None) # cdifferent model from generation
result = self.call_llm(prompt_text, merged, custom_llm)
results[rubric["dimension"]] = result
logger.log(
"RubricClassified",
{
"dimension": rubric["dimension"],
"rubric": rubric["rubric"],
"classification": result,
},
)
return results
def classify_and_store_patterns(
self,
hypothesis,
context,
prompt_loader,
cfg,
memory,
logger,
agent_name,
score=None, # Optional numeric score or win count
):
"""Classifies rubrics and stores pattern stats for the given hypothesis."""
pattern = self.classify_with_rubrics(
hypothesis=hypothesis,
context=context,
prompt_loader=prompt_loader,
cfg=cfg,
logger=logger,
)
goal = self.extract_goal_text(context.get(GOAL))
summarized = self._summarize_pattern(pattern)
goal_id, hypothesis_id, pattern_stats = generate_pattern_stats(
goal, hypothesis.text, summarized, memory, cfg, agent_name, score
)
memory.hypotheses.store_pattern_stats(goal_id, hypothesis_id, pattern_stats)
logger.log(
"RubricPatternsStored",
{"goal_id": goal_id, "hypothesis_id": hypothesis_id, "summary": summarized},
)
context["pattern_stats"] = summarized
return summarized
We took a simple approach here. We just add them to the adapter configurarion. Later we will dynamically generate them using the model.
rubrics:
- dimension: "Strategy Orientation"
rubric: "Does the reasoning proceed in a hypothesis-first (top-down) or data-first (bottom-up) manner?"
options: ["Top-Down", "Bottom-Up"]
enabled: true
- dimension: "Inference Style"
rubric: "Is the reasoning based on deductive logic from known facts, or analogical reasoning across domains?"
options: ["Deductive", "Analogical"]
enabled: true
-
Logging
- Scores stored in
scores
- Hypotheses and rubric stats logged with metadata including strategy used, evaluator, and timestamp
- Scores stored in
π Database Schema Integration
We extended our PostgreSQL schema with the following:
scores
: stores per-hypothesis evaluation results (including evaluator, score, score_type, strategy, and reasoning_strategy)cot_pattern_stats
: tracks rubric label frequencies and dimensions
This allows us to analyze strategy performance, build dashboards, and train MR.Q on real-world hypothesis outcomes.
π Sample from scores
Table
ID | Hypothesis A (Strategy) | Hypothesis B (Strategy) | Winner | Score A | Score B | Evaluator |
---|---|---|---|---|---|---|
001 | cot | plan_first | B | 72 | 85 | llm |
002 | debate | verify_then_answer | B | 78 | 90 | llm |
003 | cot | counterfactual | B | 74 | 88 | mrq |
π Strategy Evaluation
Hereβs what we found from analyzing our initial run (over ~266 hypotheses):
Strategy | Judgements | Avg Score |
---|---|---|
verify_then_answer | 44 | 89.7 |
counterfactual | 36 | 89.4 |
plan_first | 62 | 83.9 |
debate | 52 | 82.4 |
cot | 72 | 78.2 |
This confirms that more structured strategies (especially verify_then_answer
) perform consistently better than default CoT prompting.
π§± Modular Design for Reusability
This agent architecture was designed for reuse across:
- Prompt tuning experiments
- Strategy analysis tools
- MR.Q-based training and preference modeling
Each component (e.g., prompt loading, rubric classification, score logging) is reusable in other agents such as ChainOfThoughtGeneratorAgent
or DebateAgent
.
π¦ Generating Goals at Scale from External Datasets
To support scaling our experiments, we added support for loading external datasets and converting them into Co AI’s structured goal format. Here’s an example of how we imported the StrategyQA dataset:
from datasets import load_dataset
import json
from datetime import datetime
# Load StrategyQA dataset
dataset = load_dataset("ChilleD/StrategyQA", split="train")
# Convert to Co AI goal format
def convert_to_goal_format(example, idx):
return {
"id": f"strategyqa_{idx}",
"goal_text": example["question"],
"goal_type": "strategyqa",
"focus_area": "commonsense",
"source": "strategyqa",
"answer": example["answer"],
"facts": example.get("facts", []),
"created_at": datetime.utcnow().isoformat()
}
# Output file path
output_path = "strategyqa_goals.jsonl"
# Convert and write
with open(output_path, "w", encoding="utf-8") as f:
for idx, example in enumerate(dataset.select(range(100))): # Limit to 100 goals
goal = convert_to_goal_format(example, idx)
f.write(json.dumps(goal) + "\n")
print(f"Saved {output_path}")
This allows us to quickly bootstrap hundreds or thousands of goals from existing benchmarks. We can now:
Run reasoning agents over curated datasets
Compare strategy performance across domains
Train MR.Q on labeled results
And since the system supports file-based goal input, no manual entry is needed just point the agent at the generated .jsonl file.
π― Why We Generated External Data
To demonstrate that our approach is more than just a clever architecture, we needed to test it at scale. The entire point of the General Reasoner system and the broader co_ai research framework is to offer a better way to generate, evaluate, and understand reasoning strategies in local LLMs.
That meant two things:
-
We needed real data so we imported 100+ goals from external datasets like StrategyQA to simulate a broad reasoning benchmark.
-
We needed structure which is why we pulled together four distinct lines of research:
- MR.Q: a deterministic learned judge trained on past evaluations
- Chain-of-Thought generator: for hypothesis diversity
- GeneralReasonerAgent config: to manage multistrategy generation
- Rubric classifier: to tag reasoning dimensions explicitly
These components donβt just coexist they coalesce into a system that reflects a clear hypothesis:
Certain reasoning strategies are consistently better for certain types of goals.
Thatβs why we:
- Tagged each goal with a
goal_type
andfocus_area
- Labeled each hypothesis with its generation strategy
- Evaluated them via pairwise judging and rubric classification
We also made an important design distinction:
- The goal defines the high-level task.
- The reasoning strategy defines the model’s plan to approach it.
- The rubric dimensions define the actual style of reasoning used (e.g., deductive, exhaustive, data-driven).
This structured data gives us a foundation not just to test which strategy βwins,β but why it wins and under what conditions.
Thatβs the real goal here: building a system thatβs interpretable, modular, and empirically grounded, not just functional.
π What Percentage of the General Reasoner Paper Have We Covered?
Component | Status |
---|---|
Multi-strategy generation | β Complete |
Pairwise verification | β Complete |
Rubric classification | β Complete |
Score storage and logging | β Complete |
MR.Q judging support | β Integrated |
Reinforcement learning loop | β Not used replaced by MR.Q |
Dynamic strategy refinement loop | β Not used replaced by MR.Q |
π Next Steps
- Train MR.Q on real
scores
and use it as the default judge - Add a caching mechanism to avoid duplicate hypothesis evaluations
- Extend rubric classification to cover broader reasoning dimensions
- Build a Streamlit or Dash dashboard for live experiment analysis
π¬ Summary
This post outlined how we built and integrated a General Reasoner agent using local models and prompt strategies. By combining CoT, debate, planning, and critical evaluation into a single agent pipeline, we get much closer to adaptive, high-quality reasoning at scale all while running locally.
If youβre building a research agent system, you can integrate these components into your own project using our open-source co_ai architecture.