General Reasoner: The smarter Local Agent

AI Research, Prompt Engineering, Local AI, Reasoning Agents, Paper Implementations, LLM Evaluation, Autonomous Agents

May 22, 2025

General Reasoner: The smarter Local Agent

Page content

🔧 Summary

The General Reasoner paper shows how we can train LLMs to reason across domains using diverse data and a generative verifier. In this post, I walk through our open-source implementation showing how we built a modular reasoning agent capable of generating multiple hypotheses, evaluating them with an LLM-based judge, and selecting the best answer.

🧠 What We Built

We built a GeneralReasonerAgent that:

Dynamically generates multiple hypotheses using different reasoning strategies (e.g., cot, debate, verify_then_answer, etc.)
Evaluates each pair of hypotheses using either a local LLM judge or our custom MR.Q evaluator
Classifies the winning hypothesis using rubric dimensions
Logs structured results to a PostgreSQL-backed system

All of this was integrated with our existing co_ai framework, which includes:

Modular agent definitions
Hydra-based configuration
Prompt templating
JSONL and SQL-based logging and storage

    graph TD
    A[External Goals Dataset StrategyQA] --> B[Goal Store]
    B --> C[GeneralReasonerAgent]
    C --> D[Hypothesis Generator Multi-Strategy Prompts]
    D --> E1[Strategy: CoT]
    D --> E2[Strategy: Plan-First]
    D --> E3[Strategy: Debate]
    D --> E4[Strategy: Verify-Then-Answer]
    D --> E5[Strategy: Counterfactual]

    E1 & E2 & E3 & E4 & E5 --> F[Hypotheses List]

    F --> G[Pairwise Judging]
    G -->|via LLM| H1[LLM Judge]
    G -->|via Scores| H2[MR.Q Evaluator]

    H1 & H2 --> I[Scored Hypothesis Pairs]

    I --> J[Select Best Hypothesis]
    J --> K[Rubric Classification]
    K --> L[Pattern Labels Deductive, Top-Down]

    L --> M[Pattern Stats Table]
    J --> N[Score Table]

    M & N --> O[Strategy Analysis Dashboard]

    O --> P[Research Insight: Which strategies work best for which goals and why?]

    style A fill:#fdf6e3,stroke:#586e75
    style P fill:#b58900,color:#fff,font-weight:bold

🔁 The Reasoning Loop

The GeneralReasonerAgent works as follows:

🧠 Evaluating Hypotheses with Pairwise Judging

After generating a list of hypotheses based on the configured reasoning strategies, the agent enters a loop where it compares each pair of hypotheses using a custom prompt template and a configured judge (e.g., MR.Q or a local LLM).

In the configuration we can choose different modes

generate_and_judge when generatin hypotheses using the configured agents.
judge_only we would be processing as part of a larger pipline of agents.

general_reasoner:
  name: general_reasoner
  
  ... 
  thinking_mode: generate_and_judge  # (judge_only (for processing hypotheses from other agents or generate_and_judge )
  judge: llm  # (mrq or llm) 
  judge_prompt_file: judge_pairwise_comparison.txt # the prompt used for judging
  judge_model:  # the model used for judging
    name: ollama/mistral:7b-instruct
    api_base: http://localhost:11434
    api_key: null

for hyp_a, hyp_b in combinations(hypotheses, 2):
    prompt_text = prompt_loader.from_file(judging_prompt_template, self.cfg, {
        "goal": goal,
        "hypothesis_a": hyp_a.text,
        "hypothesis_b": hyp_b.text
    })
    preferred, score = self.judge.judge(prompt_text, goal, hyp_a.text, hyp_b.text)

Here’s what this is doing:

combinations(hypotheses, 2): Iterates over all possible pairs of hypotheses so each one can be compared head-to-head.
prompt_loader.from_file(...): Loads a Jinja-style template and fills in the goal, hypothesis_a, and hypothesis_b, creating a natural language evaluation task.
self.judge.judge(...): Sends the constructed prompt to the evaluator (LLM or MR.Q) to determine which hypothesis better addresses the goal and assigns a score.

👉 This structure allows the system to perform pairwise comparisons, a more robust evaluation method than scoring each hypothesis independently. It mimics how humans might compare multiple ideas to pick the best one.

⚖️ Judging Template and MR.Q Integration

The judging step is powered by a prompt template file that defines how the goal and hypotheses are presented to the evaluator. This is usually a simple Jinja-style text file such as judge_pairwise_comparison.txt. Here’s an example of what that file might look like:

You are an impartial expert evaluator. Given a goal and two hypotheses, your task is to judge which hypothesis better addresses the goal.

### Goal:
{{ goal }}

### Hypothesis A:
{{ hypothesis_a }}

### Hypothesis B:
{{ hypothesis_b }}

### Evaluation Criteria:
- Relevance to the goal
- Reasoning quality and depth
- Clarity and coherence
- Use of evidence or logical structure

### Instructions:
Do **not** include any extra explanation or internal thoughts.  
**Respond only using the following exact format**:

better hypothesis:<A or B>  
reason:<Brief explanation>  
score_a:<0-100>  
score_b:<0-100>

This prompt gets rendered by the agent with real content from each hypothesis and is then passed to the judge.

🧪 Swapping in MR.Q as the Judge

Instead of using a local LLM to evaluate responses, we can switch to using our custom MR.Q judge a lightweight, rule-based evaluator trained on prior scoring data.

This swap is handled via a config flag:

agents:
  general_reasoner:
    judge: "mrq"  # ← Use "llm" for LLM-based, or "mrq" for our custom model

Under the hood, self.judge.judge(...) automatically delegates to the correct backend. This gives us:

Faster and more consistent evaluations
No dependency on model inference
Easy debugging and reproducibility

👉 This modularity allows you to A/B test evaluation strategies by toggling a single flag in the config ideal for benchmarking or training custom reward models like MR.Q.

🧪 Using MR.Q as the Judge

You can easily swap out the LLM judge with MR.Q by setting judge: mrq in your config. MR.Q evaluates hypotheses by comparing features learned from a large number of training examples.

Training MR.Q:

First, collect ~1000 scored LLM judgments.

Use SQL queries to extract pairs of hypotheses along with their scores, strategies, and reasoning metadata.

Train MR.Q to rank hypotheses based on this historical performance data.

MR.Q is a great fit when:

You want fast, cheap evaluations

You’re analyzing new strategies

LLM evaluations are too expensive or slow

⚠️ Avoid using MR.Q to evaluate the same data it was trained on this leads to circular logic and overfitting.

🧪 Hypothesis Generation with Reasoning Strategies

The core of the GeneralReasonerAgent lies in its ability to generate diverse hypotheses for the same goal, each using a different reasoning strategy. This is what allows us to explore a wide space of approaches from standard chain-of-thought to structured debate.

⚙️ Strategy-Driven Prompting

We configure the agent with a generation_strategy_list, for example:

generation_strategy_list:
  - cot
  - plan_first
  - debate
  - verify_then_answer
  - counterfactual

For each strategy, the agent:

Loads a matching prompt template file, e.g. strategy_plan_first.txt.
Renders it using the goal and other inputs via our Jinja-based PromptLoader.
Calls a local model (e.g. Qwen or Mistral) with the rendered prompt.
Wraps the model output as a Hypothesis and tags it with the strategy used.

This loop ensures that each reasoning strategy gets a fair shot at addressing the goal with no special preference or ordering.

📁 Example Template: `strategy_verify_then_answer.txt`

You are a careful reasoner. First, verify whether you have enough information to answer this question:

{{ goal }}

Then, and only then, provide a justified, evidence-based answer.

This structure nudges the model to be more cautious and methodical, which as our results show often leads to higher-quality hypotheses.

🔍 Why It Matters

Having a modular and pluggable strategy setup means:

You can easily add new strategies (just drop in a new .txt file)
You can benchmark strategy performance across many goals
You can train MR.Q on reasoning strategy metadata for more accurate evaluations later

And since the strategy is stored in the database along with each hypothesis, everything is traceable and reproducible.

🧩 Rubric Classification

As an extension from our prior work on Chain-of-Thought pattern analysis, we added rubric-based classification for every judged hypothesis. After a pair is evaluated, the winning hypothesis is classified across dimensions like:

Strategy Orientation (Top-Down vs Bottom-Up)
Inference Style (Deductive vs Analogical)
Exploration Breadth (Greedy vs Exhaustive)
Reasoning Depth (Shallow vs Deep)
Evidence Use (Belief-driven vs Data-driven)
Certainty Expression (Confident vs Tentative)

This classification is handled via a small adapter that calls an LLM with a rubric template and parses the labeled result. Each classification is logged to the database and linked to the associated hypothesis and evaluation. This lets us analyze which strategies produce different reasoning styles over time.

class RubricClassifierMixin:
    def _load_enabled_rubrics(self, cfg):
        enabled_rubrics = []
        rubrics_cfg = cfg.get("rubrics", [])
        for entry in rubrics_cfg:
            if entry.get("enabled", False):
                enabled_rubrics.append({
                    "dimension": entry["dimension"],
                    "rubric": entry["rubric"],
                    "options": entry["options"]
                })
        return enabled_rubrics

    def classify_with_rubrics(self, hypothesis, context, prompt_loader, cfg, logger):
        results = {}
        pattern_file = cfg.get("pattern_prompt_file", "cot_pattern.txt")
        rubrics = self._load_enabled_rubrics(cfg)

        for rubric in rubrics:
            rubric["goal"] = context["goal"]["goal_text"]
            rubric["hypotheses"] = hypothesis.text
            merged = {**context, **rubric}
            prompt_text = prompt_loader.from_file(pattern_file, cfg, merged)
            custom_llm = cfg.get("analysis_model", None) # cdifferent model from generation
            result = self.call_llm(prompt_text, merged, custom_llm)
            results[rubric["dimension"]] = result
            logger.log(
                "RubricClassified",
                {
                    "dimension": rubric["dimension"],
                    "rubric": rubric["rubric"],
                    "classification": result,
                },
            )

        return results

    def classify_and_store_patterns(
        self,
        hypothesis,
        context,
        prompt_loader,
        cfg,
        memory,
        logger,
        agent_name,
        score=None,  # Optional numeric score or win count
    ):
        """Classifies rubrics and stores pattern stats for the given hypothesis."""
        pattern = self.classify_with_rubrics(
            hypothesis=hypothesis,
            context=context,
            prompt_loader=prompt_loader,
            cfg=cfg,
            logger=logger,
        )

        goal = self.extract_goal_text(context.get(GOAL))
        summarized = self._summarize_pattern(pattern)

        goal_id, hypothesis_id, pattern_stats = generate_pattern_stats(
            goal, hypothesis.text, summarized, memory, cfg, agent_name, score
        )
        memory.hypotheses.store_pattern_stats(goal_id, hypothesis_id, pattern_stats)
        logger.log(
            "RubricPatternsStored",
            {"goal_id": goal_id, "hypothesis_id": hypothesis_id, "summary": summarized},
        )

        context["pattern_stats"] = summarized
        return summarized

We took a simple approach here. We just add them to the adapter configurarion. Later we will dynamically generate them using the model.

  rubrics:
    - dimension: "Strategy Orientation"
      rubric: "Does the reasoning proceed in a hypothesis-first (top-down) or data-first (bottom-up) manner?"
      options: ["Top-Down", "Bottom-Up"]
      enabled: true
    - dimension: "Inference Style"
      rubric: "Is the reasoning based on deductive logic from known facts, or analogical reasoning across domains?"
      options: ["Deductive", "Analogical"]
      enabled: true

Logging
- Scores stored in scores
- Hypotheses and rubric stats logged with metadata including strategy used, evaluator, and timestamp

🗂 Database Schema Integration

We extended our PostgreSQL schema with the following:

scores: stores per-hypothesis evaluation results (including evaluator, score, score_type, strategy, and reasoning_strategy)
cot_pattern_stats: tracks rubric label frequencies and dimensions

This allows us to analyze strategy performance, build dashboards, and train MR.Q on real-world hypothesis outcomes.

📊 Sample from `scores` Table

ID	Hypothesis A (Strategy)	Hypothesis B (Strategy)	Winner	Score A	Score B	Evaluator
001	cot	plan_first	B	72	85	llm
002	debate	verify_then_answer	B	78	90	llm
003	cot	counterfactual	B	74	88	mrq

📊 Strategy Evaluation

Here’s what we found from analyzing our initial run (over ~266 hypotheses):

Strategy	Judgements	Avg Score
verify_then_answer	44	89.7
counterfactual	36	89.4
plan_first	62	83.9
debate	52	82.4
cot	72	78.2

This confirms that more structured strategies (especially verify_then_answer) perform consistently better than default CoT prompting.

🧱 Modular Design for Reusability

This agent architecture was designed for reuse across:

Prompt tuning experiments
Strategy analysis tools
MR.Q-based training and preference modeling

Each component (e.g., prompt loading, rubric classification, score logging) is reusable in other agents such as ChainOfThoughtGeneratorAgent or DebateAgent.

📦 Generating Goals at Scale from External Datasets

To support scaling our experiments, we added support for loading external datasets and converting them into Co AI’s structured goal format. Here’s an example of how we imported the StrategyQA dataset:

from datasets import load_dataset
import json
from datetime import datetime

# Load StrategyQA dataset
dataset = load_dataset("ChilleD/StrategyQA", split="train")

# Convert to Co AI goal format
def convert_to_goal_format(example, idx):
    return {
        "id": f"strategyqa_{idx}",
        "goal_text": example["question"],
        "goal_type": "strategyqa",
        "focus_area": "commonsense",
        "source": "strategyqa",
        "answer": example["answer"],
        "facts": example.get("facts", []),
        "created_at": datetime.utcnow().isoformat()
    }

# Output file path
output_path = "strategyqa_goals.jsonl"

# Convert and write
with open(output_path, "w", encoding="utf-8") as f:
    for idx, example in enumerate(dataset.select(range(100))):  # Limit to 100 goals
        goal = convert_to_goal_format(example, idx)
        f.write(json.dumps(goal) + "\n")

print(f"Saved {output_path}")

This allows us to quickly bootstrap hundreds or thousands of goals from existing benchmarks. We can now:

Run reasoning agents over curated datasets

Compare strategy performance across domains

Train MR.Q on labeled results

And since the system supports file-based goal input, no manual entry is needed just point the agent at the generated .jsonl file.

🎯 Why We Generated External Data

To demonstrate that our approach is more than just a clever architecture, we needed to test it at scale. The entire point of the General Reasoner system and the broader co_ai research framework is to offer a better way to generate, evaluate, and understand reasoning strategies in local LLMs.

That meant two things:

We needed real data so we imported 100+ goals from external datasets like StrategyQA to simulate a broad reasoning benchmark.
We needed structure which is why we pulled together four distinct lines of research:
- MR.Q: a deterministic learned judge trained on past evaluations
- Chain-of-Thought generator: for hypothesis diversity
- GeneralReasonerAgent config: to manage multistrategy generation
- Rubric classifier: to tag reasoning dimensions explicitly

These components don’t just coexist they coalesce into a system that reflects a clear hypothesis:

Certain reasoning strategies are consistently better for certain types of goals.

That’s why we:

Tagged each goal with a goal_type and focus_area
Labeled each hypothesis with its generation strategy
Evaluated them via pairwise judging and rubric classification

We also made an important design distinction:

The goal defines the high-level task.
The reasoning strategy defines the model’s plan to approach it.
The rubric dimensions define the actual style of reasoning used (e.g., deductive, exhaustive, data-driven).

This structured data gives us a foundation not just to test which strategy “wins,” but why it wins and under what conditions.

That’s the real goal here: building a system that’s interpretable, modular, and empirically grounded, not just functional.

📌 What Percentage of the General Reasoner Paper Have We Covered?

Component	Status
Multi-strategy generation	✅ Complete
Pairwise verification	✅ Complete
Rubric classification	✅ Complete
Score storage and logging	✅ Complete
MR.Q judging support	✅ Integrated
Reinforcement learning loop	❌ Not used replaced by MR.Q
Dynamic strategy refinement loop	❌ Not used replaced by MR.Q

🚀 Next Steps

Train MR.Q on real scores and use it as the default judge
Add a caching mechanism to avoid duplicate hypothesis evaluations
Extend rubric classification to cover broader reasoning dimensions
Build a Streamlit or Dash dashboard for live experiment analysis

💬 Summary

This post outlined how we built and integrated a General Reasoner agent using local models and prompt strategies. By combining CoT, debate, planning, and critical evaluation into a single agent pipeline, we get much closer to adaptive, high-quality reasoning at scale all while running locally.

If you’re building a research agent system, you can integrate these components into your own project using our open-source co_ai architecture.