General Reasoner: The smarter Local Agent

General Reasoner: The smarter Local Agent
Page content

πŸ”§ Summary

The General Reasoner paper shows how we can train LLMs to reason across domains using diverse data and a generative verifier. In this post, I walk through our open-source implementation showing how we built a modular reasoning agent capable of generating multiple hypotheses, evaluating them with an LLM-based judge, and selecting the best answer.


🧠 What We Built

We built a GeneralReasonerAgent that:

  • Dynamically generates multiple hypotheses using different reasoning strategies (e.g., cot, debate, verify_then_answer, etc.)
  • Evaluates each pair of hypotheses using either a local LLM judge or our custom MR.Q evaluator
  • Classifies the winning hypothesis using rubric dimensions
  • Logs structured results to a PostgreSQL-backed system

All of this was integrated with our existing co_ai framework, which includes:

  • Modular agent definitions
  • Hydra-based configuration
  • Prompt templating
  • JSONL and SQL-based logging and storage
    graph TD
    A[External Goals Dataset StrategyQA] --> B[Goal Store]
    B --> C[GeneralReasonerAgent]
    C --> D[Hypothesis Generator Multi-Strategy Prompts]
    D --> E1[Strategy: CoT]
    D --> E2[Strategy: Plan-First]
    D --> E3[Strategy: Debate]
    D --> E4[Strategy: Verify-Then-Answer]
    D --> E5[Strategy: Counterfactual]

    E1 & E2 & E3 & E4 & E5 --> F[Hypotheses List]

    F --> G[Pairwise Judging]
    G -->|via LLM| H1[LLM Judge]
    G -->|via Scores| H2[MR.Q Evaluator]

    H1 & H2 --> I[Scored Hypothesis Pairs]

    I --> J[Select Best Hypothesis]
    J --> K[Rubric Classification]
    K --> L[Pattern Labels Deductive, Top-Down]

    L --> M[Pattern Stats Table]
    J --> N[Score Table]

    M & N --> O[Strategy Analysis Dashboard]

    O --> P[Research Insight: Which strategies work best for which goals and why?]

    style A fill:#fdf6e3,stroke:#586e75
    style P fill:#b58900,color:#fff,font-weight:bold
  

πŸ” The Reasoning Loop

The GeneralReasonerAgent works as follows:

🧠 Evaluating Hypotheses with Pairwise Judging

After generating a list of hypotheses based on the configured reasoning strategies, the agent enters a loop where it compares each pair of hypotheses using a custom prompt template and a configured judge (e.g., MR.Q or a local LLM).

In the configuration we can choose different modes

  • generate_and_judge when generatin hypotheses using the configured agents.
  • judge_only we would be processing as part of a larger pipline of agents.
general_reasoner:
  name: general_reasoner
  
  ... 
  thinking_mode: generate_and_judge  # (judge_only (for processing hypotheses from other agents or generate_and_judge )
  judge: llm  # (mrq or llm) 
  judge_prompt_file: judge_pairwise_comparison.txt # the prompt used for judging
  judge_model:  # the model used for judging
    name: ollama/mistral:7b-instruct
    api_base: http://localhost:11434
    api_key: null
for hyp_a, hyp_b in combinations(hypotheses, 2):
    prompt_text = prompt_loader.from_file(judging_prompt_template, self.cfg, {
        "goal": goal,
        "hypothesis_a": hyp_a.text,
        "hypothesis_b": hyp_b.text
    })
    preferred, score = self.judge.judge(prompt_text, goal, hyp_a.text, hyp_b.text)

Here’s what this is doing:

  • combinations(hypotheses, 2): Iterates over all possible pairs of hypotheses so each one can be compared head-to-head.
  • prompt_loader.from_file(...): Loads a Jinja-style template and fills in the goal, hypothesis_a, and hypothesis_b, creating a natural language evaluation task.
  • self.judge.judge(...): Sends the constructed prompt to the evaluator (LLM or MR.Q) to determine which hypothesis better addresses the goal and assigns a score.

πŸ‘‰ This structure allows the system to perform pairwise comparisons, a more robust evaluation method than scoring each hypothesis independently. It mimics how humans might compare multiple ideas to pick the best one.

βš–οΈ Judging Template and MR.Q Integration

The judging step is powered by a prompt template file that defines how the goal and hypotheses are presented to the evaluator. This is usually a simple Jinja-style text file such as judge_pairwise_comparison.txt. Here’s an example of what that file might look like:

You are an impartial expert evaluator. Given a goal and two hypotheses, your task is to judge which hypothesis better addresses the goal.

### Goal:
{{ goal }}

### Hypothesis A:
{{ hypothesis_a }}

### Hypothesis B:
{{ hypothesis_b }}

### Evaluation Criteria:
- Relevance to the goal
- Reasoning quality and depth
- Clarity and coherence
- Use of evidence or logical structure

### Instructions:
Do **not** include any extra explanation or internal thoughts.  
**Respond only using the following exact format**:

better hypothesis:<A or B>  
reason:<Brief explanation>  
score_a:<0-100>  
score_b:<0-100>

This prompt gets rendered by the agent with real content from each hypothesis and is then passed to the judge.


πŸ§ͺ Swapping in MR.Q as the Judge

Instead of using a local LLM to evaluate responses, we can switch to using our custom MR.Q judge a lightweight, rule-based evaluator trained on prior scoring data.

This swap is handled via a config flag:

agents:
  general_reasoner:
    judge: "mrq"  # ← Use "llm" for LLM-based, or "mrq" for our custom model

Under the hood, self.judge.judge(...) automatically delegates to the correct backend. This gives us:

  • Faster and more consistent evaluations
  • No dependency on model inference
  • Easy debugging and reproducibility

πŸ‘‰ This modularity allows you to A/B test evaluation strategies by toggling a single flag in the config ideal for benchmarking or training custom reward models like MR.Q.

πŸ§ͺ Using MR.Q as the Judge

You can easily swap out the LLM judge with MR.Q by setting judge: mrq in your config. MR.Q evaluates hypotheses by comparing features learned from a large number of training examples.

Training MR.Q:

First, collect ~1000 scored LLM judgments.

Use SQL queries to extract pairs of hypotheses along with their scores, strategies, and reasoning metadata.

Train MR.Q to rank hypotheses based on this historical performance data.

MR.Q is a great fit when:

You want fast, cheap evaluations

You’re analyzing new strategies

LLM evaluations are too expensive or slow

⚠️ Avoid using MR.Q to evaluate the same data it was trained on this leads to circular logic and overfitting.


πŸ§ͺ Hypothesis Generation with Reasoning Strategies

The core of the GeneralReasonerAgent lies in its ability to generate diverse hypotheses for the same goal, each using a different reasoning strategy. This is what allows us to explore a wide space of approaches from standard chain-of-thought to structured debate.

βš™οΈ Strategy-Driven Prompting

We configure the agent with a generation_strategy_list, for example:

generation_strategy_list:
  - cot
  - plan_first
  - debate
  - verify_then_answer
  - counterfactual

For each strategy, the agent:

  1. Loads a matching prompt template file, e.g. strategy_plan_first.txt.
  2. Renders it using the goal and other inputs via our Jinja-based PromptLoader.
  3. Calls a local model (e.g. Qwen or Mistral) with the rendered prompt.
  4. Wraps the model output as a Hypothesis and tags it with the strategy used.

This loop ensures that each reasoning strategy gets a fair shot at addressing the goal with no special preference or ordering.

πŸ“ Example Template: strategy_verify_then_answer.txt

You are a careful reasoner. First, verify whether you have enough information to answer this question:

{{ goal }}

Then, and only then, provide a justified, evidence-based answer.

This structure nudges the model to be more cautious and methodical, which as our results show often leads to higher-quality hypotheses.

πŸ” Why It Matters

Having a modular and pluggable strategy setup means:

  • You can easily add new strategies (just drop in a new .txt file)
  • You can benchmark strategy performance across many goals
  • You can train MR.Q on reasoning strategy metadata for more accurate evaluations later

And since the strategy is stored in the database along with each hypothesis, everything is traceable and reproducible.


🧩 Rubric Classification

As an extension from our prior work on Chain-of-Thought pattern analysis, we added rubric-based classification for every judged hypothesis. After a pair is evaluated, the winning hypothesis is classified across dimensions like:

  • Strategy Orientation (Top-Down vs Bottom-Up)
  • Inference Style (Deductive vs Analogical)
  • Exploration Breadth (Greedy vs Exhaustive)
  • Reasoning Depth (Shallow vs Deep)
  • Evidence Use (Belief-driven vs Data-driven)
  • Certainty Expression (Confident vs Tentative)

This classification is handled via a small adapter that calls an LLM with a rubric template and parses the labeled result. Each classification is logged to the database and linked to the associated hypothesis and evaluation. This lets us analyze which strategies produce different reasoning styles over time.

class RubricClassifierMixin:
    def _load_enabled_rubrics(self, cfg):
        enabled_rubrics = []
        rubrics_cfg = cfg.get("rubrics", [])
        for entry in rubrics_cfg:
            if entry.get("enabled", False):
                enabled_rubrics.append({
                    "dimension": entry["dimension"],
                    "rubric": entry["rubric"],
                    "options": entry["options"]
                })
        return enabled_rubrics

    def classify_with_rubrics(self, hypothesis, context, prompt_loader, cfg, logger):
        results = {}
        pattern_file = cfg.get("pattern_prompt_file", "cot_pattern.txt")
        rubrics = self._load_enabled_rubrics(cfg)

        for rubric in rubrics:
            rubric["goal"] = context["goal"]["goal_text"]
            rubric["hypotheses"] = hypothesis.text
            merged = {**context, **rubric}
            prompt_text = prompt_loader.from_file(pattern_file, cfg, merged)
            custom_llm = cfg.get("analysis_model", None) # cdifferent model from generation
            result = self.call_llm(prompt_text, merged, custom_llm)
            results[rubric["dimension"]] = result
            logger.log(
                "RubricClassified",
                {
                    "dimension": rubric["dimension"],
                    "rubric": rubric["rubric"],
                    "classification": result,
                },
            )

        return results

    def classify_and_store_patterns(
        self,
        hypothesis,
        context,
        prompt_loader,
        cfg,
        memory,
        logger,
        agent_name,
        score=None,  # Optional numeric score or win count
    ):
        """Classifies rubrics and stores pattern stats for the given hypothesis."""
        pattern = self.classify_with_rubrics(
            hypothesis=hypothesis,
            context=context,
            prompt_loader=prompt_loader,
            cfg=cfg,
            logger=logger,
        )

        goal = self.extract_goal_text(context.get(GOAL))
        summarized = self._summarize_pattern(pattern)

        goal_id, hypothesis_id, pattern_stats = generate_pattern_stats(
            goal, hypothesis.text, summarized, memory, cfg, agent_name, score
        )
        memory.hypotheses.store_pattern_stats(goal_id, hypothesis_id, pattern_stats)
        logger.log(
            "RubricPatternsStored",
            {"goal_id": goal_id, "hypothesis_id": hypothesis_id, "summary": summarized},
        )

        context["pattern_stats"] = summarized
        return summarized

We took a simple approach here. We just add them to the adapter configurarion. Later we will dynamically generate them using the model.

  rubrics:
    - dimension: "Strategy Orientation"
      rubric: "Does the reasoning proceed in a hypothesis-first (top-down) or data-first (bottom-up) manner?"
      options: ["Top-Down", "Bottom-Up"]
      enabled: true
    - dimension: "Inference Style"
      rubric: "Is the reasoning based on deductive logic from known facts, or analogical reasoning across domains?"
      options: ["Deductive", "Analogical"]
      enabled: true
  1. Logging

    • Scores stored in scores
    • Hypotheses and rubric stats logged with metadata including strategy used, evaluator, and timestamp

πŸ—‚ Database Schema Integration

We extended our PostgreSQL schema with the following:

  • scores: stores per-hypothesis evaluation results (including evaluator, score, score_type, strategy, and reasoning_strategy)
  • cot_pattern_stats: tracks rubric label frequencies and dimensions

This allows us to analyze strategy performance, build dashboards, and train MR.Q on real-world hypothesis outcomes.

πŸ“Š Sample from scores Table

ID Hypothesis A (Strategy) Hypothesis B (Strategy) Winner Score A Score B Evaluator
001 cot plan_first B 72 85 llm
002 debate verify_then_answer B 78 90 llm
003 cot counterfactual B 74 88 mrq

πŸ“Š Strategy Evaluation

Here’s what we found from analyzing our initial run (over ~266 hypotheses):

Strategy Judgements Avg Score
verify_then_answer 44 89.7
counterfactual 36 89.4
plan_first 62 83.9
debate 52 82.4
cot 72 78.2

This confirms that more structured strategies (especially verify_then_answer) perform consistently better than default CoT prompting.


🧱 Modular Design for Reusability

This agent architecture was designed for reuse across:

  • Prompt tuning experiments
  • Strategy analysis tools
  • MR.Q-based training and preference modeling

Each component (e.g., prompt loading, rubric classification, score logging) is reusable in other agents such as ChainOfThoughtGeneratorAgent or DebateAgent.


πŸ“¦ Generating Goals at Scale from External Datasets

To support scaling our experiments, we added support for loading external datasets and converting them into Co AI’s structured goal format. Here’s an example of how we imported the StrategyQA dataset:

from datasets import load_dataset
import json
from datetime import datetime

# Load StrategyQA dataset
dataset = load_dataset("ChilleD/StrategyQA", split="train")

# Convert to Co AI goal format
def convert_to_goal_format(example, idx):
    return {
        "id": f"strategyqa_{idx}",
        "goal_text": example["question"],
        "goal_type": "strategyqa",
        "focus_area": "commonsense",
        "source": "strategyqa",
        "answer": example["answer"],
        "facts": example.get("facts", []),
        "created_at": datetime.utcnow().isoformat()
    }

# Output file path
output_path = "strategyqa_goals.jsonl"

# Convert and write
with open(output_path, "w", encoding="utf-8") as f:
    for idx, example in enumerate(dataset.select(range(100))):  # Limit to 100 goals
        goal = convert_to_goal_format(example, idx)
        f.write(json.dumps(goal) + "\n")

print(f"Saved {output_path}")

This allows us to quickly bootstrap hundreds or thousands of goals from existing benchmarks. We can now:

Run reasoning agents over curated datasets

Compare strategy performance across domains

Train MR.Q on labeled results

And since the system supports file-based goal input, no manual entry is needed just point the agent at the generated .jsonl file.


🎯 Why We Generated External Data

To demonstrate that our approach is more than just a clever architecture, we needed to test it at scale. The entire point of the General Reasoner system and the broader co_ai research framework is to offer a better way to generate, evaluate, and understand reasoning strategies in local LLMs.

That meant two things:

  1. We needed real data so we imported 100+ goals from external datasets like StrategyQA to simulate a broad reasoning benchmark.

  2. We needed structure which is why we pulled together four distinct lines of research:

    • MR.Q: a deterministic learned judge trained on past evaluations
    • Chain-of-Thought generator: for hypothesis diversity
    • GeneralReasonerAgent config: to manage multistrategy generation
    • Rubric classifier: to tag reasoning dimensions explicitly

These components don’t just coexist they coalesce into a system that reflects a clear hypothesis:

Certain reasoning strategies are consistently better for certain types of goals.

That’s why we:

  • Tagged each goal with a goal_type and focus_area
  • Labeled each hypothesis with its generation strategy
  • Evaluated them via pairwise judging and rubric classification

We also made an important design distinction:

  • The goal defines the high-level task.
  • The reasoning strategy defines the model’s plan to approach it.
  • The rubric dimensions define the actual style of reasoning used (e.g., deductive, exhaustive, data-driven).

This structured data gives us a foundation not just to test which strategy β€œwins,” but why it wins and under what conditions.

That’s the real goal here: building a system that’s interpretable, modular, and empirically grounded, not just functional.


πŸ“Œ What Percentage of the General Reasoner Paper Have We Covered?

Component Status
Multi-strategy generation βœ… Complete
Pairwise verification βœ… Complete
Rubric classification βœ… Complete
Score storage and logging βœ… Complete
MR.Q judging support βœ… Integrated
Reinforcement learning loop ❌ Not used replaced by MR.Q
Dynamic strategy refinement loop ❌ Not used replaced by MR.Q

πŸš€ Next Steps

  • Train MR.Q on real scores and use it as the default judge
  • Add a caching mechanism to avoid duplicate hypothesis evaluations
  • Extend rubric classification to cover broader reasoning dimensions
  • Build a Streamlit or Dash dashboard for live experiment analysis

πŸ’¬ Summary

This post outlined how we built and integrated a General Reasoner agent using local models and prompt strategies. By combining CoT, debate, planning, and critical evaluation into a single agent pipeline, we get much closer to adaptive, high-quality reasoning at scale all while running locally.

If you’re building a research agent system, you can integrate these components into your own project using our open-source co_ai architecture.