Self-Improving AI: A System That Learns, Validates, and Retrains Itself

Self-Improving Systems, Reinforcement Learning, AI Feedback Loops, Autonomous AI, Agent-Based Systems, Goal-Oriented AI, AI Monitoring and Tuning, AI Evaluation Techniques

June 30, 2025

Self-Improving AI: A System That Learns, Validates, and Retrains Itself

Page content

🤖 The Static AI Trap

Today’s AI systems are frozen in time: trained once, deployed forever. Yet the real world never stops evolving. Goals shift overnight. New research upends old truths. Context transforms without warning.

What if your AI could wake up?

In this post, we engineer an intelligence that teaches itself a system that continuously learns from the web, audits its own judgments, and retrains itself when confidence wavers.

We’ll build two breakthrough capabilities:

Goal → Model Pipeline: Transform ambiguous goals (“Build an AI that self-improves”) into precision reward models using Arxiv research.
Self-Tuning Loop: The AI’s internal auditor that spots drift, validates against GPT-4, and triggers retraining no humans needed.

By the end, you’ll deploy systems that don’t just process information… they evolve with it.

This system is built around a new approach we call RIVAL:
Reinforcement learning with Iterative and adVersarial optimization of Language models.

RIVAL is a closed-loop framework where an AI system doesn’t just learn from a static dataset it actively evaluates, challenges, and retrains itself based on feedback from a trusted oracle (e.g., an LLM). This gives it the ability to evolve its understanding of goals over time, continuously improving without requiring manual labels.

🧱 Part 1: From Goal to Reward Model

Remember the pipeline can be anything for this post I have chosen one that I will use often a processs to try and find scientific research related to a goal.

We’re building a dynamic research assistant that knows how to find, filter, and learn from the best information on the web and get better at doing that over time.

At the core of this system is a custom pipeline focused on AI research discovery, using Arxiv.org as its source. Arxiv is the fastest-moving, highest-signal archive of research in the AI space. If you’re building frontier models or tracking innovation, it’s where the story begins.

We’ll show you how to:

Search Arxiv with intent (goal-driven search)
Load and profile new papers
Score their quality and relevance
Integrate their knowledge into a working AI
Validate and retrain that AI using its own results
Continuously improve the search and filtering loop

This isn’t just “better search.” It’s a self-improving intelligence pipeline. It adapts its filters, refines what it considers “valuable” research, and uses that refinement to train internal reward models that replace the role of the LLM for faster, cheaper, more scalable evaluations.

🪞 Design Rationale

When you ask a general LLM to “find the best papers on self-improving AI,” you’ll get a generic list. Maybe 20% of that list is useful. But if you give the system a goal say “Build an AI that teaches itself to solve complex tasks” and let it learn which results actually help, it becomes something more than a query engine. It becomes autonomous, goal-oriented, and self-tuning.

This post is part of a larger vision: a system that constantly scans research, videos, codebases, and discussions from around the world, learns what’s useful to its goals, and trains itself forward without us in the loop.

🎯 Step 1: Define a Goal

I want to build an AI that can teach itself to solve complex problems better over time.

🧭 The Real Problem Isn’t Search It’s Signal

We started with this goal. It found over 300,000 results.

But when we reviewed the top 100, fewer than 25 were actually useful. The rest were vague, outdated, speculative, or just off-topic.

This is the real challenge:

🤖 AI doesn’t struggle to find information it struggles to filter it.

Building a self-improving system isn’t just about learning from data. It’s about learning to recognize good data, and ignore the rest. That’s what this post is about.

🔧 Building the Self-Improving Research Pipeline

Let’s get hands-on. Below is a real, running pipeline designed to search Arxiv, load and analyze papers, score their relevance, and then retrain itself based on what it learns.

This YAML configuration defines the pipeline structure:

goal:
  goal_text: I want to build an AI that can teach itself to solve complex problems better over time.
  goal_type: "tactical"
  goal_category: "meta_learning"
  focus_area: "self_improvement"


pipeline:
  name: rivals
  tag: "search_arxiv"
  description: "Search Arxiv for papers related to a goal"
  stages:
    - name: arxiv_search
      description: "Search Arxiv for papers related to the goal"
      cls: stephanieanieanie.agents.knowledge.arxiv_search.ArxivSearchAgent
      enabled: true
      iterations: 1

    - name: document_loader
      description: "Load documents from the search results and summarize them"
      cls: stephanieanie.agents.knowledge.document_loader.DocumentLoaderAgent
      enabled: true
      iterations: 1

    - name: document_profiler
      description: "Profile the loaded documents to extract key sections"
      cls: stephanie.agents.knowledge.document_profiler.DocumentProfilerAgent
      enabled: true
      iterations: 1

    - name: paper_score
      description: "Score the papers based on their relevance and quality"
      cls: stephanie.agents.knowledge.paper_score.PaperScoreAgent
      enabled: true
      iterations: 1

    - name: knowledge_loader
      description: "Load knowledge from the scored papers into the system"
      cls: stephanie.agents.knowledge.knowledge_loader.KnowledgeLoaderAgent
      enabled: true
      iterations: 1

    - name: document_trainer
      description: "Build document pairs for training and evaluation, tran and generate models"
      cls: stephanie.agents.knowledge.document_trainer.DocumentTrainerAgent
      enabled: true
      iterations: 1

    - name: document_reward_scorer
      description: "Score the documents based on their relevance and quality"
      cls: stephanie.agents.knowledge.document_reward_scorer.DocumentRewardScorerAgent
      enabled: true
      iterations: 1

Each of these agents is a working Python class that performs a specific task in the pipeline. Let’s walk through what each one does in sequence, and how they contribute to a system that improves itself every time it runs.

    
flowchart LR
  A[🎯 ArxivSearchAgent<br/>Find goal-related seed papers]:::highlighted
  B[📥 DocumentLoaderAgent<br/>Download & extract text]
  C[🧬 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  D[📈 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  E[📚 KnowledgeLoaderAgent<br/>Load knowledge from the Brain]
  F[🎓 DocumentTrainerAgent<br/>Learning Better Papers]
  G[🏅 DocumentRewardScorerAgent<br/>Score Documents]

  A --> B --> C --> D --> E --> F --> G

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

🔎 `ArxivSearchAgent` Semantic Search for Self-Improvement

The ArxivSearchAgent is the first step in our AI pipeline that learns to improve itself. It transforms high-level research goals into actionable queries, interfaces with the arXiv API, and returns high-quality papers relevant to the specified objective.

✨ Purpose

This agent allows our system to autonomously discover and retrieve relevant literature from arXiv.org, which becomes the knowledge base for further evaluation, ranking, and learning.

🧠 How It Works

Goal Understanding The agent reads a high-level goal_text such as: “I want to build an AI that can teach itself to solve complex problems better over time.”
Keyword Extraction A lightweight prompt-based method extracts semantic keywords from the goal (e.g., reinforcement learning, recursive self-improvement, curriculum generation).

You are an expert AI research assistant.

Your task is to analyze a research or development goal and return a list of concise, technical keywords or phrases that would be useful in an academic search engine like arXiv, Semantic Scholar, or Google Scholar.

These keywords should be specific enough to narrow results to relevant technical papers, and may include terms related to:
- methodology (e.g., "meta learning", "reward modeling")
- concepts (e.g., "recursive self-improvement", "strategic reasoning")
- tasks (e.g., "curriculum generation", "continual learning")
- disciplines (e.g., "reinforcement learning", "AI alignment")

---

Goal:
{{ goal.goal_text }}

---

{% if preferences %}
And these preferences:
{% for p in preferences %}
- {{ '{{' }} p {{ '}}' }}
{% endfor %}
{% endif %}

{% if instructions %}
Additional instructions: 
{% for i in instructions %}
- {{ '{{' }} i {{ '}}' }}
{% endfor %}
{% endif %}

Please respond with a list of 5–12 keywords or key phrases in plain text, one per line. Do not include explanations, just the keywords.

This will generate a set of key words like this

"reinforcement learning", 
"online learning", 
"experience replay", 
"feedback loops", 
"meta-learning", 
"self-improvement", 
"recursive self-improvement", 
"trategic reasoning", 
"reward modeling", 
"curriculum generation", 
"adaptive learning strategies"

We configure the search through properties in the config

  year_start: 2024
  year_end: 2025
  category: cs.AI
  max_results: 50
  top_n: 10

Query Construction These keywords are transformed into a valid arXiv query with filters for:
- Category (e.g., cs.AI)
- Date range (e.g., papers from 2021–2025)
API Search Uses the arxiv Python package to fetch matching papers, sorted by relevance.
Metadata Enrichment For each result, the agent records:
- PDF URL
- Title, abstract, authors
- Goal ID and strategy context
- arXiv ID and category
Output Results are stored in context["raw_arxiv_results"] and passed to downstream agents like DocumentLoaderAgent and PaperScorerAgent.

This will generate this query

("reward modeling" OR "meta learning" OR "continual learning" OR "recursive
    self-improvement" OR "feedback-driven optimization" OR "performance-based adaptation"
    OR "ynamic criteria adjustment" OR "AI alignment" OR "curriculum generation" OR
    "reinforcement learning" OR "elf-improving systems") AND submittedDate:[20240101
    TO 20251231] AND cat:cs.AI
``

---

#### 🧪 Example Output

```json
{
  "title": "Towards Self-Improving AI: A Meta-Learning Approach",
  "summary": "...",
  "url": "https://arxiv.org/pdf/2501.12345v2.pdf",
  "goal_id": "goal-1234",
  "parent_goal": "I want to build an AI...",
  "strategy": "stepwise_decomposition",
  "focus_area": "self_improvement",
  "published": "2024-11-01T00:00:00Z"
}

✅ Built for research

ArxivSearchAgent forms the foundation of a system that can:

Autonomously retrieve state-of-the-art knowledge
Benchmark itself against expert-written papers
Learn and update its internal value models over time

It’s not just search it’s self-supervised knowledge acquisition tailored to goal-driven reasoning.

📄 Loading the Document Results `DocumentLoaderAgent`

We covert this class in aprevious post Document Intelligence: Turning Documents into Structured Knowledge.

    flowchart LR
  A[🎯 ArxivSearchAgent<br/>Find goal-related seed papers]
  B[📥 DocumentLoaderAgent<br/>Download & extract text]:::highlighted
  C[🧬 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  D[📈 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  E[📚 KnowledgeLoaderAgent<br/>Load knowledge from the Brain]
  F[🎓 DocumentTrainerAgent<br/>Learning Better Papers]
  G[🏅 DocumentRewardScorerAgent<br/>Score Documents]

  A --> B --> C --> D --> E --> F --> G

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

Once we’ve searched Arxiv and retrieved results, we use the DocumentLoaderAgent to download and process those papers. Here’s what happens:

🔁 Step-by-Step Flow

Check for Existing Documents If we’ve already downloaded this paper before, skip downloading but optionally reclassify its domain (if force_domain_update = True).
Download and Extract Text from PDF
- Download the PDF from the Arxiv URL.
- Extract text using PDFConverter.
- Clean up the file afterward.
Summarize or Use Arxiv Metadata
- Use Arxiv metadata if available.
- If not, summarize with an LLM.
- Guess the title if needed (especially for messy PDFs).
Generate Embeddings Create a vector embedding from the title + summary. This enables similarity search and clustering later on.
Store to the Knowledge Base Save the document, with its metadata, into the system.
Classify by Domain Use DomainClassifier to label the document with relevant research domains (e.g. “machine learning”, “optimization”, “robotics”).

🚀 What Makes This Agent Self-Improving?

This loader is more than just a parser:

It filters noise early by rejecting bad PDFs or already-seen documents.
It adds metadata and structure needed for downstream training.
Its embeddings are reusable, helping compare and cluster papers over time.
It trains the domain classifier continually, if you plug it into your learning loop.

🧬 Key Code Snippet

Here’s the document ingestion process, simplified:

class DocumentLoaderAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.domain_classifier = DomainClassifier(
            memory, logger, cfg.get("domain_seed_config_path", "config/domain/seeds.yaml")
        )
        self.download_directory = cfg.get("download_directory", "/tmp")
        self.summarize_documents = cfg.get("summarize_documents", False)

    async def run(self, context: dict) -> dict:
        search_results = context.get(self.input_key, [])
        stored_documents = []

        for result in search_results:
            url = result.get("url")
            title = result.get("title")
            existing = self.memory.document.get_by_url(url)
            if existing:
                stored_documents.append(existing.to_dict())
                continue

            # Download and extract PDF text
            pdf_path = self.download_pdf(url, title)
            if not PDFConverter.validate_pdf(pdf_path):
                continue
            text = PDFConverter.pdf_to_text(pdf_path)
            os.remove(pdf_path)

            # Optional: summarize via LLM or fetch ArXiv metadata
            summary = result.get("summary")
            if self.summarize_documents:
                summary = self.generate_summary(text, context)

            # Save document + embedding
            doc = self.memory.document.add_document({
                "title": title, "summary": summary, "text": text, "url": url,
                "goal_id": context.get("goal", {}).get("id")
            })
            self.memory.embedding.get_or_create(f"{title}\n\n{summary}")
            self.assign_domains_to_document(doc)
            stored_documents.append(doc.to_dict())

        context[self.output_key] = stored_documents
        return context

    def download_pdf(self, url, title):
        response = requests.get(url, stream=True)
        file_name = re.sub(r'[^\w\-]', "_", title)[:80]
        pdf_path = f"{self.download_directory}/{file_name}.pdf"
        with open(pdf_path, "wb") as f:
            for chunk in response.iter_content(8192):
                f.write(chunk)
        return pdf_path

    def generate_summary(self, text, context):
        prompt = self.prompt_loader.load_prompt(self.cfg, {"document_text": text, **context})
        return self.call_llm(prompt, context)

    def assign_domains_to_document(self, document):
        content = document.content
        for domain, score in self.domain_classifier.classify(content, top_k=3, min_score=0.6):
            self.memory.document_domains.insert({
                "document_id": document.id,
                "domain": domain,
                "score": score
            })

This gives us structured, searchable, classified, and embeddable research documents.

📚 Structuring Knowledge: The Role of the `DocumentProfilerAgent`

Once documents are retrieved and loaded into the system, raw text alone isn’t enough. To make use of this information especially in the context of AI research papers we need to understand the structure of each document and extract the parts that matter most.

This is where the Document Profiler comes in.

    flowchart LR
  A[🎯 ArxivSearchAgent<br/>Find goal-related seed papers]
  B[📥 DocumentLoaderAgent<br/>Download & extract text]
  C[🧬 DocumentProfilerAgent<br/>Enrich, embed, and segment]:::highlighted
  D[📈 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  E[📚 KnowledgeLoaderAgent<br/>Load knowledge from the Brain]
  F[🎓 DocumentTrainerAgent<br/>Learning Better Papers]
  G[🏅 DocumentRewardScorerAgent<br/>Score Documents]

  A --> B --> C --> D --> E --> F --> G

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

🤔 What It Does

The DocumentProfilerAgent is responsible for breaking down raw documents into meaningful, structured sections such as:

Title
Abstract
Methods
Results
Key Contributions

These sections are crucial because they isolate the most useful parts of a paper and allow downstream agents (like scoring, training, or validation engines) to focus only on the information that’s likely to impact decision-making.

The profiler uses a two-phase approach:

Unstructured Heuristics: It tries to parse section headings and extract content using rules and patterns.
LLM Fallback (if needed): If the heuristic extraction misses something or is too low quality, it invokes an LLM to assist in summarizing or identifying sections.

Each section is stored with:

The section name
Extracted text
An optional LLM-generated summary
Associated domains (e.g. “reinforcement learning”, “multi-modal systems”)

DEFAULT_SECTIONS = ["title", "abstract", "methods", "results", "contributions"]

class DocumentProfilerAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.output_sections = cfg.get("output_sections", DEFAULT_SECTIONS)
        self.min_chars_per_sec = cfg.get("min_chars_per_section", 120)

        self.domain_classifier = DomainClassifier(
            memory, logger, cfg.get("domain_seed_config_path")
        )
        self.section_parser = DocumentSectionParser(cfg, logger)

    async def run(self, context: dict) -> dict:
        documents = context.get(self.input_key, [])
        profiled = []

        for doc in documents:
            doc_id = doc["id"]
            text = doc.get("content", doc.get("text", ""))
            title = doc.get("title")
            summary = doc.get("summary")

            # Step 1: Parse unstructured sections
            parsed = self.section_parser.parse(text)

            # Step 2: Optionally add title & abstract
            if title:
                parsed["title"] = title
            if summary:
                parsed["abstract"] = summary

            # Step 3: Store sections & domains
            for section, section_text in parsed.items():
                entry = self.memory.document_section.upsert({
                    "document_id": doc_id,
                    "section_name": section,
                    "section_text": section_text,
                    "source": "unstructured",
                })

                domains = self.domain_classifier.classify(section_text)
                for domain, score in domains:
                    self.memory.document_section_domains.insert({
                        "document_section_id": entry.id,
                        "domain": domain,
                        "score": float(score),
                    })

            profiled.append({
                "id": doc_id,
                "structured_data": parsed,
            })

        context[self.output_key] = profiled
        return context

🧩 Engineering Impact

This stage is a bridge between raw knowledge and usable insight. By structuring the data:

We enable fine-grained comparison between documents
We allow domain-aware filtering and scoring
We support selective training of models on the most relevant parts (e.g., only learning from a method or result section)

🧬 Contribution to Self-Improvement

The profiler contributes to the self-improving loop in two critical ways:

Data Quality: By ensuring the training data is cleanly structured, we avoid training our models on noisy or irrelevant content.
Domain Awareness: By tagging sections with topic domains, we can align documents with goals, identify coverage gaps, and route them more intelligently in future learning cycles.

In essence, this agent turns a chaotic dump of text into a set of high-quality, semantically-tagged, modular building blocks. These become the core “knowledge atoms” our AI learns from.

✔️ Measuring Relevance and Utility the `PaperScoreAgent`

Once documents have been structured and profiled, the next step is to assess how useful each one is in helping the system achieve its current goal. This is where the PaperScoreAgent comes into play.

    flowchart LR
  A[🎯 ArxivSearchAgent<br/>Find goal-related seed papers]
  B[📥 DocumentLoaderAgent<br/>Download & extract text]
  C[🧬 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  D[📈 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]:::highlighted
  E[📚 KnowledgeLoaderAgent<br/>Load knowledge from the Brain]
  F[🎓 DocumentTrainerAgent<br/>Learning Better Papers]
  G[🏅 DocumentRewardScorerAgent<br/>Score Documents]

  A --> B --> C --> D --> E --> F --> G

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

This agent evaluates each document across multiple dimensions (like relevance, novelty, clarity, etc.), using LLM-based or rule-based scoring mechanisms encapsulated in the PaperScoringMixin. It avoids redundant work by skipping re-scoring for already evaluated papers unless forced via configuration.

Each document is scored and stored, and the results can later be used for:

Selecting top-performing documents for training or inference,
Understanding which kinds of research are consistently useful,
Fine-tuning the search and filtering process.

This ensures that not only are we collecting AI papers, but we’re intelligently filtering them to extract high-value insights.

📦 Code Summary: `PaperScoreAgent`

Score papers: Computes evaluation scores for each document.
Avoid redundant work: Checks if a document has already been scored and skips it unless force_rescore is enabled.
Pulls stored scores: Fetches past evaluations from a memory database via EvaluationORM and ScoreORM.
Aggregates results: Averages scores by dimension when using cached results.

✅ Inputs:

A list of documents (from context).

📤 Outputs:

context[self.output_key] containing titles and score dictionaries.

class PaperScoreAgent(BaseAgent, PaperScoringMixin):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.force_rescore = cfg.get("force_rescore", False)

    async def run(self, context: dict) -> dict:
        documents = context.get(self.input_key, [])
        results = []

        for document in documents:
            doc_id = document["id"]
            existing = self.get_scores_by_document_id(doc_id)

            if existing and not self.force_rescore:
                results.append({
                    "title": document.get("title"),
                    "scores": self.aggregate_scores_by_dimension(existing)
                })
                continue

            score_result = self.score_paper(document, context=context)
            results.append({
                "title": document.get("title"),
                "scores": score_result
            })

        context[self.output_key] = results
        return context

    def get_scores_by_document_id(self, doc_id: int) -> list:
        evaluations = self.memory.session.query(EvaluationORM).filter_by(document_id=doc_id).all()
        scores = []
        for ev in evaluations:
            scores.extend(
                self.memory.session.query(ScoreORM).filter_by(evaluation_id=ev.id).all()
            )
        return scores

    def aggregate_scores_by_dimension(self, scores: list) -> dict:
        totals = defaultdict(list)
        for score in scores:
            if score.score != 0:
                totals[score.dimension].append(score.score)
        return {dim: round(sum(vals) / len(vals), 4) for dim, vals in totals.items()}

🧮 How PaperScoringMixin Works

The PaperScoringMixin provides the scoring logic used by PaperScoreAgent. It defines a single method score_paper() which delegates the actual evaluation to a flexible PaperScoreEvaluator.

This evaluator is configured through a YAML file (e.g., paper_review.yaml) that defines what dimensions to score (e.g., relevance, originality, clarity) and how to prompt the LLM to perform that evaluation.

The mixin ensures that any agent using it:

Loads the scoring rubric and prompt templates,
Injects the document and context into the LLM,
Receives a set of scored dimensions back.

This modular design allows you to swap in different evaluators or scoring rules just by updating the config file no code changes required. It’s an elegant abstraction that decouples how papers are scored from where they’re processed.

class PaperScoringMixin:
    def score_paper(self, paper_doc: dict, context: dict = None) -> dict:
        context = context or {}
        context["paper_score"] = paper_doc

        if not hasattr(self, "call_llm"):
            raise AttributeError("Agent must implement `call_llm(prompt, context)`")

        evaluator = PaperScoreEvaluator.from_file(
            filepath=self.cfg.get("score_config", "config/scoring/paper_review.yaml"),
            prompt_loader=self.prompt_loader,
            cfg=self.cfg,
            logger=self.logger,
            memory=self.memory,
        )

        scores = evaluator.evaluate(document=paper_doc, context=context, llm_fn=self.call_llm)
        return scores

Here’s a cleaned-up and properly formatted version of the next blog section on the KnowledgeLoaderAgent. This version keeps your original structure and clarity, but smooths out flow, formatting, and emphasis for easier inclusion in a markdown-based blog post.

🧠 `KnowledgeLoaderAgent` Filtering for Signal, Not Just Matches

    flowchart LR
  A[🎯 SurveyAgent<br/>Find goal-related seed papers]
  B[🔍 SearchOrchestratorAgent<br/>Expand with related papers]
  C[📥 DocumentLoaderAgent<br/>Download & extract text]
  D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge]:::highlighted

  A --> B --> C --> D --> E --> F

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

As document pipelines grow in depth and breadth, not all retrieved content deserves to be retained. The KnowledgeLoaderAgent is where the system gets selective filtering only the most useful knowledge for long-term storage or further reasoning.

🎯 Role in the Pipeline

The KnowledgeLoaderAgent turns a pile of downloaded, profiled, and scored documents into a targeted collection of high-quality knowledge, tuned to the current research goal. It doesn’t just rely on text overlap or keywords it uses embedding similarity, domain matching, and quality scoring to curate the best content.

This is the agent that separates “might be relevant” from “essential to know.”

🔍 What It Actually Does

Goal Domain Matching
- Uses embedding vectors to classify the goal into a domain (like "LLM Optimization" or "Knowledge Distillation").
- Computes cosine similarity between the goal embedding and each domain’s seed embedding.
Document Domain Filtering
- Keeps only those documents whose domain tags match the goal’s domain, and with a domain confidence score above a threshold.
- Domain tags were precomputed by the DocumentProfilerAgent.
Optional Score Filtering
- If use_dimensional_scores: true, it additionally filters based on quality scores like:
  - relevance, usefulness, clarity, implementability, novelty
- You can set a weighted score threshold or sort based on top-K performance across selected dimensions.
Flexible Return Format
- Returns either summary or full text, depending on downstream needs (include_full_text: true/false).

🧪 Example Use Case

Say your goal is: “Improve the fine-tuning efficiency of transformer models.”

The KnowledgeLoader:

Embeds the goal.
Detects it belongs to the "LLM Optimization" domain.
Selects documents tagged with that domain, and scores above 0.6 on domain confidence.
If configured, also checks that selected docs score well on clarity and implementability.

Result: a curated set of documents, cleanly scoped to the goal and backed by semantic and quality filtering.

⚙️ Example Config

knowledge_loader:
  name: knowledge_loader
  domain_seeds: ${path:config/domain/seeds.yaml}
  top_k: 3
  domain_threshold: 0.4
  include_full_text: false
  use_dimensional_scores: true
  dimension_weights:
    relevance: 1.0
    usefulness: 0.8
    clarity: 0.6
    implementability: 0.7
    novelty: 0.5
  min_weighted_score: 0.5

🧩 Trimmed Code Summary

class KnowledgeLoaderAgent(BaseAgent):
    def __init__(...):
        self.domain_seeds = cfg.get("domain_seeds", {})
        self.top_k = cfg.get("top_k", 3)
        self.threshold = cfg.get("domain_threshold", 0.0)
        self.include_full_text = cfg.get("include_full_text", False)
        self.use_dimensional_scores = cfg.get("use_dimensional_scores", False)
        self.dimension_weights = cfg.get("dimension_weights", {...})
        self.min_weighted_score = cfg.get("min_weighted_score", 0.5)

    async def run(self, context):
        goal_text = context["goal"]["goal_text"]
        goal_vector = self.memory.embedding.get_or_create(goal_text)

        # 1. Match goal to a domain
        domain_vectors = {
            d: np.mean([self.memory.embedding.get_or_create(x) for x in ex], axis=0)
            for d, ex in self.domain_seeds.items()
        }
        goal_domain = max(domain_vectors, key=lambda d: cosine_similarity([goal_vector], [domain_vectors[d]])[0][0])

        # 2. Filter docs by domain + optional score
        filtered = []
        for doc in context["documents"]:
            domains = self.memory.document_domains.get_domains(doc["id"])
            if any(dom.domain == goal_domain and dom.score >= self.threshold for dom in domains):
                if self.use_dimensional_scores and self.compute_weighted_score(doc["id"]) < self.min_weighted_score:
                    continue
                filtered.append(doc)

        context[self.output_key] = filtered
        return context

🦾 `DocumentTrainerAgent`: Learning to Prefer Better Papers

Once we’ve scored papers and collected preferences, we want the system to internalize what makes one paper better than another across different dimensions like relevance, clarity, usefulness, etc. That’s the job of the DocumentTrainerAgent.

    flowchart LR
  A[🎯 SurveyAgent<br/>Find goal-related seed papers]
  B[🔍 SearchOrchestratorAgent<br/>Expand with related papers]
  C[📥 DocumentLoaderAgent<br/>Download & extract text]
  D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge]
  G[📚 DocumentTrainerAgent<br/>Learning Better Papers]:::highlighted

  A --> B --> C --> D --> E --> F --> G

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

🧠 Purpose

The DocumentTrainerAgent creates training data from real scoring preferences and trains multi-dimensional reward models (RMs) that can later be used to predict document quality.

This is how the system starts to teach itself what “good” means, based on your goals and scoring feedback.

⚙️ What It Does

👥 Builds Contrastive Training Pairs Uses DocumentPreferencePairBuilder to construct contrastive pairs like:

“For goal X, document A is more relevant than document B.”

These are pulled from prior scoring runs stored in the system’s memory.
📈 Trains Per-Dimension Value Models For each dimension (e.g., clarity, usefulness, implementability), it uses DocumentMRQTrainer to train a dimension-specific reward model using contrastive loss.

Each model learns to distinguish better vs. worse outputs on that metric.
🧠 Optional Tuning Layer After training, a lightweight regression tuner is fitted to calibrate model outputs against original scores. This helps smooth the reward predictions.
💾 Saves Models and Tuners Models are saved to disk (e.g., document_rm_clarity.pt) and linked tuners are serialized to JSON. These can later be loaded for inference in downstream agents.

    flowchart LR
    A[Get Contrast Pairs] --> B[Group by Dimension]
    B --> C[Train MRQ Model per Dimension]
    C --> D[Save Models + Tuners]
    D --> E[Return to Scoring Pipeline]

🧩 Code Summary

This agent:

Pulls training pairs from memory (organized by dimension),
Prepares the data for training,
Trains a regression model per dimension using MRQ (Multidimensional Reward Quantification),
Saves the models and any tuning metadata for later use in scoring.

This closes the loop between observation and adaptation — it’s how our system evolves its judgment from LLM supervision to fast local predictors.

Here’s the core implementation:

class DocumentTrainerAgent(BaseAgent):
    async def run(self, context):
        goal_text = context["goal"]["goal_text"]

        # Step 1: Build contrastive pairs
        builder = DocumentPreferencePairBuilder(self.memory.session, self.logger)
        pairs = builder.get_training_pairs_by_dimension(goal=goal_text)

        # Step 2: Flatten all pairs into training examples
        all_pairs = []
        for dim, dim_pairs in pairs.items():
            for p in dim_pairs:
                all_pairs.append({
                    "title": p["title"],
                    "output_a": p["output_a"],
                    "output_b": p["output_b"],
                    "value_a": p["value_a"],
                    "value_b": p["value_b"],
                    "dimension": dim,
                })

        # Step 3: Train reward models
        trainer = DocumentMRQTrainer(
            memory=self.memory,
            logger=self.logger,
            encoder=TextEncoder(),
            value_predictor=DocumentValuePredictor(),
            device="cuda" if torch.cuda.is_available() else "cpu"
        )

        trained_models, tuners = trainer.train_multidimensional_model(all_pairs, cfg={
            "epochs": 10,
            "lr": 1e-4,
            "patience": 2
        })

        # Step 4: Save models and tuners
        for dim, model in trained_models.items():
            torch.save(model, f"models/document_rm_{dim}.pt")
            if dim in tuners:
                tuners[dim].save(f"models/document_rm_{dim}_tuner.json")

        return context

🧱 Building Training Pairs from Scored Papers

After documents are scored, we need to convert that feedback into structured training data. The DocumentPreferencePairBuilder does exactly that: it transforms past evaluations into contrastive pairs that teach a model how to rank quality.

🔍 What It Does

The DocumentPreferencePairBuilder queries your database for papers that have been scored (across any dimension like relevance, clarity, etc.). For each document and dimension, it finds:

The highest-scoring version (what we want the model to prefer), and
The lowest-scoring version (what we want it to avoid).

These pairs are grouped by dimension and formatted as input to a contrastive learning model, which will later be trained to favor better outputs.

This forms the core learning signal for the reward model: “Given two documents, which is better and why?”

⚙️ Example Output Format

{
  "relevance": [
    {
      "title": "Language Models as Agents",
      "output_a": "...",  # preferred version
      "output_b": "...",  # less preferred
      "value_a": 8.2,
      "value_b": 5.1
    },
    ...
  ],
  "clarity": [ ... ],
  "usefulness": [ ... ]
}

Each dimension produces a list of pairs. These are consumed by the DocumentTrainerAgent in the next stage.

🧩 Code Summary

class DocumentPreferencePairBuilder:
    def __init__(self, db, logger=None):
        self.db = db
        self.logger = logger

    def get_training_pairs_by_dimension(self, goal=None, limit=10000) -> dict:
        # SQL query to find top- and bottom-scoring versions of each doc per dimension
        query = text(""" ... """)  # trimmed for brevity

        try:
            rows = self.db.execute(query, {"limit": limit}).fetchall()
        except Exception as e:
            self.logger.log("DocumentPairBuilderError", {"error": str(e)})
            return {}

        grouped = defaultdict(dict)
        results = defaultdict(list)

        # Group rows into (top, bottom) per doc per dimension
        for row in rows:
            grouped[(row.dimension, row.doc_id)][row.rank_type] = row

        for (dim, _), pair in grouped.items():
            if "top" in pair and "bottom" in pair:
                results[dim].append({
                    "title": pair["top"].title,
                    "output_a": pair["top"].content,
                    "output_b": pair["bottom"].content,
                    "value_a": float(pair["top"].score),
                    "value_b": float(pair["bottom"].score),
                })

        return dict(results)

Here’s a concise explanation of the SQL query used to extract preference pairs for training a reward or ranking model:

🧮 SQL: Extracting Document Preference Pairs

The SQL query builds document pairs based on score differences across dimensions (e.g., novelty, clarity, relevance) to train models like MR.Q. Here’s how it works:

`scored_docs` CTE

Joins scores, evaluations, and documents to retrieve:
- The document content,
- Its associated score per dimension, and
- Row numbers (rank_high, rank_low) that identify the highest and lowest scored instances per dimension and document.
- Filters out null scores.

Top & Bottom Selection

Extracts:
- The top-scored version of each document per dimension (rank_high = 1),
- The bottom-scored version (rank_low = 1),
- Ensures the document has valid, non-empty content.

WITH scored_docs AS (
    SELECT
        s.dimension,
        s.score,
        d.id AS doc_id,
        d.title,
        d.content,
        ROW_NUMBER() OVER (
            PARTITION BY s.dimension, d.id ORDER BY s.score DESC
        ) AS rank_high,
        ROW_NUMBER() OVER (
            PARTITION BY s.dimension, d.id ORDER BY s.score ASC
        ) AS rank_low
    FROM scores s
    JOIN evaluations e ON s.evaluation_id = e.id
    JOIN documents d ON e.document_id = d.id
    WHERE s.score IS NOT NULL
)
SELECT
    dimension,
    title,
    content,
    score,
    rank_type,
    doc_id
FROM (
    SELECT
        dimension,
        title,
        content,
        score,
        'top' AS rank_type,
        doc_id
    FROM scored_docs
    WHERE rank_high = 1
        AND content IS NOT NULL
        AND content <> ''
        
    UNION ALL

    SELECT
        dimension,
        title,
        content,
        score,
        'bottom' AS rank_type,
        doc_id
    FROM scored_docs
    WHERE rank_low = 1
) AS ranked_pairs
ORDER BY dimension, doc_id
LIMIT :limit

Result Format

Returns a flattened list of pairs marked as 'top' or 'bottom' for each document and dimension.

These are then grouped in code to form contrast pairs like:

{
  "title": "Sample Title",
  "output_a": "high-quality text",
  "output_b": "lower-quality text",
  "value_a": 8.5,
  "value_b": 4.2
}

Usage

This output feeds into the contrastive training of ranking models that learn to prefer higher-quality research content based on past scores.

🧠 Learning to Rank: The DocumentMRQTrainer

Once we’ve built contrastive preference pairs from human or LLM evaluations, we need a model that can learn to replicate those judgments. That’s where the DocumentMRQTrainer comes in.

🎯 What It Does

The DocumentMRQTrainer trains a multi-dimensional reward model a lightweight neural predictor that learns to estimate the relative quality of documents given a goal. It’s designed for continuous retraining using LLM feedback or human preferences as supervision.

It supports multiple quality dimensions (e.g. relevance, clarity, novelty, etc.) and returns:

A trained model per dimension.
A regression tuner that aligns model predictions with real LLM score scales.

Think of this as the “student” that learns from the scoring “teacher” and eventually replaces it for faster inference and ranking.

⚙️ How It Works

Embedding Comparison For each pair, it uses the TextEncoder to compute a goal-aware representation for both documents. The preferred document’s embedding is subtracted from the less-preferred one, producing a contrast vector.
Binary Classification Training These contrast vectors are passed into a small feedforward model (DocumentValuePredictor) trained with binary cross-entropy loss. A label of 1.0 signals “A is better than B.”
Dimension-Specific Models Each quality dimension is trained independently, and the model state is saved for future use.
Score Alignment (Tuner) After training, a lightweight RegressionTuner is fit to map MRQ predictions to LLM-calibrated scores, using real examples. This tuner allows the MRQ model to produce human-like scores at runtime.

🧩 Key Functions

class DocumentMRQTrainer:
    def __init__(self, memory, logger, encoder=None, value_predictor=None, device="cpu"):
        self.memory = memory
        self.logger = logger
        self.device = device
        self.encoder = encoder or TextEncoder()
        self.value_predictor = value_predictor or DocumentValuePredictor(512, 1024)
        self.regression_tuners = {}

    def prepare_training_data(self, samples):
        inputs, labels = [], []
        for item in samples:
            # Embed goal (context) and document candidates
            ctx_emb = self.memory.embedding.get_or_create(item["title"])
            emb_a = self.memory.embedding.get_or_create(item["output_a"])
            emb_b = self.memory.embedding.get_or_create(item["output_b"])

            with torch.no_grad():
                zsa_a = self.encoder(ctx_emb, emb_a)
                zsa_b = self.encoder(ctx_emb, emb_b)

            # Generate contrast vector: (preferred doc) - (less preferred doc)    
            diff = zsa_a - zsa_b if item["value_a"] >= item["value_b"] else zsa_b - zsa_a
            inputs.append(diff) # Train model to recognize this "preference signal"
            labels.append(torch.tensor([1.0]))

        return DataLoader(TensorDataset(torch.stack(inputs), torch.stack(labels)), batch_size=16)

    def train(self, dataloader, cfg):
        optimizer = torch.optim.Adam(self.value_predictor.parameters(), lr=cfg.get("lr", 1e-4))
        loss_fn = nn.BCEWithLogitsLoss()
        for epoch in range(cfg.get("epochs", 10)):
            for x, y in dataloader:
                preds = self.value_predictor(x)
                loss = loss_fn(preds, y)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

    def train_multidimensional_model(self, contrast_pairs, cfg=None):
        models, tuners = {}, {}
        by_dim = defaultdict(list)
        for pair in contrast_pairs:
            by_dim[pair["dimension"]].append(pair)

        for dim, samples in by_dim.items():
            dataloader = self.prepare_training_data(samples)
            self.train(dataloader, cfg or {})
            models[dim] = self.value_predictor.state_dict()

            tuner = RegressionTuner(dimension=dim, logger=self.logger)
            for s in samples:
                for side in ["a", "b"]:
                    mrq_score = self.value_predictor(self.encoder(
                        self.memory.embedding.get_or_create(s["title"]),
                        self.memory.embedding.get_or_create(s[f"output_{side}"])
                    )).item()
                    tuner.train_single(mrq_score, s[f"value_{side}"])
            tuners[dim] = tuner

        return models, tuners

🔄 Self-Alignment to LLM

An optional feature of the trainer is self-alignment. It can compare new MRQ outputs to the nearest LLM-scored neighbors and adjust using align_with_llm_score. This keeps the model grounded to high-quality feedback while continuously improving.

trainer.align_with_llm_score(dimension, goal, hypothesis, llm_score)

📏 Staying Aligned Over Time

This trainer enables the system to close the loop it doesn’t just consume LLM scores, it learns from them and gradually becomes capable of performing its own quality judgments. Over time, this reduces reliance on large models and supports scalable, goal-specific document filtering.

🧠 `DocumentRewardScorerAgent` Multi-Dimensional Document Evaluation

The DocumentRewardScorerAgent evaluates research documents by scoring them across multiple quality dimensions using trained reward models. It plays a crucial role in your self-improving AI system by assigning structured, learnable feedback to documents enabling future ranking, filtering, and learning behaviors.

🔍 Purpose

After downloading, parsing, and profiling documents, this agent uses pre-trained reward models to assign scores along defined dimensions (e.g., relevance, clarity, engagement). These scores serve as the reward signal in downstream learning pipelines like MRQ, DPO, or preference tuning.

⚙️ Configuration

The agent loads models and encoders based on the configuration:

dimensions: ["relevance", "clarity", "engagement"]
model_dir: models/document
model_prefix: document_rm_

These models are loaded through a DocumentMRQScorer, which wraps the inference logic for multiple dimensions.

🧬 Workflow

Input:
- A list of parsed documents (context["documents"]) with title and content.
- The active goal (context["goal"]["goal_text"]), used as context during scoring.
Scoring: For each document and each dimension, the agent calls the DocumentMRQScorer.score() method, which:
- Embeds the goal and document.
- Passes the combined representation through the trained predictor.
- Returns a scalar score indicating quality.
Output: The agent adds a structured result under its output_key, containing each document’s title, text, and per-dimension scores:

{
  "title": "Self-Improving Agents via Reinforcement",
  "text": "... full document text ...",
  "scores": {
    "relevance": 8.2,
    "clarity": 7.5,
    "engagement": 6.9
  }
}

🔄 Integration

This agent typically runs after:

DocumentLoaderAgent (which fetches and converts PDF text),
DocumentProfilerAgent (which adds structure and metadata).

And before:

DocumentTrainerAgent (which uses scored outputs to generate preference training pairs).

✅ Benefits

Enables automated reward model inference for document-level supervision.
Provides consistent multi-aspect feedback aligned to goal context.
Fully compatible with preference-based learning loops for self-improvement.

⚙️ Part 2: From Building to Improving

The pipeline gives us structure: it ingests documents, scores their utility, and trains custom reward models. But structure isn’t enough. We need intelligence. What happens when our model starts drifting? When new goals arrive? When reality changes? That’s where self-tuning comes in. Our system doesn’t just run once it watches itself, compares its judgments with a trusted LLM, and retrains whenever confidence drops.

In this section, we’ll show how the system:

Monitors model performance over time
Validates itself against LLM judges
Identifies drift, stagnation, and failure modes
Retrains and updates only when trust erodes

This is the core intelligence layer the part of the system that lets it evolve, adapt, and stay sharp.

    
flowchart TD
    A[New Goal + Documents] --> B[MRQ Scoring Engine]
    B --> C[Scored Document Pairs]
    
    C --> D[SelfValidationEngine]
    D --> E[Validation Stats<br/>Agreement, Matches]
    E --> F[MetaConfidenceTracker]
    F --> G{Confidence Low?}

    G -- Yes --> H[TrainingController]
    H --> I{Cooldown OK?}
    I -- Yes --> J[Retrain MRQ Model]
    J --> B
    I -- No --> K[Wait / Skip Training]

    G -- No --> L[Keep Using Model]

    B --> M[CycleWatcher]
    M --> N{Stuck or Oscillating?}
    N -- Yes --> O[Flag Goal/Dimension for Intervention]
    N -- No --> P[Continue Monitoring]

    F --> Q[StateTracker]
    H --> Q
    B --> Q
    D --> Q

    style A fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
    style B fill:#fff3e0,stroke:#ff9800,stroke-width:2px
    style D fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    style F fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style H fill:#fbe9e7,stroke:#ff5722,stroke-width:2px
    style M fill:#ede7f6,stroke:#673ab7,stroke-width:2px
    style Q fill:#f0f4c3,stroke:#cddc39,stroke-width:2px

🔁 The Self-Tuning Loop Explained

Once the MRQ models are trained, they’re not frozen. The system continually:

Scores new documents using these models (DocumentRewardScorerAgent).
Samples a subset of those scores and compares them with LLM judgments (SelfValidationEngine).
Tracks confidence over time with MetaConfidenceTracker.
Decides whether to retrain using the TrainingController (with cooldowns to avoid thrashing).
Triggers retraining by regenerating contrast pairs and updating models (DocumentTrainerAgent).

This loop runs independently per goal and dimension. Each dimension learns at its own pace much like a student with separate subjects.

🧱 Centralized Intelligence: The Supervisor and the Shared Registry

As the system scales, multiple intelligent components trackers, controllers, validators need to coordinate efficiently. To enable this, we introduced a central registry, a lightweight but powerful mechanism for wiring up shared components.

📦 The Registry: Global, Safe, and Explicit

The registry (stephanie/registry/registry.py) is a global key-value store that acts as a service container. It ensures that core tools (like the confidence tracker or validation engine) are:

Registered only once (avoiding accidental overwrite),
Globally accessible from anywhere in the system,
Easily testable and resettable when needed.

# Example: Registering and retrieving a shared component
register("confidence_tracker", tracker)
...
tracker = get("confidence_tracker")
tracker.update(...)

This makes dependency management easy and avoids passing long chains of objects through method calls or agents.

🧠 The Supervisor: Wiring the Brain Together

The Supervisor class is the entry point to the entire AI system. It boots up all core subsystems, registers them in the global registry, and coordinates their execution. This includes:

StateTracker: Keeps track of pipeline progress and state.
MetaConfidenceTracker: Monitors model agreement with the LLM.
CycleWatcher: Detects when a model is stuck or flip-flopping.
TrainingController: Triggers retraining based on confidence drops.
SelfValidationEngine: Compares model predictions against LLM supervision.

Here’s how the components are wired:

# Inside Supervisor.__init__()
state_tracker = StateTracker(...)
confidence_tracker = MetaConfidenceTracker(...)
cycle_watcher = CycleWatcher(...)
validator = SelfValidationEngine(...)

training_controller = TrainingController(
    cfg=cfg,
    memory=self.memory,
    logger=self.logger,
    validator=validator,
    tracker=confidence_tracker,
    trainer_fn=trainer_fn,  # user-defined training callback
)

register("state_tracker", state_tracker)
register("confidence_tracker", confidence_tracker)
register("cycle_watcher", cycle_watcher)
register("training_controller", training_controller)
register("self_validation", validator)

Now, anywhere in the pipeline, you can call:

from stephanie.registry.registry import get

controller = get("training_controller")
controller.maybe_train(goal, dimension, pairs)

✅ What Makes This Powerful

The registry-supervisor pattern lets us:

Decouple agents from their dependencies,
Swap implementations easily during testing or research,
Add runtime behavior (like retraining or validation) without hardcoding it into each agent,
Maintain system-wide coherence, even as complexity grows.

This setup ensures the system isn’t just intelligent it’s composable, extensible, and introspective.

🧭 StateTracker: Keeping Tabs on the System’s Learning Journey

In a self-improving system, it’s not enough to just score documents and train models you need to track the state of every goal and dimension over time. That’s where the StateTracker comes in.

It acts like the memory and metadata hub for each goal. For every evaluation (e.g., scoring, validation, retraining), it records:

✅ What happened
⏱️ When it happened
🔄 How many times it’s happened

This allows other components like the TrainingController or CycleWatcher to make safe, informed decisions about when to retrain, freeze learning, or flag problems.

🔑 What It Tracks

For every (goal, dimension) pair, the StateTracker keeps track of:

Event	Description
`scored`	When documents were last scored by the model
`validated`	When LLM validation last occurred
`trained`	When the reward model was last retrained
`retrain_count`	How many times the model has been retrained
`frozen` / `active`	Whether learning is currently enabled or paused
`metadata`	Arbitrary tags or notes per goal (e.g., source)

🧠 Continuous Improvement, Built-In

Prevents redundant retraining by checking timestamps and cooldowns
Enables learning analytics (e.g., which goals are evolving, which are stagnant)
Acts as a registry of goals currently being monitored and improved
Facilitates lifecycle management from new goal to mature model

🧬 Example Behavior

When a document batch is scored:

state_tracker.update_event("improve_medical_accuracy", "relevance", "scored")

Later, when training completes:

state_tracker.update_event("improve_medical_accuracy", "relevance", "trained")

And at any point, you can retrieve full goal state:

state = state_tracker.get_state("improve_medical_accuracy", "relevance")

Which might return:

{
  "last_scored_at": 1719727812.3,
  "last_trained_at": 1719738912.7,
  "retrain_count": 3,
  "status": "active"
}

🧩 How It Fits In

Other modules like TrainingController, MetaConfidenceTracker, and CycleWatcher depend on the StateTracker to answer questions like:

“Is this a new goal?”
“When was the model last retrained?”
“Should we skip training due to cooldown?”
“Is this dimension currently active or frozen?”

This lightweight yet essential tool gives your AI system a memory of its own progress making it more aware, more cautious, and more intelligent over time.

🔁 CycleWatcher: Detecting When the Model Is Stuck or Spinning Its Wheels

Not all model failures are obvious. Sometimes, a model keeps training but doesn’t improve. Or worse it flip-flops on decisions with each retraining. That’s why we built the CycleWatcher.

This component acts like a thermometer for learning progress. It watches the model’s validation agreement over time and flags patterns like:

🔄 Oscillation bouncing between different behaviors with no clear trend
🧱 Stagnation stuck at low agreement, not learning from new data
📈 Healthy learning stable or improving agreement

📊 How It Works

Each time the system runs LLM validation, the CycleWatcher is notified:

cycle_watcher.record_agreement(goal="ai_alignment", dimension="clarity", agreement=0.82)

It stores a short moving window of agreement scores per goal+dimension and checks:

Pattern	Condition
`oscillating`	Recent scores swing up and down without settling
`stuck`	No meaningful improvement for a configured number of steps
`ok`	Scores are trending up or stable above a confidence threshold

🔍 Example Usage

status = cycle_watcher.status("ai_alignment", "clarity")
if status == "oscillating":
    logger.warning("Clarity scoring for 'ai_alignment' is oscillating. Consider intervention.")
elif status == "stuck":
    logger.info("No learning detected will refresh document pool.")

This gives the system a diagnostic reflex a way to self-assess not just what it’s learning, but how well.

🧠 Why It Matters

Avoids wasted retraining cycles
Helps surface noisy or low-signal goals
Gives you insight into the maturity of each goal
Supports automatic interventions (e.g., adding new documents or freezing training)

🧩 System Role

CycleWatcher works closely with:

✅ MetaConfidenceTracker: Uses agreement scores to determine model confidence
🛑 TrainingController: May defer training if the cycle is unhealthy
🧭 StateTracker: Updates state when cycle issues are flagged

Together, these components make your system resilient able to spot when it’s learning poorly and adjust course automatically.

📈 MetaConfidenceTracker: Monitoring Trust in Each Model Over Time

Your reward model might start strong, but over time, it could drift, degrade, or simply face harder examples. That’s where the MetaConfidenceTracker comes in it’s the memory of model trustworthiness.

This component tracks how often each reward model agrees with the LLM, for each goal and each scoring dimension.

🎯 Self-Tuning in Action

Every time a batch of document pairs is validated using the SelfValidationEngine, we pass the agreement score to the MetaConfidenceTracker:

tracker.update("ai_alignment", "clarity", validation_result)

It stores:

✅ Agreement % recent validation score
📆 Timestamps when last validated or updated
🔁 Trend history optional, for plotting improvement or decline

Then, we can ask:

if tracker.should_retrain("ai_alignment", "clarity"):
    print("Triggering retraining due to low confidence.")

🧠 Why It’s Smart

This tracker creates goal- and dimension-specific trust scores. That means:

A model scoring “relevance” for “robot ethics” might be high-confidence
But the same model scoring “engagement” for “computational biology” might be flagged for retraining

It’s all localized and contextual, just like how human expertise varies per topic.

⚙️ System Role

The MetaConfidenceTracker drives data-aware training loops by:

Signaling the TrainingController to retrain a model if agreement drops below a threshold
Coordinating with CycleWatcher to confirm issues aren’t transient
Recording metadata in StateTracker to track retrain events

🔐 Safety and Control

You can configure:

agreement_threshold: When to flag low confidence
min_validation_count: Don’t retrain on one bad batch
retrain_cooldown: Prevents retraining too frequently

📝 Recap

Function	Description
`update(goal, dim, result)`	Stores latest agreement % for model on a goal + dimension
`should_retrain(...)`	Returns True if agreement is below confidence threshold
`get_confidence(...)`	Returns current trust score for a goal + dimension

This tracker ensures your AI isn’t just improving it knows when and where it’s improving.

🛠️ TrainingController: Retrain If Confidence Falls

The TrainingController is the decision maker behind every retraining event in your system. It doesn’t just fire off training runs blindly it listens to signals from the validation system, checks cooldowns, and ensures that retraining happens only when justified.

🚦 What It Does

Whenever validation results come in, the controller evaluates:

Is confidence low? (via MetaConfidenceTracker)
Has enough time passed since last training? (cooldown logic)
Is there fresh training data available?

If all conditions are met, it triggers a model retrain for a specific goal and dimension.

🧠 Decision Logic

class TrainingController:
    def maybe_train(self, goal: str, dimension: str, pairs: list):
        if self.tracker.should_retrain(goal, dimension):
            self.trainer_fn(goal, dimension, pairs)
            self.tracker.reset_cooldown(goal, dimension)

You can plug in any trainer_fn you like this makes it modular. For example:

def trainer_fn(goal, dimension, pairs):
    trainer = DocumentMRQTrainer(...)
    trainer.train_single_dimension(goal, dimension, pairs)

🧰 What It Tracks

✅ Confidence score: From the validation engine
🕒 Last training time: Stored in StateTracker
🔁 Cooldown window: Prevents thrashing the model with too-frequent updates
🔒 Manual freeze status: You can freeze dimensions to block retraining temporarily

⚙️ System Integration

Dependency	Role
`MetaConfidenceTracker`	Provides signal on model reliability
`StateTracker`	Records when retraining happened
`SelfValidationEngine`	Validates current model performance
`trainer_fn`	Executes the retraining

🧭 Learning What Works, Forgetting What Doesn’t

Without this controller, you risk:

Overfitting by retraining on every dip in performance
Underfitting by never retraining models even when they degrade
Wasted compute from redundant updates

The TrainingController gives your AI system the discipline to wait, watch, and act only when it’s necessary just like a human researcher retraining their beliefs after seeing enough contradictory evidence.

✅ SelfValidationEngine Are We Still Aligned?

No matter how good a model is, it can drift. That’s why every self-improving system needs a reality check.

The SelfValidationEngine is that check.

It samples a fraction of your document comparisons those judged by your local reward model and asks a trusted LLM to weigh in. If the model and LLM agree, that’s a good sign. If they start to diverge, it’s time to worry and maybe retrain.

🎯 How the AI Improves Itself

Samples Pairs: From a batch of document comparisons, it randomly selects a subset (e.g., 5%) for validation.
Asks the Model: For each sampled pair, it calls your reward model to decide: “Which of these two documents better satisfies the goal?”
Asks the LLM: It then asks a fallback LLM (like GPT-4 or Qwen3) the same question.
Compares: If the model and LLM choose the same document, that’s a match. If not, it’s a miss.
Logs & Saves: It tracks validation statistics total checked, agreement rate, mismatches and logs everything to memory for auditing or triggering retraining.

🧬 Code Walkthrough

class SelfValidationEngine:
    def __init__(..., reward_model, llm_judge):
        self.reward_model = reward_model
        self.llm_judge = llm_judge
        self.validation_sample_rate = 0.05  # 5% of pairs get audited

def validate_batch(goal, pairs, dimension=None):
    sample = [p for p in pairs if random.random() < self.validation_sample_rate]
    
    for pair in sample:
        model_pref = reward_model(goal, doc_a, doc_b)
        llm_pref = llm_judge(goal, doc_a, doc_b)
        match = model_pref == llm_pref
        ...

The result is a report like:

{
  "validated": 20,
  "matches": 17,
  "agreement": 0.85
}

🔒 Reliability Through Self-Correction

Think of this as unit testing for your model’s behavior:

Feature	Purpose
✅ Validates predictions	Ensures the model still agrees with an external oracle
📉 Detects drift	If agreement drops, the model might be losing reliability
🔁 Triggers retraining	Feeds into the `TrainingController` to kick off updates
🧠 Informs meta-learning	Helps `MetaConfidenceTracker` spot trends in model performance

🌐 Example in Context

Let’s say your system is working on the goal: “Find the most innovative climate tech startups”.

Over time, it has trained a local reward model to judge articles and reports. But suddenly, validation shows agreement with the LLM has dropped from 92% to 71%. The SelfValidationEngine catches this and logs it. That triggers a review:

Is the model overfitting?
Has the data changed?
Should we retrain?

With this mechanism, your AI isn’t just learning it’s learning to self-audit.


class SelfValidationEngine:
    def __init__(...):
        # Takes in a config, memory store, logger, reward model, and fallback LLM judge
        ...

    def validate_batch(goal, pairs, dimension=None):
        # Randomly samples a subset of document pairs
        sample = [pair for pair in pairs if random.random() < validation_sample_rate]

        for pair in sample:
            model_pref = reward_model(goal, doc_a, doc_b)
            llm_pref = llm_judge(goal, doc_a, doc_b)
            is_match = model_pref == llm_pref
            logs.append({ goal, dimension, model_pref, llm_pref, match, truncated_docs })

        agreement = matches / validated
        memory.save("self_validation", {...})
        return { validated, matches, agreement, logs }

✨ Building a Self-Tuning AI That Learns from the Web

Imagine giving an AI a goal say, “Should I invest in Tesla?” The goal itself is deceptively simple, but the process the AI must undertake to answer it is anything but.

The AI begins by interpreting the goal and translating it into a search strategy. It scans the internet news sites, financial reports, forums, YouTube videos, and more to find documents that could help it make a decision. This is not just search. This is targeted retrieval, powered by goal-aware filters and rival-ranking: each piece of data is judged by how well it competes in usefulness against others.

1. Filter and Rank Incoming Data

Each document pulled in goes through a comparative evaluation. The AI uses a language model (LLM) to judge preference between candidate pairs. For example:

“Which of these two reports better supports the decision to buy Tesla?”

These judgments generate training signals. The AI doesn’t just take data at face value it scores, compares, and filters.

2. Train Itself with Preference Modeling (MRQ)

Using the preference signals, the AI trains a Multidimensional Ranking + Quantification (MRQ) model. This model learns to emulate the LLM’s decisions essentially distilling expensive, high-quality LLM judgments into a fast, local model.

Over time, the MRQ model becomes the AI’s primary decision engine cheaper, faster, and fine-tuned to the goal.

3. Validate and Self-Correct

This isn’t a one-shot training. The system constantly runs self-validation: comparing the MRQ model’s predictions with fresh LLM judgments on new or difficult pairs. If the model drifts, confidence drops, or oscillates, it knows to retrain.

This is where self-awareness kicks in. The AI tracks its confidence over time, learning curves, and model reliability using tools like:

MetaConfidenceTracker
CycleWatcher
SelfValidationEngine

4. Extend and Iterate

When the AI is “ready” on a goal (e.g. it can explain and justify its Tesla decision), you introduce new, related goals: “How does Tesla compare to NIO?” or “Which company leads in EV battery tech?”

Each new goal introduces fresh data and creates another self-contained learning loop. The system keeps refining not just using new data, but improving its ability to evaluate, rank, and learn from that data.

5. Continuous Self-Tuning from the Web

This is not just a static LLM pipeline. This is an active learner. It’s always scanning, always challenging itself with new data, always improving its own model of the world.

And because it uses rival-based ranking and model self-validation, it doesn’t need ground-truth labels. It constructs its own signal a powerful step toward autonomous, scalable AI reasoning.

📈 Case Study: Self-Improvement in Action

Let’s say the AI is asked: “Which recent papers best explain alignment challenges in RLHF?”

It pulls 120 candidate papers from ArXiv.
It filters and ranks them using an LLM-based comparator.
After 2 days, its MRQ model achieves 92% agreement with the LLM on sampled comparisons.
But over time, its agreement score on the “clarity” dimension drops below 80%.
The system retrains its clarity model using new pairs and fresh LLM judgments and recovers to 89%.
The AI now scores faster, and its top 5 suggestions include two papers it originally ignored.

This demonstrates the power of a feedback-driven loop that adapts as knowledge evolves.

🔚 Conclusion: An Always-On Intelligence Loop

Over this deep dive, we engineered an AI that doesn’t just execute it evolves.
Here’s what we conquered step-by-step, and why it changes everything:

🔧 Step 1: From Static to Dynamic Intelligence

We shattered the “train once, deploy forever” paradigm. Starting with goal-driven research (e.g., “Build AI that self-improves at complex tasks”), we:

Searched Arxiv with semantic keyword extraction
Scored papers across (any) 5+ dimensions (novelty, relevance, clarity)
Trained MRQ models to replace costly LLM judgments

Why it counts: Turns vague goals into self-updating knowledge engines.

🔄 Step 2: The Self-Tuning Loop

We gave our AI a conscience through:

Validation Engine: Auditing model vs. GPT-4 judgments (5% samples)
Confidence Tracking: Monitoring agreement decay with MetaConfidenceTracker
Auto-Retraining: Triggering updates only when trust erodes
Why it counts: Systems that self-correct > systems that decay.

🧩 Step 3: The Intelligent Core

We wired together autonomous agents like:

CycleWatcher to detect learning stagnation
StateTracker to manage goal lifecycles
TrainingController to enforce disciplined retraining

Why it counts: Modular intelligence > monolithic models.

🌟 Why This Isn’t Just Another Pipeline

We moved beyond automation to autonomous evolution:

Traditional AI	This System
Fixed knowledge	Web-fed learning
Manual retraining	Self-triggered updates
Black-box decisions	Auditable validation logs
One-size-fits-all	Goal-specialized RMs

🚀 Where Do We Go From Here?

You’ve now built an AI that:

Learns from open-source knowledge (Arxiv → web)
Validates its own reasoning
Retrains itself when confidence drops
Evolves with your goals

This is the foundation for truly adaptive AI. Imagine extending RIVAL to:

Real-time market analysis (tracking crypto/news)
Medical research synthesis (updating with new trials)
Self-optimizing code generation

You wont have to you will see it right here.

“The future belongs not to the strongest AI, but to the most adaptable.”

📚 References

Achiam, J. et al. (2023). GPT-4 Technical Report. arXiv:2303.08774.
- Cited for foundational insights into LLM capabilities and evaluation frameworks used in self-improving systems.
Ahmed, A. M. et al. (2024). Scalable Ensembling for Mitigating Reward Overoptimization. arXiv:2406.01013.
- Addresses challenges in reward hacking and overfitting, relevant to the SelfValidationEngine and MetaConfidenceTracker components.
Pan, A., Bhatia, K., & Steinhardt, J. (2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. arXiv:2201.03544.
- Discusses reward model alignment issues addressed by the adversarial training loop in RIVAL.
Stiennon, N. et al. (2020). Learning to Summarize with Human Feedback. Advances in Neural Information Processing Systems, 33.
- Influenced the design of preference pair generation and LLM-based validation in the DocumentTrainerAgent.
Tan, S. & Monz, C. (2025). REMEDY: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling.
- Directly informs the integration of qualitative (preference pairs) and quantitative (BLEU/COMET) rewards in the MRQ Scoring Engine.
RIVAL Framework (2025). Reinforcement Learning with Iterative and Adversarial Optimization (Paper ID: 2506.05070v1).
- Core methodology adapted for self-improving AI systems described in the blog post.

📘 Glossary

Term	Definition
LLM (Large Language Model)	A machine learning model trained on massive text datasets to understand and generate human-like language. Used here for initial document evaluations and fallback validation.
MRQ (Model-Reward-Quality)	A scoring and training framework that learns to predict quality across multiple dimensions (e.g., relevance, clarity, engagement) from pairwise document preferences.
Goal	A high-level task or objective the AI is optimizing for (e.g., “Evaluate Tesla investment”). Guides document retrieval, scoring, and training.
Reward Model	A model trained to predict which document better satisfies a given goal. Initially mimics an LLM, then operates independently.
SelfValidationEngine	A module that compares the reward model’s decisions against trusted LLM judgments to detect drift and measure correctness.
MetaConfidenceTracker	Tracks model vs LLM agreement over time, enabling automatic confidence monitoring and retraining decisions.
TrainingController	Oversees whether and when to retrain models based on validation scores, cooldowns, and thresholds.
CycleWatcher	Detects learning issues like oscillation or stagnation in model confidence. Helps prevent wasted training cycles.
Contrast Pair / Preference Pair	A pair of documents (A vs B) where one is preferred over the other. Used to train and validate reward models.
Dimension	A scoring axis such as “relevance,” “clarity,” or “insight.” MRQ models are trained per-dimension.
Supervisor	The central controller that wires together the self-improving pipeline: registering agents, managing state, and coordinating retraining logic.
StateTracker	Maintains metadata about recent events (e.g., last scored, trained, validated) for each goal and dimension.
RIVAL	An acronym describing the system architecture: Reinforcement learning with Iterative and adVersarial optimization. Represents the closed loop of improvement.
LLM Judge	The fallback, trusted judgment from a large language model used to validate the predictions of local reward models.
Cooldown	A time-based guard to prevent too-frequent retraining of reward models. Managed by the `TrainingController`.

🤖 The Static AI Trap

🧱 Part 1: From Goal to Reward Model

🪞 Design Rationale

🎯 Step 1: Define a Goal

🧭 The Real Problem Isn’t Search It’s Signal

🔧 Building the Self-Improving Research Pipeline

🔎 ArxivSearchAgent Semantic Search for Self-Improvement

✨ Purpose

🧠 How It Works

✅ Built for research

📄 Loading the Document Results DocumentLoaderAgent

🔁 Step-by-Step Flow

🚀 What Makes This Agent Self-Improving?

🧬 Key Code Snippet

📚 Structuring Knowledge: The Role of the DocumentProfilerAgent

🤔 What It Does

🧩 Engineering Impact

🧬 Contribution to Self-Improvement

✔️ Measuring Relevance and Utility the PaperScoreAgent

📦 Code Summary: PaperScoreAgent

✅ Inputs:

📤 Outputs:

🧮 How PaperScoringMixin Works

🧠 KnowledgeLoaderAgent Filtering for Signal, Not Just Matches

🎯 Role in the Pipeline

🔍 What It Actually Does

🧪 Example Use Case

⚙️ Example Config

🧩 Trimmed Code Summary

🦾 DocumentTrainerAgent: Learning to Prefer Better Papers

🧠 Purpose

⚙️ What It Does

🧩 Code Summary

🧱 Building Training Pairs from Scored Papers

🔍 What It Does

⚙️ Example Output Format

🧩 Code Summary

🧮 SQL: Extracting Document Preference Pairs

scored_docs CTE

Top & Bottom Selection

Result Format

Usage

🧠 Learning to Rank: The DocumentMRQTrainer

🎯 What It Does

⚙️ How It Works

🧩 Key Functions

🔄 Self-Alignment to LLM

📏 Staying Aligned Over Time

🧠 DocumentRewardScorerAgent Multi-Dimensional Document Evaluation

🔍 Purpose

⚙️ Configuration

🧬 Workflow

🔄 Integration

✅ Benefits

⚙️ Part 2: From Building to Improving

🔁 The Self-Tuning Loop Explained

🧱 Centralized Intelligence: The Supervisor and the Shared Registry

📦 The Registry: Global, Safe, and Explicit

🧠 The Supervisor: Wiring the Brain Together

✅ What Makes This Powerful

🧭 StateTracker: Keeping Tabs on the System’s Learning Journey

🔑 What It Tracks

🧠 Continuous Improvement, Built-In

🧬 Example Behavior

🧩 How It Fits In

🔁 CycleWatcher: Detecting When the Model Is Stuck or Spinning Its Wheels

📊 How It Works

🔍 Example Usage

🧠 Why It Matters

🧩 System Role

📈 MetaConfidenceTracker: Monitoring Trust in Each Model Over Time

🎯 Self-Tuning in Action

🧠 Why It’s Smart

⚙️ System Role

🔐 Safety and Control

📝 Recap

🛠️ TrainingController: Retrain If Confidence Falls

🚦 What It Does

🧠 Decision Logic

🧰 What It Tracks

🔎 `ArxivSearchAgent` Semantic Search for Self-Improvement

📄 Loading the Document Results `DocumentLoaderAgent`

📚 Structuring Knowledge: The Role of the `DocumentProfilerAgent`

✔️ Measuring Relevance and Utility the `PaperScoreAgent`

📦 Code Summary: `PaperScoreAgent`

🧠 `KnowledgeLoaderAgent` Filtering for Signal, Not Just Matches

🦾 `DocumentTrainerAgent`: Learning to Prefer Better Papers

`scored_docs` CTE

🧠 `DocumentRewardScorerAgent` Multi-Dimensional Document Evaluation