Document Intelligence: Turning Documents into Structured Knowledge

AI Research, LLM Systems, RAG, Agent Architectures, Reasoning Engines, Prompt Optimization, Self-Improving Systems, Knowledge Ingestion, Document Scoring

June 17, 2025

Document Intelligence: Turning Documents into Structured Knowledge

Page content

Summary

Imagine drowning in a sea of research papers, each holding a fragment of the knowledge you need for your next breakthrough. How does an AI system, striving for self-improvement, navigate this information overload to find precisely what it needs? This is the core challenge our Document Intelligence pipeline addresses, transforming chaotic documents into organized, searchable knowledge.

In this post we combine insights from Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers and Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training to build an AI document profiler that transforms unstructured papers into structured, searchable knowledge graphs.

🔍 What Problem Are We Solving?

A self-improving AI must continually adapt to new goals by acquiring relevant, high-quality knowledge. In the face of overwhelming research output, the system needs to filter and prioritize documents that align with its evolving objectives.

This framework enables that adaptation by embedding documents and their sections, classifying them into semantic domains, and scoring them across key dimensions like relevance, novelty, and clarity. The result is a dynamic, goal-aware knowledge base that supports smarter tuning, hypothesis generation, and self-directed improvement.

🧬 The Knowledge Ingestion Pipeline: An Overview

    flowchart LR
  A[🎯 SurveyAgent<br/>Find goal-related seed papers]:::highlighted
  B[🔍 SearchOrchestratorAgent<br/>Expand with related papers]
  C[📥 DocumentLoaderAgent<br/>Download & extract text]
  D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge]

  A --> B --> C --> D --> E --> F

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

To turn raw research papers into structured, searchable knowledge, we built a custom pipeline composed of intelligent agents each with a focused role. This modular design allows for scalable and interpretable processing, all while enabling dynamic tuning and extension as new needs arise.

Here’s a quick walkthrough of the pipeline stages we use:


pipeline:
  name: papers
  tag: "huggingface_related_papers import"
  description: "Import papers score and save them"
  stages:
     - name: survey
       cls: co_ai.agents.knowledge.survey.SurveyAgent
       enabled: true
       iterations: 1
     - name: search_orchestrator
       cls: co_ai.agents.knowledge.search_orchestrator.SearchOrchestratorAgent
       enabled: true
       iterations: 1
     - name: document_loader
       cls: co_ai.agents.knowledge.document_loader.DocumentLoaderAgent
       enabled: true
       iterations: 1
     - name: document_profiler
       cls: co_ai.agents.knowledge.document_profiler.DocumentProfilerAgent
       enabled: true
       iterations: 1
     - name: paper_score
       cls: co_ai.agents.knowledge.paper_score.PaperScoreAgent
       enabled: true
       iterations: 1
     - name: knowledge_loader
       cls: co_ai.agents.knowledge.knowledge_loader.KnowledgeLoaderAgent
       enabled: true
       iterations: 1

1. SurveyAgent

The pipeline begins with the SurveyAgent, which receives a high-level research goal (e.g., “Find papers related to Self correcting AI”) and translates it into concrete search targets. This might include seed topics, keywords, or subdomains.

2. SearchOrchestratorAgent

Once the survey is complete, the SearchOrchestratorAgent coordinates multiple document retrieval strategies, including direct web search, arXiv lookups, and Hugging Face dataset scraping. It standardizes the results and prepares them for downstream ingestion.

3. DocumentLoaderAgent

This agent is responsible for downloading, parsing, and storing raw research papers. It supports:

PDF ingestion and text extraction
Optional summarization via LLMs
Title guessing and metadata refinement
Domain classification using embeddings and a seed-based classifier

4. DocumentProfilerAgent

Each document is then split into sections (e.g., Abstract, Methods, Results) and analyzed in detail. The DocumentProfilerAgent computes section-level embeddings, tags domains, and stores structured knowledge in a fine-grained format.

5. PaperScoreAgent

With the documents structured, the PaperScoreAgent evaluates each one along key dimensions like relevance, clarity, and novelty. These scores can guide downstream reasoning agents and training data selection.

6. KnowledgeLoaderAgent

Finally, the KnowledgeLoaderAgent selects and loads the most relevant sections or documents into memory optimized for the current goal. It acts as a smart filter, ensuring that only the most valuable information is passed to reasoning or generation components.

🔎 SurveyAgent: Generating Search-Driven Subgoals

Its role is to generate adaptive search queries that break a broad research goal into more actionable subtopics. Given a high-level objective, the SurveyAgent deconstructs it into multiple keyword-based queries, using prompt-driven reasoning to capture different angles (e.g., novelty, feasibility). These queries can then be passed to downstream components like the document retriever or orchestrator for targeted literature exploration.

goal:
  goal_text: "https://arxiv.org/pdf/2503.00735"
  goal_type: "similar_papers"
  focus_area: "AI research"

similar_papers this goal type short circuits web search. Instead of querying external databases, we use an Hugging Face Space: librarian-bots/recommend_similar_papers to directly generate a list of relevant papers based on a paper goal.

🤖 Using AI to Find Papers About AI: The Hugging Face Paper Recommender

    flowchart LR
  A[🎯 SurveyAgent<br/>Find goal-related seed papers]
  B[🔍 SearchOrchestratorAgent<br/>Expand with related papers]:::highlighted
  C[📥 DocumentLoaderAgent<br/>Download & extract text]
  D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge]

  A --> B --> C --> D --> E --> F

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

One of the first challenges in building an AI that learns from research is finding the right research to learn from. Instead of relying solely on keyword search or manual curation, we decided to integrate an AI-powered paper recommendation system right into our pipeline.

Enter the Hugging Face Space: librarian-bots/recommend_similar_papers an open-access tool that takes a research paper URL and returns a set of similar papers using LLM-based document understanding and embedding matching.

Here’s how we integrated it:

from gradio_client import Client

def recommend_similar_papers(paper_url: str) -> list[dict]:
    client = Client("librarian-bots/recommend_similar_papers")
    result = client.predict(paper_url, None, False, api_name="/predict")
    ...

We also added a caching layer to avoid repeated requests and support fast iterations:

CACHE_DIR = Path(".paper_cache")
def _get_cache_path(paper_url: str) -> Path:
    key = hashlib.md5(paper_url.encode()).hexdigest()
    return CACHE_DIR / f"{key}.pkl"

The result? When a seed paper is submitted (e.g., an arXiv PDF), our system fetches topically related papers automatically and formats them for ingestion:

{
  "source": "recommend_similar_papers",
  "url": "https://arxiv.org/pdf/2506.10952.pdf",
  "title": "2506.10952",
  "summary": "Not yet processed"
}

This tool forms a core input source for the knowledge pipeline. It ensures that:

We start from high-quality seed papers,
Each paper can dynamically pull in its “literature neighborhood”,
Downstream agents can reason over clusters of ideas, not just isolated works.

At this stage we have a list of results in the following format

This is a goal to search for similar papers. This is the process we are explaining in this post.

goal:
  goal_text: "https://arxiv.org/pdf/2503.00735"
  goal_type: "similar_papers"

This will generate 10 results.

https://arxiv.org/pdf/2505.01441.pdf
https://arxiv.org/pdf/2505.14147.pdf
https://arxiv.org/pdf/2504.15900.pdf
https://arxiv.org/pdf/2505.14652.pdf
https://arxiv.org/pdf/2505.08364.pdf
https://arxiv.org/pdf/2506.01369.pdf

Once we’ve identified which papers to explore, the next step is to pull them into our system and convert them into structured, searchable knowledge.

📥 Downloading and Injesting Research Papers

    flowchart LR
  A[🎯 SurveyAgent<br/>Find goal-related seed papers]
  B[🔍 SearchOrchestratorAgent<br/>Expand with related papers]
  C[📥 DocumentLoaderAgent<br/>Download & extract text]:::highlighted
  D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge]

  A --> B --> C --> D --> E --> F

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

Our ingestion pipeline begins by downloading the paper’s PDF directly from its URL. This is done in a lightweight batch step no browser or crawler needed. We simply fetch, store temporarily, and then extract the full text using our PDFConverter tool.

response = requests.get(url)
with open(f"{self.download_directory}/{title}.pdf", "wb") as f:
    f.write(response.content)

text = PDFConverter.pdf_to_text(f"{self.download_directory}/{title}.pdf")


from pdfminer.high_level import extract_text

class PDFConverter:
    ...
    def pdf_to_text(file_path: Union[str, Path]) -> str:
        ...
        try:
            text = extract_text(str(file_path))
            return text.strip()
        except PDFSyntaxError as e:
         ...

With the raw text extracted, we enrich it with high-quality metadata. Rather than depending solely on LLMs to generate titles or summaries which can hallucinate or miss the author’s intent we query the arXiv API (our internal fetch_arxiv_metadata tool) to retrieve the official title and abstract when available.

def fetch_arxiv_metadata(arxiv_id: str) -> dict:
    """
    Query the arXiv API and return metadata for a given arXiv ID.

    Args:
        arxiv_id (str): e.g., "2505.19590"

    Returns:
        dict: {
            'title': str,
            'summary': str,
            'authors': list[str],
            'published': str (ISO format),
            'url': str
        }
    """
    url = f"https://export.arxiv.org/api/query?id_list={arxiv_id}"
    response = requests.get(url)

    if response.status_code != 200:
        raise ValueError(f"arXiv API request failed with {response.status_code}")

    root = ET.fromstring(response.text)
    ns = {"atom": "http://www.w3.org/2005/Atom"}
    entry = root.find("atom:entry", ns)

    if entry is None:
        raise ValueError(f"No entry found for arXiv ID {arxiv_id}")

    title = entry.find("atom:title", ns).text.strip().replace("\n", " ")
    summary = entry.find("atom:summary", ns).text.strip().replace("\n", " ")
    authors = [
        author.find("atom:name", ns).text 
            for author in entry.findall("atom:author", ns)
    ]
    published = entry.find("atom:published", ns).text
    pdf_url = entry.find("atom:id", ns).text

    return {
        "title": title,
        "summary": summary,
        "authors": authors,
        "published": published,
        "url": pdf_url,
    }

Why use arXiv’s metadata?

🧭 We’ve found the titles and summaries from arXiv to be more reliable and consistent than those generated by LLMs, especially for scientific documents. These curated details help anchor the rest of the pipeline in truth.

Finally, we embed the document using our memory store. The embedding combines the title and summary, providing a rich semantic fingerprint of the paper:

embed_text = f"{title}\n\n{summary}"
self.memory.embedding.get_or_create(embed_text)

These embeddings allow us to later search, cluster, and rank papers based on similarity even across different pipelines or research goals.

At this stage, each paper is:

Parsed into clean text,
Enriched with arXiv metadata (or optionally LLM-generated fallback),
Embedded for retrieval,
Stored in the database for downstream profiling and scoring.

At this stage we have a document

class DocumentORM(Base):
    __tablename__ = "documents"

    id = Column(Integer, primary_key=True)
    goal_id = Column(Integer, ForeignKey("goals.id", ondelete="SET NULL"), nullable=True)

    title = Column(String, nullable=False)
    source = Column(String, nullable=False)
    external_id = Column(String, nullable=True)
    domain_label = Column(String, nullable=True)
    url = Column(String, nullable=True)
    summary = Column(Text, nullable=True)
    content = Column(Text, nullable=True)
    date_added = Column(DateTime(timezone=True), server_default=func.now())

    domains = Column(ARRAY(String), nullable=True)

    sections = relationship(
        "DocumentSectionORM",
        back_populates="document",
        cascade="all, delete-orphan"
    )

    domains_rel = relationship(
        "DocumentDomainORM",
        back_populates="document",
        cascade="all, delete-orphan"
    )

    def to_dict(self):
...

In the next section, we’ll look at how we turn categorize the documents using domains. 📚🔍

🧭 Classifying Knowledge with Domain Intelligence

As our AI system ingests papers and breaks them into meaningful sections, we face an important challenge: How do we know what each document or even each section is really about? Understanding the domain of a document (e.g., “reasoning”, “vision”, “symbolic learning”) is crucial for organizing knowledge, surfacing relevant content, and enabling downstream agents to specialize their reasoning.

This is where the DomainClassifier comes in.

🧠 What the DomainClassifier Does

The DomainClassifier is a lightweight but powerful component that assigns semantic labels called domains to documents and their sections. It works by comparing a document’s content to a small set of seed phrases representing each domain. These seeds are defined in a YAML config file, and embeddings for each seed are generated using the same vector store already used throughout our system.

🌱 Understanding Domain Seeds

One of the critical components driving how we classify documents in this system is the seeds.yaml file. This file acts as a structured knowledge base that defines what each domain “sounds like.” It provides a list of descriptive seed phrases per domain effectively forming the vector blueprint for that domain.

Each list of phrases is embedded using a model like text-embedding-3-small, producing a domain centroid vector. When a new paper arrives, we embed the full text of the document and compute its cosine similarity to each domain’s centroid. The top-matching domains (based on similarity score) are saved to the document_domains table if the match exceeds a configurable confidence threshold (e.g., min_classification_score: 0.6 in document_loader config).

This simple but powerful setup gives us the ability to classify any research paper into multiple overlapping conceptual areas even if it doesn’t use the same vocabulary as the seed phrases.

For example, here’s what a domain seed file might look like:

domains:
  symbolic:
    description: "Symbolic reasoning, planning, and logic-guided AI."
    seeds:
      - "Symbolic planning using language agents."
      - "Neuro-symbolic reasoning for AI alignment."
      - "Inductive logic programming with LLMs."
      - "Formal rule extraction from natural language."
      - "Symbolic systems for concept generalization."

  alignment:
    description: "Scalable oversight, alignment, and control of AI behavior."
    seeds:
      - "Scalable oversight and alignment strategies."
      - "Training language models to be helpful and harmless."
      - "Preventing goal misgeneralization in agents."
      - "Reward modeling for safety and usefulness."
      - "Evaluating AI via human preference learning."

  planning:
    description: "Strategic action planning with language and decision models."
    seeds:
      - "Hierarchical reinforcement learning for agents."
      - "Planning with tree search and transformers."
      - "Goal decomposition and reasoning chains."
      - "Language-driven policy generation."
      - "Meta-planning and strategy selection in LLMs."

 ...

When a new document or section arrives, the DomainClassifier:

Embeds the input text.
Compares it against each domain seed embedding using cosine similarity.
Returns the top-matching domains above a configurable confidence threshold.

    flowchart TD
    %% Seed processing
    A[seeds.yaml] --> B[Seed Phrases Grouped by Domain]
    B --> C[Embed Each Phrase → Vector]
    C --> D[Compute Domain Centroids mean vector per domain]

    %% Document input
    E[Incoming Document Text] --> F1[Embed Full Document → Vector]
    E --> F2[Parse into Sections]

    %% Section-level processing
    F2 --> G[Embed Each Section → Vector]
    G --> H[Cosine Similarity with Domain Centroids]
    D --> H

    H --> I[Aggregate Section Similarities → Top K Matching Domains]
    I --> J[Assign Domains to Document]
    J --> K[Store in document_domains Table]

    %% Document embedding output
    F1 --> L[Store in document_embeddings Table]

    %% Styles
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#bbf,stroke:#333,stroke-width:2px
    style K fill:#dfd,stroke:#333,stroke-width:2px
    style L fill:#dfd,stroke:#333,stroke-width:2px

This approach gives us soft classification with semantic grounding. We’re not doing brittle keyword matching or hard-coded labels. Instead, we use vector similarity to reason about what each text is truly about even when it uses novel phrasing or unfamiliar terminology.

This brings several key advantages:

✅ Modularity: Domains are defined via config and can be extended without code changes.
✅ Generalization: Embedding-based matching allows classification even when exact seed terms don’t appear.
✅ Granularity: Because it’s lightweight, we apply this not just to entire documents but also to individual sections a major step forward in understanding mixed-topic papers.

import yaml
from sklearn.metrics.pairwise import cosine_similarity

import yaml
from sklearn.metrics.pairwise import cosine_similarity

class DomainClassifier:
    def __init__(self, memory, logger, config_path="config/domain/seeds.yaml"):
        self.memory = memory
        self.logger = logger
        self.logger.log("DomainClassifierInit", {"config_path": config_path})

        with open(config_path, "r") as f:
            self.domain_config = yaml.safe_load(f)
        
        self.domains = self.domain_config.get("domains", {})
        self.logger.log("DomainConfigLoaded", {"num_domains": len(self.domains)})
        
        self._prepare_seed_embeddings()

    def _prepare_seed_embeddings(self):
        self.embeddings = []
        self.labels = []
        total_seeds = 0

        for domain, details in self.domains.items():
            seeds = details.get("seeds", [])
            total_seeds += len(seeds)
            for seed in seeds:
                embedding = self.memory.embedding.get_or_create(seed)
                self.embeddings.append(embedding)
                self.labels.append(domain)
        
        self.logger.log(
            "SeedEmbeddingsPrepared",
            {"total_seeds": total_seeds, "domains": list(self.domains.keys())},
        )

    def classify(self, text: str, top_k: int = 3, min_score: float = 0.7):
        embedding = self.memory.embedding.get_or_create(text)
        scores = []

        for domain, seed_embedding in zip(self.labels, self.embeddings):
            score = float(cosine_similarity([embedding], [seed_embedding])[0][0])
            scores.append((domain, score))

        sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)

        top_matches = sorted_scores[:top_k]

        # log warning if none meet the threshold
        if all(score < min_score for _, score in top_matches):
            self.logger.log(
                "LowDomainScore",
                {"text_snippet": text[:100], "top_scores": top_matches},
            )

        return top_matches

This is the process followed.

    flowchart LR
    A[📁 domain seeds YAML config] --> B[🧠 Embed each seed phrase]
    B --> C[💾 Store seed embeddings]

    D[📄 Document text] --> E[🧠 Embed document]
    C --> F[📏 Cosine similarity seed ↔ doc]
    E --> F
    F --> G{Score ≥ threshold?}
    
    G -- Yes --> H[🏷️ Assign domain labels]
    G -- No --> I[❌ Skip domain]

    subgraph Section-level Classification
        J[📑 Section text] --> K[🧠 Embed section]
        C --> L[📏 Cosine similarity seed ↔ section]
        K --> L
        L --> M{Score ≥ threshold?}
        M -- Yes --> N[🏷️ Assign domain to section]
        M -- No --> O[❌ Skip section]
    end

📚 How We Use It

We apply the classifier in two places:

At the document level, immediately after text extraction and summarization.

The classification is controlled through the agent configuration

document_loader:
  name: document_loader

  force_domain_update: true   
  top_k_domains: 3
  min_classification_score: 0.6  
  domain_seed_config_path: "config/domain/seeds.yaml"

  max_chars_for_summary: 16000
  summarize_documents: true

Field	Type	Description
`name`	string	The name identifier for this pipeline stage (used for logging and tracing).
`force_domain_update`	boolean	If `true`, forces re-classification of document domains even if already assigned.
`top_k_domains`	integer	Number of top-scoring domains to assign to each document.
`min_classification_score`	float	Minimum cosine similarity score for a domain to be considered valid.
`domain_seed_config_path`	string	Path to the YAML file that contains seed phrases for each domain.
`max_chars_for_summary`	integer	Maximum number of characters to consider when summarizing document text.
`summarize_documents`	boolean	Whether to generate a summary (via LLM).

Note we still assign domains even if they are below the threshold.

  def assign_domains_to_document(self, document):
      """
      Classifies the document text into one or more domains,
      and stores results in the document_domains table.
      """
      content = document.content
      if content:
          results = self.domain_classifier.classify(content,   
             self.top_k_domains, self.min_classification_score)
          for domain, score in results:
              self.memory.document_domains.insert({
                  "document_id": document.id,
                  "domain": domain,
                  "score": score,
              })
              self.logger.log("DomainAssigned", {
                  "title": document.title[:60] if document.title else "",
                  "domain": domain,
                  "score": score,
              })
      else:
          self.logger.log("DocumentNoContent", {
              "document_id": document.id,
              "title": document.title[:60] if document.title else "",
          })

At the section level, after we split the document into discrete segments using the DocumentProfilerAgent.

Each section can have different domain tags allowing us to track exactly which part of a paper is about reasoning, which is about learning, and which is about evaluation, for example.


for doc in documents:
...
    detected_domains = self.domain_classifier.classify(text)
    ...
    for section, text in chosen.items():
        self.memory.document_section.upsert(
            {
                "document_id": doc_id,
                "section_name": section,
                "section_text": text,
                "source": "unstructured+llm",
                "domains": detected_domains,  
                "summary": generated_summary,
            }
        )

This enables our knowledge system to become domain-aware at every level empowering smarter selection, routing, and synthesis by future agents.

🧩 Breaking Down a Document: From Raw Text to Structured Sections

    flowchart LR
  A[🎯 SurveyAgent<br/>Find goal-related seed papers]
  B[🔍 SearchOrchestratorAgent<br/>Expand with related papers]
  C[📥 DocumentLoaderAgent<br/>Download & extract text]
  D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment]:::highlighted
  E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge]

  A --> B --> C --> D --> E --> F

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

One of the key challenges in reasoning over research papers, blog posts, or long-form documents is understanding their structure. These documents often contain valuable ideas buried under dense formatting or inconsistent section headings. To extract and analyze meaningful content, we need to convert unstructured text into structured, labeled segments.

That’s where our DocumentSectionParser comes in.

🛠 What It Does

The DocumentSectionParser uses Unstructured to segment text into a series of typed elements like Title, Heading, and NarrativeText. From there, the parser:

Groups content under detected headings using Unstructured’s partitioning output.
Cleans and normalizes the headings (e.g., removing digits, punctuation, and case variation).
Maps the cleaned headings to canonical categories (like Motivation, Method, Results) using a configurable YAML mapping (target_sections.yaml).
Filters out garbage sections using regular expression rules and a configurable minimum content threshold.
Returns a cleaned, structured dictionary of {section_name: content} suitable for downstream processing, scoring, or reasoning.

📁 Example: From Raw Text to Labeled Sections

Given a document like:

1. Introduction
In this paper, we explore...
2. Related Work
Prior methods have focused on...
3. Approach
Our model differs by...

The parser will return:

{
  "motivation": "In this paper, we explore...",
  "background": "Prior methods have focused on...",
  "method": "Our model differs by..."
}

All section labels are standardized, making it easier to apply evaluation logic, domain scoring, or prompt selection downstream.

📚 Parsing Papers into Meaningful Sections

To help the DocumentSectionParser make sense of the diverse and inconsistent headings used in research papers, we rely on a configuration file: target_sections.yaml.

This YAML file defines the canonical sections we want to extract like abstract, method, or results and provides a list of synonyms or variations commonly used in real papers for each one. For example:

title:
  - title

abstract:
  - abstract
  - summary

introduction:
  - introduction
  - intro

related_work:
  - related work
  - background
  - prior work
  - literature review

method:
  - method
  - methods
  - methodology
  - approach
  - algorithm

implementation:
  - implementation
  - code
  - technical details

results:
  - results
  - result
  - evaluation
  - performance

discussion:
  - discussion
  - analysis
  - interpretation

conclusion:
  - conclusion
  - conclusions
  - final remarks

limitations:
  - limitations
  - drawbacks
  - challenges

future_work:
  - future work
  - next steps
  - extensions

references:
  - references
  - bibliography
  - works cited

This parser turns raw document text into a structured, labeled format that maps each section of the paper to a specific purpose such as “method,” “results,” or “conclusion.”

class DocumentSectionParser:
    def __init__(self, cfg=None, logger=None):
        self.cfg = cfg or {}
        self.logger = logger or print
        self.min_chars_per_sec = self.cfg.get("min_chars_per_sec", 20)

        # Load target sections from YAML
        self.config_path = self.cfg.get(
            "target_sections_config",
            "config/domain/target_sections.yaml"
        )
        self.TARGET_SECTIONS = self._load_target_sections()
        self.SECTION_TO_CATEGORY = self._build_section_to_category()

    def parse(self, text: str) -> dict:
        from unstructured.partition.text import partition_text
        from unstructured.staging.base import elements_to_json

        elements = partition_text(text=text)
        json_elems = elements_to_json(elements)
        structure = self.parse_unstructured_elements(json.loads(json_elems))
        cleaned = {self.clean_section_heading(k): v for k, v in structure.items()}
        mapped = self.map_sections(cleaned)
        final = self.trim_low_quality_sections(mapped)
        return final

    def _load_target_sections(self) -> dict:
        """Load TARGET_SECTIONS from a YAML file"""
        path = Path(self.config_path)
        if not path.exists():
            raise FileNotFoundError(f"Target sections config not found: {path}")
        with open(path, "r", encoding="utf-8") as f:
            return yaml.safe_load(f)

    def _build_section_to_category(self) -> dict:
        """Build reverse lookup map from synonyms to categories"""
        mapping = {}
        for cat, synonyms in self.TARGET_SECTIONS.items():
            for synonym in synonyms:
                normalized = self._normalize(synonym)
                mapping[normalized] = cat
        return mapping

    def _normalize(self, name: str) -> str:
        return re.sub(r"[^a-z0-9]", "", name.lower().strip())

    def parse_unstructured_elements(self, elements: list[dict]) -> dict[str, str]:
        current_section = None
        current_content = []
        structured = {}

        for el in elements:
            el_type = el.get("type")
            el_text = el.get("text", "").strip()

            if not el_text:
                continue

            if el_type in ("Title", "Heading"):
                if current_section and current_content:
                    structured[current_section] = "\n\n".join(current_content).strip()
                current_section = el_text.strip()
                current_content = []
            elif el_type in ("NarrativeText", "UncategorizedText", "ListItem"):
                if current_section:
                    current_content.append(el_text)

        if current_section and current_content:
            structured[current_section] = "\n\n".join(current_content).strip()

        return structured

    def clean_section_heading(self, heading: str) -> str:
        if not heading:
            return ""
        heading = re.sub(r"^\s*[\d\.\s]+\s*", " ", heading)
        heading = re.sub(r"^(section|chapter|part)\s+\w+", "", heading, flags=re.IGNORECASE)
        heading = re.sub(r"[^\w\s]", "", heading)
        heading = re.sub(r"\s+", " ", heading).strip()
        return heading

    def map_sections(self, parsed_sections: dict[str, str]) -> dict[str, str]:
        mapped = {}

        for sec_name, content in parsed_sections.items():
            normalized = self._normalize(sec_name)
            if normalized in self.SECTION_TO_CATEGORY:
                category = self.SECTION_TO_CATEGORY[normalized]
                mapped[category] = content
            else:
                best_match, score = process.extractOne(normalized, self.SECTION_TO_CATEGORY.keys())
                if score > 75:
                    category = self.SECTION_TO_CATEGORY[best_match]
                    mapped[category] = content

        return mapped

    def is_valid_section(self, text: str) -> bool:
        if not text or len(text.strip()) < 10:
            return False

        garbage_patterns = [
            r"^\d+$",
            r"^[a-zA-Z]$",
            r"^[A-Z][a-z]+\s\d+$",
            r"^[ivxlcdmIVXLCDM]+$",
            r"^[\W_]+$",
            r"^[^\w\s].{0,20}$"
        ]

        for pattern in garbage_patterns:
            if re.fullmatch(pattern, text.strip()):
                return False

        return True

    def trim_low_quality_sections(self, structured_data: dict[str, str]) -> dict[str, str]:
        cleaned = {}
        for key, text in structured_data.items():
            if self.is_valid_section(text):
                cleaned[key] = text
            else:
                self.logger.log("TrimmingSection", 
                    {"section": key, "data": text[:50]})
        return cleaned

Here’s how it works, step by step:

🛠️ Initialization and Configuration

def __init__(self, cfg=None, logger=None):

The parser accepts a configuration dictionary and a logger. Most importantly, it loads a YAML file that defines the target sections we care about (e.g., “abstract”, “methods”, “findings”). These sections may appear in many different forms in papers this file helps us normalize them.

self.TARGET_SECTIONS = self._load_target_sections()
self.SECTION_TO_CATEGORY = self._build_section_to_category()

We build a reverse mapping of section names to their canonical categories, allowing us to group headings like “Experiments” and “Methodology” under a common "method" label.

📃 Parsing Document Text

The core logic happens in the parse() method:

def parse(self, text: str) -> dict:

This method processes the raw document text in several steps:

Partitioning text into elements using the unstructured library.
Parsing structured sections using parse_unstructured_elements().
Cleaning up section headings with clean_section_heading().
Mapping section titles to canonical categories with map_sections().
Filtering out low-quality content with trim_low_quality_sections().

The result is a clean, dictionary-style mapping from logical section names (like "method" or "results") to their actual text content.

🧱 Building the Structure

def parse_unstructured_elements(self, elements: list[dict]) -> dict[str, str]:

This method walks through the parsed elements, separating the document into a series of headings and the content under them. It keeps track of the current section title and accumulates text lines under each heading.

🧼 Normalizing Section Headings

def clean_section_heading(self, heading: str) -> str:

Section titles often include noise like numbers or formatting artifacts (e.g., "1. Introduction" or "Chapter 2 – Results"). This method strips out that noise so the parser can match the title to a canonical form.

🧠 Mapping to Target Categories

def map_sections(self, parsed_sections: dict[str, str]) -> dict[str, str]:

After cleaning, this method matches each heading to a target category using fuzzy matching. For example, both "Findings" and "Results" might map to the canonical "results" section.

If an exact match isn’t found, the method uses the fuzzywuzzy library to select the closest match with a confidence threshold.

🚮 Filtering Out Noise

def trim_low_quality_sections(self, structured_data: dict[str, str]) 
  I-> dict[str, str]:

Sometimes headings or content are just junk single letters, roman numerals, or short fragments. This method uses regular expressions to catch and exclude those, ensuring we only keep high-quality content.

By breaking down a document into canonical, cleanly labeled sections, we enable downstream agents like scorers, evaluators, or prompt compilers to focus on exactly the parts they need. Whether it’s analyzing methods, extracting contributions, or reviewing conclusions, this structured representation is our foundation for intelligent document understanding. In short: this parser turns messy PDFs into structured knowledge.

🤖 Fallback to LLM: When Structure Fails

While our primary extraction method relies on the unstructured library to parse documents into meaningful sections, we included a fallback mechanism that uses a language model to guide the process when the structure is ambiguous or low quality.

In this approach, the system prompts the LLM to suggest relevant section headings and heuristically slices the document between those headings. It then attempts to match these chunks back to our target sections (like methods, results, etc.).

However, in our experience especially when using local models via tools like Ollama this LLM-guided fallback was often imprecise. Section boundaries were vague, and the generated summaries lacked consistency across documents. This led us to rely more heavily on the unstructured parser, which, while imperfect, provided more reliable segmentation when paired with our synonym-based mapping and quality filtering heuristics.

We kept the LLM fallback in place for edge cases, but tuned the system to favor the structured route wherever possible.

🧠 This enables

This modular parsing step powers multiple downstream agents:

The PaperScoreAgent uses the parsed sections to assign multidimensional scores (e.g., correctness, clarity, originality).
The PromptCompilerAgent can target specific sections (like method) for prompt tuning or hypothesis refinement.
The domain classifier can reason about which section most clearly represents the paper’s core domain.

🔧 Configurable & Extendable

Want to support a new section type like Conclusion or Ethics? Just add synonyms to the target_sections.yaml file, and the parser will start mapping them automatically.

method:
  - approach
  - method
  - model
  - technique
motivation:
  - motivation
  - intro
  - background

💡 Summary

The DocumentSectionParser acts as the translation layer between raw, messy research text and structured, analyzable components. It lets the rest of our system focus on what matters: scoring ideas, evaluating reasoning, and generating insights not decoding inconsistent formatting.

📊 Scoring a Paper From Every Angle

    flowchart LR
  A[🎯 SurveyAgent<br/>Find goal-related seed papers]
  B[🔍 SearchOrchestratorAgent<br/>Expand with related papers]
  C[📥 DocumentLoaderAgent<br/>Download & extract text]
  D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]:::highlighted
  F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge]

  A --> B --> C --> D --> E --> F

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

One of the core innovations in our pipeline is the multi-dimensional scoring system implemented in the PaperScoreAgent. Unlike traditional single-score evaluation systems, this adapter evaluates each document along several distinct axes of merit, enabling more nuanced and interpretable assessments.

This is an example configuration for scoring papers. We support two output formats simple and cor.

output_format: simple

dimensions:
  - name: relevance
    file: relevance
    weight: 1.5
    extra_data: { parser: numeric }

  - name: novelty
    file: novelty
    weight: 1.2
    extra_data: { parser: numeric }

  - name: implementability
    file: implementability
    weight: 1.0
    extra_data: { parser: numeric }

This is a selection of the scoring prompts that are useful for this task.

Prompt File	Score Dimension	Description	Purpose in Scoring
clarity.txt	Clarity	Evaluates how clearly the paper presents its ideas, methods, and results.	Helps determine if the content is accessible and understandable to readers.
feasibility.txt	Feasibility	Assesses how practical and realistic it is to implement the ideas presented in the paper.	Filters out overly theoretical ideas that are hard to realize in practice.
implementability.txt	Implementability	Focuses on whether the system, framework, or method can be built with available tools.	Indicates the technical viability of turning the paper into a working system.
integration_fit.txt	Integration Fit	Measures how well the proposed method could integrate with existing systems or workflows.	Useful for evaluating papers for system expansion or plugin potential.
modularity.txt	Modularity	Assesses whether the components described can be separated and reused independently.	Encourages preference for composable and flexible research artifacts.
novelty.txt	Novelty	Examines whether the idea or approach is new in the context of existing literature.	Helps prioritize papers that bring fresh insights or methodologies.
originality.txt	Originality	Measures the paper’s creative departure from standard or known methods.	Encourages discovery of innovative, unique contributions.
performance_gain_potential.txt	Performance Gain Potential	Evaluates the potential for the approach to improve measurable outcomes (e.g., speed, accuracy).	Important for identifying research that could push the state-of-the-art forward.
relevance.txt	Relevance	Assesses how closely aligned the content is to the current research goal.	Ensures selected documents are pertinent to the ongoing task or question.

🎯 Why Multi-Dimensional?

In research and reasoning workflows, a document can be:

Correct but uninspired,
Creative but technically flawed,
Relevant but poorly written.

A single score can’t capture these tradeoffs. Multi-dimensional scoring allows us to reason more like a reviewer: evaluating a paper’s correctness, originality, clarity, relevance, and more each as a standalone quality.

This is an example score result.


📊 Dimension Scores Summary
╒══════════════════╤═════════╤══════════╤══════════════════════════════════════════════════════════════╕
│ Dimension        │   Score │ Weight   │ Rationale (preview)                                          │
╞══════════════════╪═════════╪══════════╪══════════════════════════════════════════════════════════════╡
│ relevance        │   65    │ 1.5      │ rationale: The paper introduces a method to enhance reasonin │
├──────────────────┼─────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ novelty          │   75    │ 1.2      │ rationale: The paper introduces a novel approach by training │
├──────────────────┼─────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ implementability │   95    │ 1.0      │ rationale: The paper describes a modular approach with a cle │
├──────────────────┼─────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ FINAL            │   76.35 │ -        │ Weighted average                                             │
╘══════════════════╧═════════╧══════════╧══════════════════════════════════════════════════════════════╛

🛠️ How It Works

1. Document Selection

We begin with a filtered set of documents (usually sourced from arXiv, Hugging Face, or internal corpora) that are deemed domain-relevant using the KnowledgeLoaderAgent.

2. Score Check

For each document:

If prior scores exist (and force_rescore is not set), we reuse them.
Otherwise, we proceed to scoring.

3. LLM-Based Evaluation

Each document is scored using an LLM prompt designed to generate structured scores for each dimension. For example:

{
  "correctness": 0.8,
  "originality": 0.9,
  "clarity": 0.7,
  "relevance": 0.95
}

`Novelty` prompt example

This is an example prompt.

Here we use the title and summary to determine the score. Each score rturns a rational and a acore between 0-100.

You are evaluating a research paper for its novelty.

Paper title: {{ document.title }}
Paper summary: {{ document.summary }}

Does the paper introduce new concepts, architectures, or techniques that are not commonly found in existing work on reasoning, planning, or self-evaluation in AI?

Return your review in the exact structured format below. Do not include headings, markdown, or additional commentary. Use only plain text fields as shown:

rationale: <brief explanation>

score: <0–100>

This output is parsed and converted into rows in the scores table, each tied to a specific evaluation record for traceability.

An example score result

{
  "relevance": {"score": 80, "rationale": "Covers reinforcement learning methods relevant to self-improvement"},
  "novelty": {"score": 60, "rationale": "Uses common PPO variant without novel extensions"},
  "clarity": {"score": 95, "rationale": "Well-structured, includes implementation details"},
  "feasibility": {"score": 70, "rationale": "Can be implemented with standard frameworks"},
  "impact": {"score": 85, "rationale": "May improve training efficiency in early stages"}
}

4. Dimension-Aware Aggregation

We then compute an average per dimension, ignoring zero/invalid scores. This provides a quality profile for each paper a fingerprint of its strengths and weaknesses.

5. Usage in the Pipeline

These scores feed into:

Ranking systems (e.g., selecting the top 5 documents most relevant and correct),
Router agents that choose the best reasoning model or strategy based on what the current goal values (e.g., originality vs correctness),
Training data filters, ensuring that only high-quality samples contribute to model tuning.

📈 What Makes It Powerful?

This adapter turns each paper into a multi-dimensional vector of quality. That opens the door to:

Comparative judgments across dimensions,
Contrastive pair training (as in MR.Q),
Symbolic rule learning about which types of documents help with which goals,
And eventually, meta-reasoning over document effectiveness.

As we gather more scores across different domains and tasks, we use this information to:

Improve the LLM scoring prompts,
Fine-tune downstream rankers (e.g., SVMs or reward models),
Guide PromptCompilerAgent decisions by training it on which prompts lead to high-dimensional scores.

This system turns document evaluation from a bottleneck into a rich signal, powering every layer of the self-improving Co AI pipeline.

–

💡 Adaptive Document Selection: Studying What Matters

    flowchart LR
  A[🎯 SurveyAgent<br/>Find goal-related seed papers]
  B[🔍 SearchOrchestratorAgent<br/>Expand with related papers]
  C[📥 DocumentLoaderAgent<br/>Download & extract text]
  D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment]
  E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]
  F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge]:::highlighted

  A --> B --> C --> D --> E --> F

  classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;

In large, heterogeneous document collections, simply keyword-matching a goal to text is insufficient. We need smarter ways to understand what a goal is really about and identify documents that align with that intent. This is where the KnowledgeLoaderAgent comes in.

🎯 Purpose

The KnowledgeLoaderAgent is designed to select the most suitable documents for a given research goal by adapting to domain-level semantics and multi-dimensional document scoring. Rather than matching raw text, it uses domain-specific embeddings to rank documents based on conceptual proximity.

In short: the Knowledge Agent is where noise becomes signal. And that signal is the foundation of everything that follows.

🧩 How It Works

🧭 Domain Embedding Seeds Each research domain (e.g., "LLM Optimization", "Knowledge Retrieval", "Symbolic Reasoning") is associated with a small set of seed examples concise phrases or representative goals. These seeds are embedded and averaged to form a domain centroid vector.
🎯 Goal Classification via Embedding Similarity When a new goal enters the pipeline, it is embedded using a local embedding model (memory.embedding.get_or_create). The system computes cosine similarity between the goal vector and each domain centroid to identify the most relevant domain. This domain assignment helps scope the retrieval to the most contextually aligned documents.
📄 Document Filtering by Domain + Quality Each document has already been annotated with:
- One or more domain scores (via prior classification), and
- A set of multi-dimensional scores (e.g., clarity, feasibility, novelty, etc.) from the dynamic scoring stage.
The Knowledge Agent filters documents using two criteria:

✅ Domain Match: The document must be tagged with the same domain as the goal, and the domain score must exceed a minimum threshold (e.g., 0.6).

✅ Quality Match (optional): If enabled via config (use_multidimensional_scores: true), the agent can further prioritize documents that:
- Score above a specified threshold on any or all scoring dimensions.
- Or are ranked among the top-k highest scoring documents across selected dimensions.
This ensures that not only is the document about the right thing it’s also well-written, original, implementable, and useful.
🔁 Context-Aware Return Format Depending on configuration:
- The agent can return summaries (summary) for compact processing.
- Or return full document content (text) for richer downstream pipelines like symbolic reasoning, hypothesis generation, or tool synthesis.

This approach allows the system to focus attention on the most promising knowledge, even across thousands of documents. By aligning goals to domain vectors, we simulate a kind of semantic routing making the system behave like an adaptive information filter.

⚙️ Example Configuration

knowledge_loader:
  name: knowledge_loader
  domain_seeds: ${path:config/domain/seeds.yaml}
  top_k: 3
  domain_threshold: 0.4
  include_full_text: false

This is the agent code as of this blog post


class KnowledgeLoaderAgent(BaseAgent):

    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.domain_seeds = cfg.get("domain_seeds", {})
        self.top_k = cfg.get("top_k", 3)
        self.threshold = cfg.get("domain_threshold", 0.0)
        self.include_full_text = cfg.get("include_full_text", False)

        # Optional scoring configuration
        self.use_dimensional_scores = cfg.get("use_dimensional_scores", False)
        self.dimension_weights = cfg.get("dimension_weights", {
            "relevance": 1.0,
            "usefulness": 0.8,
            "clarity": 0.6,
            "implementability": 0.7,
            "novelty": 0.5,
            "integration_fit": 0.6,
        })
        self.min_weighted_score = cfg.get("min_weighted_score", 0.5)

    async def run(self, context: dict) -> dict:
        goal = context.get(GOAL)
        goal_text = goal.get("goal_text", "")
        documents = context.get("documents", [])

        if not goal_text or not documents:
            self.logger.log("DocumentFilterSkipped", {"reason": "Missing goal or documents"})
            return context

        # Step 1: Assign domain to the goal
        goal_vector = self.memory.embedding.get_or_create(goal_text)
        domain_vectors = {
            domain: np.mean([self.memory.embedding.get_or_create(ex) for ex in examples], axis=0)
            for domain, examples in self.domain_seeds.items()
        }

        goal_domain, goal_domain_score = None, -1
        for domain, vec in domain_vectors.items():
            score = float(cosine_similarity([goal_vector], [vec])[0][0])
            if score > goal_domain_score:
                goal_domain = domain
                goal_domain_score = score

        context["goal_domain"] = goal_domain
        context["goal_domain_score"] = goal_domain_score
        self.logger.log("GoalDomainAssigned", {"domain": goal_domain, "score": goal_domain_score})

        # Step 2: Filter documents based on domain + optional dimensional scores
        filtered = []
        for doc in documents:
            doc_domains = self.memory.document_domains.get_domains(doc["id"])
            if not doc_domains:
                continue

            for dom in doc_domains[:self.top_k]:
                if dom.domain == goal_domain and dom.score >= self.threshold:
                    # Optional: score-based filtering
                    if self.use_dimensional_scores:
                        score = self.compute_weighted_score(doc["id"])
                        if score < self.min_weighted_score:
                            continue  # reject document
                    else:
                        score = None

                    selected_content = doc["text"] if self.include_full_text else doc["summary"]
                    filtered.append({
                        "id": doc["id"],
                        "title": doc["title"],
                        "domain": dom.domain,
                        "domain_score": dom.score,
                        "doc_score": score,
                        "content": selected_content
                    })
                    break

        context[self.output_key] = filtered
        context["filtered_document_ids"] = [doc["id"] for doc in filtered]
        self.logger.log("DocumentsFiltered", {
            "count": len(filtered),
            "used_scores": self.use_dimensional_scores,
            "min_score_threshold": self.min_weighted_score if self.use_dimensional_scores else None,
            "dimensions": list(self.dimension_weights.keys()) if self.use_dimensional_scores else None
        })
        return context

    def compute_weighted_score(self, doc_id: str) -> float:
        scores = self.memory.document_scores.get_scores(doc_id)
        if not scores:
            return 0.0
        total, weight_sum = 0.0, 0.0
        for dim, weight in self.dimension_weights.items():
            dim_score = next((s.score for s in scores if s.dimension == dim), None)
            if dim_score is not None:
                total += weight * dim_score
                weight_sum += weight
        return total / weight_sum if weight_sum > 0 else 0.0

📊 Structured Storage and Future Feedback Loops

Every domain assignment, document selection, and multi-dimensional score is not only logged but also saved to the database in a structured format. This persistent knowledge base enables far more than just traceability it becomes the training ground for the system’s next evolution. By combining domain tags, content metadata, and scoring dimensions (like clarity, novelty, symbolic alignment), we lay the foundation for downstream agents to learn from historical data.

In upcoming stages, this data will be used to drive MR.Q-based ranking, DPO-style prompt refinement, and even automated rule tuning. The result is a dynamic, continuously improving system where feedback isn’t just collected it’s operationalized into behavior. This is how the system learns what good knowledge looks like, and how it should shape the reasoning strategies that come next.

🧠 Conclusion: From Chaos to Coordinated Knowledge

In this post, we’ve shown how to transform chaotic, unstructured research papers into structured, ranked, and goal-filtered knowledge using a modular Document Intelligence system. Each stage from domain assignment and section parsing to dimensional scoring and filtering is handled by purpose-built agents working in tandem. Importantly, we’ve favored unstructured, local parsing and scoring over LLM-based black boxes, allowing us to retain interpretability, efficiency, and control throughout the pipeline.

But this is more than cleanup this is bootstrapping intelligence. The knowledge output from this process feeds directly into the next phase: self-improving AI workflows. Our PromptCompilerAgent, for instance, will consume these structured documents to generate higher-quality prompts. Then, using MR.Q-based preference tuning, we’ll evaluate and refine those prompts based on outcomes creating a feedback loop where the system learns from its own behavior. In short: this is not just document understanding it’s the foundation for a self-learning AI research agent. And among all our previous milestones, this one might be the most decisive step toward that vision.

🗺️ Knowledge Flow Toward Self-Learning

    flowchart TD
    A[Unstructured Documents PDFs, Papers] --> B[DocumentProfilerAgent<br/>Parse + Structure]
    B --> C[DomainClassifier<br/>Assign Domain Tags]
    B --> D[DimensionalScorer<br/>Score: Clarity, Novelty, etc.]
    C & D --> E[KnowledgeLoaderAgent<br/>Filter by Goal Domain & Score Threshold]
    E --> F[Filtered Structured Documents]
    F --> G[PromptCompilerAgent<br/>Generate Prompts Based on Knowledge]
    G --> H[Prompt Evaluation MR.Q]
    H --> I[Prompt Tuning / Self-Improvement]
    I --> J[Enhanced Reasoning Pipeline]

🎯 Our Aim: To build a system of local, interpretable agents that doesn’t just match the output quality of large language models but surpasses them in consistency, control, and clarity. By composing modular reasoning agents, domain-aware filters, and prompt-programmable scaffolds, we gain traceable intelligence, tunable behavior, and a feedback loop that learns from every decision. The result is not just an alternative to powerful LLMs it’s a more accountable, composable, and continuously improvable reasoning system.

📚 References

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training ArXiv: 2506.10952 – Inspired the design of embedding-based domain classification for document filtering. https://arxiv.org/abs/2506.10952
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers ArXiv: 2505.21497 – Informed the structural parsing of scientific documents into reusable components. https://arxiv.org/abs/2505.21497
Unstructured.io – Document Parsing Tools Used for segmenting and structurally parsing raw PDF/text documents. https://unstructured.io
FuzzyWuzzy Matching – Fuzzy string matching for section heading normalization Seatgeek’s FuzzyWuzzy library was used to align arbitrary section headings with canonical categories. https://github.com/seatgeek/fuzzywuzzy
Cosine Similarity & Embeddings – Domain Matching and Semantic Similarity Domain classification and filtering rely on cosine similarity of vector embeddings (via local model or sentence transformers).
- Scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
Hydra Configuration System – Flexible YAML-based Configurations for Agents Used throughout the Co AI pipeline for modular agent configuration. https://hydra.cc

📘 Glossary

Term	Definition
Knowledge Ingestion Pipeline	The series of steps through which documents are retrieved, parsed, embedded, classified, and scored.
Domain Seeds	Short example texts representing key domains; used as anchor points to classify new documents by similarity.
Embedding	A numerical vector representation of text that preserves semantic meaning, enabling similarity comparison.
Cosine Similarity	A metric for measuring how similar two vector embeddings are, commonly used for semantic matching.
Target Sections	Canonical paper sections like Abstract, Method, or Results into which raw documents are categorized.
DocumentSectionParser	A parser that uses the Unstructured library to extract and normalize meaningful sections from papers.
Multi-Dimensional Scoring	A method of evaluating documents or hypotheses along several quality dimensions (e.g., clarity, correctness).
MR.Q	A structured scoring method for comparing hypotheses or documents, inspired by preference modeling.
ScoreORM / EvaluationORM	Database models that store structured scoring information for documents, hypotheses, or agent outputs.
Hydra	A configuration management system used to define flexible YAML-based agent and pipeline configs.
Unstructured.io	A library for parsing unstructured documents into structured elements like headings and paragraphs.
Fuzzy Matching	A method of aligning approximate strings (like section titles) using libraries like FuzzyWuzzy.
Document Domains	Categories assigned to documents based on their similarity to predefined seed examples.
Agent	A functional component in the Co AI framework responsible for performing a specific task (e.g., scoring, parsing).
Pipeline	A sequential flow of agents orchestrated to achieve a research or reasoning objective.
Prompt Compiler	A component that builds and refines prompts based on structured inputs and performance feedback.
Self-Improving AI	An AI system that can evaluate its performance and adapt its internal tools and knowledge accordingly.