Document Intelligence: Turning Documents into Structured Knowledge

Summary
Imagine drowning in a sea of research papers, each holding a fragment of the knowledge you need for your next breakthrough. How does an AI system, striving for self-improvement, navigate this information overload to find precisely what it needs? This is the core challenge our Document Intelligence pipeline addresses, transforming chaotic documents into organized, searchable knowledge.
In this post we combine insights from Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers and Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training to build an AI document profiler that transforms unstructured papers into structured, searchable knowledge graphs.
๐ What Problem Are We Solving?
A self-improving AI must continually adapt to new goals by acquiring relevant, high-quality knowledge. In the face of overwhelming research output, the system needs to filter and prioritize documents that align with its evolving objectives.
This framework enables that adaptation by embedding documents and their sections, classifying them into semantic domains, and scoring them across key dimensions like relevance, novelty, and clarity. The result is a dynamic, goal-aware knowledge base that supports smarter tuning, hypothesis generation, and self-directed improvement.
๐งฌ The Knowledge Ingestion Pipeline: An Overview
flowchart LR A[๐ฏ SurveyAgent<br/>Find goal-related seed papers]:::highlighted B[๐ SearchOrchestratorAgent<br/>Expand with related papers] C[๐ฅ DocumentLoaderAgent<br/>Download & extract text] D[๐ง DocumentProfilerAgent<br/>Enrich, embed, and segment] E[๐ PaperScoreAgent<br/>Rate for novelty, relevance, etc.] F[๐ KnowledgeLoaderAgent<br/>Store as structured knowledge] A --> B --> C --> D --> E --> F classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
To turn raw research papers into structured, searchable knowledge, we built a custom pipeline composed of intelligent agents each with a focused role. This modular design allows for scalable and interpretable processing, all while enabling dynamic tuning and extension as new needs arise.
Hereโs a quick walkthrough of the pipeline stages we use:
pipeline:
name: papers
tag: "huggingface_related_papers import"
description: "Import papers score and save them"
stages:
- name: survey
cls: co_ai.agents.knowledge.survey.SurveyAgent
enabled: true
iterations: 1
- name: search_orchestrator
cls: co_ai.agents.knowledge.search_orchestrator.SearchOrchestratorAgent
enabled: true
iterations: 1
- name: document_loader
cls: co_ai.agents.knowledge.document_loader.DocumentLoaderAgent
enabled: true
iterations: 1
- name: document_profiler
cls: co_ai.agents.knowledge.document_profiler.DocumentProfilerAgent
enabled: true
iterations: 1
- name: paper_score
cls: co_ai.agents.knowledge.paper_score.PaperScoreAgent
enabled: true
iterations: 1
- name: knowledge_loader
cls: co_ai.agents.knowledge.knowledge_loader.KnowledgeLoaderAgent
enabled: true
iterations: 1
1. SurveyAgent
The pipeline begins with the SurveyAgent
, which receives a high-level research goal (e.g., โFind papers related to Self correcting AIโ) and translates it into concrete search targets. This might include seed topics, keywords, or subdomains.
2. SearchOrchestratorAgent
Once the survey is complete, the SearchOrchestratorAgent
coordinates multiple document retrieval strategies, including direct web search, arXiv lookups, and Hugging Face dataset scraping. It standardizes the results and prepares them for downstream ingestion.
3. DocumentLoaderAgent
This agent is responsible for downloading, parsing, and storing raw research papers. It supports:
- PDF ingestion and text extraction
- Optional summarization via LLMs
- Title guessing and metadata refinement
- Domain classification using embeddings and a seed-based classifier
4. DocumentProfilerAgent
Each document is then split into sections (e.g., Abstract, Methods, Results) and analyzed in detail. The DocumentProfilerAgent
computes section-level embeddings, tags domains, and stores structured knowledge in a fine-grained format.
5. PaperScoreAgent
With the documents structured, the PaperScoreAgent
evaluates each one along key dimensions like relevance, clarity, and novelty. These scores can guide downstream reasoning agents and training data selection.
6. KnowledgeLoaderAgent
Finally, the KnowledgeLoaderAgent
selects and loads the most relevant sections or documents into memory optimized for the current goal. It acts as a smart filter, ensuring that only the most valuable information is passed to reasoning or generation components.
๐ SurveyAgent: Generating Search-Driven Subgoals
Its role is to generate adaptive search queries that break a broad research goal into more actionable subtopics. Given a high-level objective, the SurveyAgent deconstructs it into multiple keyword-based queries, using prompt-driven reasoning to capture different angles (e.g., novelty, feasibility). These queries can then be passed to downstream components like the document retriever or orchestrator for targeted literature exploration.
goal:
goal_text: "https://arxiv.org/pdf/2503.00735"
goal_type: "similar_papers"
focus_area: "AI research"
similar_papers
this goal type short circuits web search. Instead of querying external databases, we use an Hugging Face Space: librarian-bots/recommend_similar_papers
to directly generate a list of relevant papers based on a paper goal.
๐ค Using AI to Find Papers About AI: The Hugging Face Paper Recommender
flowchart LR A[๐ฏ SurveyAgent<br/>Find goal-related seed papers] B[๐ SearchOrchestratorAgent<br/>Expand with related papers]:::highlighted C[๐ฅ DocumentLoaderAgent<br/>Download & extract text] D[๐ง DocumentProfilerAgent<br/>Enrich, embed, and segment] E[๐ PaperScoreAgent<br/>Rate for novelty, relevance, etc.] F[๐ KnowledgeLoaderAgent<br/>Store as structured knowledge] A --> B --> C --> D --> E --> F classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
One of the first challenges in building an AI that learns from research is finding the right research to learn from. Instead of relying solely on keyword search or manual curation, we decided to integrate an AI-powered paper recommendation system right into our pipeline.
Enter the Hugging Face Space: librarian-bots/recommend_similar_papers
an open-access tool that takes a research paper URL and returns a set of similar papers using LLM-based document understanding and embedding matching.
Hereโs how we integrated it:
from gradio_client import Client
def recommend_similar_papers(paper_url: str) -> list[dict]:
client = Client("librarian-bots/recommend_similar_papers")
result = client.predict(paper_url, None, False, api_name="/predict")
...
We also added a caching layer to avoid repeated requests and support fast iterations:
CACHE_DIR = Path(".paper_cache")
def _get_cache_path(paper_url: str) -> Path:
key = hashlib.md5(paper_url.encode()).hexdigest()
return CACHE_DIR / f"{key}.pkl"
The result? When a seed paper is submitted (e.g., an arXiv PDF), our system fetches topically related papers automatically and formats them for ingestion:
{
"source": "recommend_similar_papers",
"url": "https://arxiv.org/pdf/2506.10952.pdf",
"title": "2506.10952",
"summary": "Not yet processed"
}
This tool forms a core input source for the knowledge pipeline. It ensures that:
- We start from high-quality seed papers,
- Each paper can dynamically pull in its “literature neighborhood”,
- Downstream agents can reason over clusters of ideas, not just isolated works.
At this stage we have a list of results in the following format
This is a goal to search for similar papers. This is the process we are explaining in this post.
goal:
goal_text: "https://arxiv.org/pdf/2503.00735"
goal_type: "similar_papers"
This will generate 10 results.
https://arxiv.org/pdf/2505.01441.pdf
https://arxiv.org/pdf/2505.14147.pdf
https://arxiv.org/pdf/2504.15900.pdf
https://arxiv.org/pdf/2505.14652.pdf
https://arxiv.org/pdf/2505.08364.pdf
https://arxiv.org/pdf/2506.01369.pdf
Once weโve identified which papers to explore, the next step is to pull them into our system and convert them into structured, searchable knowledge.
๐ฅ Downloading and Injesting Research Papers
flowchart LR A[๐ฏ SurveyAgent<br/>Find goal-related seed papers] B[๐ SearchOrchestratorAgent<br/>Expand with related papers] C[๐ฅ DocumentLoaderAgent<br/>Download & extract text]:::highlighted D[๐ง DocumentProfilerAgent<br/>Enrich, embed, and segment] E[๐ PaperScoreAgent<br/>Rate for novelty, relevance, etc.] F[๐ KnowledgeLoaderAgent<br/>Store as structured knowledge] A --> B --> C --> D --> E --> F classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
Our ingestion pipeline begins by downloading the paperโs PDF directly from its URL. This is done in a lightweight batch step no browser or crawler needed. We simply fetch, store temporarily, and then extract the full text using our PDFConverter
tool.
response = requests.get(url)
with open(f"{self.download_directory}/{title}.pdf", "wb") as f:
f.write(response.content)
text = PDFConverter.pdf_to_text(f"{self.download_directory}/{title}.pdf")
from pdfminer.high_level import extract_text
class PDFConverter:
...
def pdf_to_text(file_path: Union[str, Path]) -> str:
...
try:
text = extract_text(str(file_path))
return text.strip()
except PDFSyntaxError as e:
...
With the raw text extracted, we enrich it with high-quality metadata. Rather than depending solely on LLMs to generate titles or summaries which can hallucinate or miss the authorโs intent we query the arXiv API (our internal fetch_arxiv_metadata
tool) to retrieve the official title and abstract when available.
def fetch_arxiv_metadata(arxiv_id: str) -> dict:
"""
Query the arXiv API and return metadata for a given arXiv ID.
Args:
arxiv_id (str): e.g., "2505.19590"
Returns:
dict: {
'title': str,
'summary': str,
'authors': list[str],
'published': str (ISO format),
'url': str
}
"""
url = f"https://export.arxiv.org/api/query?id_list={arxiv_id}"
response = requests.get(url)
if response.status_code != 200:
raise ValueError(f"arXiv API request failed with {response.status_code}")
root = ET.fromstring(response.text)
ns = {"atom": "http://www.w3.org/2005/Atom"}
entry = root.find("atom:entry", ns)
if entry is None:
raise ValueError(f"No entry found for arXiv ID {arxiv_id}")
title = entry.find("atom:title", ns).text.strip().replace("\n", " ")
summary = entry.find("atom:summary", ns).text.strip().replace("\n", " ")
authors = [
author.find("atom:name", ns).text
for author in entry.findall("atom:author", ns)
]
published = entry.find("atom:published", ns).text
pdf_url = entry.find("atom:id", ns).text
return {
"title": title,
"summary": summary,
"authors": authors,
"published": published,
"url": pdf_url,
}
Why use arXivโs metadata?
๐งญ Weโve found the titles and summaries from arXiv to be more reliable and consistent than those generated by LLMs, especially for scientific documents. These curated details help anchor the rest of the pipeline in truth.
Finally, we embed the document using our memory store. The embedding combines the title and summary, providing a rich semantic fingerprint of the paper:
embed_text = f"{title}\n\n{summary}"
self.memory.embedding.get_or_create(embed_text)
These embeddings allow us to later search, cluster, and rank papers based on similarity even across different pipelines or research goals.
At this stage, each paper is:
- Parsed into clean text,
- Enriched with arXiv metadata (or optionally LLM-generated fallback),
- Embedded for retrieval,
- Stored in the database for downstream profiling and scoring.
At this stage we have a document
class DocumentORM(Base):
__tablename__ = "documents"
id = Column(Integer, primary_key=True)
goal_id = Column(Integer, ForeignKey("goals.id", ondelete="SET NULL"), nullable=True)
title = Column(String, nullable=False)
source = Column(String, nullable=False)
external_id = Column(String, nullable=True)
domain_label = Column(String, nullable=True)
url = Column(String, nullable=True)
summary = Column(Text, nullable=True)
content = Column(Text, nullable=True)
date_added = Column(DateTime(timezone=True), server_default=func.now())
domains = Column(ARRAY(String), nullable=True)
sections = relationship(
"DocumentSectionORM",
back_populates="document",
cascade="all, delete-orphan"
)
domains_rel = relationship(
"DocumentDomainORM",
back_populates="document",
cascade="all, delete-orphan"
)
def to_dict(self):
...
In the next section, weโll look at how we turn categorize the documents using domains. ๐๐
๐งญ Classifying Knowledge with Domain Intelligence
As our AI system ingests papers and breaks them into meaningful sections, we face an important challenge: How do we know what each document or even each section is really about? Understanding the domain of a document (e.g., “reasoning”, “vision”, “symbolic learning”) is crucial for organizing knowledge, surfacing relevant content, and enabling downstream agents to specialize their reasoning.
This is where the DomainClassifier comes in.
๐ง What the DomainClassifier Does
The DomainClassifier
is a lightweight but powerful component that assigns semantic labels called domains to documents and their sections. It works by comparing a documentโs content to a small set of seed phrases representing each domain. These seeds are defined in a YAML config file, and embeddings for each seed are generated using the same vector store already used throughout our system.
๐ฑ Understanding Domain Seeds
One of the critical components driving how we classify documents in this system is the seeds.yaml
file. This file acts as a structured knowledge base that defines what each domain โsounds like.โ It provides a list of descriptive seed phrases per domain effectively forming the vector blueprint for that domain.
Each list of phrases is embedded using a model like text-embedding-3-small
, producing a domain centroid vector. When a new paper arrives, we embed the full text of the document and compute its cosine similarity to each domain’s centroid. The top-matching domains (based on similarity score) are saved to the document_domains
table if the match exceeds a configurable confidence threshold (e.g., min_classification_score: 0.6
in document_loader
config).
This simple but powerful setup gives us the ability to classify any research paper into multiple overlapping conceptual areas even if it doesnโt use the same vocabulary as the seed phrases.
For example, hereโs what a domain seed file might look like:
domains:
symbolic:
description: "Symbolic reasoning, planning, and logic-guided AI."
seeds:
- "Symbolic planning using language agents."
- "Neuro-symbolic reasoning for AI alignment."
- "Inductive logic programming with LLMs."
- "Formal rule extraction from natural language."
- "Symbolic systems for concept generalization."
alignment:
description: "Scalable oversight, alignment, and control of AI behavior."
seeds:
- "Scalable oversight and alignment strategies."
- "Training language models to be helpful and harmless."
- "Preventing goal misgeneralization in agents."
- "Reward modeling for safety and usefulness."
- "Evaluating AI via human preference learning."
planning:
description: "Strategic action planning with language and decision models."
seeds:
- "Hierarchical reinforcement learning for agents."
- "Planning with tree search and transformers."
- "Goal decomposition and reasoning chains."
- "Language-driven policy generation."
- "Meta-planning and strategy selection in LLMs."
...
When a new document or section arrives, the DomainClassifier
:
- Embeds the input text.
- Compares it against each domain seed embedding using cosine similarity.
- Returns the top-matching domains above a configurable confidence threshold.
flowchart TD %% Seed processing A[seeds.yaml] --> B[Seed Phrases Grouped by Domain] B --> C[Embed Each Phrase โ Vector] C --> D[Compute Domain Centroids mean vector per domain] %% Document input E[Incoming Document Text] --> F1[Embed Full Document โ Vector] E --> F2[Parse into Sections] %% Section-level processing F2 --> G[Embed Each Section โ Vector] G --> H[Cosine Similarity with Domain Centroids] D --> H H --> I[Aggregate Section Similarities โ Top K Matching Domains] I --> J[Assign Domains to Document] J --> K[Store in document_domains Table] %% Document embedding output F1 --> L[Store in document_embeddings Table] %% Styles style A fill:#f9f,stroke:#333,stroke-width:2px style E fill:#bbf,stroke:#333,stroke-width:2px style K fill:#dfd,stroke:#333,stroke-width:2px style L fill:#dfd,stroke:#333,stroke-width:2px
This approach gives us soft classification with semantic grounding. We’re not doing brittle keyword matching or hard-coded labels. Instead, we use vector similarity to reason about what each text is truly about even when it uses novel phrasing or unfamiliar terminology.
This brings several key advantages:
- โ Modularity: Domains are defined via config and can be extended without code changes.
- โ Generalization: Embedding-based matching allows classification even when exact seed terms donโt appear.
- โ Granularity: Because itโs lightweight, we apply this not just to entire documents but also to individual sections a major step forward in understanding mixed-topic papers.
import yaml
from sklearn.metrics.pairwise import cosine_similarity
import yaml
from sklearn.metrics.pairwise import cosine_similarity
class DomainClassifier:
def __init__(self, memory, logger, config_path="config/domain/seeds.yaml"):
self.memory = memory
self.logger = logger
self.logger.log("DomainClassifierInit", {"config_path": config_path})
with open(config_path, "r") as f:
self.domain_config = yaml.safe_load(f)
self.domains = self.domain_config.get("domains", {})
self.logger.log("DomainConfigLoaded", {"num_domains": len(self.domains)})
self._prepare_seed_embeddings()
def _prepare_seed_embeddings(self):
self.embeddings = []
self.labels = []
total_seeds = 0
for domain, details in self.domains.items():
seeds = details.get("seeds", [])
total_seeds += len(seeds)
for seed in seeds:
embedding = self.memory.embedding.get_or_create(seed)
self.embeddings.append(embedding)
self.labels.append(domain)
self.logger.log(
"SeedEmbeddingsPrepared",
{"total_seeds": total_seeds, "domains": list(self.domains.keys())},
)
def classify(self, text: str, top_k: int = 3, min_score: float = 0.7):
embedding = self.memory.embedding.get_or_create(text)
scores = []
for domain, seed_embedding in zip(self.labels, self.embeddings):
score = float(cosine_similarity([embedding], [seed_embedding])[0][0])
scores.append((domain, score))
sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
top_matches = sorted_scores[:top_k]
# log warning if none meet the threshold
if all(score < min_score for _, score in top_matches):
self.logger.log(
"LowDomainScore",
{"text_snippet": text[:100], "top_scores": top_matches},
)
return top_matches
This is the process followed.
flowchart LR A[๐ domain seeds YAML config] --> B[๐ง Embed each seed phrase] B --> C[๐พ Store seed embeddings] D[๐ Document text] --> E[๐ง Embed document] C --> F[๐ Cosine similarity seed โ doc] E --> F F --> G{Score โฅ threshold?} G -- Yes --> H[๐ท๏ธ Assign domain labels] G -- No --> I[โ Skip domain] subgraph Section-level Classification J[๐ Section text] --> K[๐ง Embed section] C --> L[๐ Cosine similarity seed โ section] K --> L L --> M{Score โฅ threshold?} M -- Yes --> N[๐ท๏ธ Assign domain to section] M -- No --> O[โ Skip section] end
๐ How We Use It
We apply the classifier in two places:
- At the document level, immediately after text extraction and summarization.
The classification is controlled through the agent configuration
document_loader:
name: document_loader
force_domain_update: true
top_k_domains: 3
min_classification_score: 0.6
domain_seed_config_path: "config/domain/seeds.yaml"
max_chars_for_summary: 16000
summarize_documents: true
Field | Type | Description |
---|---|---|
name |
string | The name identifier for this pipeline stage (used for logging and tracing). |
force_domain_update |
boolean | If true , forces re-classification of document domains even if already assigned. |
top_k_domains |
integer | Number of top-scoring domains to assign to each document. |
min_classification_score |
float | Minimum cosine similarity score for a domain to be considered valid. |
domain_seed_config_path |
string | Path to the YAML file that contains seed phrases for each domain. |
max_chars_for_summary |
integer | Maximum number of characters to consider when summarizing document text. |
summarize_documents |
boolean | Whether to generate a summary (via LLM). |
Note we still assign domains even if they are below the threshold.
def assign_domains_to_document(self, document):
"""
Classifies the document text into one or more domains,
and stores results in the document_domains table.
"""
content = document.content
if content:
results = self.domain_classifier.classify(content,
self.top_k_domains, self.min_classification_score)
for domain, score in results:
self.memory.document_domains.insert({
"document_id": document.id,
"domain": domain,
"score": score,
})
self.logger.log("DomainAssigned", {
"title": document.title[:60] if document.title else "",
"domain": domain,
"score": score,
})
else:
self.logger.log("DocumentNoContent", {
"document_id": document.id,
"title": document.title[:60] if document.title else "",
})
- At the section level, after we split the document into discrete segments using the
DocumentProfilerAgent
.
Each section can have different domain tags allowing us to track exactly which part of a paper is about reasoning, which is about learning, and which is about evaluation, for example.
for doc in documents:
...
detected_domains = self.domain_classifier.classify(text)
...
for section, text in chosen.items():
self.memory.document_section.upsert(
{
"document_id": doc_id,
"section_name": section,
"section_text": text,
"source": "unstructured+llm",
"domains": detected_domains,
"summary": generated_summary,
}
)
This enables our knowledge system to become domain-aware at every level empowering smarter selection, routing, and synthesis by future agents.
๐งฉ Breaking Down a Document: From Raw Text to Structured Sections
flowchart LR A[๐ฏ SurveyAgent<br/>Find goal-related seed papers] B[๐ SearchOrchestratorAgent<br/>Expand with related papers] C[๐ฅ DocumentLoaderAgent<br/>Download & extract text] D[๐ง DocumentProfilerAgent<br/>Enrich, embed, and segment]:::highlighted E[๐ PaperScoreAgent<br/>Rate for novelty, relevance, etc.] F[๐ KnowledgeLoaderAgent<br/>Store as structured knowledge] A --> B --> C --> D --> E --> F classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
One of the key challenges in reasoning over research papers, blog posts, or long-form documents is understanding their structure. These documents often contain valuable ideas buried under dense formatting or inconsistent section headings. To extract and analyze meaningful content, we need to convert unstructured text into structured, labeled segments.
Thatโs where our DocumentSectionParser
comes in.
๐ What It Does
The DocumentSectionParser
uses Unstructured to segment text into a series of typed elements like Title
, Heading
, and NarrativeText
. From there, the parser:
- Groups content under detected headings using Unstructuredโs partitioning output.
- Cleans and normalizes the headings (e.g., removing digits, punctuation, and case variation).
- Maps the cleaned headings to canonical categories (like
Motivation
,Method
,Results
) using a configurable YAML mapping (target_sections.yaml
). - Filters out garbage sections using regular expression rules and a configurable minimum content threshold.
- Returns a cleaned, structured dictionary of
{section_name: content}
suitable for downstream processing, scoring, or reasoning.
๐ Example: From Raw Text to Labeled Sections
Given a document like:
1. Introduction
In this paper, we explore...
2. Related Work
Prior methods have focused on...
3. Approach
Our model differs by...
The parser will return:
{
"motivation": "In this paper, we explore...",
"background": "Prior methods have focused on...",
"method": "Our model differs by..."
}
All section labels are standardized, making it easier to apply evaluation logic, domain scoring, or prompt selection downstream.
๐ Parsing Papers into Meaningful Sections
To help the DocumentSectionParser make sense of the diverse and inconsistent headings used in research papers, we rely on a configuration file: target_sections.yaml
.
This YAML file defines the canonical sections we want to extract like abstract, method, or results and provides a list of synonyms or variations commonly used in real papers for each one. For example:
title:
- title
abstract:
- abstract
- summary
introduction:
- introduction
- intro
related_work:
- related work
- background
- prior work
- literature review
method:
- method
- methods
- methodology
- approach
- algorithm
implementation:
- implementation
- code
- technical details
results:
- results
- result
- evaluation
- performance
discussion:
- discussion
- analysis
- interpretation
conclusion:
- conclusion
- conclusions
- final remarks
limitations:
- limitations
- drawbacks
- challenges
future_work:
- future work
- next steps
- extensions
references:
- references
- bibliography
- works cited
This parser turns raw document text into a structured, labeled format that maps each section of the paper to a specific purpose such as “method,” “results,” or “conclusion.”
class DocumentSectionParser:
def __init__(self, cfg=None, logger=None):
self.cfg = cfg or {}
self.logger = logger or print
self.min_chars_per_sec = self.cfg.get("min_chars_per_sec", 20)
# Load target sections from YAML
self.config_path = self.cfg.get(
"target_sections_config",
"config/domain/target_sections.yaml"
)
self.TARGET_SECTIONS = self._load_target_sections()
self.SECTION_TO_CATEGORY = self._build_section_to_category()
def parse(self, text: str) -> dict:
from unstructured.partition.text import partition_text
from unstructured.staging.base import elements_to_json
elements = partition_text(text=text)
json_elems = elements_to_json(elements)
structure = self.parse_unstructured_elements(json.loads(json_elems))
cleaned = {self.clean_section_heading(k): v for k, v in structure.items()}
mapped = self.map_sections(cleaned)
final = self.trim_low_quality_sections(mapped)
return final
def _load_target_sections(self) -> dict:
"""Load TARGET_SECTIONS from a YAML file"""
path = Path(self.config_path)
if not path.exists():
raise FileNotFoundError(f"Target sections config not found: {path}")
with open(path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)
def _build_section_to_category(self) -> dict:
"""Build reverse lookup map from synonyms to categories"""
mapping = {}
for cat, synonyms in self.TARGET_SECTIONS.items():
for synonym in synonyms:
normalized = self._normalize(synonym)
mapping[normalized] = cat
return mapping
def _normalize(self, name: str) -> str:
return re.sub(r"[^a-z0-9]", "", name.lower().strip())
def parse_unstructured_elements(self, elements: list[dict]) -> dict[str, str]:
current_section = None
current_content = []
structured = {}
for el in elements:
el_type = el.get("type")
el_text = el.get("text", "").strip()
if not el_text:
continue
if el_type in ("Title", "Heading"):
if current_section and current_content:
structured[current_section] = "\n\n".join(current_content).strip()
current_section = el_text.strip()
current_content = []
elif el_type in ("NarrativeText", "UncategorizedText", "ListItem"):
if current_section:
current_content.append(el_text)
if current_section and current_content:
structured[current_section] = "\n\n".join(current_content).strip()
return structured
def clean_section_heading(self, heading: str) -> str:
if not heading:
return ""
heading = re.sub(r"^\s*[\d\.\s]+\s*", " ", heading)
heading = re.sub(r"^(section|chapter|part)\s+\w+", "", heading, flags=re.IGNORECASE)
heading = re.sub(r"[^\w\s]", "", heading)
heading = re.sub(r"\s+", " ", heading).strip()
return heading
def map_sections(self, parsed_sections: dict[str, str]) -> dict[str, str]:
mapped = {}
for sec_name, content in parsed_sections.items():
normalized = self._normalize(sec_name)
if normalized in self.SECTION_TO_CATEGORY:
category = self.SECTION_TO_CATEGORY[normalized]
mapped[category] = content
else:
best_match, score = process.extractOne(normalized, self.SECTION_TO_CATEGORY.keys())
if score > 75:
category = self.SECTION_TO_CATEGORY[best_match]
mapped[category] = content
return mapped
def is_valid_section(self, text: str) -> bool:
if not text or len(text.strip()) < 10:
return False
garbage_patterns = [
r"^\d+$",
r"^[a-zA-Z]$",
r"^[A-Z][a-z]+\s\d+$",
r"^[ivxlcdmIVXLCDM]+$",
r"^[\W_]+$",
r"^[^\w\s].{0,20}$"
]
for pattern in garbage_patterns:
if re.fullmatch(pattern, text.strip()):
return False
return True
def trim_low_quality_sections(self, structured_data: dict[str, str]) -> dict[str, str]:
cleaned = {}
for key, text in structured_data.items():
if self.is_valid_section(text):
cleaned[key] = text
else:
self.logger.log("TrimmingSection",
{"section": key, "data": text[:50]})
return cleaned
Here’s how it works, step by step:
๐ ๏ธ Initialization and Configuration
def __init__(self, cfg=None, logger=None):
The parser accepts a configuration dictionary and a logger. Most importantly, it loads a YAML file that defines the target sections we care about (e.g., “abstract”, “methods”, “findings”). These sections may appear in many different forms in papers this file helps us normalize them.
self.TARGET_SECTIONS = self._load_target_sections()
self.SECTION_TO_CATEGORY = self._build_section_to_category()
We build a reverse mapping of section names to their canonical categories, allowing us to group headings like “Experiments” and “Methodology” under a common "method"
label.
๐ Parsing Document Text
The core logic happens in the parse()
method:
def parse(self, text: str) -> dict:
This method processes the raw document text in several steps:
- Partitioning text into elements using the
unstructured
library. - Parsing structured sections using
parse_unstructured_elements()
. - Cleaning up section headings with
clean_section_heading()
. - Mapping section titles to canonical categories with
map_sections()
. - Filtering out low-quality content with
trim_low_quality_sections()
.
The result is a clean, dictionary-style mapping from logical section names (like "method"
or "results"
) to their actual text content.
๐งฑ Building the Structure
def parse_unstructured_elements(self, elements: list[dict]) -> dict[str, str]:
This method walks through the parsed elements, separating the document into a series of headings and the content under them. It keeps track of the current section title and accumulates text lines under each heading.
๐งผ Normalizing Section Headings
def clean_section_heading(self, heading: str) -> str:
Section titles often include noise like numbers or formatting artifacts (e.g., "1. Introduction"
or "Chapter 2 โ Results"
). This method strips out that noise so the parser can match the title to a canonical form.
๐ง Mapping to Target Categories
def map_sections(self, parsed_sections: dict[str, str]) -> dict[str, str]:
After cleaning, this method matches each heading to a target category using fuzzy matching. For example, both "Findings"
and "Results"
might map to the canonical "results"
section.
If an exact match isnโt found, the method uses the fuzzywuzzy library to select the closest match with a confidence threshold.
๐ฎ Filtering Out Noise
def trim_low_quality_sections(self, structured_data: dict[str, str])
I-> dict[str, str]:
Sometimes headings or content are just junk single letters, roman numerals, or short fragments. This method uses regular expressions to catch and exclude those, ensuring we only keep high-quality content.
By breaking down a document into canonical, cleanly labeled sections, we enable downstream agents like scorers, evaluators, or prompt compilers to focus on exactly the parts they need. Whether it’s analyzing methods, extracting contributions, or reviewing conclusions, this structured representation is our foundation for intelligent document understanding. In short: this parser turns messy PDFs into structured knowledge.
๐ค Fallback to LLM: When Structure Fails
While our primary extraction method relies on the unstructured
library to parse documents into meaningful sections, we included a fallback mechanism that uses a language model to guide the process when the structure is ambiguous or low quality.
In this approach, the system prompts the LLM to suggest relevant section headings and heuristically slices the document between those headings. It then attempts to match these chunks back to our target sections (like methods, results, etc.).
However, in our experience especially when using local models via tools like Ollama this LLM-guided fallback was often imprecise. Section boundaries were vague, and the generated summaries lacked consistency across documents. This led us to rely more heavily on the unstructured parser, which, while imperfect, provided more reliable segmentation when paired with our synonym-based mapping and quality filtering heuristics.
We kept the LLM fallback in place for edge cases, but tuned the system to favor the structured route wherever possible.
๐ง This enables
This modular parsing step powers multiple downstream agents:
- The
PaperScoreAgent
uses the parsed sections to assign multidimensional scores (e.g., correctness, clarity, originality). - The
PromptCompilerAgent
can target specific sections (likemethod
) for prompt tuning or hypothesis refinement. - The domain classifier can reason about which section most clearly represents the paperโs core domain.
๐ง Configurable & Extendable
Want to support a new section type like Conclusion
or Ethics
? Just add synonyms to the target_sections.yaml
file, and the parser will start mapping them automatically.
method:
- approach
- method
- model
- technique
motivation:
- motivation
- intro
- background
๐ก Summary
The DocumentSectionParser
acts as the translation layer between raw, messy research text and structured, analyzable components. It lets the rest of our system focus on what matters: scoring ideas, evaluating reasoning, and generating insights not decoding inconsistent formatting.
๐ Scoring a Paper From Every Angle
flowchart LR A[๐ฏ SurveyAgent<br/>Find goal-related seed papers] B[๐ SearchOrchestratorAgent<br/>Expand with related papers] C[๐ฅ DocumentLoaderAgent<br/>Download & extract text] D[๐ง DocumentProfilerAgent<br/>Enrich, embed, and segment] E[๐ PaperScoreAgent<br/>Rate for novelty, relevance, etc.]:::highlighted F[๐ KnowledgeLoaderAgent<br/>Store as structured knowledge] A --> B --> C --> D --> E --> F classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
One of the core innovations in our pipeline is the multi-dimensional scoring system implemented in the PaperScoreAgent
. Unlike traditional single-score evaluation systems, this adapter evaluates each document along several distinct axes of merit, enabling more nuanced and interpretable assessments.
This is an example configuration for scoring papers. We support two output formats simple and cor.
output_format: simple
dimensions:
- name: relevance
file: relevance
weight: 1.5
extra_data: { parser: numeric }
- name: novelty
file: novelty
weight: 1.2
extra_data: { parser: numeric }
- name: implementability
file: implementability
weight: 1.0
extra_data: { parser: numeric }
This is a selection of the scoring prompts that are useful for this task.
Prompt File | Score Dimension | Description | Purpose in Scoring |
---|---|---|---|
clarity.txt | Clarity | Evaluates how clearly the paper presents its ideas, methods, and results. | Helps determine if the content is accessible and understandable to readers. |
feasibility.txt | Feasibility | Assesses how practical and realistic it is to implement the ideas presented in the paper. | Filters out overly theoretical ideas that are hard to realize in practice. |
implementability.txt | Implementability | Focuses on whether the system, framework, or method can be built with available tools. | Indicates the technical viability of turning the paper into a working system. |
integration_fit.txt | Integration Fit | Measures how well the proposed method could integrate with existing systems or workflows. | Useful for evaluating papers for system expansion or plugin potential. |
modularity.txt | Modularity | Assesses whether the components described can be separated and reused independently. | Encourages preference for composable and flexible research artifacts. |
novelty.txt | Novelty | Examines whether the idea or approach is new in the context of existing literature. | Helps prioritize papers that bring fresh insights or methodologies. |
originality.txt | Originality | Measures the paperโs creative departure from standard or known methods. | Encourages discovery of innovative, unique contributions. |
performance_gain_potential.txt | Performance Gain Potential | Evaluates the potential for the approach to improve measurable outcomes (e.g., speed, accuracy). | Important for identifying research that could push the state-of-the-art forward. |
relevance.txt | Relevance | Assesses how closely aligned the content is to the current research goal. | Ensures selected documents are pertinent to the ongoing task or question. |
๐ฏ Why Multi-Dimensional?
In research and reasoning workflows, a document can be:
- Correct but uninspired,
- Creative but technically flawed,
- Relevant but poorly written.
A single score can’t capture these tradeoffs. Multi-dimensional scoring allows us to reason more like a reviewer: evaluating a paperโs correctness, originality, clarity, relevance, and more each as a standalone quality.
This is an example score result.
๐ Dimension Scores Summary
โโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Dimension โ Score โ Weight โ Rationale (preview) โ
โโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ relevance โ 65 โ 1.5 โ rationale: The paper introduces a method to enhance reasonin โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ novelty โ 75 โ 1.2 โ rationale: The paper introduces a novel approach by training โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ implementability โ 95 โ 1.0 โ rationale: The paper describes a modular approach with a cle โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ FINAL โ 76.35 โ - โ Weighted average โ
โโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ ๏ธ How It Works
1. Document Selection
We begin with a filtered set of documents (usually sourced from arXiv, Hugging Face, or internal corpora) that are deemed domain-relevant using the KnowledgeLoaderAgent
.
2. Score Check
For each document:
- If prior scores exist (and
force_rescore
is not set), we reuse them. - Otherwise, we proceed to scoring.
3. LLM-Based Evaluation
Each document is scored using an LLM prompt designed to generate structured scores for each dimension. For example:
{
"correctness": 0.8,
"originality": 0.9,
"clarity": 0.7,
"relevance": 0.95
}
Novelty
prompt example
This is an example prompt.
Here we use the title and summary to determine the score. Each score rturns a rational and a acore between 0-100.
You are evaluating a research paper for its novelty.
Paper title: {{ document.title }}
Paper summary: {{ document.summary }}
Does the paper introduce new concepts, architectures, or techniques that are not commonly found in existing work on reasoning, planning, or self-evaluation in AI?
Return your review in the exact structured format below. Do not include headings, markdown, or additional commentary. Use only plain text fields as shown:
rationale: <brief explanation>
score: <0โ100>
This output is parsed and converted into rows in the scores
table, each tied to a specific evaluation
record for traceability.
An example score result
{
"relevance": {"score": 80, "rationale": "Covers reinforcement learning methods relevant to self-improvement"},
"novelty": {"score": 60, "rationale": "Uses common PPO variant without novel extensions"},
"clarity": {"score": 95, "rationale": "Well-structured, includes implementation details"},
"feasibility": {"score": 70, "rationale": "Can be implemented with standard frameworks"},
"impact": {"score": 85, "rationale": "May improve training efficiency in early stages"}
}
4. Dimension-Aware Aggregation
We then compute an average per dimension, ignoring zero/invalid scores. This provides a quality profile for each paper a fingerprint of its strengths and weaknesses.
5. Usage in the Pipeline
These scores feed into:
- Ranking systems (e.g., selecting the top 5 documents most relevant and correct),
- Router agents that choose the best reasoning model or strategy based on what the current goal values (e.g., originality vs correctness),
- Training data filters, ensuring that only high-quality samples contribute to model tuning.
๐ What Makes It Powerful?
This adapter turns each paper into a multi-dimensional vector of quality. That opens the door to:
- Comparative judgments across dimensions,
- Contrastive pair training (as in MR.Q),
- Symbolic rule learning about which types of documents help with which goals,
- And eventually, meta-reasoning over document effectiveness.
๐ Continuous Refinement
As we gather more scores across different domains and tasks, we use this information to:
- Improve the LLM scoring prompts,
- Fine-tune downstream rankers (e.g., SVMs or reward models),
- Guide
PromptCompilerAgent
decisions by training it on which prompts lead to high-dimensional scores.
This system turns document evaluation from a bottleneck into a rich signal, powering every layer of the self-improving Co AI pipeline.
–
๐ก Adaptive Document Selection: Studying What Matters
flowchart LR A[๐ฏ SurveyAgent<br/>Find goal-related seed papers] B[๐ SearchOrchestratorAgent<br/>Expand with related papers] C[๐ฅ DocumentLoaderAgent<br/>Download & extract text] D[๐ง DocumentProfilerAgent<br/>Enrich, embed, and segment] E[๐ PaperScoreAgent<br/>Rate for novelty, relevance, etc.] F[๐ KnowledgeLoaderAgent<br/>Store as structured knowledge]:::highlighted A --> B --> C --> D --> E --> F classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
In large, heterogeneous document collections, simply keyword-matching a goal to text is insufficient. We need smarter ways to understand what a goal is really about and identify documents that align with that intent. This is where the KnowledgeLoaderAgent
comes in.
๐ฏ Purpose
The KnowledgeLoaderAgent
is designed to select the most suitable documents for a given research goal by adapting to domain-level semantics and multi-dimensional document scoring. Rather than matching raw text, it uses domain-specific embeddings to rank documents based on conceptual proximity.
In short: the Knowledge Agent is where noise becomes signal. And that signal is the foundation of everything that follows.
๐งฉ How It Works
-
๐งญ Domain Embedding Seeds Each research domain (e.g.,
"LLM Optimization"
,"Knowledge Retrieval"
,"Symbolic Reasoning"
) is associated with a small set of seed examples concise phrases or representative goals. These seeds are embedded and averaged to form a domain centroid vector. -
๐ฏ Goal Classification via Embedding Similarity When a new goal enters the pipeline, it is embedded using a local embedding model (
memory.embedding.get_or_create
). The system computes cosine similarity between the goal vector and each domain centroid to identify the most relevant domain. This domain assignment helps scope the retrieval to the most contextually aligned documents. -
๐ Document Filtering by Domain + Quality Each document has already been annotated with:
- One or more domain scores (via prior classification), and
- A set of multi-dimensional scores (e.g., clarity, feasibility, novelty, etc.) from the dynamic scoring stage.
The Knowledge Agent filters documents using two criteria:
โ Domain Match: The document must be tagged with the same domain as the goal, and the domain score must exceed a minimum threshold (e.g.,
0.6
).โ Quality Match (optional): If enabled via config (
use_multidimensional_scores: true
), the agent can further prioritize documents that:- Score above a specified threshold on any or all scoring dimensions.
- Or are ranked among the top-k highest scoring documents across selected dimensions.
This ensures that not only is the document about the right thing itโs also well-written, original, implementable, and useful.
-
๐ Context-Aware Return Format Depending on configuration:
- The agent can return summaries (
summary
) for compact processing. - Or return full document content (
text
) for richer downstream pipelines like symbolic reasoning, hypothesis generation, or tool synthesis.
- The agent can return summaries (
This approach allows the system to focus attention on the most promising knowledge, even across thousands of documents. By aligning goals to domain vectors, we simulate a kind of semantic routing making the system behave like an adaptive information filter.
โ๏ธ Example Configuration
knowledge_loader:
name: knowledge_loader
domain_seeds: ${path:config/domain/seeds.yaml}
top_k: 3
domain_threshold: 0.4
include_full_text: false
This is the agent code as of this blog post
class KnowledgeLoaderAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.domain_seeds = cfg.get("domain_seeds", {})
self.top_k = cfg.get("top_k", 3)
self.threshold = cfg.get("domain_threshold", 0.0)
self.include_full_text = cfg.get("include_full_text", False)
# Optional scoring configuration
self.use_dimensional_scores = cfg.get("use_dimensional_scores", False)
self.dimension_weights = cfg.get("dimension_weights", {
"relevance": 1.0,
"usefulness": 0.8,
"clarity": 0.6,
"implementability": 0.7,
"novelty": 0.5,
"integration_fit": 0.6,
})
self.min_weighted_score = cfg.get("min_weighted_score", 0.5)
async def run(self, context: dict) -> dict:
goal = context.get(GOAL)
goal_text = goal.get("goal_text", "")
documents = context.get("documents", [])
if not goal_text or not documents:
self.logger.log("DocumentFilterSkipped", {"reason": "Missing goal or documents"})
return context
# Step 1: Assign domain to the goal
goal_vector = self.memory.embedding.get_or_create(goal_text)
domain_vectors = {
domain: np.mean([self.memory.embedding.get_or_create(ex) for ex in examples], axis=0)
for domain, examples in self.domain_seeds.items()
}
goal_domain, goal_domain_score = None, -1
for domain, vec in domain_vectors.items():
score = float(cosine_similarity([goal_vector], [vec])[0][0])
if score > goal_domain_score:
goal_domain = domain
goal_domain_score = score
context["goal_domain"] = goal_domain
context["goal_domain_score"] = goal_domain_score
self.logger.log("GoalDomainAssigned", {"domain": goal_domain, "score": goal_domain_score})
# Step 2: Filter documents based on domain + optional dimensional scores
filtered = []
for doc in documents:
doc_domains = self.memory.document_domains.get_domains(doc["id"])
if not doc_domains:
continue
for dom in doc_domains[:self.top_k]:
if dom.domain == goal_domain and dom.score >= self.threshold:
# Optional: score-based filtering
if self.use_dimensional_scores:
score = self.compute_weighted_score(doc["id"])
if score < self.min_weighted_score:
continue # reject document
else:
score = None
selected_content = doc["text"] if self.include_full_text else doc["summary"]
filtered.append({
"id": doc["id"],
"title": doc["title"],
"domain": dom.domain,
"domain_score": dom.score,
"doc_score": score,
"content": selected_content
})
break
context[self.output_key] = filtered
context["filtered_document_ids"] = [doc["id"] for doc in filtered]
self.logger.log("DocumentsFiltered", {
"count": len(filtered),
"used_scores": self.use_dimensional_scores,
"min_score_threshold": self.min_weighted_score if self.use_dimensional_scores else None,
"dimensions": list(self.dimension_weights.keys()) if self.use_dimensional_scores else None
})
return context
def compute_weighted_score(self, doc_id: str) -> float:
scores = self.memory.document_scores.get_scores(doc_id)
if not scores:
return 0.0
total, weight_sum = 0.0, 0.0
for dim, weight in self.dimension_weights.items():
dim_score = next((s.score for s in scores if s.dimension == dim), None)
if dim_score is not None:
total += weight * dim_score
weight_sum += weight
return total / weight_sum if weight_sum > 0 else 0.0
๐ Structured Storage and Future Feedback Loops
Every domain assignment, document selection, and multi-dimensional score is not only logged but also saved to the database in a structured format. This persistent knowledge base enables far more than just traceability it becomes the training ground for the systemโs next evolution. By combining domain tags, content metadata, and scoring dimensions (like clarity, novelty, symbolic alignment), we lay the foundation for downstream agents to learn from historical data.
In upcoming stages, this data will be used to drive MR.Q-based ranking, DPO-style prompt refinement, and even automated rule tuning. The result is a dynamic, continuously improving system where feedback isnโt just collected itโs operationalized into behavior. This is how the system learns what good knowledge looks like, and how it should shape the reasoning strategies that come next.
๐ง Conclusion: From Chaos to Coordinated Knowledge
In this post, weโve shown how to transform chaotic, unstructured research papers into structured, ranked, and goal-filtered knowledge using a modular Document Intelligence system. Each stage from domain assignment and section parsing to dimensional scoring and filtering is handled by purpose-built agents working in tandem. Importantly, weโve favored unstructured, local parsing and scoring over LLM-based black boxes, allowing us to retain interpretability, efficiency, and control throughout the pipeline.
But this is more than cleanup this is bootstrapping intelligence. The knowledge output from this process feeds directly into the next phase: self-improving AI workflows. Our PromptCompilerAgent
, for instance, will consume these structured documents to generate higher-quality prompts. Then, using MR.Q-based preference tuning, weโll evaluate and refine those prompts based on outcomes creating a feedback loop where the system learns from its own behavior. In short: this is not just document understanding itโs the foundation for a self-learning AI research agent. And among all our previous milestones, this one might be the most decisive step toward that vision.
๐บ๏ธ Knowledge Flow Toward Self-Learning
flowchart TD A[Unstructured Documents PDFs, Papers] --> B[DocumentProfilerAgent<br/>Parse + Structure] B --> C[DomainClassifier<br/>Assign Domain Tags] B --> D[DimensionalScorer<br/>Score: Clarity, Novelty, etc.] C & D --> E[KnowledgeLoaderAgent<br/>Filter by Goal Domain & Score Threshold] E --> F[Filtered Structured Documents] F --> G[PromptCompilerAgent<br/>Generate Prompts Based on Knowledge] G --> H[Prompt Evaluation MR.Q] H --> I[Prompt Tuning / Self-Improvement] I --> J[Enhanced Reasoning Pipeline]
๐ฏ Our Aim: To build a system of local, interpretable agents that doesn’t just match the output quality of large language models but surpasses them in consistency, control, and clarity. By composing modular reasoning agents, domain-aware filters, and prompt-programmable scaffolds, we gain traceable intelligence, tunable behavior, and a feedback loop that learns from every decision. The result is not just an alternative to powerful LLMs it’s a more accountable, composable, and continuously improvable reasoning system.
๐ References
-
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training ArXiv: 2506.10952 โ Inspired the design of embedding-based domain classification for document filtering. https://arxiv.org/abs/2506.10952
-
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers ArXiv: 2505.21497 โ Informed the structural parsing of scientific documents into reusable components. https://arxiv.org/abs/2505.21497
-
Unstructured.io โ Document Parsing Tools Used for segmenting and structurally parsing raw PDF/text documents. https://unstructured.io
-
FuzzyWuzzy Matching โ Fuzzy string matching for section heading normalization Seatgeek’s FuzzyWuzzy library was used to align arbitrary section headings with canonical categories. https://github.com/seatgeek/fuzzywuzzy
-
Cosine Similarity & Embeddings โ Domain Matching and Semantic Similarity Domain classification and filtering rely on cosine similarity of vector embeddings (via local model or sentence transformers).
-
Hydra Configuration System โ Flexible YAML-based Configurations for Agents Used throughout the Co AI pipeline for modular agent configuration. https://hydra.cc
๐ Glossary
Term | Definition |
---|---|
Knowledge Ingestion Pipeline | The series of steps through which documents are retrieved, parsed, embedded, classified, and scored. |
Domain Seeds | Short example texts representing key domains; used as anchor points to classify new documents by similarity. |
Embedding | A numerical vector representation of text that preserves semantic meaning, enabling similarity comparison. |
Cosine Similarity | A metric for measuring how similar two vector embeddings are, commonly used for semantic matching. |
Target Sections | Canonical paper sections like Abstract, Method, or Results into which raw documents are categorized. |
DocumentSectionParser | A parser that uses the Unstructured library to extract and normalize meaningful sections from papers. |
Multi-Dimensional Scoring | A method of evaluating documents or hypotheses along several quality dimensions (e.g., clarity, correctness). |
MR.Q | A structured scoring method for comparing hypotheses or documents, inspired by preference modeling. |
ScoreORM / EvaluationORM | Database models that store structured scoring information for documents, hypotheses, or agent outputs. |
Hydra | A configuration management system used to define flexible YAML-based agent and pipeline configs. |
Unstructured.io | A library for parsing unstructured documents into structured elements like headings and paragraphs. |
Fuzzy Matching | A method of aligning approximate strings (like section titles) using libraries like FuzzyWuzzy. |
Document Domains | Categories assigned to documents based on their similarity to predefined seed examples. |
Agent | A functional component in the Co AI framework responsible for performing a specific task (e.g., scoring, parsing). |
Pipeline | A sequential flow of agents orchestrated to achieve a research or reasoning objective. |
Prompt Compiler | A component that builds and refines prompts based on structured inputs and performance feedback. |
Self-Improving AI | An AI system that can evaluate its performance and adapt its internal tools and knowledge accordingly. |