Thinking in Primitives: Why AI Reasoning Should Learn to Point

Thinking in Primitives: Why AI Reasoning Should Learn to Point
Page content

From visual primitives to context-filtered reasoning, grounded verification, and AI movie repair

TL;DR

This post argues that AI reasoning should not operate over everything it can see, read, or detect. It should operate over the right primitives for the current task.

The paper Thinking with Visual Primitives shows that multimodal models reason better when they can point to visual entities using boxes and points. That solves a Reference Gap: language is often too vague to anchor reasoning to the right part of an image.

This post extends that idea with a second gap:

The Relevance Gap: even if a system can point, it still needs to know which things are worth pointing at.

The proposed architecture is:

raw content
candidate primitives
context filter
active primitive set
relations
verification
repair

In practice, tools like YOLO, pose models, trackers, and segmentation systems can extract candidate primitives: boxes, points, keypoints, masks, and trajectories. A reasoning layer then filters those primitives by context, checks relations such as touching, holding, aligned_with, or transferred_to, and produces grounded decisions or repair instructions.

The key claim:

Reasoning is not full reconstruction. Reasoning is context-filtered primitive selection followed by relation verification.

This matters for AI-generated image critique, entity interaction detection, object placement verification, and movie generation. Instead of saying “the handoff is unclear,” a primitive verifier can say:

The envelope never moved from Alice's hand region into Maria's hand region.
Regenerate frames 32–52 with a visible transfer.

That is the difference between vague critique and grounded, repairable reasoning.


Abstract

AI systems usually answer in language, but reasoning does not have to begin there. Human thought often appears to work through sparse, task-relevant primitives: objects, relations, locations, contact points, trajectories, constraints, and affordances. When we think of an building, we do not load a full-resolution image or recite a paragraph. The active representation changes with context: airport, accident, war, engineering, boarding, flight, threat.

The paper Thinking with Visual Primitives formalizes a version of this idea for multimodal models. It argues that visual reasoning suffers not only from a Perception Gap, where models fail to see enough detail, but from a deeper Reference Gap, where language fails to precisely point to the visual entities being reasoned about. Its solution is to treat points and bounding boxes as “minimal units of thought,” allowing a model to point while it reasons.

This post adds the Relevance Gap: even when a system can produce valid visual references, it still needs to select which references matter for the current task. It still has to know what is worth pointing at. A detector may produce hundreds of candidate primitives. More detections are not automatically more understanding. Without context filtering, the system no longer drowns in pixels or tokens; it drowns in candidate handles.

The architecture proposed here is simple:

extract candidate primitives
filter by task context
build relations
verify the required relation
repair what failed

YOLO, pose models, object trackers, segmentation models, and open-vocabulary detectors can extract primitives from images and video. A reasoning layer selects the active primitive set, checks relations such as touching, holding, aligned, inside, facing, moving_toward, or transferred_to, and produces grounded decisions or targeted repair instructions.

The core contribution is this:

Reasoning is not full reconstruction. Reasoning is context-filtered primitive selection followed by relation verification.

    %%{init: {'theme':'dark', 'themeVariables': { 'fontSize':'14px', 'primaryBorderColor':'#ffffff', 'lineColor':'#cccccc'}}}%%
flowchart TD
    A["🧠 Human Thinking<br/>Sparse, task-relevant primitives"]
    A1["✈️ Same object, different context<br/>airport • accident • war • engineering"]
    A2["🎯 Active primitive set<br/>only what matters for the task"]

    B["🤖 Current AI Failure Mode<br/>more pixels • more tokens • more detections"]
    B1["⚠️ Reference Gap<br/>language cannot reliably point"]
    B2["⚠️ Relevance Gap<br/>the system points to too much"]

    C["📄 Thinking with Visual Primitives<br/>points and boxes inside reasoning"]
    C1["📦 Boxes<br/>object and region handles"]
    C2["📍 Points<br/>locations, joints, paths, contact"]

    D["🚀 This Post<br/>context-filtered primitive reasoning"]
    D1["👁️ Extract<br/>YOLO • pose • tracking • segmentation"]
    D2["🎯 Filter<br/>select the active primitive set"]
    D3["🔗 Relate<br/>touching • holding • aligned • transferred_to"]
    D4["✅ Verify + Repair<br/>grounded decision or targeted fix"]

    E["🎬 Applications"]
    E1["🖼️ Image critique"]
    E2["🔩 Entity interaction and placement"]
    E3["🎥 Movie generation verification"]

    A --> A1 --> A2
    B --> B1
    B --> B2
    C --> C1
    C --> C2
    A2 -. "mimic sparse cognition" .-> D
    C -. "point while reasoning" .-> D
    B2 -. "solve with context filtering" .-> D
    D --> D1 --> D2 --> D3 --> D4 --> E
    E --> E1
    E --> E2
    E --> E3

    classDef human fill:#2b66a8,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
    classDef ai fill:#c04040,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
    classDef paper fill:#8855cc,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
    classDef contribution fill:#3eaa5c,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
    classDef app fill:#c4a040,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
    classDef sub fill:#1e1e1e,stroke:#999999,color:#dddddd,stroke-width:1.5px,font-size:13px;

    class A human;
    class B ai;
    class C paper;
    class D contribution;
    class E app;
    class A1,A2,B1,B2,C1,C2,D1,D2,D3,D4,E1,E2,E3 sub;
  

1. Before Words, Before Images

When someone says, “Think of an airplane,” what happens?

The answer depends on the context.

If you are going to the airport, the airplane may activate a passenger-oriented representation: boarding gate, luggage, seats, aisle, overhead bins, takeoff.

If you hear that an airplane has been involved in an accident, a different representation appears: wreckage, smoke, impact, emergency response, failure, trajectory.

If the context is war, the airplane changes again: fighter jet, radar, missile, target, speed, threat.

If the context is engineering, the image becomes wings, fuselage, engines, lift, drag, airflow, control surfaces.

The object is the same, but the active representation is different.

That matters.

It suggests that we do not think by loading a full-resolution image of an airplane. We also do not usually think by reciting a paragraph about airplanes. We activate a sparse, task-conditioned structure: the parts, relations, constraints, and expected changes that matter for the current question.

In one context, the relevant primitive is wing. In another, it is boarding gate. In another, it is impact trajectory. In another, it is radar signature.

Thought is not a complete picture. It is a context-sensitive selection of primitives.

Human intelligence is not only the ability to perceive the world. It is the ability to ignore most of it. We do not reason over every available detail. We select the subset that matters. More context is not automatically more intelligence. Often, intelligence is knowing which context can be safely discarded.

This is the mistake many AI systems still make. Because AI systems answer in language, we assume their reasoning should happen in language. Because images are made of pixels, we assume visual reasoning should happen over pixels. Because modern models can accept larger contexts, we assume giving them more context will make them smarter.

But humans answer in language too. Humans see in pixels too. Humans live inside enormous context windows. Yet the understanding underneath is something else: primitive, relational, spatial, compressed, and task-conditioned.

A useful intelligence does not need to activate the entire airplane to answer a question about boarding. It does not need every visual detail of a scene to decide whether one person is handing an object to another. It does not need every object in an image to know whether the generated picture failed the prompt. It needs the right primitives, selected for the current task.

That is the deeper reason Thinking with Visual Primitives is interesting. The paper points toward a form of AI reasoning that is not merely language-first. It suggests that models need stable visual handles, such as boxes and points, inside the reasoning process itself, so the model can point while it reasons rather than relying on vague phrases like “that object” or “the thing on the left.”

This post takes that idea one step further.

It is not enough to extract primitives. A useful reasoning system must extract the right primitives for the current context, reason over their relations, and ignore the rest.

That is the core claim:

Reasoning is not full reconstruction. Reasoning is context-filtered primitive selection.


2. The Paper’s Core Idea: The Reference Gap

The paper begins by distinguishing two problems in multimodal AI.

The first is the Perception Gap:

Can the model see enough detail?

This is the problem most visual-language systems try to solve with higher resolution, image crops, dynamic patching, and more visual tokens.

But the paper argues that there is a deeper problem: the Reference Gap.

A model may see the object, but still fail to keep track of what it is reasoning about. Natural language is often too vague to serve as a precise pointer into continuous visual space. Phrases like “the object on the left,” “the bear on the ground,” “that path,” or “the person near the table” can drift during reasoning.

The paper’s solution is simple and powerful:

Let the model point while it reasons.

Instead of reasoning only in text, the model inserts visual primitives directly into its chain of thought:

<ref>bear</ref><box>[[50,447,647,771]]</box>

or:

<point>[[309,512],[357,369],[408,510]]</point>

The box or point is not just a final output. It becomes part of the reasoning process.

That is the key contribution: points and bounding boxes become minimal units of thought.

The paper demonstrates this on counting, spatial reasoning, maze navigation, and path tracing. In counting examples, the model grounds each candidate object with boxes, filters invalid candidates, and tallies the grounded set. In path and maze examples, it uses point sequences to represent exploration, backtracking, and route tracing.

This is a major step. It moves multimodal reasoning from vague language toward grounded reference.


3. Our Extension: The Relevance Gap

The Thinking with Visual Primitives paper identifies the Reference Gap:

The model may know what it wants to say, but language alone is often too vague to point to the right part of the image.

Visual primitives help close that gap. A model can say:

<ref>cup</ref><box>[[190,230,215,260]]</box>

instead of relying on a phrase like:

the small object near the person

That is a major step. It gives the model a stable reference.

But once a system can point, a second problem appears:

Which references are worth pointing at?

A detector, pose model, tracker, or segmentation system can produce dozens or hundreds of candidate primitives from a single image or video clip:

person_1
person_2
chair_1
table_1
window_1
cup_1
phone_1
left_wrist
right_wrist
nose
shoulder
background_region
trajectory_1
trajectory_2

Many of these primitives may be correct. But correctness is not the same as relevance.

If the task is:

Is person_1 handing cup_1 to person_2?

then the active primitive set should be much smaller:

person_1
person_2
person_1 hand points
person_2 hand points
cup_1
cup_1 trajectory

The chair, window, phone, wall, and background regions may all be real, but they are not relevant unless they affect the handoff.

This is the Relevance Gap:

The Relevance Gap is the failure mode where a system has access to many correct candidate references, but cannot select the subset relevant to the current task.

In other words:

Perception Gap:
Can the system see enough?

Reference Gap:
Can the system point to what it means?

Relevance Gap:
Can the system choose which references matter?

The Relevance Gap is easy to miss because it appears after progress has already been made. The system sees the scene. It detects objects. It produces boxes, points, masks, keypoints, or tracks. It may even ground its reasoning in explicit references.

But if it reasons over the wrong references, or too many references, the reasoning can still fail.

More perception can become more confusion. More primitives can become a new kind of overload. The model no longer drowns in pixels or tokens. It drowns in candidate handles.

That is why primitive reasoning needs a context filter.

    flowchart LR
    subgraph A["Without Relevance Filtering"]
        A1["🔎 Many candidate primitives<br/>people • objects • hands • background"]
        A2["🧠 Reason over everything"]
        A3["⚠️ Noisy or confused output"]
        A1 --> A2 --> A3
    end

    subgraph B["With Context-Filtered Primitive Selection"]
        B1["🔎 Many candidate primitives"]
        B2["🎯 Context filter<br/>task: verify handoff"]
        B3["🧩 Active primitive set<br/>giver • receiver • hands • object • trajectory"]
        B4["✅ Grounded decision<br/>handoff passed or failed"]
        B1 --> B2 --> B3 --> B4
    end

    classDef bad fill:#7a2d2d,stroke:#4a1111,color:#fff,stroke-width:2px;
    classDef good fill:#2d7a4a,stroke:#164529,color:#fff,stroke-width:2px;
    classDef neutral fill:#35495e,stroke:#1f2d3a,color:#fff,stroke-width:2px;

    class A1,B1 neutral;
    class A2,A3 bad;
    class B2,B3,B4 good;
  

This is the key extension proposed in this post.

The goal is not to reason over every primitive the system can extract. The goal is to reason over the smallest useful primitive set for the current task.

That changes the architecture from:

detect everything
reason over everything

to:

extract candidate primitives
filter by context
reason over active primitives

This is also where the analogy with human thought becomes useful. Humans do not usually reason by activating everything they know about an object or scene. We select what matters. The airplane at the airport activates gates, luggage, seats, and boarding. The airplane in an accident activates smoke, wreckage, impact, and trajectory. The object is the same, but the active primitive set changes with the task.

A useful AI system needs the same discipline.

The Reference Gap asks:

Can the system point to what it means?

The Relevance Gap asks:

Can the system choose what is worth pointing at?

That is the difference between grounded reference and grounded reasoning.


4. From Visual Primitives to Reference Primitives

The paper begins with two visual primitives:

Primitive Meaning Best for
Bounding box A region around an object or part objects, people, body parts, scene elements
Point / polyline A coordinate or coordinate sequence joints, contact points, gaze, paths, trajectories

These are enough to make visual reasoning more precise. A model no longer has to say “the object on the left” or “that person near the table.” It can bind the thought to a coordinate:

<ref>person</ref><box>[[120,80,340,900]]</box>

or:

<point>[[421,612]]</point>

However, the deeper idea is not limited to images.

A box in an image is just one kind of pointer. A sentence span in a document is also a pointer. A function symbol in code is a pointer. A UI component box is a pointer. A video trajectory is a pointer across time.

So we can generalize the paper’s visual primitive into a broader abstraction:

A Reference Primitive is any machine-verifiable pointer used inside a reasoning trace.

It is the thing the system can point to when it makes a claim.

Domain Reference primitive Example
Image box / keypoint [[421,612]] = left wrist
Video trajectory person_1.hand.frames[42:58]
Document span ch03.s142
Code symbol engine.py::repair_chapter
UI component box [[80,120,400,200]]
VPM pixel region {x:12,y:9,w:4,h:2}

The principle is identical across domains:

Reasoning should point to evidence.

A reasoning trace that contains only natural language is hard to inspect. It may sound convincing, but the system cannot easily verify what the claim refers to.

A reasoning trace with reference primitives is different. It can say:

This claim depends on this box.
This critique depends on this keypoint.
This edit depends on this sentence.
This bug depends on this function.
This UI failure depends on this component region.

That makes the reasoning trace inspectable.

But this also creates the next problem. If we extract every possible primitive, we have not solved reasoning. We have only changed the form of overload. Instead of drowning in pixels, tokens, or context, the model can drown in candidate references.

So the Reference Primitive is only the first step.

A useful AI system must not merely point. It must know which pointers matter.


5. YOLO Is Not the Reasoner

YOLO does not implement the paper by itself.

YOLO detects objects. It gives us candidate visual primitives:

person_1 box
person_2 box
cup_1 box
chair_1 box

A pose model gives us body primitives:

left_wrist point
right_wrist point
nose point
shoulder points

A tracker gives us temporal primitives:

person_1 across frames
cup_1 across frames
hand trajectory

These are useful, but none of these systems reason by themselves.

YOLO can tell us that two people and a cup are present. It cannot tell us whether the people are having a conversation, whether one person is handing the cup to the other, or whether the cup is irrelevant background clutter.

That requires another layer.

The reasoning layer asks questions like:

Is person_1 facing person_2?
Is person_1's hand near cup_1?
Is cup_1 between person_1 and person_2?
Does cup_1 move from person_1's hand region to person_2's hand region?

But even that is not enough.

Before reasoning begins, the system must decide which primitives matter for the current task. If the task is:

Is person_1 handing the cup to person_2?

then the active primitive set should include:

person_1
person_2
person_1 hand points
person_2 hand points
cup_1
cup_1 trajectory
distance between hands
contact events

It probably does not need:

chair_1
window_1
wall_1
floor_1
background objects

Those detections may be correct, but they are not relevant unless they affect the handoff.

So the practical stack becomes:

Image / frame
YOLO / pose / segmentation / tracking
Candidate reference primitives
Context filter
Active primitive set
Primitive relations
Verifier
Issue / score / repair

The paper gives the theory: reason with primitives by allowing the model to point while it thinks. YOLO gives a practical way to extract some of those primitives locally. The contribution here is the full reasoning loop:

Extract candidate primitives, filter them by context, reason over their relations, and verify the result.

That is what turns object detection into primitive-based understanding.


6. The Primitive Reasoning Architecture

The architecture has one central job:

Convert raw visual content into a small, task-relevant set of primitives, reason over the relationships between those primitives, and produce a grounded decision.

The important part is the middle of the pipeline. We are not asking the model to reason over every pixel, every object, or every possible visual detail. We first extract candidate primitives, then filter them through the current task.

    flowchart TD
    A["🖼️ Image / Video / Generated Scene"]
    B["👁️ Primitive Extraction<br/>YOLO • pose • tracking • segmentation"]
    C["🪝 Candidate Reference Primitives<br/>people • objects • hands • regions • trajectories"]
    D["🎯 Context Filter<br/>select primitives relevant to the task"]
    E["🧠 Active Primitive Set<br/>small enough to reason over"]
    F["🔗 Relation Layer<br/>near • touching • facing • holding • moving • transferred_to"]
    G["✅ Verifier / Reasoner<br/>check whether required relation exists"]
    H["🛠️ Output<br/>decision • score • issue • repair instruction"]
    X["⚠️ Primitive overload<br/>too many references become confusion"]

    A --> B --> C --> D --> E --> F --> G --> H
    C -. "without filtering" .-> X

    classDef input fill:#35495e,stroke:#1f2d3a,color:#fff,stroke-width:2px;
    classDef extract fill:#1f4e79,stroke:#0d2b45,color:#fff,stroke-width:2px;
    classDef primitive fill:#5b3f8c,stroke:#2d1d4d,color:#fff,stroke-width:2px;
    classDef filter fill:#7a5a2d,stroke:#4a3214,color:#fff,stroke-width:2px;
    classDef active fill:#2d7a4a,stroke:#164529,color:#fff,stroke-width:2px;
    classDef relation fill:#2d7a73,stroke:#144542,color:#fff,stroke-width:2px;
    classDef output fill:#7a4a2d,stroke:#4a2a14,color:#fff,stroke-width:2px;
    classDef danger fill:#7a2d2d,stroke:#4a1111,color:#fff,stroke-width:2px;

    class A input;
    class B extract;
    class C primitive;
    class D filter;
    class E active;
    class F relation;
    class G relation;
    class H output;
    class X danger;
  

The implementation can start with two small objects.

class ReferencePrimitiveDTO(BaseModel):
    primitive_id: str
    primitive_type: Literal[
        "box", "point", "keypoint", "polyline",
        "mask", "span", "symbol", "region", "trajectory"
    ]
    label: str | None = None
    coordinates: Any
    confidence: float | None = None
    source: str
    frame_index: int | None = None
    parent_primitive_id: str | None = None
    meta: dict = Field(default_factory=dict)

This is the thing the system can point to.

class PrimitiveRelationDTO(BaseModel):
    relation_id: str
    subject_primitive_id: str
    relation_type: Literal[
        "near", "touching", "facing", "holding",
        "inside", "left_of", "right_of",
        "moving_toward", "moving_away", "transferred_to"
    ]
    object_primitive_id: str
    confidence: float
    evidence_primitive_ids: list[str] = Field(default_factory=list)
    frame_range: tuple[int, int] | None = None
    meta: dict = Field(default_factory=dict)

This is the relationship the system can reason over.

Together, they define the primitive reasoning layer:

ReferencePrimitiveDTO = what can be pointed at
PrimitiveRelationDTO = what can be checked between pointers

A candidate set such as:

person_1 box
person_2 box
cup_1 box
person_1 right_wrist point

becomes:

near(person_1, person_2)
facing(person_1, person_2)
near(person_1.right_wrist, cup_1)
between(cup_1, person_1, person_2)
moving_toward(cup_1, person_2.left_wrist)

That is where the system moves from detection to understanding.


7. Application One: AI-Generated Image Critique

The first practical demonstration is generated image critique.

A prompt says:

A woman sitting at a desk, holding a red book in both hands, looking down at the book.

The generator produces an image.

At a glance, it may look fine. But maybe the book is visible but not red, the hands are near the book but not touching it, the head faces forward instead of downward, the desk is ambiguous, or one wrist bends unnaturally.

A normal AI critique might say:

“The image mostly follows the prompt, but the anatomy could be improved.”

That is not enough. The critique is too vague to verify, too vague to measure, and too vague to reliably repair.

A reference-grounded critique is different.

It turns the prompt into requirements:

Requirement:
Both hands should hold the red book.

It turns the image into candidate primitives:

person_1 box
desk_1 box
book_1 box
left_wrist point
right_wrist point
nose point

Then the context filter selects the active primitive set for this requirement:

left_wrist = [[370,790]]
right_wrist = [[720,790]]
book_box = [[450,610,650,750]]

Now the verifier can reason over the relationship:

Decision:
FAIL. Neither wrist is close enough to the book box.

Repair:
Move both hands so they visibly touch or hold the red book.

The system is not merely saying that the image is wrong. It is identifying which requirement failed, which primitives matter, which relation is missing, and what repair should be attempted next.

Prompt requirement Active primitives Verifier question Possible repair
holding a red book wrists, book box are both wrists near the book? move hands onto book
looking down at book nose/head, book box is head/gaze oriented toward book? tilt head toward book
sitting at desk person pose, desk box does pose align with seated position? regenerate seated posture
red book book box, color region is the book region red? make book visibly red
natural anatomy shoulder, elbow, wrist are joint angles plausible? correct arm/wrist pose

This turns image generation into a closed-loop improvement process:

prompt
generated image
candidate primitives
active primitive set
relation verifier
revision prompt
regenerate

That kind of critique can be measured. It can be turned into a revision prompt. It can be checked again after regeneration.


8. Application Two: Entity Interaction and Placement

The next application is interaction detection, but not only human interaction.

The broader question is:

Are the relevant entities in the required relationship?

Sometimes those entities are people:

  • is one person handing an object to another?
  • is one person helping another stand?
  • are two people facing each other in conversation?
  • is a character actually holding the object the prompt described?

But the same reasoning applies to physical placement, assembly, robotics, manufacturing, generated images, and video verification:

  • is the chip seated correctly on the circuit board?
  • is the cable plugged into the port?
  • is the screw aligned with the hole?
  • is the tool touching the correct surface?
  • is the object inside the container?
  • is the generated character’s hand actually touching the book?

This is where primitive reasoning becomes more general.

Interaction is not presence. Interaction is relation.

A detector might give us candidate primitives:

person_1 box
person_2 box
cup_1 box
left_wrist point
right_wrist point
chip_1 box
socket_region_1 box
cable_1 endpoint
port_1 box

Those detections may all be correct, but they are not all relevant to the current task.

If the task is:

Is person_1 handing the cup to person_2?

then the active primitive set should be:

person_1
person_2
person_1 hand points
person_2 hand points
cup_1
cup_1 trajectory

If the task is:

Is chip_1 correctly placed on the circuit board?

then the active primitive set should be:

chip_1
socket_region_1
chip_1 corners
chip_1 pin row
board_contact row

The context changes the primitive set.

Interaction / relation Primitive pattern
Conversation two people close + facing + shared attention
Object exchange object moves from one hand region to another
Assistance one body/hand primitive supports another body primitive
Placement object aligns with and sits inside a target region
Insertion object trajectory enters socket/container/slot region
Connection cable/plug endpoint overlaps or locks into target port
Assembly part primitives align with board/socket/contact primitives
No valid interaction expected relation is absent, weak, or contradicted

Object detection tells us:

cup exists
person exists
chip exists
board exists

Primitive reasoning asks:

Is the cup being transferred?
Is the person holding it?
Is the chip aligned?
Is the cable connected?
Is the required relation actually true?

For a human handoff, the verifier may check:

near(person_1.right_wrist, cup_1)
moving_toward(cup_1, person_2.left_wrist)
near(person_2.left_wrist, cup_1)
transferred_to(cup_1, person_2)

For chip placement, it may check:

inside(chip_1, socket_region_1)
aligned_with(chip_1.edges, socket_region_1.edges)
near(chip_1.pins, board_contacts)
orientation_matches(chip_1, socket_region_1)

If the chip is shifted, rotated, or not seated, the verifier can produce a grounded issue:

Decision:
FAIL. chip_1 is offset from socket_region_1 and rotated relative to the expected contact row.

Evidence:
chip_1 box
socket_region_1 box
chip_1 pin row
board_contact row

Repair:
Reposition chip_1 so its pin row aligns with the board contacts and its bounding box sits inside socket_region_1.

The common pattern is always the same:

What is the task?
Which primitives matter?
What relation is expected?
What evidence supports it?
Did the relation actually happen?

That is primitive-level interaction reasoning.

The system does not need to understand everything in the scene. It needs to select the right primitives, check the right relation, and produce a grounded decision.


9. Application Three: Movie Generation Verification

The third application is the most ambitious: movie generation verification.

Image critique checks whether a single frame satisfies a prompt. Entity interaction checks whether things are in the right relation. Movie verification extends both ideas across time.

A book scene says:

Alice hands the envelope to Maria.

A video model generates a clip.

At a glance, the clip may look plausible. Alice and Maria are both present. The envelope appears. Their hands move. The scene has the right mood.

But the actual event may not happen.

A normal reviewer might say:

“The scene sort of works, but the handoff is unclear.”

That is useful, but too vague for an automated generation loop. The system needs to know what failed, where it failed, and which frames need repair.

A primitive verifier can be more precise:

Required event:
handoff(envelope, Alice, Maria)

Observed:
envelope_box remains near Alice.right_hand from frames 12–44
Maria.left_hand approaches but never contacts envelope_box
envelope_box disappears at frame 45
no transferred_to relation detected

Decision:
FAIL. Handoff not visually completed.

Repair:
Regenerate frames 32–52 with a visible envelope transfer from Alice's right hand into Maria's left hand.

This is the movie-generation extension of primitive reasoning.

The same architecture applies:

generated frames
YOLO + pose + tracking
candidate temporal primitives
context filter
active primitive set
event relations
scene-action verification
targeted regeneration

The key difference is time.

In a still image, the verifier asks:

Is the hand near the envelope?

In video, the verifier asks:

Does the envelope move from Alice's hand region to Maria's hand region across a valid frame range?

That requires identity, continuity, and event structure.

A video can be represented as a sequence of frame-level primitives:

Frame 12:
Alice.right_hand = [[312,540]]
envelope_box = [[326,528,372,562]]
Maria.left_hand = [[690,550]]

Frame 28:
Alice.right_hand = [[410,542]]
envelope_box = [[435,530,482,563]]
Maria.left_hand = [[600,548]]

Frame 44:
Alice.right_hand = [[520,545]]
envelope_box = [[548,532,596,565]]
Maria.left_hand = [[560,550]]

Those frame-level primitives can be turned into temporal relations:

near(envelope_box, Alice.right_hand, frames=12–30)
moving_toward(envelope_box, Maria.left_hand, frames=24–40)
near(envelope_box, Maria.left_hand, frames=40–48)
transferred_to(envelope_box, Maria.left_hand, frames=42–48)

That gives us an event primitive:

handoff(envelope, Alice, Maria, frames=12–48)

A generated movie is not just a sequence of pretty frames. It is a sequence of required actions:

Alice gives Maria the envelope.
Maria opens it.
She reads the letter.
Her expression changes.
She steps back.

Each action can be translated into verifiable primitive requirements.

Scene action Required primitive evidence
Alice gives Maria the envelope envelope moves from Alice hand region to Maria hand region
Maria opens it Maria hand contacts envelope; envelope state changes
Maria reads the letter gaze/head direction aligns with letter region
Her expression changes face landmarks / expression classifier changes over frames
She steps back Maria body box moves away from Alice / table / object

Some are easy. Some are hard. But the architecture is the same.

Movie verification is not just image critique repeated over frames. It requires temporal primitive reasoning. The system must track identities, preserve object continuity, detect relations, and verify that required relations become true in the correct order.

The final point is simple:

A generated movie should not only look plausible. It should make the required events verifiably happen.


10. Implementation Sketch

The first implementation does not need to train a new multimodal model.

That is the practical advantage of treating this as a system architecture rather than a model architecture.

The Thinking with Visual Primitives paper trains a model to produce visual primitives directly inside its reasoning process. That is powerful, but it is not the only way to explore the idea. We can approximate the same “point-to-reason” pattern with existing tools by separating the system into five stages:

extract
filter
relate
verify
repair

The tool stack can be built from existing components:

YOLO / object detector              → object boxes
YOLO-pose / MediaPipe / OpenPose     → body keypoints
ByteTrack / BoT-SORT                 → persistent IDs across frames
GroundingDINO / open-vocab detector  → domain-specific objects
Segmentation model                   → masks and precise regions
Reasoning layer                      → context filtering, relations, verification, repair

For video, a YOLO tracker can preserve object identity across frames:

results = model.track(frame, persist=True, tracker="bytetrack.yaml")

Each tracked object can be converted into a visual primitive:

<ref>person_1</ref><box>[[120,45,200,310]]</box>

The paper uses normalized 0–999 coordinates, so raw pixel coordinates can be converted into that shared coordinate space:

x_norm = int((x_raw / width) * 999)
y_norm = int((y_raw / height) * 999)

Then the active primitive set and relations can be verified:

for frame in video:
    candidate_primitives = extract_primitives(frame)
    tracks = update_tracks(candidate_primitives)

active_set = context_filter.select(
    task="Does person_1 hand the cup to person_2?",
    primitives=tracks,
)

relations = relation_builder.build(active_set)

result = verifier.verify(
    task="object_exchange(person_1, cup_1, person_2)",
    active_set=active_set,
    relations=relations,
)

print(result.decision)
print(result.evidence)
print(result.repair_instruction)

There are two ways this architecture can work.

The first is the external pipeline described above:

detectors / pose / tracking
candidate primitives
context filter
relations
verification

This works today with existing tools.

The second is a native primitive-capable model, like the direction proposed in Thinking with Visual Primitives. In that case, the model itself may produce points and boxes as part of its reasoning. The rest of the architecture still applies:

model-generated primitives
context filter
relations
verification / repair

So the goal is not to compete with the paper. The goal is to build the system layer around the paper’s insight.

DeepSeek-style models make better primitives available inside reasoning. This architecture asks what happens next:

Which primitives matter? What relation should exist? Did that relation actually happen? What should be repaired if it did not?

The full source at the end of this post can be organized as a small reproducible reference pipeline:

reference_reasoning/
  dto.py
  geometry.py
  yolo_extract.py
  pose_extract.py
  tracking.py
  context_filter.py
  relation_builder.py
  verifiers.py
  report.py
  demo_image.py
  demo_video.py

A minimal setup might look like:

pip install ultralytics opencv-python pydantic numpy

And the demos can be run as:

python demo_image.py --image scene.png --task "is the person holding the book?"
python demo_video.py --video handoff.mp4 --task "does person 1 hand the object to person 2?"

The goal of the implementation is not to solve every visual reasoning problem. It is to demonstrate the architecture:

Extract candidate primitives from perception, filter them by context, convert them into relations, and verify whether the required relation exists.


11. Why This Works

Primitive reasoning works because it is efficient, verifiable, and repairable.

It is efficient because it avoids reasoning over everything. A naive multimodal pipeline sends the full image, full video, or full scene into a large vision-language model and asks it to reason directly over raw content. That can work, but most tasks do not require the entire scene.

If the question is:

Is person_1 handing the cup to person_2?

the system does not need every pixel, every object, every texture, every background detail, or every frame at full resolution. It needs a small active primitive set:

{
  "frame": 42,
  "active_primitives": [
    {"id": "person_1", "type": "box", "label": "person", "coords": [120, 45, 200, 310]},
    {"id": "person_1_right_wrist", "type": "point", "label": "right_wrist", "coords": [180, 220]},
    {"id": "person_2_left_wrist", "type": "point", "label": "left_wrist", "coords": [260, 225]},
    {"id": "cup_1", "type": "box", "label": "cup", "coords": [190, 230, 215, 260]}
  ],
  "relations": [
    {"subject": "person_1_right_wrist", "relation": "near", "object": "cup_1"},
    {"subject": "cup_1", "relation": "moving_toward", "object": "person_2_left_wrist"}
  ]
}

Now the reasoning layer works over a compact state: a few primitives and relations, not millions of pixels.

It is verifiable because a claim can point to its evidence.

A vague critique says:

The hand looks wrong.

A grounded critique says:

Requirement:
left hand touches book

Evidence:
left_wrist = [[370,790]]
book_box = [[450,610,650,750]]

Observed relation:
distance(left_wrist, book_box) = 0.42 × object diagonal

Threshold:
contact requires <= 0.20 × object diagonal

Decision:
FAIL. Hand-object contact is not satisfied.

Repair:
Move the left hand so it visibly touches or holds the book.

That can be tested.

It is repairable because failures are local. If the wrist point is wrong, the detector failed. If the book box is wrong, the object extractor failed. If both are correct but the threshold is too strict, the verifier needs calibration. If the wrong objects were selected, the context filter failed.

A vague model failure says:

The image looks slightly off.

A primitive reasoning failure says:

The chip placement failed because chip_box is outside socket_region.
The socket detector confidence is only 0.41, so this issue should be reviewed.

That gives the next step somewhere to start.

The result is a closed-loop improvement system:

generate
extract primitives
filter by task
verify relations
repair
regenerate
verify again

That is why primitive reasoning is useful. It turns visual judgment into evidence, relation, threshold, confidence, and repair.


12. Limitations

This architecture is useful, but it has real limitations.

The first limitation is practical rather than technical: the detector backend matters.

In this post, YOLO is the easiest example because it is fast, widely used, and simple to run locally. But “YOLO” is not one single thing. It is a family of object-detection models and implementations with different licenses and deployment constraints.

The common Ultralytics YOLO stack is available under AGPL-3.0 by default, with a separate Enterprise license for proprietary or production use. For an open-source demo, research notebook, or local experiment, that may be acceptable. For a closed-source product, internal tool, SaaS deployment, or commercial generation pipeline, it may not be.

That means the architecture should not depend on YOLO specifically. YOLO should be treated as one possible primitive extractor:

PrimitiveExtractor
  ├── YOLOExtractor
  ├── TorchVisionDetectorExtractor
  ├── GroundingDINOExtractor
  ├── DETR / RT-DETR extractor
  ├── MediaPipePoseExtractor
  ├── SAM / segmentation extractor
  └── ManualFixtureExtractor

The second limitation is that primitive reasoning is only as good as the primitives it receives.

YOLO can miss objects. Pose models can misplace hands. Trackers can swap identities. Open-vocabulary detectors can hallucinate labels. Segmentation models can produce masks that are too broad or too narrow.

Primitive reasoning reduces hallucination, but it does not eliminate it. The right phrase is not “zero hallucinations.” It is:

Bounded hallucination.

The reasoning layer is constrained to a declared primitive set, so unsupported visual entities are easier to catch. If the system says the character is holding a book, it should be able to point to the hand primitive, the book primitive, and the relation between them. But the system can still inherit detector errors, select the wrong active primitive set, overinterpret weak relations, or use thresholds that are too strict or too loose.

The third limitation is domain specificity. General object detectors may recognize person, cup, chair, or book, but fail on objects such as bed rail, neural interface, surgical telemetry pad, chip socket, pin row, board contact, robot gripper, or custom tool head.

For those cases, the pipeline needs specialized extractors, open-vocabulary detectors, segmentation, manual fixtures, or project-specific fine-tuning.

The fourth limitation is semantic ambiguity. Many useful relations can be checked geometrically:

near
inside
overlapping
aligned_with
touching
moving_toward
transferred_to

But some relations require semantic judgment:

comforting
threatening
hesitating
pretending
arguing
agreeing
acting suspiciously

Primitive reasoning can provide evidence for those higher-level judgments, but it may not fully determine them.

The final limitation is temporal fragility. Video requires the system to preserve identity over time:

person_1 remains person_1
object_1 remains object_1
hand trajectory is continuous
object does not disappear or swap identity

Trackers can fail when objects overlap, leave frame, re-enter, become occluded, or change appearance. A single ID swap can break the event trace.

The core limitation is this:

Primitive reasoning does not remove the need for perception, judgment, or calibration. It gives those failures a structure.

That is still valuable.

A vague model failure says:

The scene looks wrong.

A primitive reasoning failure says:

The chip placement failed because chip_box is outside socket_region.
But the socket detector confidence is only 0.41, so this issue should be reviewed.

The goal is not perfection. The goal is inspectability.

Given these constraints, it’s worth comparing our external-verifier approach with the internal-primitive paradigm of the original paper and with pure VLM-based critics


13. Comparative Analysis: Internal vs. External Primitives, and the Verifier Loop

The Thinking with Visual Primitives paper demonstrated that large multimodal models benefit from interleaving spatial markers (boxes, points) directly into their chain‑of‑thought. However, the primitives are internal to the model; they are generated as text tokens in the reasoning trace. This makes the reasoning more precise, but it does not automatically make it verifiable or repairable by an external system.

Pure VLM‑based critics (e.g., asking a frontier model “Does this image match the prompt?”) produce natural‑language feedback. That feedback can sound plausible, but it lacks machine‑verifiable pointers to specific image regions, making it difficult to measure improvement or to automate repair.

Our work occupies a distinct point in the design space: we use external primitive extractors (YOLO, pose models, trackers) and a deterministic verifier loop that can be inspected, tested, and iterated. The table below summarises the key differences.

Capability Thinking with Visual Primitives (DeepSeek) VLM‑based critics (e.g., GPT‑4V) This work (Reference‑Grounded Verification)
Primitive source Model‑internal, generated during chain‑of‑thought No explicit primitives; only language External detectors: YOLO, pose, tracking, segmentation
Verifiability Primitive output is part of the model’s text; coordinates can be checked but the reasoning trace is still a black‑box Low: feedback is natural language, not anchored to coordinates High: every issue must cite one or more primitive IDs and provides deterministic evidence
Repairability The model can be prompted to revise, but the revision instruction is again language‑only The critic can suggest a revision prompt, but it’s vague (“improve the anatomy”) The verifier produces a targeted repair instruction grounded in a failed relation (e.g., “Move left_wrist onto book_box”)
Generalisability to video Limited to single‑frame spatial reasoning in current work Can describe a video clip but loses track of object identity across frames Designed for temporal primitives; object tracks and event relations are first‑class
Deterministic component No deterministic layer; model outputs are stochastic None Geometry verifiers (distance, angle, inside) are pure functions with fixture‑based tests
Open‑vocabulary objects Learnt from pretraining data; box output is tied to known classes Can handle novel objects through language, but cannot ground them reliably Plug‑in detectors (Grounding DINO, SAM) extend the vocabulary; the verifier logic is class‑agnostic
Inspectability You can read the CoT and the boxes, but you cannot verify the reasoning steps externally Opaque Every issue can be traced back to specific primitives and a numeric threshold

The optimal future system is likely a hybrid. A model could produce internal primitives and relation hypotheses, while an external verifier loop checks them against geometric constraints, prompt requirements, and temporal consistency. Our architecture provides the outer loop; the paper provides a path toward the inner loop. Together, they point toward AI systems that not only think with primitives, but also prove that their thoughts are correct—and know exactly what to fix when they are not.

This comparison underscores the core contribution: we are not replacing the paper’s insight, but building the verification and repair layer around it, and we are doing it in a way that is deterministic, inspectable, and directly transferable to domains beyond images.

This comparison underscores why we see our work not as a replacement for the paper’s insight, but as the verification and repair layer built around it—a pattern that generalises far beyond images.


14. The Generalization: What We Are Really Saying

The broader claim is not limited to images, YOLO, or movie generation.

The claim is this:

Reasoning becomes more reliable when it operates over task-relevant primitives instead of raw context.

Humans seem to do this naturally. We do not reason over everything we see, know, remember, or imagine. We activate the parts of the world that matter for the current situation.

If we think about an airplane at the airport, we activate gates, seats, luggage, boarding, and takeoff. If we think about an airplane accident, we activate wreckage, trajectory, failure, smoke, and emergency response. If we think about an airplane in war, we activate radar, missiles, speed, threat, and target.

The object is the same. The active primitive set is different.

That is the cognitive pattern this architecture tries to mimic:

raw world
possible primitives
context-filtered active primitive set
relations
verification
decision or repair

Across domains, the same structure appears:

What we are saying In human thought In an AI system What it enables
Do not reason over everything Ignore irrelevant detail Avoid passing all pixels, tokens, detections, or frames forward Less noise and confusion
Find possible handles Notice objects, parts, places, causes Extract boxes, points, spans, symbols, tracks Grounded references
Select by context Activate what matters for the current question Build an active primitive set Task-focused reasoning
Reason over relations Understand how parts connect or change Check touching, holding, inside, aligned, transferred Structured inference
Verify against expectation Decide whether the situation satisfies the goal Compare observed relations to required relations Pass/fail decisions
Repair locally Fix the part that failed Produce targeted regeneration or correction instructions Iterative improvement

This is the heart of the idea.

Primitive extraction alone is not enough. A detector can find hundreds of things. A language model can consume thousands of tokens. A video model can generate thousands of frames. But intelligence is not the amount of material available to the system.

Intelligence is the ability to select the right material for the current decision.

The paper shows that multimodal models improve when they can point while reasoning. This architecture adds the next layer:

A useful system must know which things are worth pointing at.

That gives us the distinction:

Perception Gap:
Can the system see enough?

Reference Gap:
Can the system point to what it means?

Relevance Gap:
Can the system select which references matter?

The first gap is about seeing. The second gap is about pointing. The third gap is about understanding the task.

The final generalization is simple:

More context is not more intelligence. The right primitive set is.

A system that reasons over everything gets lost. A system that extracts primitives but fails to filter them drowns in candidate references. A system that selects the active primitive set can reason more like a human: not by reconstructing the whole world, but by activating the parts of the world needed for the decision in front of it.

We are taking the paper’s idea of visual primitives and extending it into a broader principle:

Reasoning is context-filtered primitive selection, followed by relation verification.


Conclusion

The Thinking with Visual Primitives paper matters because it challenges a hidden assumption: that reasoning must happen primarily in language.

It shows that multimodal models can reason better when they can point. Points and boxes become minimal units of thought. The model’s reasoning no longer floats above the image in vague phrases like “that object” or “the thing on the left.” It binds itself to physical coordinates.

That is the first step.

But once a system can point, a second question appears:

What should it point at?

A detector may find hundreds of objects. A pose model may produce dozens of keypoints. A tracker may preserve identities across thousands of frames. More perception gives the system more possible references, but more references do not automatically produce better reasoning.

Without context filtering, the model can still get lost.

It no longer drowns in pixels. It drowns in primitives.

That is the extension proposed here.

The paper identifies the Reference Gap: language is not precise enough to anchor visual reasoning. This post adds the Relevance Gap: even when references exist, the system must select the references that matter for the current task.

The resulting architecture is simple:

raw content
candidate primitives
context filter
active primitive set
relations
verification
repair

YOLO, pose models, trackers, segmenters, and open-vocabulary detectors can extract candidate primitives. A context filter selects the active primitive set. A relation layer checks whether the relevant objects, hands, parts, trajectories, symbols, or spans are connected in the expected way. Verifiers turn failures into grounded issues. Repair builders turn those issues into targeted regeneration or correction instructions.

That gives us a practical path from primitive reasoning to real applications:

  • generated image critique,
  • entity interaction detection,
  • object placement verification,
  • assembly checking,
  • generated video validation,
  • movie scene repair.

The deeper claim is not about YOLO, image generation, or any one model.

The deeper claim is this:

Reasoning is context-filtered primitive selection followed by relation verification.

Humans seem to do something like this naturally. We do not reason over the whole world. We activate the parts of the world that matter. The airplane at the airport is not the airplane in an accident, and neither is the airplane in war. The object may be the same, but the active primitive set changes with the task.

AI systems need the same discipline.

More context is not more intelligence. More pixels are not more understanding. More detections are not more reasoning.

A useful system does not merely see more. It selects better.

So the final lesson is not just:

AI should point while it reasons.

It is:

AI should know what is worth pointing at.

That is what turns visual primitives into a broader architecture for grounded, inspectable, repairable reasoning.


Appendix: A Minimal Reference-Grounded Reasoning Demo

This appendix implements a small version of the architecture described in the post.

It demonstrates the full loop:

candidate primitives
context filter
active primitive set
relations
verification
repair instruction

The demo intentionally uses fixture data first. This makes the reasoning layer deterministic and easy to inspect. After that, an optional YOLO adapter shows how real detections can be converted into the same primitive format.

Install

pip install pydantic numpy

Optional YOLO support:

pip install ultralytics opencv-python

Single-file demo: reference_reasoning_demo.py

"""
reference_reasoning_demo.py

A minimal reference-grounded primitive reasoning demo.

This file demonstrates:

1. ReferencePrimitiveDTO
2. PrimitiveRelationDTO
3. Context filtering
4. Relation building
5. Verification
6. Repair instruction generation
7. Optional YOLO extraction adapter

Run fixture demos:

    python reference_reasoning_demo.py

Optional YOLO demo:

    python reference_reasoning_demo.py --image path/to/image.jpg

The fixture demos do not require YOLO.
"""

from __future__ import annotations

import argparse
import math
from typing import Any, Literal

from pydantic import BaseModel, Field


# ============================================================
# 1. DTOs
# ============================================================

PrimitiveType = Literal[
    "box",
    "point",
    "keypoint",
    "polyline",
    "mask",
    "span",
    "symbol",
    "region",
    "trajectory",
]

RelationType = Literal[
    "near",
    "touching",
    "inside",
    "aligned_with",
    "moving_toward",
    "transferred_to",
    "unknown",
]


class ReferencePrimitiveDTO(BaseModel):
    """
    A machine-verifiable pointer.

    In images, this may be a box or keypoint.
    In video, this may be a trajectory.
    In documents, this could be a span.
    In code, this could be a symbol.

    For this demo, we mainly use boxes and points.
    """

    primitive_id: str
    primitive_type: PrimitiveType
    label: str | None = None
    coordinates: Any
    confidence: float | None = None
    source: str = "fixture"
    frame_index: int | None = None
    parent_primitive_id: str | None = None
    meta: dict[str, Any] = Field(default_factory=dict)


class PrimitiveRelationDTO(BaseModel):
    """
    A checkable relationship between two primitives.
    """

    relation_id: str
    subject_primitive_id: str
    relation_type: RelationType
    object_primitive_id: str
    confidence: float
    evidence_primitive_ids: list[str] = Field(default_factory=list)
    frame_range: tuple[int, int] | None = None
    meta: dict[str, Any] = Field(default_factory=dict)


class ActivePrimitiveSetDTO(BaseModel):
    """
    The context-filtered subset of primitives relevant to the task.
    """

    task: str
    selected: list[ReferencePrimitiveDTO]
    rejected: list[ReferencePrimitiveDTO] = Field(default_factory=list)
    rationale: dict[str, str] = Field(default_factory=dict)


class VerificationResultDTO(BaseModel):
    """
    A grounded verifier output.
    """

    task: str
    decision: Literal["PASS", "FAIL", "UNCERTAIN"]
    confidence: float
    evidence: list[str]
    observed_relations: list[PrimitiveRelationDTO] = Field(default_factory=list)
    repair_instruction: str | None = None

class TaskDTO(BaseModel):
    task_id: str
    action_type: Literal[
        "object_contact",
        "object_transfer",
        "placement_verification",
        "gaze_verification",
        "anatomy_check",
    ]
    subject_labels: list[str] = Field(default_factory=list)
    object_labels: list[str] = Field(default_factory=list)
    required_body_parts: list[str] = Field(default_factory=list)
    required_relations: list[str] = Field(default_factory=list)
    temporal: bool = False

# ============================================================
# 2. Geometry helpers
# ============================================================

def box_center(box: list[float]) -> tuple[float, float]:
    x1, y1, x2, y2 = box
    return ((x1 + x2) / 2.0, (y1 + y2) / 2.0)


def box_diagonal(box: list[float]) -> float:
    x1, y1, x2, y2 = box
    return math.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)


def point_to_box_distance(point: list[float], box: list[float]) -> float:
    px, py = point
    x1, y1, x2, y2 = box

    dx = max(x1 - px, 0, px - x2)
    dy = max(y1 - py, 0, py - y2)

    return math.sqrt(dx * dx + dy * dy)


def point_distance(a: list[float], b: list[float]) -> float:
    return math.sqrt((a[0] - b[0]) ** 2 + (a[1] - b[1]) ** 2)


def is_point_near_box(
    point: list[float],
    box: list[float],
    relative_threshold: float = 0.20,
) -> tuple[bool, float]:
    """
    Returns:
        is_near: whether point is close enough to the box
        relative_distance: distance divided by object diagonal
    """
    dist = point_to_box_distance(point, box)
    diag = max(1.0, box_diagonal(box))
    rel = dist / diag
    return rel <= relative_threshold, rel


def is_box_inside(inner: list[float], outer: list[float], tolerance: float = 0.0) -> bool:
    ix1, iy1, ix2, iy2 = inner
    ox1, oy1, ox2, oy2 = outer

    return (
        ix1 >= ox1 - tolerance
        and iy1 >= oy1 - tolerance
        and ix2 <= ox2 + tolerance
        and iy2 <= oy2 + tolerance
    )


def normalize_box_0_999(box: list[float], width: int, height: int) -> list[int]:
    x1, y1, x2, y2 = box

    return [
        int((x1 / width) * 999),
        int((y1 / height) * 999),
        int((x2 / width) * 999),
        int((y2 / height) * 999),
    ]


# ============================================================
# 3. Context filter
# ============================================================

class ContextFilter:
    """
    Selects the primitives relevant to the current task.

    This is the Relevance Gap layer:
    the system should not reason over everything it can detect.

    The filter accepts either:
    - a structured TaskDTO, preferred for serious use
    - a plain string, converted into a TaskDTO for demo convenience
    """

    def select(
        self,
        task: TaskDTO | str,
        primitives: list[ReferencePrimitiveDTO],
    ) -> ActivePrimitiveSetDTO:
        task_dto = self._coerce_task(task)

        selected: list[ReferencePrimitiveDTO] = []
        rejected: list[ReferencePrimitiveDTO] = []
        rationale: dict[str, str] = {}

        wanted_labels = {
            item.lower()
            for item in (
                task_dto.subject_labels
                + task_dto.object_labels
                + task_dto.required_body_parts
            )
        }

        for primitive in primitives:
            label = (primitive.label or "").lower()
            pid = primitive.primitive_id.lower()

            keep = any(
                wanted in label or wanted in pid
                for wanted in wanted_labels
            )

            if keep:
                selected.append(primitive)
                rationale[primitive.primitive_id] = (
                    f"Selected for {task_dto.action_type}: "
                    f"matches one of {sorted(wanted_labels)}"
                )
            else:
                rejected.append(primitive)
                rationale[primitive.primitive_id] = (
                    f"Rejected for {task_dto.action_type}: "
                    f"does not match active task labels"
                )

        return ActivePrimitiveSetDTO(
            task=task_dto.task_id,
            selected=selected,
            rejected=rejected,
            rationale=rationale,
        )

    def _coerce_task(self, task: TaskDTO | str) -> TaskDTO:
        if isinstance(task, TaskDTO):
            return task

        task_l = task.lower()

        if any(word in task_l for word in ["hand", "handoff", "give", "transfer", "cup", "envelope"]):
            return TaskDTO(
                task_id=task,
                action_type="object_transfer",
                subject_labels=["person", "alice", "maria"],
                object_labels=["cup", "envelope"],
                required_body_parts=["wrist", "hand"],
                required_relations=["near", "moving_toward", "transferred_to"],
                temporal=True,
            )

        if any(word in task_l for word in ["hold", "holding", "book"]):
            return TaskDTO(
                task_id=task,
                action_type="object_contact",
                subject_labels=["person"],
                object_labels=["book"],
                required_body_parts=["wrist", "hand"],
                required_relations=["near", "touching"],
                temporal=False,
            )

        if any(word in task_l for word in ["chip", "board", "socket", "place", "placement"]):
            return TaskDTO(
                task_id=task,
                action_type="placement_verification",
                subject_labels=["chip"],
                object_labels=["socket", "board", "circuit_board"],
                required_body_parts=["pin", "contact"],
                required_relations=["inside", "aligned_with"],
                temporal=False,
            )

        return TaskDTO(
            task_id=task,
            action_type="object_contact",
            subject_labels=[],
            object_labels=[],
            required_body_parts=[],
            required_relations=[],
            temporal=False,
        )
    
# ============================================================
# 4. Relation builder
# ============================================================

class RelationBuilder:
    """
    Converts primitives into primitive relations.

    The demo supports:
    - near(point, box)
    - inside(box, box)
    - simple transferred_to relation from frame sequence
    """

    def build(self, active_set: ActivePrimitiveSetDTO) -> list[PrimitiveRelationDTO]:
        primitives = active_set.selected
        relations: list[PrimitiveRelationDTO] = []

        points = [p for p in primitives if p.primitive_type in {"point", "keypoint"}]
        boxes = [p for p in primitives if p.primitive_type == "box"]

        # Spatial near relations between keypoints and boxes
        for point in points:
            for box in boxes:
                if point.parent_primitive_id == box.primitive_id:
                    continue

                near, rel_dist = is_point_near_box(
                    point.coordinates,
                    box.coordinates,
                    relative_threshold=0.25,
                )

                if near:
                    relations.append(
                        PrimitiveRelationDTO(
                            relation_id=f"rel_near_{point.primitive_id}_{box.primitive_id}",
                            subject_primitive_id=point.primitive_id,
                            relation_type="near",
                            object_primitive_id=box.primitive_id,
                            confidence=self._combined_confidence(point, box, base=0.85),
                            evidence_primitive_ids=[point.primitive_id, box.primitive_id],
                            meta={"relative_distance": rel_dist},
                        )
                    )

        # Inside relations between boxes
        for inner in boxes:
            for outer in boxes:
                if inner.primitive_id == outer.primitive_id:
                    continue

                if is_box_inside(inner.coordinates, outer.coordinates, tolerance=5):
                    relations.append(
                        PrimitiveRelationDTO(
                            relation_id=f"rel_inside_{inner.primitive_id}_{outer.primitive_id}",
                            subject_primitive_id=inner.primitive_id,
                            relation_type="inside",
                            object_primitive_id=outer.primitive_id,
                            confidence=self._combined_confidence(inner, outer, base=0.90),
                            evidence_primitive_ids=[inner.primitive_id, outer.primitive_id],
                        )
                    )

        # Simple temporal transfer relation if task contains frame-indexed object movement.
        relations.extend(self._build_simple_transfer_relations(primitives))

        return relations

    def _combined_confidence(
        self,
        a: ReferencePrimitiveDTO,
        b: ReferencePrimitiveDTO,
        base: float,
    ) -> float:
        ca = a.confidence if a.confidence is not None else 1.0
        cb = b.confidence if b.confidence is not None else 1.0
        return round(min(ca, cb) * base, 3)

    def _build_simple_transfer_relations(
        self,
        primitives: list[ReferencePrimitiveDTO],
    ) -> list[PrimitiveRelationDTO]:
        """
        Toy temporal transfer detector.

        Looks for an object such as cup/envelope near giver hand early
        and near receiver hand later.
        """
        relations: list[PrimitiveRelationDTO] = []

        object_boxes = [
            p for p in primitives
            if p.primitive_type == "box"
            and p.label
            and p.label.lower() in {"cup", "envelope"}
            and p.frame_index is not None
        ]

        hands = [
            p for p in primitives
            if p.primitive_type in {"point", "keypoint"}
            and p.label
            and "wrist" in p.label.lower()
            and p.frame_index is not None
        ]

        if not object_boxes or not hands:
            return relations

        early = min(p.frame_index for p in object_boxes if p.frame_index is not None)
        late = max(p.frame_index for p in object_boxes if p.frame_index is not None)

        early_objects = [p for p in object_boxes if p.frame_index == early]
        late_objects = [p for p in object_boxes if p.frame_index == late]

        early_hands = [p for p in hands if p.frame_index == early]
        late_hands = [p for p in hands if p.frame_index == late]

        for obj_early in early_objects:
            for hand_early in early_hands:
                near_early, _ = is_point_near_box(hand_early.coordinates, obj_early.coordinates)

                if not near_early:
                    continue

                for obj_late in late_objects:
                    if obj_late.label != obj_early.label:
                        continue

                    for hand_late in late_hands:
                        if hand_late.parent_primitive_id == hand_early.parent_primitive_id:
                            continue

                        near_late, _ = is_point_near_box(hand_late.coordinates, obj_late.coordinates)

                        if near_late:
                            relations.append(
                                PrimitiveRelationDTO(
                                    relation_id=f"rel_transferred_{obj_early.label}_{early}_{late}",
                                    subject_primitive_id=obj_late.primitive_id,
                                    relation_type="transferred_to",
                                    object_primitive_id=hand_late.primitive_id,
                                    confidence=0.78,
                                    evidence_primitive_ids=[
                                        obj_early.primitive_id,
                                        hand_early.primitive_id,
                                        obj_late.primitive_id,
                                        hand_late.primitive_id,
                                    ],
                                    frame_range=(early, late),
                                    meta={
                                        "from_hand": hand_early.primitive_id,
                                        "to_hand": hand_late.primitive_id,
                                    },
                                )
                            )

        return relations


# ============================================================
# 5. Verifiers
# ============================================================

class Verifier:
    """
    Verifies task-specific requirements against observed primitive relations.
    """

    def verify(
        self,
        task: str,
        active_set: ActivePrimitiveSetDTO,
        relations: list[PrimitiveRelationDTO],
    ) -> VerificationResultDTO:
        task_l = task.lower()

        if any(word in task_l for word in ["hold", "holding", "book"]):
            return self._verify_holding_book(task, active_set, relations)

        if any(word in task_l for word in ["hand", "handoff", "give", "transfer", "cup", "envelope"]):
            return self._verify_object_transfer(task, active_set, relations)

        if any(word in task_l for word in ["chip", "board", "socket", "placement", "place"]):
            return self._verify_chip_placement(task, active_set, relations)

        return VerificationResultDTO(
            task=task,
            decision="UNCERTAIN",
            confidence=0.2,
            evidence=["No verifier matched this task."],
            observed_relations=relations,
            repair_instruction="Add a verifier for this task type.",
        )

    def _verify_holding_book(
        self,
        task: str,
        active_set: ActivePrimitiveSetDTO,
        relations: list[PrimitiveRelationDTO],
    ) -> VerificationResultDTO:
        book_ids = {
            p.primitive_id
            for p in active_set.selected
            if (p.label or "").lower() == "book"
        }

        wrist_near_book = [
            r for r in relations
            if r.relation_type == "near"
            and r.object_primitive_id in book_ids
            and "wrist" in r.subject_primitive_id.lower()
        ]

        if len(wrist_near_book) >= 2:
            return VerificationResultDTO(
                task=task,
                decision="PASS",
                confidence=min(r.confidence for r in wrist_near_book),
                evidence=[
                    "Both wrist primitives are near the book primitive.",
                    *[self._relation_summary(r) for r in wrist_near_book],
                ],
                observed_relations=wrist_near_book,
            )

        return VerificationResultDTO(
            task=task,
            decision="FAIL",
            confidence=0.75,
            evidence=[
                "Expected both wrists to be near the book.",
                f"Observed wrist-book relations: {len(wrist_near_book)}",
            ],
            observed_relations=wrist_near_book,
            repair_instruction="Move both hands so they visibly touch or hold the book.",
        )

    def _verify_object_transfer(
        self,
        task: str,
        active_set: ActivePrimitiveSetDTO,
        relations: list[PrimitiveRelationDTO],
    ) -> VerificationResultDTO:
        transfers = [r for r in relations if r.relation_type == "transferred_to"]

        if transfers:
            return VerificationResultDTO(
                task=task,
                decision="PASS",
                confidence=max(r.confidence for r in transfers),
                evidence=[
                    "A transferred_to relation was detected.",
                    *[self._relation_summary(r) for r in transfers],
                ],
                observed_relations=transfers,
            )

        return VerificationResultDTO(
            task=task,
            decision="FAIL",
            confidence=0.72,
            evidence=[
                "No transferred_to relation was detected.",
                "The object did not move from one hand region to another in the observed frame range.",
            ],
            observed_relations=[],
            repair_instruction=(
                "Regenerate the relevant frames with the object remaining visible "
                "and moving from the giver's hand region into the receiver's hand region."
            ),
        )

    def _verify_chip_placement(
        self,
        task: str,
        active_set: ActivePrimitiveSetDTO,
        relations: list[PrimitiveRelationDTO],
    ) -> VerificationResultDTO:
        inside_relations = [
            r for r in relations
            if r.relation_type == "inside"
            and "chip" in r.subject_primitive_id.lower()
            and "socket" in r.object_primitive_id.lower()
        ]

        if inside_relations:
            return VerificationResultDTO(
                task=task,
                decision="PASS",
                confidence=max(r.confidence for r in inside_relations),
                evidence=[
                    "chip primitive is inside socket primitive.",
                    *[self._relation_summary(r) for r in inside_relations],
                ],
                observed_relations=inside_relations,
            )

        return VerificationResultDTO(
            task=task,
            decision="FAIL",
            confidence=0.80,
            evidence=[
                "Expected chip to be inside socket region.",
                "No inside(chip, socket_region) relation was detected.",
            ],
            observed_relations=[],
            repair_instruction=(
                "Reposition the chip so its bounding box sits inside the socket region "
                "and its pins align with the board contacts."
            ),
        )

    def _relation_summary(self, relation: PrimitiveRelationDTO) -> str:
        return (
            f"{relation.relation_type}("
            f"{relation.subject_primitive_id}, {relation.object_primitive_id}"
            f") confidence={relation.confidence}"
        )


# ============================================================
# 6. Fixture demos
# ============================================================

def demo_image_holding_book() -> None:
    print("\n=== Demo 1: Generated image critique — holding a book ===")

    task = "Is the person holding the book?"

    primitives = [
        ReferencePrimitiveDTO(
            primitive_id="person_1",
            primitive_type="box",
            label="person",
            coordinates=[100, 80, 760, 960],
            confidence=0.96,
        ),
        ReferencePrimitiveDTO(
            primitive_id="book_1",
            primitive_type="box",
            label="book",
            coordinates=[450, 610, 650, 750],
            confidence=0.92,
        ),
        ReferencePrimitiveDTO(
            primitive_id="left_wrist",
            primitive_type="keypoint",
            label="left_wrist",
            coordinates=[370, 790],
            confidence=0.88,
            parent_primitive_id="person_1",
        ),
        ReferencePrimitiveDTO(
            primitive_id="right_wrist",
            primitive_type="keypoint",
            label="right_wrist",
            coordinates=[720, 790],
            confidence=0.84,
            parent_primitive_id="person_1",
        ),
        ReferencePrimitiveDTO(
            primitive_id="desk_1",
            primitive_type="box",
            label="desk",
            coordinates=[50, 760, 900, 980],
            confidence=0.90,
        ),
    ]

    run_pipeline(task, primitives)


def demo_video_object_transfer_success() -> None:
    print("\n=== Demo 2: Video verification — successful cup transfer ===")

    task = "Does person_1 hand the cup to person_2?"

    primitives = [
        # Frame 10: cup near person_1 hand
        ReferencePrimitiveDTO(
            primitive_id="cup_f10",
            primitive_type="box",
            label="cup",
            coordinates=[190, 230, 215, 260],
            confidence=0.90,
            frame_index=10,
        ),
        ReferencePrimitiveDTO(
            primitive_id="person_1_right_wrist_f10",
            primitive_type="keypoint",
            label="right_wrist",
            coordinates=[185, 240],
            confidence=0.88,
            parent_primitive_id="person_1",
            frame_index=10,
        ),
        ReferencePrimitiveDTO(
            primitive_id="person_2_left_wrist_f10",
            primitive_type="keypoint",
            label="left_wrist",
            coordinates=[320, 245],
            confidence=0.86,
            parent_primitive_id="person_2",
            frame_index=10,
        ),
        # Frame 40: cup near person_2 hand
        ReferencePrimitiveDTO(
            primitive_id="cup_f40",
            primitive_type="box",
            label="cup",
            coordinates=[300, 232, 326, 262],
            confidence=0.89,
            frame_index=40,
        ),
        ReferencePrimitiveDTO(
            primitive_id="person_1_right_wrist_f40",
            primitive_type="keypoint",
            label="right_wrist",
            coordinates=[210, 245],
            confidence=0.83,
            parent_primitive_id="person_1",
            frame_index=40,
        ),
        ReferencePrimitiveDTO(
            primitive_id="person_2_left_wrist_f40",
            primitive_type="keypoint",
            label="left_wrist",
            coordinates=[312, 244],
            confidence=0.87,
            parent_primitive_id="person_2",
            frame_index=40,
        ),
    ]

    run_pipeline(task, primitives)


def demo_chip_placement_failure() -> None:
    print("\n=== Demo 3: Placement verification — failed chip placement ===")

    task = "Verify that chip_1 has been placed correctly on the circuit board socket."

    primitives = [
        ReferencePrimitiveDTO(
            primitive_id="chip_1",
            primitive_type="box",
            label="chip",
            coordinates=[420, 420, 520, 520],
            confidence=0.93,
        ),
        ReferencePrimitiveDTO(
            primitive_id="socket_region_1",
            primitive_type="box",
            label="socket",
            coordinates=[300, 300, 400, 400],
            confidence=0.91,
        ),
        ReferencePrimitiveDTO(
            primitive_id="board_1",
            primitive_type="box",
            label="circuit_board",
            coordinates=[100, 100, 800, 800],
            confidence=0.96,
        ),
    ]

    run_pipeline(task, primitives)


def run_pipeline(task: str, primitives: list[ReferencePrimitiveDTO]) -> VerificationResultDTO:
    context_filter = ContextFilter()
    relation_builder = RelationBuilder()
    verifier = Verifier()

    active_set = context_filter.select(task, primitives)
    relations = relation_builder.build(active_set)
    result = verifier.verify(task, active_set, relations)

    print(f"\nTask: {task}")

    print("\nSelected primitives:")
    for primitive in active_set.selected:
        print(f"  - {primitive.primitive_id}: {primitive.label} {primitive.coordinates}")

    print("\nRelations:")
    if relations:
        for relation in relations:
            print(
                f"  - {relation.relation_type}("
                f"{relation.subject_primitive_id}, {relation.object_primitive_id}"
                f") conf={relation.confidence}"
            )
    else:
        print("  - none")

    print("\nDecision:")
    print(f"  {result.decision} confidence={result.confidence}")

    print("\nEvidence:")
    for item in result.evidence:
        print(f"  - {item}")

    if result.repair_instruction:
        print("\nRepair:")
        print(f"  {result.repair_instruction}")

    return result


# ============================================================
# 7. Optional YOLO adapter
# ============================================================

def extract_yolo_primitives(image_path: str) -> list[ReferencePrimitiveDTO]:
    """
    Optional YOLO extractor.

    Requires:
        pip install ultralytics opencv-python

    This adapter converts YOLO detections into ReferencePrimitiveDTO.
    """
    try:
        from ultralytics import YOLO
    except ImportError as exc:
        raise RuntimeError(
            "Ultralytics is not installed. Run: pip install ultralytics opencv-python"
        ) from exc

    model = YOLO("yolo11n.pt")
    results = model(image_path)

    result = results[0]
    height, width = result.orig_shape
    names = result.names

    primitives: list[ReferencePrimitiveDTO] = []

    if result.boxes is None:
        return primitives

    for idx, box in enumerate(result.boxes):
        cls_id = int(box.cls[0])
        label = names[cls_id]
        confidence = float(box.conf[0])
        raw_box = [float(v) for v in box.xyxy[0].tolist()]
        norm_box = normalize_box_0_999(raw_box, width=width, height=height)

        primitives.append(
            ReferencePrimitiveDTO(
                primitive_id=f"{label}_{idx + 1}",
                primitive_type="box",
                label=label,
                coordinates=norm_box,
                confidence=round(confidence, 3),
                source="yolo",
                meta={
                    "raw_box_xyxy": raw_box,
                    "image_width": width,
                    "image_height": height,
                },
            )
        )

    return primitives


def evaluation_row(
    case_id: int,
    task: str,
    input_type: str,
    primitives: list[ReferencePrimitiveDTO],
    expected_relation: str,
    what_it_demonstrates: str,
) -> dict[str, str | int]:
    context_filter = ContextFilter()
    relation_builder = RelationBuilder()
    verifier = Verifier()

    active_set = context_filter.select(task, primitives)
    relations = relation_builder.build(active_set)
    result = verifier.verify(task, active_set, relations)

    observed = ", ".join(
        f"{r.relation_type}({r.subject_primitive_id}, {r.object_primitive_id})"
        for r in result.observed_relations
    ) or "none"

    return {
        "Case": case_id,
        "Task": task,
        "Input type": input_type,
        "Candidate primitives": len(primitives),
        "Active primitives": len(active_set.selected),
        "Expected relation": expected_relation,
        "Observed relation": observed,
        "Decision": result.decision,
        "Repair generated": result.repair_instruction or "None",
        "What this demonstrates": what_it_demonstrates,
    }


def markdown_table(rows: list[dict[str, str | int]]) -> str:
    headers = list(rows[0].keys())
    lines = []

    lines.append("| " + " | ".join(headers) + " |")
    lines.append("| " + " | ".join(["---"] * len(headers)) + " |")

    for row in rows:
        values = [str(row[h]).replace("\n", " ") for h in headers]
        lines.append("| " + " | ".join(values) + " |")

    return "\n".join(lines)


def run_evaluation_table() -> None:
    rows = []

    # Case 1: Holding book failure
    rows.append(
        evaluation_row(
            case_id=1,
            task="Is the person holding the book?",
            input_type="Image fixture",
            primitives=[
                ReferencePrimitiveDTO(
                    primitive_id="person_1",
                    primitive_type="box",
                    label="person",
                    coordinates=[100, 80, 760, 960],
                    confidence=0.96,
                ),
                ReferencePrimitiveDTO(
                    primitive_id="book_1",
                    primitive_type="box",
                    label="book",
                    coordinates=[450, 610, 650, 750],
                    confidence=0.92,
                ),
                ReferencePrimitiveDTO(
                    primitive_id="left_wrist",
                    primitive_type="keypoint",
                    label="left_wrist",
                    coordinates=[370, 790],
                    confidence=0.88,
                    parent_primitive_id="person_1",
                ),
                ReferencePrimitiveDTO(
                    primitive_id="right_wrist",
                    primitive_type="keypoint",
                    label="right_wrist",
                    coordinates=[720, 790],
                    confidence=0.84,
                    parent_primitive_id="person_1",
                ),
                ReferencePrimitiveDTO(
                    primitive_id="desk_1",
                    primitive_type="box",
                    label="desk",
                    coordinates=[50, 760, 900, 980],
                    confidence=0.90,
                ),
            ],
            expected_relation="near/touching(wrists, book)",
            what_it_demonstrates="Image critique can become grounded repair.",
        )
    )

    # Case 2: Cup transfer success
    rows.append(
        evaluation_row(
            case_id=2,
            task="Does person_1 hand the cup to person_2?",
            input_type="Video fixture",
            primitives=[
                ReferencePrimitiveDTO(
                    primitive_id="cup_f10",
                    primitive_type="box",
                    label="cup",
                    coordinates=[190, 230, 215, 260],
                    confidence=0.90,
                    frame_index=10,
                ),
                ReferencePrimitiveDTO(
                    primitive_id="person_1_right_wrist_f10",
                    primitive_type="keypoint",
                    label="right_wrist",
                    coordinates=[185, 240],
                    confidence=0.88,
                    parent_primitive_id="person_1",
                    frame_index=10,
                ),
                ReferencePrimitiveDTO(
                    primitive_id="person_2_left_wrist_f10",
                    primitive_type="keypoint",
                    label="left_wrist",
                    coordinates=[320, 245],
                    confidence=0.86,
                    parent_primitive_id="person_2",
                    frame_index=10,
                ),
                ReferencePrimitiveDTO(
                    primitive_id="cup_f40",
                    primitive_type="box",
                    label="cup",
                    coordinates=[300, 232, 326, 262],
                    confidence=0.89,
                    frame_index=40,
                ),
                ReferencePrimitiveDTO(
                    primitive_id="person_1_right_wrist_f40",
                    primitive_type="keypoint",
                    label="right_wrist",
                    coordinates=[210, 245],
                    confidence=0.83,
                    parent_primitive_id="person_1",
                    frame_index=40,
                ),
                ReferencePrimitiveDTO(
                    primitive_id="person_2_left_wrist_f40",
                    primitive_type="keypoint",
                    label="left_wrist",
                    coordinates=[312, 244],
                    confidence=0.87,
                    parent_primitive_id="person_2",
                    frame_index=40,
                ),
            ],
            expected_relation="transferred_to(cup, person_2)",
            what_it_demonstrates="Temporal primitives can support event verification.",
        )
    )

    # Case 3: Chip placement failure
    rows.append(
        evaluation_row(
            case_id=3,
            task="Verify that chip_1 has been placed correctly on the circuit board socket.",
            input_type="Placement fixture",
            primitives=[
                ReferencePrimitiveDTO(
                    primitive_id="chip_1",
                    primitive_type="box",
                    label="chip",
                    coordinates=[420, 420, 520, 520],
                    confidence=0.93,
                ),
                ReferencePrimitiveDTO(
                    primitive_id="socket_region_1",
                    primitive_type="box",
                    label="socket",
                    coordinates=[300, 300, 400, 400],
                    confidence=0.91,
                ),
                ReferencePrimitiveDTO(
                    primitive_id="board_1",
                    primitive_type="box",
                    label="circuit_board",
                    coordinates=[100, 100, 800, 800],
                    confidence=0.96,
                ),
            ],
            expected_relation="inside(chip, socket_region)",
            what_it_demonstrates="Primitive reasoning works for non-human placement tasks.",
        )
    )

    print(markdown_table(rows))

# ============================================================
# 8. CLI
# ============================================================

def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--eval-table",
        action="store_true",
        help="Print a markdown evaluation table for the fixture demos.",
    )
    parser.add_argument("--image", type=str, default=None, help="Optional image path for YOLO extraction.")
    parser.add_argument(
        "--task",
        type=str,
        default="Is the person holding the book?",
        help="Task to verify for optional YOLO demo.",
    )
    args = parser.parse_args()

    if args.image:
        print("\n=== Optional YOLO extraction demo ===")
        primitives = extract_yolo_primitives(args.image)
        run_pipeline(args.task, primitives)
        return
    if args.eval_table:
        run_evaluation_table()
        return

    demo_image_holding_book()
    demo_video_object_transfer_success()
    demo_chip_placement_failure()


if __name__ == "__main__":
    main()

Expected output

Running:

python reference_reasoning_demo.py

will produce three demonstrations.

Demo 1: generated image critique

The system checks whether both wrists are near the book box. In the fixture data, the wrists are too far away, so the verifier returns:

Decision:
  FAIL confidence=0.75

Repair:
  Move both hands so they visibly touch or hold the book.

Demo 2: video object transfer

The system checks whether the cup moves from one person’s hand region to another person’s hand region across frames. In the fixture data, the transfer relation is detected:

Decision:
  PASS confidence=0.78

Demo 3: chip placement

The system checks whether the chip box is inside the socket region. In the fixture data, the chip is offset from the socket, so the verifier returns:

Decision:
  FAIL confidence=0.8

Repair:
  Reposition the chip so its bounding box sits inside the socket region and its pins align with the board contacts.

Demo 4: evaluation table

Running:

python reference_reasoning_demo.py --eval-table

will produce

Case Task Input type Candidate primitives Active primitives Expected relation Observed relation Decision Repair generated What this demonstrates
1 Is the person holding the book? Image fixture 5 4 near/touching(wrists, book) none FAIL Move both hands so they visibly touch or hold the book. Image critique can become grounded repair.
2 Does person_1 hand the cup to person_2? Video fixture 6 6 transferred_to(cup, person_2) transferred_to(cup_f40, person_2_left_wrist_f40) PASS None Temporal primitives can support event verification.
3 Verify that chip_1 has been placed correctly on the circuit board socket. Placement fixture 3 3 inside(chip, socket_region) none FAIL Reposition the chip so its bounding box sits inside the socket region and its pins align with the board contacts. Primitive reasoning works for non-human placement tasks.

What this demonstrates

This small demo is not trying to solve all visual reasoning.

It demonstrates the architecture:

candidate primitives
context filter
active primitive set
relation builder
verifier
repair instruction

The same pattern supports:

  • image prompt verification,
  • human/object interaction detection,
  • chip or part placement verification,
  • generated video event checking,
  • movie scene repair.

The important point is that the system does not reason over the full scene. It reasons over the active primitive set selected for the current task.

That is the practical version of the post’s central claim:

Reasoning is context-filtered primitive selection followed by relation verification.


Glossary

Term Meaning in this post
Primitive A small, task-relevant unit of representation, such as an object, point, region, span, symbol, trajectory, or relation.
Visual Primitive A primitive grounded in visual space, usually a box, point, polyline, mask, or trajectory.
Reference Primitive A machine-verifiable pointer used inside a reasoning trace. Examples include image boxes, body keypoints, document spans, code symbols, UI component regions, and video trajectories.
Candidate Primitive Any primitive extracted from the input before relevance filtering. A detector may produce many candidate primitives, most of which may not matter for the task.
Active Primitive Set The context-filtered subset of primitives selected as relevant to the current task. This is the main object the reasoning layer operates over.
Context Filter The layer that selects which candidate primitives matter for the current question, prompt, scene action, or verification task.
Primitive Relation A checkable relationship between primitives, such as near, touching, inside, aligned_with, holding, moving_toward, or transferred_to.
Relation Verification The process of checking whether the expected relation between selected primitives actually exists.
Perception Gap The problem of whether a model can see enough detail in an image, video, or multimodal input.
Reference Gap The problem identified by Thinking with Visual Primitives: language is often too vague to precisely anchor reasoning to the correct visual entity.
Relevance Gap The extension proposed in this post: even if a system can point to many entities, it still needs to select which references matter for the current task.
Point While Reasoning The idea that a model should include explicit references, such as boxes and points, inside its reasoning process instead of relying only on vague language.
Task-Conditioned Compression The idea that intelligence activates only the primitives relevant to the current context, rather than reconstructing or processing everything.
Context-Filtered Primitive Selection The central claim of the post: reliable reasoning depends on selecting the right primitives for the task before reasoning over relations.
Primitive Reasoning Layer The system layer that receives primitives, filters them by context, builds relations, verifies requirements, and produces decisions or repairs.
Grounded Critique A critique that points to specific primitives and failed relations, rather than offering a vague judgment.
Verifier A deterministic or model-assisted component that checks whether required primitive relations are satisfied.
Repair Instruction A targeted instruction generated from a failed verification, such as “move the left hand onto the book” or “regenerate frames 32–52 with a visible envelope transfer.”
YOLO An object-detection model family used in the post as a practical way to extract candidate visual primitives such as object boxes.
Pose Model A model that detects body keypoints such as wrists, elbows, shoulders, nose, or head position.
Tracker A system such as ByteTrack or BoT-SORT that preserves object identity across video frames.
Open-Vocabulary Detector A detector that can identify objects based on text prompts or broader vocabularies, useful for domain-specific primitives.
Segmentation Model A model that extracts precise object or region masks rather than simple boxes.
Temporal Primitive A primitive that exists across time, such as an object track, hand trajectory, or region movement across frames.
Event Primitive A higher-level primitive inferred from temporal relations, such as handoff(envelope, Alice, Maria) or place(chip, socket_region).
Generated Image Critique The application of primitive reasoning to verify whether an AI-generated image satisfies a prompt.
Entity Interaction Detection The application of primitive reasoning to determine whether entities are in the required relationship, such as handoff, placement, insertion, support, or connection.
Movie Generation Verification The application of primitive reasoning to check whether generated video frames actually fulfill required scene actions over time.
Bounded Hallucination The idea that primitive reasoning can reduce unsupported claims by constraining reasoning to declared primitives, while still allowing for detector errors or relation mistakes.
Inspectability The property that a reasoning trace can be examined because it points to explicit primitives, relations, thresholds, and confidence values.
Active Reference A reference primitive selected by the context filter as relevant to the current reasoning task.
Candidate Reference Overload The failure mode where a system extracts many correct primitives but becomes confused because it reasons over too many irrelevant references.
Extract → Filter → Relate → Verify → Repair The core pipeline proposed in the post.

References and Further Reading

Area Reference Why it matters for this post
Core paper Lu et al., “Thinking with Visual Primitives” The central inspiration for this post. It introduces the idea of using points and bounding boxes as “minimal units of thought,” allowing multimodal models to point while reasoning rather than relying only on vague natural-language references. The original DeepSeek repo appears to have been removed or mirrored, so cite carefully and note the source status if needed. (GitHub)
YOLO / object detection Ultralytics YOLO documentation and licensing YOLO is used in the post as a practical primitive extractor for object boxes. The licensing matters because the common Ultralytics stack uses AGPL-3.0 by default with an Enterprise option for proprietary use. (Ultralytics)
Object tracking Zhang et al., “ByteTrack: Multi-Object Tracking by Associating Every Detection Box” ByteTrack is useful for preserving object identity across frames, which is essential for temporal primitives such as object trajectories, handoffs, and movie-action verification. (arXiv)
Object tracking Aharon et al., “BoT-SORT: Robust Associations Multi-Pedestrian Tracking” BoT-SORT is another practical tracking option, combining motion and appearance information for stronger multi-object tracking. (arXiv)
Open-vocabulary detection Liu et al., “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection” Grounding DINO is relevant when YOLO’s fixed classes are not enough, especially for domain-specific primitives such as “bed rail,” “chip socket,” “pin row,” or “neural interface.” (arXiv)
Segmentation / masks Kirillov et al., “Segment Anything” SAM is useful when boxes are too coarse and the reasoning system needs precise masks or regions instead of rectangular boxes. (arXiv)
Hand / pose keypoints Zhang et al., “MediaPipe Hands: On-device Real-time Hand Tracking” Hand landmarks are important for verifying relations such as holding, touching, handoff, placement, grasping, and object contact. (arXiv)
Community context Discussion of “Thinking with Visual Primitives” on LocalLLaMA Useful for understanding how the community interpreted the paper: especially the distinction between outputting boxes as final answers and interleaving visual primitives inside the reasoning trace. Use as informal context, not as the primary citation. (Reddit)

Suggested “Further Reading” Notes

Topic Suggested direction
Reference Gap Read Thinking with Visual Primitives first. The key idea is that language is often too imprecise to anchor visual reasoning, so the model needs coordinate-level handles.
Relevance Gap This post’s proposed extension: once a system can point, it still needs to select which references matter for the current task.
Primitive extraction YOLO, Grounding DINO, MediaPipe, and SAM can all serve as primitive extractors, depending on whether you need boxes, open-vocabulary objects, hand landmarks, or masks.
Temporal primitives ByteTrack and BoT-SORT are useful starting points for preserving object identity across frames.
Verification loops The most practical implementation path is not to train a new model first, but to build a pipeline: extract → filter → relate → verify → repair.

BibTeX-style draft references

@article{lu2026thinkingvisualprimitives,
  title   = {Thinking with Visual Primitives},
  author  = {Lu, Ruijie and Ma, Yiyang and Chen, Xiaokang and others},
  year    = {2026},
  note    = {Technical report; source availability may vary due to repository removal/mirroring}
}

@article{zhang2021bytetrack,
  title   = {ByteTrack: Multi-Object Tracking by Associating Every Detection Box},
  author  = {Zhang, Yifu and Sun, Peize and Jiang, Yi and others},
  year    = {2021},
  journal = {arXiv preprint arXiv:2110.06864}
}

@article{aharon2022botsort,
  title   = {BoT-SORT: Robust Associations Multi-Pedestrian Tracking},
  author  = {Aharon, Nir and Orfaig, Roy and Bobrovsky, Ben-Zion},
  year    = {2022},
  journal = {arXiv preprint arXiv:2206.14651}
}

@article{liu2023groundingdino,
  title   = {Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
  author  = {Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and others},
  year    = {2023},
  journal = {arXiv preprint arXiv:2303.05499}
}

@article{kirillov2023segmentanything,
  title   = {Segment Anything},
  author  = {Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and others},
  year    = {2023},
  journal = {arXiv preprint arXiv:2304.02643}
}

@article{zhang2020mediapipehands,
  title   = {MediaPipe Hands: On-device Real-time Hand Tracking},
  author  = {Zhang, Fan and Bazarevsky, Valentin and Vakunov, Andrey and others},
  year    = {2020},
  journal = {arXiv preprint arXiv:2006.10214}
}