Thinking in Primitives: Why AI Reasoning Should Learn to Point
From visual primitives to context-filtered reasoning, grounded verification, and AI movie repair
TL;DR
This post argues that AI reasoning should not operate over everything it can see, read, or detect. It should operate over the right primitives for the current task.
The paper Thinking with Visual Primitives shows that multimodal models reason better when they can point to visual entities using boxes and points. That solves a Reference Gap: language is often too vague to anchor reasoning to the right part of an image.
This post extends that idea with a second gap:
The Relevance Gap: even if a system can point, it still needs to know which things are worth pointing at.
The proposed architecture is:
raw content
↓
candidate primitives
↓
context filter
↓
active primitive set
↓
relations
↓
verification
↓
repair
In practice, tools like YOLO, pose models, trackers, and segmentation systems can extract candidate primitives: boxes, points, keypoints, masks, and trajectories. A reasoning layer then filters those primitives by context, checks relations such as touching, holding, aligned_with, or transferred_to, and produces grounded decisions or repair instructions.
The key claim:
Reasoning is not full reconstruction. Reasoning is context-filtered primitive selection followed by relation verification.
This matters for AI-generated image critique, entity interaction detection, object placement verification, and movie generation. Instead of saying “the handoff is unclear,” a primitive verifier can say:
The envelope never moved from Alice's hand region into Maria's hand region.
Regenerate frames 32–52 with a visible transfer.
That is the difference between vague critique and grounded, repairable reasoning.
Abstract
AI systems usually answer in language, but reasoning does not have to begin there. Human thought often appears to work through sparse, task-relevant primitives: objects, relations, locations, contact points, trajectories, constraints, and affordances. When we think of an building, we do not load a full-resolution image or recite a paragraph. The active representation changes with context: airport, accident, war, engineering, boarding, flight, threat.
The paper Thinking with Visual Primitives formalizes a version of this idea for multimodal models. It argues that visual reasoning suffers not only from a Perception Gap, where models fail to see enough detail, but from a deeper Reference Gap, where language fails to precisely point to the visual entities being reasoned about. Its solution is to treat points and bounding boxes as “minimal units of thought,” allowing a model to point while it reasons.
This post adds the Relevance Gap: even when a system can produce valid visual references, it still needs to select which references matter for the current task. It still has to know what is worth pointing at. A detector may produce hundreds of candidate primitives. More detections are not automatically more understanding. Without context filtering, the system no longer drowns in pixels or tokens; it drowns in candidate handles.
The architecture proposed here is simple:
extract candidate primitives
↓
filter by task context
↓
build relations
↓
verify the required relation
↓
repair what failed
YOLO, pose models, object trackers, segmentation models, and open-vocabulary detectors can extract primitives from images and video. A reasoning layer selects the active primitive set, checks relations such as touching, holding, aligned, inside, facing, moving_toward, or transferred_to, and produces grounded decisions or targeted repair instructions.
The core contribution is this:
Reasoning is not full reconstruction. Reasoning is context-filtered primitive selection followed by relation verification.
%%{init: {'theme':'dark', 'themeVariables': { 'fontSize':'14px', 'primaryBorderColor':'#ffffff', 'lineColor':'#cccccc'}}}%%
flowchart TD
A["🧠 Human Thinking<br/>Sparse, task-relevant primitives"]
A1["✈️ Same object, different context<br/>airport • accident • war • engineering"]
A2["🎯 Active primitive set<br/>only what matters for the task"]
B["🤖 Current AI Failure Mode<br/>more pixels • more tokens • more detections"]
B1["⚠️ Reference Gap<br/>language cannot reliably point"]
B2["⚠️ Relevance Gap<br/>the system points to too much"]
C["📄 Thinking with Visual Primitives<br/>points and boxes inside reasoning"]
C1["📦 Boxes<br/>object and region handles"]
C2["📍 Points<br/>locations, joints, paths, contact"]
D["🚀 This Post<br/>context-filtered primitive reasoning"]
D1["👁️ Extract<br/>YOLO • pose • tracking • segmentation"]
D2["🎯 Filter<br/>select the active primitive set"]
D3["🔗 Relate<br/>touching • holding • aligned • transferred_to"]
D4["✅ Verify + Repair<br/>grounded decision or targeted fix"]
E["🎬 Applications"]
E1["🖼️ Image critique"]
E2["🔩 Entity interaction and placement"]
E3["🎥 Movie generation verification"]
A --> A1 --> A2
B --> B1
B --> B2
C --> C1
C --> C2
A2 -. "mimic sparse cognition" .-> D
C -. "point while reasoning" .-> D
B2 -. "solve with context filtering" .-> D
D --> D1 --> D2 --> D3 --> D4 --> E
E --> E1
E --> E2
E --> E3
classDef human fill:#2b66a8,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
classDef ai fill:#c04040,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
classDef paper fill:#8855cc,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
classDef contribution fill:#3eaa5c,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
classDef app fill:#c4a040,stroke:#ffffff,color:#ffffff,stroke-width:3px,font-weight:bold;
classDef sub fill:#1e1e1e,stroke:#999999,color:#dddddd,stroke-width:1.5px,font-size:13px;
class A human;
class B ai;
class C paper;
class D contribution;
class E app;
class A1,A2,B1,B2,C1,C2,D1,D2,D3,D4,E1,E2,E3 sub;
1. Before Words, Before Images
When someone says, “Think of an airplane,” what happens?
The answer depends on the context.
If you are going to the airport, the airplane may activate a passenger-oriented representation: boarding gate, luggage, seats, aisle, overhead bins, takeoff.
If you hear that an airplane has been involved in an accident, a different representation appears: wreckage, smoke, impact, emergency response, failure, trajectory.
If the context is war, the airplane changes again: fighter jet, radar, missile, target, speed, threat.
If the context is engineering, the image becomes wings, fuselage, engines, lift, drag, airflow, control surfaces.
The object is the same, but the active representation is different.
That matters.
It suggests that we do not think by loading a full-resolution image of an airplane. We also do not usually think by reciting a paragraph about airplanes. We activate a sparse, task-conditioned structure: the parts, relations, constraints, and expected changes that matter for the current question.
In one context, the relevant primitive is wing.
In another, it is boarding gate.
In another, it is impact trajectory.
In another, it is radar signature.
Thought is not a complete picture. It is a context-sensitive selection of primitives.
Human intelligence is not only the ability to perceive the world. It is the ability to ignore most of it. We do not reason over every available detail. We select the subset that matters. More context is not automatically more intelligence. Often, intelligence is knowing which context can be safely discarded.
This is the mistake many AI systems still make. Because AI systems answer in language, we assume their reasoning should happen in language. Because images are made of pixels, we assume visual reasoning should happen over pixels. Because modern models can accept larger contexts, we assume giving them more context will make them smarter.
But humans answer in language too. Humans see in pixels too. Humans live inside enormous context windows. Yet the understanding underneath is something else: primitive, relational, spatial, compressed, and task-conditioned.
A useful intelligence does not need to activate the entire airplane to answer a question about boarding. It does not need every visual detail of a scene to decide whether one person is handing an object to another. It does not need every object in an image to know whether the generated picture failed the prompt. It needs the right primitives, selected for the current task.
That is the deeper reason Thinking with Visual Primitives is interesting. The paper points toward a form of AI reasoning that is not merely language-first. It suggests that models need stable visual handles, such as boxes and points, inside the reasoning process itself, so the model can point while it reasons rather than relying on vague phrases like “that object” or “the thing on the left.”
This post takes that idea one step further.
It is not enough to extract primitives. A useful reasoning system must extract the right primitives for the current context, reason over their relations, and ignore the rest.
That is the core claim:
Reasoning is not full reconstruction. Reasoning is context-filtered primitive selection.
2. The Paper’s Core Idea: The Reference Gap
The paper begins by distinguishing two problems in multimodal AI.
The first is the Perception Gap:
Can the model see enough detail?
This is the problem most visual-language systems try to solve with higher resolution, image crops, dynamic patching, and more visual tokens.
But the paper argues that there is a deeper problem: the Reference Gap.
A model may see the object, but still fail to keep track of what it is reasoning about. Natural language is often too vague to serve as a precise pointer into continuous visual space. Phrases like “the object on the left,” “the bear on the ground,” “that path,” or “the person near the table” can drift during reasoning.
The paper’s solution is simple and powerful:
Let the model point while it reasons.
Instead of reasoning only in text, the model inserts visual primitives directly into its chain of thought:
<ref>bear</ref><box>[[50,447,647,771]]</box>
or:
<point>[[309,512],[357,369],[408,510]]</point>
The box or point is not just a final output. It becomes part of the reasoning process.
That is the key contribution: points and bounding boxes become minimal units of thought.
The paper demonstrates this on counting, spatial reasoning, maze navigation, and path tracing. In counting examples, the model grounds each candidate object with boxes, filters invalid candidates, and tallies the grounded set. In path and maze examples, it uses point sequences to represent exploration, backtracking, and route tracing.
This is a major step. It moves multimodal reasoning from vague language toward grounded reference.
3. Our Extension: The Relevance Gap
The Thinking with Visual Primitives paper identifies the Reference Gap:
The model may know what it wants to say, but language alone is often too vague to point to the right part of the image.
Visual primitives help close that gap. A model can say:
<ref>cup</ref><box>[[190,230,215,260]]</box>
instead of relying on a phrase like:
the small object near the person
That is a major step. It gives the model a stable reference.
But once a system can point, a second problem appears:
Which references are worth pointing at?
A detector, pose model, tracker, or segmentation system can produce dozens or hundreds of candidate primitives from a single image or video clip:
person_1
person_2
chair_1
table_1
window_1
cup_1
phone_1
left_wrist
right_wrist
nose
shoulder
background_region
trajectory_1
trajectory_2
Many of these primitives may be correct. But correctness is not the same as relevance.
If the task is:
Is person_1 handing cup_1 to person_2?
then the active primitive set should be much smaller:
person_1
person_2
person_1 hand points
person_2 hand points
cup_1
cup_1 trajectory
The chair, window, phone, wall, and background regions may all be real, but they are not relevant unless they affect the handoff.
This is the Relevance Gap:
The Relevance Gap is the failure mode where a system has access to many correct candidate references, but cannot select the subset relevant to the current task.
In other words:
Perception Gap:
Can the system see enough?
Reference Gap:
Can the system point to what it means?
Relevance Gap:
Can the system choose which references matter?
The Relevance Gap is easy to miss because it appears after progress has already been made. The system sees the scene. It detects objects. It produces boxes, points, masks, keypoints, or tracks. It may even ground its reasoning in explicit references.
But if it reasons over the wrong references, or too many references, the reasoning can still fail.
More perception can become more confusion. More primitives can become a new kind of overload. The model no longer drowns in pixels or tokens. It drowns in candidate handles.
That is why primitive reasoning needs a context filter.
flowchart LR
subgraph A["Without Relevance Filtering"]
A1["🔎 Many candidate primitives<br/>people • objects • hands • background"]
A2["🧠 Reason over everything"]
A3["⚠️ Noisy or confused output"]
A1 --> A2 --> A3
end
subgraph B["With Context-Filtered Primitive Selection"]
B1["🔎 Many candidate primitives"]
B2["🎯 Context filter<br/>task: verify handoff"]
B3["🧩 Active primitive set<br/>giver • receiver • hands • object • trajectory"]
B4["✅ Grounded decision<br/>handoff passed or failed"]
B1 --> B2 --> B3 --> B4
end
classDef bad fill:#7a2d2d,stroke:#4a1111,color:#fff,stroke-width:2px;
classDef good fill:#2d7a4a,stroke:#164529,color:#fff,stroke-width:2px;
classDef neutral fill:#35495e,stroke:#1f2d3a,color:#fff,stroke-width:2px;
class A1,B1 neutral;
class A2,A3 bad;
class B2,B3,B4 good;
This is the key extension proposed in this post.
The goal is not to reason over every primitive the system can extract. The goal is to reason over the smallest useful primitive set for the current task.
That changes the architecture from:
detect everything
↓
reason over everything
to:
extract candidate primitives
↓
filter by context
↓
reason over active primitives
This is also where the analogy with human thought becomes useful. Humans do not usually reason by activating everything they know about an object or scene. We select what matters. The airplane at the airport activates gates, luggage, seats, and boarding. The airplane in an accident activates smoke, wreckage, impact, and trajectory. The object is the same, but the active primitive set changes with the task.
A useful AI system needs the same discipline.
The Reference Gap asks:
Can the system point to what it means?
The Relevance Gap asks:
Can the system choose what is worth pointing at?
That is the difference between grounded reference and grounded reasoning.
4. From Visual Primitives to Reference Primitives
The paper begins with two visual primitives:
| Primitive | Meaning | Best for |
|---|---|---|
| Bounding box | A region around an object or part | objects, people, body parts, scene elements |
| Point / polyline | A coordinate or coordinate sequence | joints, contact points, gaze, paths, trajectories |
These are enough to make visual reasoning more precise. A model no longer has to say “the object on the left” or “that person near the table.” It can bind the thought to a coordinate:
<ref>person</ref><box>[[120,80,340,900]]</box>
or:
<point>[[421,612]]</point>
However, the deeper idea is not limited to images.
A box in an image is just one kind of pointer. A sentence span in a document is also a pointer. A function symbol in code is a pointer. A UI component box is a pointer. A video trajectory is a pointer across time.
So we can generalize the paper’s visual primitive into a broader abstraction:
A Reference Primitive is any machine-verifiable pointer used inside a reasoning trace.
It is the thing the system can point to when it makes a claim.
| Domain | Reference primitive | Example |
|---|---|---|
| Image | box / keypoint | [[421,612]] = left wrist |
| Video | trajectory | person_1.hand.frames[42:58] |
| Document | span | ch03.s142 |
| Code | symbol | engine.py::repair_chapter |
| UI | component box | [[80,120,400,200]] |
| VPM | pixel region | {x:12,y:9,w:4,h:2} |
The principle is identical across domains:
Reasoning should point to evidence.
A reasoning trace that contains only natural language is hard to inspect. It may sound convincing, but the system cannot easily verify what the claim refers to.
A reasoning trace with reference primitives is different. It can say:
This claim depends on this box.
This critique depends on this keypoint.
This edit depends on this sentence.
This bug depends on this function.
This UI failure depends on this component region.
That makes the reasoning trace inspectable.
But this also creates the next problem. If we extract every possible primitive, we have not solved reasoning. We have only changed the form of overload. Instead of drowning in pixels, tokens, or context, the model can drown in candidate references.
So the Reference Primitive is only the first step.
A useful AI system must not merely point. It must know which pointers matter.
5. YOLO Is Not the Reasoner
YOLO does not implement the paper by itself.
YOLO detects objects. It gives us candidate visual primitives:
person_1 box
person_2 box
cup_1 box
chair_1 box
A pose model gives us body primitives:
left_wrist point
right_wrist point
nose point
shoulder points
A tracker gives us temporal primitives:
person_1 across frames
cup_1 across frames
hand trajectory
These are useful, but none of these systems reason by themselves.
YOLO can tell us that two people and a cup are present. It cannot tell us whether the people are having a conversation, whether one person is handing the cup to the other, or whether the cup is irrelevant background clutter.
That requires another layer.
The reasoning layer asks questions like:
Is person_1 facing person_2?
Is person_1's hand near cup_1?
Is cup_1 between person_1 and person_2?
Does cup_1 move from person_1's hand region to person_2's hand region?
But even that is not enough.
Before reasoning begins, the system must decide which primitives matter for the current task. If the task is:
Is person_1 handing the cup to person_2?
then the active primitive set should include:
person_1
person_2
person_1 hand points
person_2 hand points
cup_1
cup_1 trajectory
distance between hands
contact events
It probably does not need:
chair_1
window_1
wall_1
floor_1
background objects
Those detections may be correct, but they are not relevant unless they affect the handoff.
So the practical stack becomes:
Image / frame
↓
YOLO / pose / segmentation / tracking
↓
Candidate reference primitives
↓
Context filter
↓
Active primitive set
↓
Primitive relations
↓
Verifier
↓
Issue / score / repair
The paper gives the theory: reason with primitives by allowing the model to point while it thinks. YOLO gives a practical way to extract some of those primitives locally. The contribution here is the full reasoning loop:
Extract candidate primitives, filter them by context, reason over their relations, and verify the result.
That is what turns object detection into primitive-based understanding.
6. The Primitive Reasoning Architecture
The architecture has one central job:
Convert raw visual content into a small, task-relevant set of primitives, reason over the relationships between those primitives, and produce a grounded decision.
The important part is the middle of the pipeline. We are not asking the model to reason over every pixel, every object, or every possible visual detail. We first extract candidate primitives, then filter them through the current task.
flowchart TD
A["🖼️ Image / Video / Generated Scene"]
B["👁️ Primitive Extraction<br/>YOLO • pose • tracking • segmentation"]
C["🪝 Candidate Reference Primitives<br/>people • objects • hands • regions • trajectories"]
D["🎯 Context Filter<br/>select primitives relevant to the task"]
E["🧠 Active Primitive Set<br/>small enough to reason over"]
F["🔗 Relation Layer<br/>near • touching • facing • holding • moving • transferred_to"]
G["✅ Verifier / Reasoner<br/>check whether required relation exists"]
H["🛠️ Output<br/>decision • score • issue • repair instruction"]
X["⚠️ Primitive overload<br/>too many references become confusion"]
A --> B --> C --> D --> E --> F --> G --> H
C -. "without filtering" .-> X
classDef input fill:#35495e,stroke:#1f2d3a,color:#fff,stroke-width:2px;
classDef extract fill:#1f4e79,stroke:#0d2b45,color:#fff,stroke-width:2px;
classDef primitive fill:#5b3f8c,stroke:#2d1d4d,color:#fff,stroke-width:2px;
classDef filter fill:#7a5a2d,stroke:#4a3214,color:#fff,stroke-width:2px;
classDef active fill:#2d7a4a,stroke:#164529,color:#fff,stroke-width:2px;
classDef relation fill:#2d7a73,stroke:#144542,color:#fff,stroke-width:2px;
classDef output fill:#7a4a2d,stroke:#4a2a14,color:#fff,stroke-width:2px;
classDef danger fill:#7a2d2d,stroke:#4a1111,color:#fff,stroke-width:2px;
class A input;
class B extract;
class C primitive;
class D filter;
class E active;
class F relation;
class G relation;
class H output;
class X danger;
The implementation can start with two small objects.
class ReferencePrimitiveDTO(BaseModel):
primitive_id: str
primitive_type: Literal[
"box", "point", "keypoint", "polyline",
"mask", "span", "symbol", "region", "trajectory"
]
label: str | None = None
coordinates: Any
confidence: float | None = None
source: str
frame_index: int | None = None
parent_primitive_id: str | None = None
meta: dict = Field(default_factory=dict)
This is the thing the system can point to.
class PrimitiveRelationDTO(BaseModel):
relation_id: str
subject_primitive_id: str
relation_type: Literal[
"near", "touching", "facing", "holding",
"inside", "left_of", "right_of",
"moving_toward", "moving_away", "transferred_to"
]
object_primitive_id: str
confidence: float
evidence_primitive_ids: list[str] = Field(default_factory=list)
frame_range: tuple[int, int] | None = None
meta: dict = Field(default_factory=dict)
This is the relationship the system can reason over.
Together, they define the primitive reasoning layer:
ReferencePrimitiveDTO = what can be pointed at
PrimitiveRelationDTO = what can be checked between pointers
A candidate set such as:
person_1 box
person_2 box
cup_1 box
person_1 right_wrist point
becomes:
near(person_1, person_2)
facing(person_1, person_2)
near(person_1.right_wrist, cup_1)
between(cup_1, person_1, person_2)
moving_toward(cup_1, person_2.left_wrist)
That is where the system moves from detection to understanding.
7. Application One: AI-Generated Image Critique
The first practical demonstration is generated image critique.
A prompt says:
A woman sitting at a desk, holding a red book in both hands, looking down at the book.
The generator produces an image.
At a glance, it may look fine. But maybe the book is visible but not red, the hands are near the book but not touching it, the head faces forward instead of downward, the desk is ambiguous, or one wrist bends unnaturally.
A normal AI critique might say:
“The image mostly follows the prompt, but the anatomy could be improved.”
That is not enough. The critique is too vague to verify, too vague to measure, and too vague to reliably repair.
A reference-grounded critique is different.
It turns the prompt into requirements:
Requirement:
Both hands should hold the red book.
It turns the image into candidate primitives:
person_1 box
desk_1 box
book_1 box
left_wrist point
right_wrist point
nose point
Then the context filter selects the active primitive set for this requirement:
left_wrist = [[370,790]]
right_wrist = [[720,790]]
book_box = [[450,610,650,750]]
Now the verifier can reason over the relationship:
Decision:
FAIL. Neither wrist is close enough to the book box.
Repair:
Move both hands so they visibly touch or hold the red book.
The system is not merely saying that the image is wrong. It is identifying which requirement failed, which primitives matter, which relation is missing, and what repair should be attempted next.
| Prompt requirement | Active primitives | Verifier question | Possible repair |
|---|---|---|---|
| holding a red book | wrists, book box | are both wrists near the book? | move hands onto book |
| looking down at book | nose/head, book box | is head/gaze oriented toward book? | tilt head toward book |
| sitting at desk | person pose, desk box | does pose align with seated position? | regenerate seated posture |
| red book | book box, color region | is the book region red? | make book visibly red |
| natural anatomy | shoulder, elbow, wrist | are joint angles plausible? | correct arm/wrist pose |
This turns image generation into a closed-loop improvement process:
prompt
↓
generated image
↓
candidate primitives
↓
active primitive set
↓
relation verifier
↓
revision prompt
↓
regenerate
That kind of critique can be measured. It can be turned into a revision prompt. It can be checked again after regeneration.
8. Application Two: Entity Interaction and Placement
The next application is interaction detection, but not only human interaction.
The broader question is:
Are the relevant entities in the required relationship?
Sometimes those entities are people:
- is one person handing an object to another?
- is one person helping another stand?
- are two people facing each other in conversation?
- is a character actually holding the object the prompt described?
But the same reasoning applies to physical placement, assembly, robotics, manufacturing, generated images, and video verification:
- is the chip seated correctly on the circuit board?
- is the cable plugged into the port?
- is the screw aligned with the hole?
- is the tool touching the correct surface?
- is the object inside the container?
- is the generated character’s hand actually touching the book?
This is where primitive reasoning becomes more general.
Interaction is not presence. Interaction is relation.
A detector might give us candidate primitives:
person_1 box
person_2 box
cup_1 box
left_wrist point
right_wrist point
chip_1 box
socket_region_1 box
cable_1 endpoint
port_1 box
Those detections may all be correct, but they are not all relevant to the current task.
If the task is:
Is person_1 handing the cup to person_2?
then the active primitive set should be:
person_1
person_2
person_1 hand points
person_2 hand points
cup_1
cup_1 trajectory
If the task is:
Is chip_1 correctly placed on the circuit board?
then the active primitive set should be:
chip_1
socket_region_1
chip_1 corners
chip_1 pin row
board_contact row
The context changes the primitive set.
| Interaction / relation | Primitive pattern |
|---|---|
| Conversation | two people close + facing + shared attention |
| Object exchange | object moves from one hand region to another |
| Assistance | one body/hand primitive supports another body primitive |
| Placement | object aligns with and sits inside a target region |
| Insertion | object trajectory enters socket/container/slot region |
| Connection | cable/plug endpoint overlaps or locks into target port |
| Assembly | part primitives align with board/socket/contact primitives |
| No valid interaction | expected relation is absent, weak, or contradicted |
Object detection tells us:
cup exists
person exists
chip exists
board exists
Primitive reasoning asks:
Is the cup being transferred?
Is the person holding it?
Is the chip aligned?
Is the cable connected?
Is the required relation actually true?
For a human handoff, the verifier may check:
near(person_1.right_wrist, cup_1)
moving_toward(cup_1, person_2.left_wrist)
near(person_2.left_wrist, cup_1)
transferred_to(cup_1, person_2)
For chip placement, it may check:
inside(chip_1, socket_region_1)
aligned_with(chip_1.edges, socket_region_1.edges)
near(chip_1.pins, board_contacts)
orientation_matches(chip_1, socket_region_1)
If the chip is shifted, rotated, or not seated, the verifier can produce a grounded issue:
Decision:
FAIL. chip_1 is offset from socket_region_1 and rotated relative to the expected contact row.
Evidence:
chip_1 box
socket_region_1 box
chip_1 pin row
board_contact row
Repair:
Reposition chip_1 so its pin row aligns with the board contacts and its bounding box sits inside socket_region_1.
The common pattern is always the same:
What is the task?
Which primitives matter?
What relation is expected?
What evidence supports it?
Did the relation actually happen?
That is primitive-level interaction reasoning.
The system does not need to understand everything in the scene. It needs to select the right primitives, check the right relation, and produce a grounded decision.
9. Application Three: Movie Generation Verification
The third application is the most ambitious: movie generation verification.
Image critique checks whether a single frame satisfies a prompt. Entity interaction checks whether things are in the right relation. Movie verification extends both ideas across time.
A book scene says:
Alice hands the envelope to Maria.
A video model generates a clip.
At a glance, the clip may look plausible. Alice and Maria are both present. The envelope appears. Their hands move. The scene has the right mood.
But the actual event may not happen.
A normal reviewer might say:
“The scene sort of works, but the handoff is unclear.”
That is useful, but too vague for an automated generation loop. The system needs to know what failed, where it failed, and which frames need repair.
A primitive verifier can be more precise:
Required event:
handoff(envelope, Alice, Maria)
Observed:
envelope_box remains near Alice.right_hand from frames 12–44
Maria.left_hand approaches but never contacts envelope_box
envelope_box disappears at frame 45
no transferred_to relation detected
Decision:
FAIL. Handoff not visually completed.
Repair:
Regenerate frames 32–52 with a visible envelope transfer from Alice's right hand into Maria's left hand.
This is the movie-generation extension of primitive reasoning.
The same architecture applies:
generated frames
↓
YOLO + pose + tracking
↓
candidate temporal primitives
↓
context filter
↓
active primitive set
↓
event relations
↓
scene-action verification
↓
targeted regeneration
The key difference is time.
In a still image, the verifier asks:
Is the hand near the envelope?
In video, the verifier asks:
Does the envelope move from Alice's hand region to Maria's hand region across a valid frame range?
That requires identity, continuity, and event structure.
A video can be represented as a sequence of frame-level primitives:
Frame 12:
Alice.right_hand = [[312,540]]
envelope_box = [[326,528,372,562]]
Maria.left_hand = [[690,550]]
Frame 28:
Alice.right_hand = [[410,542]]
envelope_box = [[435,530,482,563]]
Maria.left_hand = [[600,548]]
Frame 44:
Alice.right_hand = [[520,545]]
envelope_box = [[548,532,596,565]]
Maria.left_hand = [[560,550]]
Those frame-level primitives can be turned into temporal relations:
near(envelope_box, Alice.right_hand, frames=12–30)
moving_toward(envelope_box, Maria.left_hand, frames=24–40)
near(envelope_box, Maria.left_hand, frames=40–48)
transferred_to(envelope_box, Maria.left_hand, frames=42–48)
That gives us an event primitive:
handoff(envelope, Alice, Maria, frames=12–48)
A generated movie is not just a sequence of pretty frames. It is a sequence of required actions:
Alice gives Maria the envelope.
Maria opens it.
She reads the letter.
Her expression changes.
She steps back.
Each action can be translated into verifiable primitive requirements.
| Scene action | Required primitive evidence |
|---|---|
| Alice gives Maria the envelope | envelope moves from Alice hand region to Maria hand region |
| Maria opens it | Maria hand contacts envelope; envelope state changes |
| Maria reads the letter | gaze/head direction aligns with letter region |
| Her expression changes | face landmarks / expression classifier changes over frames |
| She steps back | Maria body box moves away from Alice / table / object |
Some are easy. Some are hard. But the architecture is the same.
Movie verification is not just image critique repeated over frames. It requires temporal primitive reasoning. The system must track identities, preserve object continuity, detect relations, and verify that required relations become true in the correct order.
The final point is simple:
A generated movie should not only look plausible. It should make the required events verifiably happen.
10. Implementation Sketch
The first implementation does not need to train a new multimodal model.
That is the practical advantage of treating this as a system architecture rather than a model architecture.
The Thinking with Visual Primitives paper trains a model to produce visual primitives directly inside its reasoning process. That is powerful, but it is not the only way to explore the idea. We can approximate the same “point-to-reason” pattern with existing tools by separating the system into five stages:
extract
↓
filter
↓
relate
↓
verify
↓
repair
The tool stack can be built from existing components:
YOLO / object detector → object boxes
YOLO-pose / MediaPipe / OpenPose → body keypoints
ByteTrack / BoT-SORT → persistent IDs across frames
GroundingDINO / open-vocab detector → domain-specific objects
Segmentation model → masks and precise regions
Reasoning layer → context filtering, relations, verification, repair
For video, a YOLO tracker can preserve object identity across frames:
results = model.track(frame, persist=True, tracker="bytetrack.yaml")
Each tracked object can be converted into a visual primitive:
<ref>person_1</ref><box>[[120,45,200,310]]</box>
The paper uses normalized 0–999 coordinates, so raw pixel coordinates can be converted into that shared coordinate space:
x_norm = int((x_raw / width) * 999)
y_norm = int((y_raw / height) * 999)
Then the active primitive set and relations can be verified:
for frame in video:
candidate_primitives = extract_primitives(frame)
tracks = update_tracks(candidate_primitives)
active_set = context_filter.select(
task="Does person_1 hand the cup to person_2?",
primitives=tracks,
)
relations = relation_builder.build(active_set)
result = verifier.verify(
task="object_exchange(person_1, cup_1, person_2)",
active_set=active_set,
relations=relations,
)
print(result.decision)
print(result.evidence)
print(result.repair_instruction)
There are two ways this architecture can work.
The first is the external pipeline described above:
detectors / pose / tracking
↓
candidate primitives
↓
context filter
↓
relations
↓
verification
This works today with existing tools.
The second is a native primitive-capable model, like the direction proposed in Thinking with Visual Primitives. In that case, the model itself may produce points and boxes as part of its reasoning. The rest of the architecture still applies:
model-generated primitives
↓
context filter
↓
relations
↓
verification / repair
So the goal is not to compete with the paper. The goal is to build the system layer around the paper’s insight.
DeepSeek-style models make better primitives available inside reasoning. This architecture asks what happens next:
Which primitives matter? What relation should exist? Did that relation actually happen? What should be repaired if it did not?
The full source at the end of this post can be organized as a small reproducible reference pipeline:
reference_reasoning/
dto.py
geometry.py
yolo_extract.py
pose_extract.py
tracking.py
context_filter.py
relation_builder.py
verifiers.py
report.py
demo_image.py
demo_video.py
A minimal setup might look like:
pip install ultralytics opencv-python pydantic numpy
And the demos can be run as:
python demo_image.py --image scene.png --task "is the person holding the book?"
python demo_video.py --video handoff.mp4 --task "does person 1 hand the object to person 2?"
The goal of the implementation is not to solve every visual reasoning problem. It is to demonstrate the architecture:
Extract candidate primitives from perception, filter them by context, convert them into relations, and verify whether the required relation exists.
11. Why This Works
Primitive reasoning works because it is efficient, verifiable, and repairable.
It is efficient because it avoids reasoning over everything. A naive multimodal pipeline sends the full image, full video, or full scene into a large vision-language model and asks it to reason directly over raw content. That can work, but most tasks do not require the entire scene.
If the question is:
Is person_1 handing the cup to person_2?
the system does not need every pixel, every object, every texture, every background detail, or every frame at full resolution. It needs a small active primitive set:
{
"frame": 42,
"active_primitives": [
{"id": "person_1", "type": "box", "label": "person", "coords": [120, 45, 200, 310]},
{"id": "person_1_right_wrist", "type": "point", "label": "right_wrist", "coords": [180, 220]},
{"id": "person_2_left_wrist", "type": "point", "label": "left_wrist", "coords": [260, 225]},
{"id": "cup_1", "type": "box", "label": "cup", "coords": [190, 230, 215, 260]}
],
"relations": [
{"subject": "person_1_right_wrist", "relation": "near", "object": "cup_1"},
{"subject": "cup_1", "relation": "moving_toward", "object": "person_2_left_wrist"}
]
}
Now the reasoning layer works over a compact state: a few primitives and relations, not millions of pixels.
It is verifiable because a claim can point to its evidence.
A vague critique says:
The hand looks wrong.
A grounded critique says:
Requirement:
left hand touches book
Evidence:
left_wrist = [[370,790]]
book_box = [[450,610,650,750]]
Observed relation:
distance(left_wrist, book_box) = 0.42 × object diagonal
Threshold:
contact requires <= 0.20 × object diagonal
Decision:
FAIL. Hand-object contact is not satisfied.
Repair:
Move the left hand so it visibly touches or holds the book.
That can be tested.
It is repairable because failures are local. If the wrist point is wrong, the detector failed. If the book box is wrong, the object extractor failed. If both are correct but the threshold is too strict, the verifier needs calibration. If the wrong objects were selected, the context filter failed.
A vague model failure says:
The image looks slightly off.
A primitive reasoning failure says:
The chip placement failed because chip_box is outside socket_region.
The socket detector confidence is only 0.41, so this issue should be reviewed.
That gives the next step somewhere to start.
The result is a closed-loop improvement system:
generate
↓
extract primitives
↓
filter by task
↓
verify relations
↓
repair
↓
regenerate
↓
verify again
That is why primitive reasoning is useful. It turns visual judgment into evidence, relation, threshold, confidence, and repair.
12. Limitations
This architecture is useful, but it has real limitations.
The first limitation is practical rather than technical: the detector backend matters.
In this post, YOLO is the easiest example because it is fast, widely used, and simple to run locally. But “YOLO” is not one single thing. It is a family of object-detection models and implementations with different licenses and deployment constraints.
The common Ultralytics YOLO stack is available under AGPL-3.0 by default, with a separate Enterprise license for proprietary or production use. For an open-source demo, research notebook, or local experiment, that may be acceptable. For a closed-source product, internal tool, SaaS deployment, or commercial generation pipeline, it may not be.
That means the architecture should not depend on YOLO specifically. YOLO should be treated as one possible primitive extractor:
PrimitiveExtractor
├── YOLOExtractor
├── TorchVisionDetectorExtractor
├── GroundingDINOExtractor
├── DETR / RT-DETR extractor
├── MediaPipePoseExtractor
├── SAM / segmentation extractor
└── ManualFixtureExtractor
The second limitation is that primitive reasoning is only as good as the primitives it receives.
YOLO can miss objects. Pose models can misplace hands. Trackers can swap identities. Open-vocabulary detectors can hallucinate labels. Segmentation models can produce masks that are too broad or too narrow.
Primitive reasoning reduces hallucination, but it does not eliminate it. The right phrase is not “zero hallucinations.” It is:
Bounded hallucination.
The reasoning layer is constrained to a declared primitive set, so unsupported visual entities are easier to catch. If the system says the character is holding a book, it should be able to point to the hand primitive, the book primitive, and the relation between them. But the system can still inherit detector errors, select the wrong active primitive set, overinterpret weak relations, or use thresholds that are too strict or too loose.
The third limitation is domain specificity. General object detectors may recognize person, cup, chair, or book, but fail on objects such as bed rail, neural interface, surgical telemetry pad, chip socket, pin row, board contact, robot gripper, or custom tool head.
For those cases, the pipeline needs specialized extractors, open-vocabulary detectors, segmentation, manual fixtures, or project-specific fine-tuning.
The fourth limitation is semantic ambiguity. Many useful relations can be checked geometrically:
near
inside
overlapping
aligned_with
touching
moving_toward
transferred_to
But some relations require semantic judgment:
comforting
threatening
hesitating
pretending
arguing
agreeing
acting suspiciously
Primitive reasoning can provide evidence for those higher-level judgments, but it may not fully determine them.
The final limitation is temporal fragility. Video requires the system to preserve identity over time:
person_1 remains person_1
object_1 remains object_1
hand trajectory is continuous
object does not disappear or swap identity
Trackers can fail when objects overlap, leave frame, re-enter, become occluded, or change appearance. A single ID swap can break the event trace.
The core limitation is this:
Primitive reasoning does not remove the need for perception, judgment, or calibration. It gives those failures a structure.
That is still valuable.
A vague model failure says:
The scene looks wrong.
A primitive reasoning failure says:
The chip placement failed because chip_box is outside socket_region.
But the socket detector confidence is only 0.41, so this issue should be reviewed.
The goal is not perfection. The goal is inspectability.
Given these constraints, it’s worth comparing our external-verifier approach with the internal-primitive paradigm of the original paper and with pure VLM-based critics
13. Comparative Analysis: Internal vs. External Primitives, and the Verifier Loop
The Thinking with Visual Primitives paper demonstrated that large multimodal models benefit from interleaving spatial markers (boxes, points) directly into their chain‑of‑thought. However, the primitives are internal to the model; they are generated as text tokens in the reasoning trace. This makes the reasoning more precise, but it does not automatically make it verifiable or repairable by an external system.
Pure VLM‑based critics (e.g., asking a frontier model “Does this image match the prompt?”) produce natural‑language feedback. That feedback can sound plausible, but it lacks machine‑verifiable pointers to specific image regions, making it difficult to measure improvement or to automate repair.
Our work occupies a distinct point in the design space: we use external primitive extractors (YOLO, pose models, trackers) and a deterministic verifier loop that can be inspected, tested, and iterated. The table below summarises the key differences.
| Capability | Thinking with Visual Primitives (DeepSeek) | VLM‑based critics (e.g., GPT‑4V) | This work (Reference‑Grounded Verification) |
|---|---|---|---|
| Primitive source | Model‑internal, generated during chain‑of‑thought | No explicit primitives; only language | External detectors: YOLO, pose, tracking, segmentation |
| Verifiability | Primitive output is part of the model’s text; coordinates can be checked but the reasoning trace is still a black‑box | Low: feedback is natural language, not anchored to coordinates | High: every issue must cite one or more primitive IDs and provides deterministic evidence |
| Repairability | The model can be prompted to revise, but the revision instruction is again language‑only | The critic can suggest a revision prompt, but it’s vague (“improve the anatomy”) | The verifier produces a targeted repair instruction grounded in a failed relation (e.g., “Move left_wrist onto book_box”) |
| Generalisability to video | Limited to single‑frame spatial reasoning in current work | Can describe a video clip but loses track of object identity across frames | Designed for temporal primitives; object tracks and event relations are first‑class |
| Deterministic component | No deterministic layer; model outputs are stochastic | None | Geometry verifiers (distance, angle, inside) are pure functions with fixture‑based tests |
| Open‑vocabulary objects | Learnt from pretraining data; box output is tied to known classes | Can handle novel objects through language, but cannot ground them reliably | Plug‑in detectors (Grounding DINO, SAM) extend the vocabulary; the verifier logic is class‑agnostic |
| Inspectability | You can read the CoT and the boxes, but you cannot verify the reasoning steps externally | Opaque | Every issue can be traced back to specific primitives and a numeric threshold |
The optimal future system is likely a hybrid. A model could produce internal primitives and relation hypotheses, while an external verifier loop checks them against geometric constraints, prompt requirements, and temporal consistency. Our architecture provides the outer loop; the paper provides a path toward the inner loop. Together, they point toward AI systems that not only think with primitives, but also prove that their thoughts are correct—and know exactly what to fix when they are not.
This comparison underscores the core contribution: we are not replacing the paper’s insight, but building the verification and repair layer around it, and we are doing it in a way that is deterministic, inspectable, and directly transferable to domains beyond images.
This comparison underscores why we see our work not as a replacement for the paper’s insight, but as the verification and repair layer built around it—a pattern that generalises far beyond images.
14. The Generalization: What We Are Really Saying
The broader claim is not limited to images, YOLO, or movie generation.
The claim is this:
Reasoning becomes more reliable when it operates over task-relevant primitives instead of raw context.
Humans seem to do this naturally. We do not reason over everything we see, know, remember, or imagine. We activate the parts of the world that matter for the current situation.
If we think about an airplane at the airport, we activate gates, seats, luggage, boarding, and takeoff. If we think about an airplane accident, we activate wreckage, trajectory, failure, smoke, and emergency response. If we think about an airplane in war, we activate radar, missiles, speed, threat, and target.
The object is the same. The active primitive set is different.
That is the cognitive pattern this architecture tries to mimic:
raw world
↓
possible primitives
↓
context-filtered active primitive set
↓
relations
↓
verification
↓
decision or repair
Across domains, the same structure appears:
| What we are saying | In human thought | In an AI system | What it enables |
|---|---|---|---|
| Do not reason over everything | Ignore irrelevant detail | Avoid passing all pixels, tokens, detections, or frames forward | Less noise and confusion |
| Find possible handles | Notice objects, parts, places, causes | Extract boxes, points, spans, symbols, tracks | Grounded references |
| Select by context | Activate what matters for the current question | Build an active primitive set | Task-focused reasoning |
| Reason over relations | Understand how parts connect or change | Check touching, holding, inside, aligned, transferred | Structured inference |
| Verify against expectation | Decide whether the situation satisfies the goal | Compare observed relations to required relations | Pass/fail decisions |
| Repair locally | Fix the part that failed | Produce targeted regeneration or correction instructions | Iterative improvement |
This is the heart of the idea.
Primitive extraction alone is not enough. A detector can find hundreds of things. A language model can consume thousands of tokens. A video model can generate thousands of frames. But intelligence is not the amount of material available to the system.
Intelligence is the ability to select the right material for the current decision.
The paper shows that multimodal models improve when they can point while reasoning. This architecture adds the next layer:
A useful system must know which things are worth pointing at.
That gives us the distinction:
Perception Gap:
Can the system see enough?
Reference Gap:
Can the system point to what it means?
Relevance Gap:
Can the system select which references matter?
The first gap is about seeing. The second gap is about pointing. The third gap is about understanding the task.
The final generalization is simple:
More context is not more intelligence. The right primitive set is.
A system that reasons over everything gets lost. A system that extracts primitives but fails to filter them drowns in candidate references. A system that selects the active primitive set can reason more like a human: not by reconstructing the whole world, but by activating the parts of the world needed for the decision in front of it.
We are taking the paper’s idea of visual primitives and extending it into a broader principle:
Reasoning is context-filtered primitive selection, followed by relation verification.
Conclusion
The Thinking with Visual Primitives paper matters because it challenges a hidden assumption: that reasoning must happen primarily in language.
It shows that multimodal models can reason better when they can point. Points and boxes become minimal units of thought. The model’s reasoning no longer floats above the image in vague phrases like “that object” or “the thing on the left.” It binds itself to physical coordinates.
That is the first step.
But once a system can point, a second question appears:
What should it point at?
A detector may find hundreds of objects. A pose model may produce dozens of keypoints. A tracker may preserve identities across thousands of frames. More perception gives the system more possible references, but more references do not automatically produce better reasoning.
Without context filtering, the model can still get lost.
It no longer drowns in pixels. It drowns in primitives.
That is the extension proposed here.
The paper identifies the Reference Gap: language is not precise enough to anchor visual reasoning. This post adds the Relevance Gap: even when references exist, the system must select the references that matter for the current task.
The resulting architecture is simple:
raw content
↓
candidate primitives
↓
context filter
↓
active primitive set
↓
relations
↓
verification
↓
repair
YOLO, pose models, trackers, segmenters, and open-vocabulary detectors can extract candidate primitives. A context filter selects the active primitive set. A relation layer checks whether the relevant objects, hands, parts, trajectories, symbols, or spans are connected in the expected way. Verifiers turn failures into grounded issues. Repair builders turn those issues into targeted regeneration or correction instructions.
That gives us a practical path from primitive reasoning to real applications:
- generated image critique,
- entity interaction detection,
- object placement verification,
- assembly checking,
- generated video validation,
- movie scene repair.
The deeper claim is not about YOLO, image generation, or any one model.
The deeper claim is this:
Reasoning is context-filtered primitive selection followed by relation verification.
Humans seem to do something like this naturally. We do not reason over the whole world. We activate the parts of the world that matter. The airplane at the airport is not the airplane in an accident, and neither is the airplane in war. The object may be the same, but the active primitive set changes with the task.
AI systems need the same discipline.
More context is not more intelligence. More pixels are not more understanding. More detections are not more reasoning.
A useful system does not merely see more. It selects better.
So the final lesson is not just:
AI should point while it reasons.
It is:
AI should know what is worth pointing at.
That is what turns visual primitives into a broader architecture for grounded, inspectable, repairable reasoning.
Appendix: A Minimal Reference-Grounded Reasoning Demo
This appendix implements a small version of the architecture described in the post.
It demonstrates the full loop:
candidate primitives
↓
context filter
↓
active primitive set
↓
relations
↓
verification
↓
repair instruction
The demo intentionally uses fixture data first. This makes the reasoning layer deterministic and easy to inspect. After that, an optional YOLO adapter shows how real detections can be converted into the same primitive format.
Install
pip install pydantic numpy
Optional YOLO support:
pip install ultralytics opencv-python
Single-file demo: reference_reasoning_demo.py
"""
reference_reasoning_demo.py
A minimal reference-grounded primitive reasoning demo.
This file demonstrates:
1. ReferencePrimitiveDTO
2. PrimitiveRelationDTO
3. Context filtering
4. Relation building
5. Verification
6. Repair instruction generation
7. Optional YOLO extraction adapter
Run fixture demos:
python reference_reasoning_demo.py
Optional YOLO demo:
python reference_reasoning_demo.py --image path/to/image.jpg
The fixture demos do not require YOLO.
"""
from __future__ import annotations
import argparse
import math
from typing import Any, Literal
from pydantic import BaseModel, Field
# ============================================================
# 1. DTOs
# ============================================================
PrimitiveType = Literal[
"box",
"point",
"keypoint",
"polyline",
"mask",
"span",
"symbol",
"region",
"trajectory",
]
RelationType = Literal[
"near",
"touching",
"inside",
"aligned_with",
"moving_toward",
"transferred_to",
"unknown",
]
class ReferencePrimitiveDTO(BaseModel):
"""
A machine-verifiable pointer.
In images, this may be a box or keypoint.
In video, this may be a trajectory.
In documents, this could be a span.
In code, this could be a symbol.
For this demo, we mainly use boxes and points.
"""
primitive_id: str
primitive_type: PrimitiveType
label: str | None = None
coordinates: Any
confidence: float | None = None
source: str = "fixture"
frame_index: int | None = None
parent_primitive_id: str | None = None
meta: dict[str, Any] = Field(default_factory=dict)
class PrimitiveRelationDTO(BaseModel):
"""
A checkable relationship between two primitives.
"""
relation_id: str
subject_primitive_id: str
relation_type: RelationType
object_primitive_id: str
confidence: float
evidence_primitive_ids: list[str] = Field(default_factory=list)
frame_range: tuple[int, int] | None = None
meta: dict[str, Any] = Field(default_factory=dict)
class ActivePrimitiveSetDTO(BaseModel):
"""
The context-filtered subset of primitives relevant to the task.
"""
task: str
selected: list[ReferencePrimitiveDTO]
rejected: list[ReferencePrimitiveDTO] = Field(default_factory=list)
rationale: dict[str, str] = Field(default_factory=dict)
class VerificationResultDTO(BaseModel):
"""
A grounded verifier output.
"""
task: str
decision: Literal["PASS", "FAIL", "UNCERTAIN"]
confidence: float
evidence: list[str]
observed_relations: list[PrimitiveRelationDTO] = Field(default_factory=list)
repair_instruction: str | None = None
class TaskDTO(BaseModel):
task_id: str
action_type: Literal[
"object_contact",
"object_transfer",
"placement_verification",
"gaze_verification",
"anatomy_check",
]
subject_labels: list[str] = Field(default_factory=list)
object_labels: list[str] = Field(default_factory=list)
required_body_parts: list[str] = Field(default_factory=list)
required_relations: list[str] = Field(default_factory=list)
temporal: bool = False
# ============================================================
# 2. Geometry helpers
# ============================================================
def box_center(box: list[float]) -> tuple[float, float]:
x1, y1, x2, y2 = box
return ((x1 + x2) / 2.0, (y1 + y2) / 2.0)
def box_diagonal(box: list[float]) -> float:
x1, y1, x2, y2 = box
return math.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)
def point_to_box_distance(point: list[float], box: list[float]) -> float:
px, py = point
x1, y1, x2, y2 = box
dx = max(x1 - px, 0, px - x2)
dy = max(y1 - py, 0, py - y2)
return math.sqrt(dx * dx + dy * dy)
def point_distance(a: list[float], b: list[float]) -> float:
return math.sqrt((a[0] - b[0]) ** 2 + (a[1] - b[1]) ** 2)
def is_point_near_box(
point: list[float],
box: list[float],
relative_threshold: float = 0.20,
) -> tuple[bool, float]:
"""
Returns:
is_near: whether point is close enough to the box
relative_distance: distance divided by object diagonal
"""
dist = point_to_box_distance(point, box)
diag = max(1.0, box_diagonal(box))
rel = dist / diag
return rel <= relative_threshold, rel
def is_box_inside(inner: list[float], outer: list[float], tolerance: float = 0.0) -> bool:
ix1, iy1, ix2, iy2 = inner
ox1, oy1, ox2, oy2 = outer
return (
ix1 >= ox1 - tolerance
and iy1 >= oy1 - tolerance
and ix2 <= ox2 + tolerance
and iy2 <= oy2 + tolerance
)
def normalize_box_0_999(box: list[float], width: int, height: int) -> list[int]:
x1, y1, x2, y2 = box
return [
int((x1 / width) * 999),
int((y1 / height) * 999),
int((x2 / width) * 999),
int((y2 / height) * 999),
]
# ============================================================
# 3. Context filter
# ============================================================
class ContextFilter:
"""
Selects the primitives relevant to the current task.
This is the Relevance Gap layer:
the system should not reason over everything it can detect.
The filter accepts either:
- a structured TaskDTO, preferred for serious use
- a plain string, converted into a TaskDTO for demo convenience
"""
def select(
self,
task: TaskDTO | str,
primitives: list[ReferencePrimitiveDTO],
) -> ActivePrimitiveSetDTO:
task_dto = self._coerce_task(task)
selected: list[ReferencePrimitiveDTO] = []
rejected: list[ReferencePrimitiveDTO] = []
rationale: dict[str, str] = {}
wanted_labels = {
item.lower()
for item in (
task_dto.subject_labels
+ task_dto.object_labels
+ task_dto.required_body_parts
)
}
for primitive in primitives:
label = (primitive.label or "").lower()
pid = primitive.primitive_id.lower()
keep = any(
wanted in label or wanted in pid
for wanted in wanted_labels
)
if keep:
selected.append(primitive)
rationale[primitive.primitive_id] = (
f"Selected for {task_dto.action_type}: "
f"matches one of {sorted(wanted_labels)}"
)
else:
rejected.append(primitive)
rationale[primitive.primitive_id] = (
f"Rejected for {task_dto.action_type}: "
f"does not match active task labels"
)
return ActivePrimitiveSetDTO(
task=task_dto.task_id,
selected=selected,
rejected=rejected,
rationale=rationale,
)
def _coerce_task(self, task: TaskDTO | str) -> TaskDTO:
if isinstance(task, TaskDTO):
return task
task_l = task.lower()
if any(word in task_l for word in ["hand", "handoff", "give", "transfer", "cup", "envelope"]):
return TaskDTO(
task_id=task,
action_type="object_transfer",
subject_labels=["person", "alice", "maria"],
object_labels=["cup", "envelope"],
required_body_parts=["wrist", "hand"],
required_relations=["near", "moving_toward", "transferred_to"],
temporal=True,
)
if any(word in task_l for word in ["hold", "holding", "book"]):
return TaskDTO(
task_id=task,
action_type="object_contact",
subject_labels=["person"],
object_labels=["book"],
required_body_parts=["wrist", "hand"],
required_relations=["near", "touching"],
temporal=False,
)
if any(word in task_l for word in ["chip", "board", "socket", "place", "placement"]):
return TaskDTO(
task_id=task,
action_type="placement_verification",
subject_labels=["chip"],
object_labels=["socket", "board", "circuit_board"],
required_body_parts=["pin", "contact"],
required_relations=["inside", "aligned_with"],
temporal=False,
)
return TaskDTO(
task_id=task,
action_type="object_contact",
subject_labels=[],
object_labels=[],
required_body_parts=[],
required_relations=[],
temporal=False,
)
# ============================================================
# 4. Relation builder
# ============================================================
class RelationBuilder:
"""
Converts primitives into primitive relations.
The demo supports:
- near(point, box)
- inside(box, box)
- simple transferred_to relation from frame sequence
"""
def build(self, active_set: ActivePrimitiveSetDTO) -> list[PrimitiveRelationDTO]:
primitives = active_set.selected
relations: list[PrimitiveRelationDTO] = []
points = [p for p in primitives if p.primitive_type in {"point", "keypoint"}]
boxes = [p for p in primitives if p.primitive_type == "box"]
# Spatial near relations between keypoints and boxes
for point in points:
for box in boxes:
if point.parent_primitive_id == box.primitive_id:
continue
near, rel_dist = is_point_near_box(
point.coordinates,
box.coordinates,
relative_threshold=0.25,
)
if near:
relations.append(
PrimitiveRelationDTO(
relation_id=f"rel_near_{point.primitive_id}_{box.primitive_id}",
subject_primitive_id=point.primitive_id,
relation_type="near",
object_primitive_id=box.primitive_id,
confidence=self._combined_confidence(point, box, base=0.85),
evidence_primitive_ids=[point.primitive_id, box.primitive_id],
meta={"relative_distance": rel_dist},
)
)
# Inside relations between boxes
for inner in boxes:
for outer in boxes:
if inner.primitive_id == outer.primitive_id:
continue
if is_box_inside(inner.coordinates, outer.coordinates, tolerance=5):
relations.append(
PrimitiveRelationDTO(
relation_id=f"rel_inside_{inner.primitive_id}_{outer.primitive_id}",
subject_primitive_id=inner.primitive_id,
relation_type="inside",
object_primitive_id=outer.primitive_id,
confidence=self._combined_confidence(inner, outer, base=0.90),
evidence_primitive_ids=[inner.primitive_id, outer.primitive_id],
)
)
# Simple temporal transfer relation if task contains frame-indexed object movement.
relations.extend(self._build_simple_transfer_relations(primitives))
return relations
def _combined_confidence(
self,
a: ReferencePrimitiveDTO,
b: ReferencePrimitiveDTO,
base: float,
) -> float:
ca = a.confidence if a.confidence is not None else 1.0
cb = b.confidence if b.confidence is not None else 1.0
return round(min(ca, cb) * base, 3)
def _build_simple_transfer_relations(
self,
primitives: list[ReferencePrimitiveDTO],
) -> list[PrimitiveRelationDTO]:
"""
Toy temporal transfer detector.
Looks for an object such as cup/envelope near giver hand early
and near receiver hand later.
"""
relations: list[PrimitiveRelationDTO] = []
object_boxes = [
p for p in primitives
if p.primitive_type == "box"
and p.label
and p.label.lower() in {"cup", "envelope"}
and p.frame_index is not None
]
hands = [
p for p in primitives
if p.primitive_type in {"point", "keypoint"}
and p.label
and "wrist" in p.label.lower()
and p.frame_index is not None
]
if not object_boxes or not hands:
return relations
early = min(p.frame_index for p in object_boxes if p.frame_index is not None)
late = max(p.frame_index for p in object_boxes if p.frame_index is not None)
early_objects = [p for p in object_boxes if p.frame_index == early]
late_objects = [p for p in object_boxes if p.frame_index == late]
early_hands = [p for p in hands if p.frame_index == early]
late_hands = [p for p in hands if p.frame_index == late]
for obj_early in early_objects:
for hand_early in early_hands:
near_early, _ = is_point_near_box(hand_early.coordinates, obj_early.coordinates)
if not near_early:
continue
for obj_late in late_objects:
if obj_late.label != obj_early.label:
continue
for hand_late in late_hands:
if hand_late.parent_primitive_id == hand_early.parent_primitive_id:
continue
near_late, _ = is_point_near_box(hand_late.coordinates, obj_late.coordinates)
if near_late:
relations.append(
PrimitiveRelationDTO(
relation_id=f"rel_transferred_{obj_early.label}_{early}_{late}",
subject_primitive_id=obj_late.primitive_id,
relation_type="transferred_to",
object_primitive_id=hand_late.primitive_id,
confidence=0.78,
evidence_primitive_ids=[
obj_early.primitive_id,
hand_early.primitive_id,
obj_late.primitive_id,
hand_late.primitive_id,
],
frame_range=(early, late),
meta={
"from_hand": hand_early.primitive_id,
"to_hand": hand_late.primitive_id,
},
)
)
return relations
# ============================================================
# 5. Verifiers
# ============================================================
class Verifier:
"""
Verifies task-specific requirements against observed primitive relations.
"""
def verify(
self,
task: str,
active_set: ActivePrimitiveSetDTO,
relations: list[PrimitiveRelationDTO],
) -> VerificationResultDTO:
task_l = task.lower()
if any(word in task_l for word in ["hold", "holding", "book"]):
return self._verify_holding_book(task, active_set, relations)
if any(word in task_l for word in ["hand", "handoff", "give", "transfer", "cup", "envelope"]):
return self._verify_object_transfer(task, active_set, relations)
if any(word in task_l for word in ["chip", "board", "socket", "placement", "place"]):
return self._verify_chip_placement(task, active_set, relations)
return VerificationResultDTO(
task=task,
decision="UNCERTAIN",
confidence=0.2,
evidence=["No verifier matched this task."],
observed_relations=relations,
repair_instruction="Add a verifier for this task type.",
)
def _verify_holding_book(
self,
task: str,
active_set: ActivePrimitiveSetDTO,
relations: list[PrimitiveRelationDTO],
) -> VerificationResultDTO:
book_ids = {
p.primitive_id
for p in active_set.selected
if (p.label or "").lower() == "book"
}
wrist_near_book = [
r for r in relations
if r.relation_type == "near"
and r.object_primitive_id in book_ids
and "wrist" in r.subject_primitive_id.lower()
]
if len(wrist_near_book) >= 2:
return VerificationResultDTO(
task=task,
decision="PASS",
confidence=min(r.confidence for r in wrist_near_book),
evidence=[
"Both wrist primitives are near the book primitive.",
*[self._relation_summary(r) for r in wrist_near_book],
],
observed_relations=wrist_near_book,
)
return VerificationResultDTO(
task=task,
decision="FAIL",
confidence=0.75,
evidence=[
"Expected both wrists to be near the book.",
f"Observed wrist-book relations: {len(wrist_near_book)}",
],
observed_relations=wrist_near_book,
repair_instruction="Move both hands so they visibly touch or hold the book.",
)
def _verify_object_transfer(
self,
task: str,
active_set: ActivePrimitiveSetDTO,
relations: list[PrimitiveRelationDTO],
) -> VerificationResultDTO:
transfers = [r for r in relations if r.relation_type == "transferred_to"]
if transfers:
return VerificationResultDTO(
task=task,
decision="PASS",
confidence=max(r.confidence for r in transfers),
evidence=[
"A transferred_to relation was detected.",
*[self._relation_summary(r) for r in transfers],
],
observed_relations=transfers,
)
return VerificationResultDTO(
task=task,
decision="FAIL",
confidence=0.72,
evidence=[
"No transferred_to relation was detected.",
"The object did not move from one hand region to another in the observed frame range.",
],
observed_relations=[],
repair_instruction=(
"Regenerate the relevant frames with the object remaining visible "
"and moving from the giver's hand region into the receiver's hand region."
),
)
def _verify_chip_placement(
self,
task: str,
active_set: ActivePrimitiveSetDTO,
relations: list[PrimitiveRelationDTO],
) -> VerificationResultDTO:
inside_relations = [
r for r in relations
if r.relation_type == "inside"
and "chip" in r.subject_primitive_id.lower()
and "socket" in r.object_primitive_id.lower()
]
if inside_relations:
return VerificationResultDTO(
task=task,
decision="PASS",
confidence=max(r.confidence for r in inside_relations),
evidence=[
"chip primitive is inside socket primitive.",
*[self._relation_summary(r) for r in inside_relations],
],
observed_relations=inside_relations,
)
return VerificationResultDTO(
task=task,
decision="FAIL",
confidence=0.80,
evidence=[
"Expected chip to be inside socket region.",
"No inside(chip, socket_region) relation was detected.",
],
observed_relations=[],
repair_instruction=(
"Reposition the chip so its bounding box sits inside the socket region "
"and its pins align with the board contacts."
),
)
def _relation_summary(self, relation: PrimitiveRelationDTO) -> str:
return (
f"{relation.relation_type}("
f"{relation.subject_primitive_id}, {relation.object_primitive_id}"
f") confidence={relation.confidence}"
)
# ============================================================
# 6. Fixture demos
# ============================================================
def demo_image_holding_book() -> None:
print("\n=== Demo 1: Generated image critique — holding a book ===")
task = "Is the person holding the book?"
primitives = [
ReferencePrimitiveDTO(
primitive_id="person_1",
primitive_type="box",
label="person",
coordinates=[100, 80, 760, 960],
confidence=0.96,
),
ReferencePrimitiveDTO(
primitive_id="book_1",
primitive_type="box",
label="book",
coordinates=[450, 610, 650, 750],
confidence=0.92,
),
ReferencePrimitiveDTO(
primitive_id="left_wrist",
primitive_type="keypoint",
label="left_wrist",
coordinates=[370, 790],
confidence=0.88,
parent_primitive_id="person_1",
),
ReferencePrimitiveDTO(
primitive_id="right_wrist",
primitive_type="keypoint",
label="right_wrist",
coordinates=[720, 790],
confidence=0.84,
parent_primitive_id="person_1",
),
ReferencePrimitiveDTO(
primitive_id="desk_1",
primitive_type="box",
label="desk",
coordinates=[50, 760, 900, 980],
confidence=0.90,
),
]
run_pipeline(task, primitives)
def demo_video_object_transfer_success() -> None:
print("\n=== Demo 2: Video verification — successful cup transfer ===")
task = "Does person_1 hand the cup to person_2?"
primitives = [
# Frame 10: cup near person_1 hand
ReferencePrimitiveDTO(
primitive_id="cup_f10",
primitive_type="box",
label="cup",
coordinates=[190, 230, 215, 260],
confidence=0.90,
frame_index=10,
),
ReferencePrimitiveDTO(
primitive_id="person_1_right_wrist_f10",
primitive_type="keypoint",
label="right_wrist",
coordinates=[185, 240],
confidence=0.88,
parent_primitive_id="person_1",
frame_index=10,
),
ReferencePrimitiveDTO(
primitive_id="person_2_left_wrist_f10",
primitive_type="keypoint",
label="left_wrist",
coordinates=[320, 245],
confidence=0.86,
parent_primitive_id="person_2",
frame_index=10,
),
# Frame 40: cup near person_2 hand
ReferencePrimitiveDTO(
primitive_id="cup_f40",
primitive_type="box",
label="cup",
coordinates=[300, 232, 326, 262],
confidence=0.89,
frame_index=40,
),
ReferencePrimitiveDTO(
primitive_id="person_1_right_wrist_f40",
primitive_type="keypoint",
label="right_wrist",
coordinates=[210, 245],
confidence=0.83,
parent_primitive_id="person_1",
frame_index=40,
),
ReferencePrimitiveDTO(
primitive_id="person_2_left_wrist_f40",
primitive_type="keypoint",
label="left_wrist",
coordinates=[312, 244],
confidence=0.87,
parent_primitive_id="person_2",
frame_index=40,
),
]
run_pipeline(task, primitives)
def demo_chip_placement_failure() -> None:
print("\n=== Demo 3: Placement verification — failed chip placement ===")
task = "Verify that chip_1 has been placed correctly on the circuit board socket."
primitives = [
ReferencePrimitiveDTO(
primitive_id="chip_1",
primitive_type="box",
label="chip",
coordinates=[420, 420, 520, 520],
confidence=0.93,
),
ReferencePrimitiveDTO(
primitive_id="socket_region_1",
primitive_type="box",
label="socket",
coordinates=[300, 300, 400, 400],
confidence=0.91,
),
ReferencePrimitiveDTO(
primitive_id="board_1",
primitive_type="box",
label="circuit_board",
coordinates=[100, 100, 800, 800],
confidence=0.96,
),
]
run_pipeline(task, primitives)
def run_pipeline(task: str, primitives: list[ReferencePrimitiveDTO]) -> VerificationResultDTO:
context_filter = ContextFilter()
relation_builder = RelationBuilder()
verifier = Verifier()
active_set = context_filter.select(task, primitives)
relations = relation_builder.build(active_set)
result = verifier.verify(task, active_set, relations)
print(f"\nTask: {task}")
print("\nSelected primitives:")
for primitive in active_set.selected:
print(f" - {primitive.primitive_id}: {primitive.label} {primitive.coordinates}")
print("\nRelations:")
if relations:
for relation in relations:
print(
f" - {relation.relation_type}("
f"{relation.subject_primitive_id}, {relation.object_primitive_id}"
f") conf={relation.confidence}"
)
else:
print(" - none")
print("\nDecision:")
print(f" {result.decision} confidence={result.confidence}")
print("\nEvidence:")
for item in result.evidence:
print(f" - {item}")
if result.repair_instruction:
print("\nRepair:")
print(f" {result.repair_instruction}")
return result
# ============================================================
# 7. Optional YOLO adapter
# ============================================================
def extract_yolo_primitives(image_path: str) -> list[ReferencePrimitiveDTO]:
"""
Optional YOLO extractor.
Requires:
pip install ultralytics opencv-python
This adapter converts YOLO detections into ReferencePrimitiveDTO.
"""
try:
from ultralytics import YOLO
except ImportError as exc:
raise RuntimeError(
"Ultralytics is not installed. Run: pip install ultralytics opencv-python"
) from exc
model = YOLO("yolo11n.pt")
results = model(image_path)
result = results[0]
height, width = result.orig_shape
names = result.names
primitives: list[ReferencePrimitiveDTO] = []
if result.boxes is None:
return primitives
for idx, box in enumerate(result.boxes):
cls_id = int(box.cls[0])
label = names[cls_id]
confidence = float(box.conf[0])
raw_box = [float(v) for v in box.xyxy[0].tolist()]
norm_box = normalize_box_0_999(raw_box, width=width, height=height)
primitives.append(
ReferencePrimitiveDTO(
primitive_id=f"{label}_{idx + 1}",
primitive_type="box",
label=label,
coordinates=norm_box,
confidence=round(confidence, 3),
source="yolo",
meta={
"raw_box_xyxy": raw_box,
"image_width": width,
"image_height": height,
},
)
)
return primitives
def evaluation_row(
case_id: int,
task: str,
input_type: str,
primitives: list[ReferencePrimitiveDTO],
expected_relation: str,
what_it_demonstrates: str,
) -> dict[str, str | int]:
context_filter = ContextFilter()
relation_builder = RelationBuilder()
verifier = Verifier()
active_set = context_filter.select(task, primitives)
relations = relation_builder.build(active_set)
result = verifier.verify(task, active_set, relations)
observed = ", ".join(
f"{r.relation_type}({r.subject_primitive_id}, {r.object_primitive_id})"
for r in result.observed_relations
) or "none"
return {
"Case": case_id,
"Task": task,
"Input type": input_type,
"Candidate primitives": len(primitives),
"Active primitives": len(active_set.selected),
"Expected relation": expected_relation,
"Observed relation": observed,
"Decision": result.decision,
"Repair generated": result.repair_instruction or "None",
"What this demonstrates": what_it_demonstrates,
}
def markdown_table(rows: list[dict[str, str | int]]) -> str:
headers = list(rows[0].keys())
lines = []
lines.append("| " + " | ".join(headers) + " |")
lines.append("| " + " | ".join(["---"] * len(headers)) + " |")
for row in rows:
values = [str(row[h]).replace("\n", " ") for h in headers]
lines.append("| " + " | ".join(values) + " |")
return "\n".join(lines)
def run_evaluation_table() -> None:
rows = []
# Case 1: Holding book failure
rows.append(
evaluation_row(
case_id=1,
task="Is the person holding the book?",
input_type="Image fixture",
primitives=[
ReferencePrimitiveDTO(
primitive_id="person_1",
primitive_type="box",
label="person",
coordinates=[100, 80, 760, 960],
confidence=0.96,
),
ReferencePrimitiveDTO(
primitive_id="book_1",
primitive_type="box",
label="book",
coordinates=[450, 610, 650, 750],
confidence=0.92,
),
ReferencePrimitiveDTO(
primitive_id="left_wrist",
primitive_type="keypoint",
label="left_wrist",
coordinates=[370, 790],
confidence=0.88,
parent_primitive_id="person_1",
),
ReferencePrimitiveDTO(
primitive_id="right_wrist",
primitive_type="keypoint",
label="right_wrist",
coordinates=[720, 790],
confidence=0.84,
parent_primitive_id="person_1",
),
ReferencePrimitiveDTO(
primitive_id="desk_1",
primitive_type="box",
label="desk",
coordinates=[50, 760, 900, 980],
confidence=0.90,
),
],
expected_relation="near/touching(wrists, book)",
what_it_demonstrates="Image critique can become grounded repair.",
)
)
# Case 2: Cup transfer success
rows.append(
evaluation_row(
case_id=2,
task="Does person_1 hand the cup to person_2?",
input_type="Video fixture",
primitives=[
ReferencePrimitiveDTO(
primitive_id="cup_f10",
primitive_type="box",
label="cup",
coordinates=[190, 230, 215, 260],
confidence=0.90,
frame_index=10,
),
ReferencePrimitiveDTO(
primitive_id="person_1_right_wrist_f10",
primitive_type="keypoint",
label="right_wrist",
coordinates=[185, 240],
confidence=0.88,
parent_primitive_id="person_1",
frame_index=10,
),
ReferencePrimitiveDTO(
primitive_id="person_2_left_wrist_f10",
primitive_type="keypoint",
label="left_wrist",
coordinates=[320, 245],
confidence=0.86,
parent_primitive_id="person_2",
frame_index=10,
),
ReferencePrimitiveDTO(
primitive_id="cup_f40",
primitive_type="box",
label="cup",
coordinates=[300, 232, 326, 262],
confidence=0.89,
frame_index=40,
),
ReferencePrimitiveDTO(
primitive_id="person_1_right_wrist_f40",
primitive_type="keypoint",
label="right_wrist",
coordinates=[210, 245],
confidence=0.83,
parent_primitive_id="person_1",
frame_index=40,
),
ReferencePrimitiveDTO(
primitive_id="person_2_left_wrist_f40",
primitive_type="keypoint",
label="left_wrist",
coordinates=[312, 244],
confidence=0.87,
parent_primitive_id="person_2",
frame_index=40,
),
],
expected_relation="transferred_to(cup, person_2)",
what_it_demonstrates="Temporal primitives can support event verification.",
)
)
# Case 3: Chip placement failure
rows.append(
evaluation_row(
case_id=3,
task="Verify that chip_1 has been placed correctly on the circuit board socket.",
input_type="Placement fixture",
primitives=[
ReferencePrimitiveDTO(
primitive_id="chip_1",
primitive_type="box",
label="chip",
coordinates=[420, 420, 520, 520],
confidence=0.93,
),
ReferencePrimitiveDTO(
primitive_id="socket_region_1",
primitive_type="box",
label="socket",
coordinates=[300, 300, 400, 400],
confidence=0.91,
),
ReferencePrimitiveDTO(
primitive_id="board_1",
primitive_type="box",
label="circuit_board",
coordinates=[100, 100, 800, 800],
confidence=0.96,
),
],
expected_relation="inside(chip, socket_region)",
what_it_demonstrates="Primitive reasoning works for non-human placement tasks.",
)
)
print(markdown_table(rows))
# ============================================================
# 8. CLI
# ============================================================
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
"--eval-table",
action="store_true",
help="Print a markdown evaluation table for the fixture demos.",
)
parser.add_argument("--image", type=str, default=None, help="Optional image path for YOLO extraction.")
parser.add_argument(
"--task",
type=str,
default="Is the person holding the book?",
help="Task to verify for optional YOLO demo.",
)
args = parser.parse_args()
if args.image:
print("\n=== Optional YOLO extraction demo ===")
primitives = extract_yolo_primitives(args.image)
run_pipeline(args.task, primitives)
return
if args.eval_table:
run_evaluation_table()
return
demo_image_holding_book()
demo_video_object_transfer_success()
demo_chip_placement_failure()
if __name__ == "__main__":
main()
Expected output
Running:
python reference_reasoning_demo.py
will produce three demonstrations.
Demo 1: generated image critique
The system checks whether both wrists are near the book box. In the fixture data, the wrists are too far away, so the verifier returns:
Decision:
FAIL confidence=0.75
Repair:
Move both hands so they visibly touch or hold the book.
Demo 2: video object transfer
The system checks whether the cup moves from one person’s hand region to another person’s hand region across frames. In the fixture data, the transfer relation is detected:
Decision:
PASS confidence=0.78
Demo 3: chip placement
The system checks whether the chip box is inside the socket region. In the fixture data, the chip is offset from the socket, so the verifier returns:
Decision:
FAIL confidence=0.8
Repair:
Reposition the chip so its bounding box sits inside the socket region and its pins align with the board contacts.
Demo 4: evaluation table
Running:
python reference_reasoning_demo.py --eval-table
will produce
| Case | Task | Input type | Candidate primitives | Active primitives | Expected relation | Observed relation | Decision | Repair generated | What this demonstrates |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Is the person holding the book? | Image fixture | 5 | 4 | near/touching(wrists, book) | none | FAIL | Move both hands so they visibly touch or hold the book. | Image critique can become grounded repair. |
| 2 | Does person_1 hand the cup to person_2? | Video fixture | 6 | 6 | transferred_to(cup, person_2) | transferred_to(cup_f40, person_2_left_wrist_f40) | PASS | None | Temporal primitives can support event verification. |
| 3 | Verify that chip_1 has been placed correctly on the circuit board socket. | Placement fixture | 3 | 3 | inside(chip, socket_region) | none | FAIL | Reposition the chip so its bounding box sits inside the socket region and its pins align with the board contacts. | Primitive reasoning works for non-human placement tasks. |
What this demonstrates
This small demo is not trying to solve all visual reasoning.
It demonstrates the architecture:
candidate primitives
↓
context filter
↓
active primitive set
↓
relation builder
↓
verifier
↓
repair instruction
The same pattern supports:
- image prompt verification,
- human/object interaction detection,
- chip or part placement verification,
- generated video event checking,
- movie scene repair.
The important point is that the system does not reason over the full scene. It reasons over the active primitive set selected for the current task.
That is the practical version of the post’s central claim:
Reasoning is context-filtered primitive selection followed by relation verification.
Glossary
| Term | Meaning in this post |
|---|---|
| Primitive | A small, task-relevant unit of representation, such as an object, point, region, span, symbol, trajectory, or relation. |
| Visual Primitive | A primitive grounded in visual space, usually a box, point, polyline, mask, or trajectory. |
| Reference Primitive | A machine-verifiable pointer used inside a reasoning trace. Examples include image boxes, body keypoints, document spans, code symbols, UI component regions, and video trajectories. |
| Candidate Primitive | Any primitive extracted from the input before relevance filtering. A detector may produce many candidate primitives, most of which may not matter for the task. |
| Active Primitive Set | The context-filtered subset of primitives selected as relevant to the current task. This is the main object the reasoning layer operates over. |
| Context Filter | The layer that selects which candidate primitives matter for the current question, prompt, scene action, or verification task. |
| Primitive Relation | A checkable relationship between primitives, such as near, touching, inside, aligned_with, holding, moving_toward, or transferred_to. |
| Relation Verification | The process of checking whether the expected relation between selected primitives actually exists. |
| Perception Gap | The problem of whether a model can see enough detail in an image, video, or multimodal input. |
| Reference Gap | The problem identified by Thinking with Visual Primitives: language is often too vague to precisely anchor reasoning to the correct visual entity. |
| Relevance Gap | The extension proposed in this post: even if a system can point to many entities, it still needs to select which references matter for the current task. |
| Point While Reasoning | The idea that a model should include explicit references, such as boxes and points, inside its reasoning process instead of relying only on vague language. |
| Task-Conditioned Compression | The idea that intelligence activates only the primitives relevant to the current context, rather than reconstructing or processing everything. |
| Context-Filtered Primitive Selection | The central claim of the post: reliable reasoning depends on selecting the right primitives for the task before reasoning over relations. |
| Primitive Reasoning Layer | The system layer that receives primitives, filters them by context, builds relations, verifies requirements, and produces decisions or repairs. |
| Grounded Critique | A critique that points to specific primitives and failed relations, rather than offering a vague judgment. |
| Verifier | A deterministic or model-assisted component that checks whether required primitive relations are satisfied. |
| Repair Instruction | A targeted instruction generated from a failed verification, such as “move the left hand onto the book” or “regenerate frames 32–52 with a visible envelope transfer.” |
| YOLO | An object-detection model family used in the post as a practical way to extract candidate visual primitives such as object boxes. |
| Pose Model | A model that detects body keypoints such as wrists, elbows, shoulders, nose, or head position. |
| Tracker | A system such as ByteTrack or BoT-SORT that preserves object identity across video frames. |
| Open-Vocabulary Detector | A detector that can identify objects based on text prompts or broader vocabularies, useful for domain-specific primitives. |
| Segmentation Model | A model that extracts precise object or region masks rather than simple boxes. |
| Temporal Primitive | A primitive that exists across time, such as an object track, hand trajectory, or region movement across frames. |
| Event Primitive | A higher-level primitive inferred from temporal relations, such as handoff(envelope, Alice, Maria) or place(chip, socket_region). |
| Generated Image Critique | The application of primitive reasoning to verify whether an AI-generated image satisfies a prompt. |
| Entity Interaction Detection | The application of primitive reasoning to determine whether entities are in the required relationship, such as handoff, placement, insertion, support, or connection. |
| Movie Generation Verification | The application of primitive reasoning to check whether generated video frames actually fulfill required scene actions over time. |
| Bounded Hallucination | The idea that primitive reasoning can reduce unsupported claims by constraining reasoning to declared primitives, while still allowing for detector errors or relation mistakes. |
| Inspectability | The property that a reasoning trace can be examined because it points to explicit primitives, relations, thresholds, and confidence values. |
| Active Reference | A reference primitive selected by the context filter as relevant to the current reasoning task. |
| Candidate Reference Overload | The failure mode where a system extracts many correct primitives but becomes confused because it reasons over too many irrelevant references. |
| Extract → Filter → Relate → Verify → Repair | The core pipeline proposed in the post. |
References and Further Reading
| Area | Reference | Why it matters for this post |
|---|---|---|
| Core paper | Lu et al., “Thinking with Visual Primitives” | The central inspiration for this post. It introduces the idea of using points and bounding boxes as “minimal units of thought,” allowing multimodal models to point while reasoning rather than relying only on vague natural-language references. The original DeepSeek repo appears to have been removed or mirrored, so cite carefully and note the source status if needed. (GitHub) |
| YOLO / object detection | Ultralytics YOLO documentation and licensing | YOLO is used in the post as a practical primitive extractor for object boxes. The licensing matters because the common Ultralytics stack uses AGPL-3.0 by default with an Enterprise option for proprietary use. (Ultralytics) |
| Object tracking | Zhang et al., “ByteTrack: Multi-Object Tracking by Associating Every Detection Box” | ByteTrack is useful for preserving object identity across frames, which is essential for temporal primitives such as object trajectories, handoffs, and movie-action verification. (arXiv) |
| Object tracking | Aharon et al., “BoT-SORT: Robust Associations Multi-Pedestrian Tracking” | BoT-SORT is another practical tracking option, combining motion and appearance information for stronger multi-object tracking. (arXiv) |
| Open-vocabulary detection | Liu et al., “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection” | Grounding DINO is relevant when YOLO’s fixed classes are not enough, especially for domain-specific primitives such as “bed rail,” “chip socket,” “pin row,” or “neural interface.” (arXiv) |
| Segmentation / masks | Kirillov et al., “Segment Anything” | SAM is useful when boxes are too coarse and the reasoning system needs precise masks or regions instead of rectangular boxes. (arXiv) |
| Hand / pose keypoints | Zhang et al., “MediaPipe Hands: On-device Real-time Hand Tracking” | Hand landmarks are important for verifying relations such as holding, touching, handoff, placement, grasping, and object contact. (arXiv) |
| Community context | Discussion of “Thinking with Visual Primitives” on LocalLLaMA | Useful for understanding how the community interpreted the paper: especially the distinction between outputting boxes as final answers and interleaving visual primitives inside the reasoning trace. Use as informal context, not as the primary citation. (Reddit) |
Suggested “Further Reading” Notes
| Topic | Suggested direction |
|---|---|
| Reference Gap | Read Thinking with Visual Primitives first. The key idea is that language is often too imprecise to anchor visual reasoning, so the model needs coordinate-level handles. |
| Relevance Gap | This post’s proposed extension: once a system can point, it still needs to select which references matter for the current task. |
| Primitive extraction | YOLO, Grounding DINO, MediaPipe, and SAM can all serve as primitive extractors, depending on whether you need boxes, open-vocabulary objects, hand landmarks, or masks. |
| Temporal primitives | ByteTrack and BoT-SORT are useful starting points for preserving object identity across frames. |
| Verification loops | The most practical implementation path is not to train a new model first, but to build a pipeline: extract → filter → relate → verify → repair. |
BibTeX-style draft references
@article{lu2026thinkingvisualprimitives,
title = {Thinking with Visual Primitives},
author = {Lu, Ruijie and Ma, Yiyang and Chen, Xiaokang and others},
year = {2026},
note = {Technical report; source availability may vary due to repository removal/mirroring}
}
@article{zhang2021bytetrack,
title = {ByteTrack: Multi-Object Tracking by Associating Every Detection Box},
author = {Zhang, Yifu and Sun, Peize and Jiang, Yi and others},
year = {2021},
journal = {arXiv preprint arXiv:2110.06864}
}
@article{aharon2022botsort,
title = {BoT-SORT: Robust Associations Multi-Pedestrian Tracking},
author = {Aharon, Nir and Orfaig, Roy and Bobrovsky, Ben-Zion},
year = {2022},
journal = {arXiv preprint arXiv:2206.14651}
}
@article{liu2023groundingdino,
title = {Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
author = {Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and others},
year = {2023},
journal = {arXiv preprint arXiv:2303.05499}
}
@article{kirillov2023segmentanything,
title = {Segment Anything},
author = {Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and others},
year = {2023},
journal = {arXiv preprint arXiv:2304.02643}
}
@article{zhang2020mediapipehands,
title = {MediaPipe Hands: On-device Real-time Hand Tracking},
author = {Zhang, Fan and Bazarevsky, Valentin and Vakunov, Andrey and others},
year = {2020},
journal = {arXiv preprint arXiv:2006.10214}
}