Codex Manager: Building a Prompt-State Runtime for Hackathon-Grade Code Optimization

Codex Manager: Building a Prompt-State Runtime for Hackathon-Grade Code Optimization
Page content

TL;DR

Codex Manager uses AI to generate code as an artifact, then tests that artifact, diagnoses what happened, and repairs the prompt state that produced it. The code is not the thing being optimized directly. The prompt state is.

Summary

Humanity’s Last Hackathon framed the challenge as a test of context, not code: the task was hard enough that the real question was not whether someone could hand-write one clever kernel, but whether they could build a system that used AI effectively under changing constraints.

Codex Manager is our attempt at that idea: a prompt-state runtime that manages Codex through candidate generation, isolated execution, diagnosis, repair, and verified promotion.

What Is Codex Manager?

Codex Manager starts from one simple claim:

Codex Manager optimizes PromptState, not code. Candidate code is evidence.

Instead of treating generated code as the answer, the system treats it as a proposal that must survive a controlled loop.

Codex proposes a candidate diff. The manager applies it in an isolated workspace. Build, correctness, and benchmark gates decide whether the candidate is real. Failed attempts become diagnoses. Diagnoses become prompt deltas. Prompt deltas update the next PromptState. Only candidates that pass correctness and improve the benchmark are allowed to survive.

That shift from optimizing code directly to optimizing the state around code generation shaped the entire architecture.

Over the course of the build, Codex Manager grew from a small prompt-repair loop into a complete hackathon-style runtime:

  • task packs
  • shadow execution
  • command benchmarks
  • candidate executor registries
  • reproducible run bundles
  • submission packaging
  • platform exports
  • a full pipeline orchestrator
  • a CLI that runs the workflow from the terminal

The result is not just a tool for one benchmark. It is a pattern for agentic coding systems:

    flowchart LR
    %% Start of the loop
    PS["🧠 PromptState<br/>(task, context, lessons, warnings, banned)"]
    
    %% Codex generation
    PS -->|"generate()"| C["🤖 Codex<br/>(candidate generator)"]
    C -->|"unified diff"| AR["🔬 AttemptResult<br/>(applied, compiled, correct, speedup)"]
    
    %% Evidence gates
    AR -->|"fails gate"| D["🩺 AttemptDiagnosis<br/>(what failed & why)"]
    AR -->|"passes & improves"| PROMO["✅ Promote Candidate<br/>(verified improvement)"]
    
    %% Repair & feedback
    D -->|"diagnosis → lesson"| PD["🔄 PromptDelta<br/>(new constraints, lessons, banned moves)"]
    PD -->|"apply delta"| PS
    
    %% Styling
    classDef state fill:#e0f0ff,stroke:#3a6ea5,stroke-width:2px,color:#1a2b3c
    classDef evidence fill:#fff7e0,stroke:#d9a34a,stroke-width:2px,color:#4a3a1a
    classDef gate fill:#e6ffe6,stroke:#4a9d4a,stroke-width:2px,color:#1a3c1a
    classDef promote fill:#e6ffe6,stroke:#4a9d4a,stroke-width:3px,color:#0a3d0a,font-weight:bold
    
    class PS state
    class C evidence
    class AR evidence
    class D gate
    class PROMO promote
    class PD state
  

The model generates. The manager evaluates. The prompt state evolves. The artifacts prove what happened.

This post walks through that system from the first design decision to the final command-line pipeline, using a concrete vector_add example to show how the pieces fit together.


The Core Idea: PromptState, Not Code

Once we stopped treating the hackathon as a code-writing contest, the design became much clearer.

The thing to optimize was not the kernel.

The thing to optimize was the state around Codex.

We called that state PromptState.

PromptState is the evolving memory of the run: the constraints, lessons, warnings, and contracts that shape the next candidate.

Most code-generation systems treat the prompt as temporary. A prompt is assembled, sent to the model, and discarded. If the result fails, the next prompt is usually improvised: “try again,” “fix the bug,” “make it faster,” “preserve correctness this time.”

That works for small interactions. It does not work as an engineering loop.

A serious manager needs a structured object that says:

Here is what the model currently knows.
Here are the mistakes it has already made.
Here are the moves it is no longer allowed to make.
Here are the patterns it should preserve.
Here is the exact output contract it must obey.

Codex Manager makes that explicit.

A PromptState is not just a prompt string. It is the working memory of the run.

class PromptStateDTO(BaseModel):
    run_id: str
    task_id: str
    attempt_id: str

    context_pack: str
    system_prompt: str
    user_prompt: str

    prior_lessons: list[str] = Field(default_factory=list)
    failure_warnings: list[str] = Field(default_factory=list)
    success_patterns: list[str] = Field(default_factory=list)
    banned_moves: list[str] = Field(default_factory=list)

    output_contract: str

This is the real optimization surface.

The code changes every attempt, but the state accumulates.

A candidate might fail because it returned prose instead of a diff. Another might apply cleanly but break correctness. Another might pass correctness and regress performance. Each of those outcomes teaches the manager something different. The next prompt should not merely be louder or longer. It should be more informed.

That is what PromptState gives us.

The loop becomes:

PromptState
→ candidate generated by Codex
→ candidate evaluated as evidence
→ failure or success diagnosed
→ prompt delta produced
→ next PromptState

Codex is still doing valuable work. It proposes code. But the manager decides how the next proposal should be shaped.

That is the central inversion:

Codex generates candidates. Codex Manager evolves the conditions under which candidates are generated.

The candidate code is no longer treated as the final answer. It is treated as an experiment.

Did it apply cleanly?

Did it build?

Did it preserve correctness?

Did it improve the benchmark?

Did it violate the output contract?

Did it touch files it was not allowed to touch?

The answers become structured evidence. That evidence becomes a diagnosis. The diagnosis becomes a prompt delta. The prompt delta becomes the next PromptState.

For example, if a candidate weakens boundary behavior, the manager does not simply say:

Try again.

It updates the next prompt with something specific:

Previous attempt likely failed because boundary semantics were weakened.
Preserve bounds checks and mismatched-length behavior.
Do not assume aligned input sizes.
Return one valid unified diff against the target file only.

If a candidate passes correctness but slows the benchmark, the next prompt changes differently:

The previous candidate was correct but slower.
Avoid extra branching, allocations, sleeps, or unnecessary memory traffic.
Target the measured hotspot directly.
Preserve the successful correctness structure.

This gives the run a kind of external learning.

The model’s weights do not change. Codex does not become smarter inside the session. But the context around Codex becomes more precise, more constrained, and more informed by evidence.

That is why the distinction matters.

A normal agent loop says:

generate code
test code
retry

Codex Manager says:

generate candidate
test candidate
diagnose evidence
repair PromptState
generate under improved constraints

The benchmark still matters. The code still matters. But the thing being improved across attempts is the state that shapes the next candidate.

Below is the full prompt-state lifecycle:

    flowchart TD
    A["🗂️ Task Pack<br/>goal, source, tests, benchmark, constraints"] --> B["🧠 PromptState<br/>context, lessons, warnings, banned moves, output contract"]

    B --> C["🤖 Codex<br/>candidate generator"]
    C --> D["📝 Candidate Diff<br/>proposed code change"]

    D --> E["🛡️ Isolated Workspace<br/>apply patch safely"]
    E --> F{"🧪 Build + Correctness<br/>passes?"}

    F -- No --> G["🔬 AttemptResult<br/>failure evidence"]
    F -- Yes --> H{"📊 Benchmark<br/>improves?"}

    H -- No --> G
    H -- Yes --> I["✅ Promote Candidate<br/>verified improvement"]

    G --> J["🩺 AttemptDiagnosis<br/>what failed and why"]
    J --> K["🔄 PromptDelta<br/>new constraints, lessons, banned moves"]
    K --> B

    I --> L["🗃️ Run Artifacts<br/>PROMPTS.log, reports, bundle, submission"]

    classDef config fill:#f0f0ff,stroke:#6a6a9a,stroke-width:2px,color:#1a1a3c
    classDef state fill:#e0f0ff,stroke:#3a6ea5,stroke-width:2px,color:#1a2b3c
    classDef evidence fill:#fff7e0,stroke:#d9a34a,stroke-width:2px,color:#4a3a1a
    classDef gate fill:#e6ffe6,stroke:#4a9d4a,stroke-width:2px,color:#1a3c1a

    class A,L config
    class B,K state
    class C,D,G,J evidence
    class E,F,H,I gate
  

Codex does not learn inside the run.

The manager learns externally by turning execution evidence into prompt-state updates.

That is the core idea. The model generates. The harness verifies. The manager updates the state. The artifacts prove the path.


The Loop: From Candidate to Evidence

Once PromptState became the thing we were optimizing, the next question was obvious:

What counts as evidence?

A generated candidate is not evidence by itself. It is only a proposal.

Codex can return something that looks plausible, follows the shape of the prompt, and even compiles in your head. That does not mean it is correct. It does not mean it is safe. It does not mean it is faster. It does not even mean it is a valid patch.

So the manager has to turn every proposal into a structured result.

That is the job of AttemptResult.

In Codex Manager, an attempt is not “whatever the model said.” An attempt is what remains after the candidate has been tested by the system.

A simplified version looks like this:

class AttemptResultDTO(BaseModel):
    run_id: str
    task_id: str
    attempt_id: str
    prompt_hash: str

    candidate_text: str = ""
    patch_text: str = ""

    applied: bool = False
    compiled: bool = False
    correctness_passed: bool = False
    benchmark_passed: bool = False

    baseline_ms: float | None = None
    median_ms: float | None = None
    speedup: float | None = None

    failure_reason: str | None = None
    raw_test_output: str = ""
    raw_benchmark_output: str = ""

    metadata: dict = Field(default_factory=dict)

This object is how the system prevents hallucinated promotion.

A candidate is not “good” because it sounds good. It is not good because the model says it is optimized. It is not good because the diff looks clever.

It is good only if the evidence says it survived the gates.

An attempt begins when Codex returns a candidate, usually as a unified diff. The manager does not apply that diff to the real source tree. It creates an isolated workspace, copies the task files into it, and applies the patch there.

The flow is deliberately strict:

candidate diff
→ isolated workspace
→ safe patch application
→ build gate
→ correctness gate
→ benchmark gate
→ AttemptResult

Each gate answers one question:

  • Patch gate: Did the candidate apply cleanly, and did it only touch allowed files?
  • Build gate: Does the modified target still compile or pass the syntax/build step?
  • Correctness gate: Does the candidate preserve the required behavior?
  • Benchmark gate: If correctness passed, did performance improve?

That “if correctness passed” matters.

A faster wrong answer is not an optimization. It is a bug with good timing.

So the benchmark is blocked unless correctness succeeds.

That rule became one of the central design constraints:

No benchmark result counts unless correctness passes.

In practice, this prevents the manager from rewarding the most common failure mode in AI-generated optimization: removing necessary logic, weakening checks, changing edge-case behavior, or altering semantics in exchange for speed.

For example, in the vector_add task, the baseline implementation preserves Python zip semantics:

def vector_add(a, b):
    return [x + y for x, y in zip(a, b)]

A candidate might try to rewrite it as:

def vector_add(a, b):
    return [a[i] + b[i] for i in range(len(a))]

That looks reasonable for equal-length arrays. It may even appear faster in a narrow benchmark. But it breaks mismatched-length behavior:

vector_add([1, 2, 3], [10, 20])

The correct result should preserve zip behavior:

[11, 22]

The rewritten version can index past the shorter list and fail.

That failure becomes structured evidence:

{
  "attempt_id": "attempt_001",
  "applied": true,
  "compiled": true,
  "correctness_passed": false,
  "benchmark_passed": false,
  "failure_reason": "correctness_failed",
  "raw_test_output": "IndexError: list index out of range"
}

The important thing is not merely that the attempt failed.

The important thing is that the failure is now machine-readable.

The manager can diagnose it:

The candidate changed the iteration semantics and broke mismatched-length behavior.

Then the next prompt can be repaired:

Preserve zip semantics.
Do not assume equal-length vectors.
Do not trade correctness for speed.

A different candidate might pass correctness but regress the benchmark:

{
  "attempt_id": "attempt_002",
  "applied": true,
  "compiled": true,
  "correctness_passed": true,
  "benchmark_passed": true,
  "baseline_ms": 100.0,
  "median_ms": 105.0,
  "speedup": -0.05,
  "failure_reason": "benchmark_regression"
}

That is a very different kind of evidence.

The code is correct, but it is slower. So the next prompt should not focus on boundary semantics. It should focus on performance discipline:

The previous candidate was correct but slower.
Avoid extra branching, allocations, sleeps, or unnecessary memory traffic.
Target the measured hotspot directly.

And if an attempt finally passes both gates:

{
  "attempt_id": "attempt_003",
  "applied": true,
  "compiled": true,
  "correctness_passed": true,
  "benchmark_passed": true,
  "baseline_ms": 100.0,
  "median_ms": 84.0,
  "speedup": 0.16,
  "failure_reason": null
}

then it can be promoted.

Not because Codex said it was better.

Not because the diff looked clever.

Because the evidence survived the gates.

This is the difference between autocomplete and engineering.

Codex Manager does not ask:

Does this answer look good?

It asks:

Did it apply?
Did it build?
Did it preserve correctness?
Did it improve the benchmark?
What did we learn if it failed?

The final candidate is only the visible output.

The real product is the evidence trail that explains why it was accepted.


Diagnosis and Repair: Turning Failure into the Next Prompt

An important design was the split between diagnosis and repair:

AttemptResult → AttemptDiagnosis
AttemptDiagnosis → PromptDelta

That split became one of the most important architectural choices in the system.

The diagnoser answers:

What did the evidence show?

The repair policy answers:

How should the next prompt change?

Those are not the same question.

A failed attempt contains raw evidence: exit codes, compiler output, test failures, benchmark numbers, patch-application errors, and metadata from the isolated workspace. The diagnoser’s job is to turn that raw evidence into a clear interpretation. It should say what failed, why it likely failed, how confident the system is, and what lesson should be carried forward.

The repair policy then decides how to mutate the next PromptState.

That separation matters because a test error should not be allowed to improvise the next instruction. The manager remains in control: evidence becomes diagnosis, diagnosis becomes a structured prompt delta, and only then does PromptState evolve.

In other words, failure does not trigger a vague retry.

Failure becomes a controlled state transition.

For example, given:

failure_reason = "bounds_check_missing"

the diagnoser might produce:

Candidate removed or weakened boundary handling.
The next prompt must preserve boundary semantics and handle non-divisible input sizes.

The repair policy then translates that diagnosis into explicit prompt-state changes:

user_additions:
  - Explicitly preserve boundary checks and handle non-divisible input sizes.

new_failure_warnings:
  - Previous attempt likely failed because boundary guards were missing or weakened.

new_banned_moves:
  - Assume aligned sizes
  - Remove tid/count guards

That is very different from saying:

Try again.

The system now knows what kind of retry it is performing.

A compilation_failed diagnosis produces a different repair:

system_additions:
  - Preserve public interfaces, function names, signatures, buffer bindings, and required imports.

user_additions:
  - Before changing syntax, compare against the baseline and keep the smallest compilable edit.

new_banned_moves:
  - Pseudocode
  - Undefined symbols
  - Changed public interface

A benchmark_regression produces a different repair again:

user_additions:
  - The previous candidate was correct but slower.
  - Target a smaller hotspot-specific optimization.
  - Avoid extra branching, sleeps, allocations, or unnecessary memory traffic.

new_failure_warnings:
  - Previous candidate regressed benchmark performance.

This is the point of the split. The manager does not treat all failures as equal. A syntax failure, a correctness failure, a benchmark regression, and a malformed model response all require different prompt changes.

Once failures are typed this way, the manager can build a small operating manual for itself:

Failure What the diagnoser sees How the prompt is repaired
candidate_generation_failed The executor failed to return a usable candidate Simplify the output contract; require one valid artifact only
patch_apply_failed The candidate was not a clean diff or touched the wrong file Require a minimal unified diff against the exact target file
compilation_failed Syntax, imports, interface, or build contract broke Preserve signatures, imports, bindings, and public interface
correctness_failed Behavior changed even if the code built Preserve semantics before optimizing speed
bounds_check_missing Boundary or size assumptions broke edge cases Preserve bounds checks and handle non-divisible sizes
benchmark_regression Candidate was correct but slower Avoid extra branching, allocation, sync, or memory traffic
benchmark_failed Benchmark crashed or stopped emitting required metrics Preserve benchmark compatibility and output format

That table is the manager’s memory during a run.

It is not memory in the model weights. It is memory in the surrounding system. The next prompt becomes more constrained because the last attempt produced evidence.

This gives the loop its shape:

Attempt fails
→ evidence is classified
→ diagnosis records the lesson
→ repair policy mutates PromptState
→ next attempt is generated under better constraints

The prompt state evolves through structured deltas, not ad-hoc string rewrites.

That is what makes the process traceable. Every prompt change can be connected back to a specific attempt, a specific failure class, and a specific repair rule. When the final candidate is promoted, we can inspect the path that led there:

attempt_001/
  result.json
  diagnosis.md
  next_prompt_delta.md

attempt_002/
  result.json
  diagnosis.md
  next_prompt_delta.md

attempt_003/
  result.json
  diagnosis.md
  next_prompt_delta.md

The final code is not floating in space. It has provenance. The retry is more than a loop around the same mistake.


The Staircase We Built

We built Codex Manager in layers.

Each layer has one job, and each job removes one source of instability from the system. The result looks like a pipeline because that is what it is: a layered runtime where candidates move from context, to execution, to evidence, to packaging.

This is the current layer stack:

Step Layer Job What it stabilizes
1 Prompt-state loop Turn attempt results into prompt updates Ad-hoc retrying
2 Runtime contract Define the engine/facade/profile boundary Hardcoded execution paths
3 Shadow execution Apply and test candidates in isolated workspaces Blind mutation of real source
4 Executor registry Swap mock, scripted, and live candidate generators Model lock-in
5 Task packs Load external task definitions from YAML/JSON Hardcoded benchmark problems
6 Run bundles Preserve complete run artifacts and replay metadata Unreplayable runs
7 Command adapter Run build, correctness, and benchmark commands Toy-only evaluation
8 Submission packager Convert run evidence into judge-ready artifacts Messy or incomplete submission outputs
9 Platform adapters Translate submissions into target platform layouts Platform-specific branch logic inside the core runtime
10 Pipeline orchestrator Run the full workflow as one deterministic sequence Manual multi-step operation
11 CLI Expose the pipeline as terminal commands and profiles Python-snippet operation under pressure

The important thing is that the layer stack keeps the system understandable. Each part has a narrow responsibility, and the whole thing composes into a pipeline.


Cold Example 1: A Portable vector_add Task Pack

The task pack is where the system stops being a demo. It is the contract between the outside world and the manager: the problem, source file, allowed patch paths, build command, correctness command, benchmark command, output contract, and known failure modes.

task_id: command_vector_add_pack
profile: kernel_optimization
goal: Optimize the command-mode vector_add kernel while preserving correctness.
max_attempts: 3

execution:
  mode: command
  source_dir: source
  target_file: kernel.py
  allowed_patch_paths:
    - kernel.py
  build_command: python -m py_compile kernel.py
  correctness_command: python test_kernel.py
  benchmark_command: python bench_kernel.py
  benchmark_output_format: key_value
  baseline_key: baseline_ms
  score_key: median_ms
  higher_is_better: false
  baseline_ms: 100.0
  target_speedup: 0.10

context:
  operation_name: vector_add
  hardware_target: deterministic_python_command
  output_contract: Return one valid unified diff against kernel.py and nothing else.
  correctness_contract:
    - Preserve vector_add(a, b) behavior.
    - Do not modify tests or benchmarks.
    - Handle empty vectors.
    - Handle mismatched vector lengths using zip semantics.
  benchmark_contract:
    - Benchmark output must print baseline_ms=<float>.
    - Benchmark output must print median_ms=<float>.
    - Lower median_ms is better.
  known_failure_modes:
    - Touching test files invalidates the candidate.
    - Syntax errors fail the build gate.
    - Removing zip semantics may break mismatched length behavior.

There is no hidden Python factory here. The task defines its own source, target, tests, benchmark, output contract, failure modes, and scoring semantics.

This is important because the task is no longer hardcoded into Python. The manager can ingest it, build a context pack, run the same loop, and produce the same audit trail for any task that follows the contract.


Cold Example 2: Command-Driven Evaluation

The command runner turns a candidate diff into evidence.

For each candidate, it runs a strict five-phase pipeline:

Patch → Build → Correctness → Benchmark → Metric parse

The benchmark harness emits simple key-value output:

baseline_ms=100.0
median_ms=84.0

The parser computes speedup = (baseline_ms - median_ms) / baseline_ms, giving (100.0 - 84.0) / 100.0 = 0.16. A 16% speedup only matters if correctness passed first.

Here’s the actual attempt chain observed in a mock run:

  • Attempt 1: patch applies, build passes, correctness fails → failure_reason = correctness_failed → prompt adds: preserve zip semantics
  • Attempt 2: correctness passes, benchmark regresses → failure_reason = benchmark_regression → prompt adds: avoid extra branching and allocations
  • Attempt 3: correctness passes, benchmark improves → promoted

This deterministic progression proves that the manager can recover from both safety and performance failures without human intervention.


Cold Example 3: One Command Pipeline

By the end, the whole workflow became one terminal command:

writer codex-manager pipeline run \
  --task examples/codex_manager/command_vector_add/task.yaml \
  --output-root runs/blog_vector_add_smoke \
  --platform generic_command \
  --executor mock \
  --create-zip \
  --overwrite

A real smoke run produced:

Pipeline run
  Status: completed
  Pipeline id: pipe_e140aaaa0ae6
  Run id: cmrun_dbc92100fcb7
  Submission id: submission_384f757e2f35
  Platform export id: export_189cb76315d3
  Submission zip path: runs/blog_vector_add_smoke/submission.zip
  Pipeline report: runs/blog_vector_add_smoke/pipeline_report.md

Pipeline validation
  Valid: True

That single command runs:

task pack load
→ prompt-state optimization
→ command execution
→ bundle generation
→ submission packaging
→ platform export
→ validation

The Artifact Trail: Evidence Has Structure

By the time a run finishes, Codex Manager has not only produced a candidate. It has produced a trail.

That trail matters because agentic code systems are otherwise hard to inspect. A model may return a plausible answer, but without the surrounding evidence we cannot tell whether the answer was lucky, verified, overfit, unsafe, or simply accepted because no one looked closely enough.

Codex Manager writes the evidence down.

At the attempt level, each executed attempt gets its own directory:

attempts/
  attempt_001/
    prompt.md
    result.json
    diagnosis.md
    next_prompt_delta.md

  attempt_002/
    prompt.md
    result.json
    diagnosis.md
    next_prompt_delta.md

  attempt_003/
    prompt.md
    result.json
    diagnosis.md
    next_prompt_delta.md

That directory is the chain of custody for each candidate.

  • prompt.md shows what Codex saw.
  • result.json shows what happened when the candidate was executed.
  • diagnosis.md explains what the evidence meant.
  • next_prompt_delta.md shows how the next prompt changed.

A run also keeps a separate timeline of prompt states:

prompt_states/
  attempt_001/
    prompt.md
  attempt_002/
    prompt.md
  attempt_003/
    prompt.md
  attempt_004/
    prompt.md

That distinction matters.

attempts/ contains candidates that were actually executed. prompt_states/ contains the evolving state of the run, including the next prompt state that would be used if another attempt were needed.

So the final unexecuted next state can still be inspected, but it does not masquerade as an executed attempt.

This gives the artifact model clean boundaries:

attempts/
  attempt_001/
    prompt.md
    result.json
    diagnosis.md
    next_prompt_delta.md

prompt_states/
  attempt_004/
    prompt.md

At the run level, Codex Manager writes:

run_manifest.json
context_pack.md
PROMPTS.log
report.md
validation.json
best/

At the submission level, it writes:

submission_manifest.json
README.md
summary.json
prompt_evolution.json
evidence/
best/

And once the full pipeline runs, it writes:

pipeline_manifest.json
pipeline_summary.json
pipeline_report.md
pipeline_events.jsonl
task_run/
submission/
platform_export/

The pipeline manifest ties the whole chain together with IDs and hashes:

{
  "status": "completed",
  "pipeline_id": "pipe_5298de9da2df",
  "run_id": "cmrun_b608c9be1db5",
  "submission_id": "submission_01281797e2a5",
  "platform_export_id": "export_efdbf4e960b2",
  "chain_of_custody": {
    "run_bundle": "446b95950ed91687a24016450ff310e9bb37b80a44794147c053416b892ba214",
    "submission": "9afec5d1f8f7587e491fd3f7264495a6138a1a881c27e1b2bd30535fa20cee52",
    "platform_export": "6960142bad20b2d16b60693bc111e9c2b59bc7a9dddac9b439ef0b53389c4bb8"
  }
}

That is the point of the artifact trail.

The system is not asking anyone to trust a generated file. It preserves the context, the candidate, the execution result, the diagnosis, the repair, the submission, and the export.

A judge can inspect it. A developer can replay it. A future run can use the same evidence to improve the next context.

The final code is only the visible tip of the run. The artifact trail is the proof that it deserved to survive.


Make the AI Work Harder

The point of Codex Manager is to make the AI work harder.

A normal coding assistant can produce an answer and move on. If the answer is wrong, the burden falls back on the human: find the bug, explain the failure, rewrite the prompt, run the test again, decide whether the next version is better.

That is useful, but it is still mostly human-managed.

For this hackathon, the challenge was framed around context. Our interpretation was simple: if context is the thing being judged, then the system should not rely on a human manually carrying that context from one attempt to the next. The system itself should preserve the context, update it, and force the next attempt to deal with what happened before.

That is what Codex Manager is trying to demonstrate.

It gives the AI a structured environment where it cannot simply produce code and disappear. Every candidate has to pass through the same loop:

generate candidate
→ apply it safely
→ build it
→ test correctness
→ benchmark it
→ diagnose failure
→ repair the next prompt
→ try again under better constraints

The model still proposes the code. But the surrounding system makes the model face evidence.

If the model returns prose instead of a diff, the next prompt gets stricter.

If the candidate breaks correctness, the next prompt carries that lesson.

If the candidate passes correctness but slows the benchmark, the next prompt changes again.

If the candidate succeeds, the system promotes it and preserves the trail.

This is what we mean by making the AI work harder. We are not just asking it for a better answer. We are building the conditions under which it has to produce one.

The manager gives the AI something closer to an external working memory:

  • what the task is
  • what the output contract is
  • what failed before
  • what moves are banned
  • what patterns should be preserved
  • what the benchmark actually measured
  • what evidence is required before promotion

That is the context: an evolving control system where each attempt leaves evidence that shapes the next one.

    flowchart TD
    A["🧠 PromptState"] --> B["🤖 Generate Candidate"]
    B --> C["🛡️ Apply Candidate<br/>in Isolated Workspace"]
    C --> D["🔍 Validate Attempt"]

    D --> E{"Passed?"}

    E -- No --> F["🩺 Diagnose Failure"]
    F --> G["🔄 Repair PromptState"]
    G --> A

    E -- Yes --> H["✅ Promote Candidate"]
    H --> I["🎯 Verified Solution"]

    subgraph Validation
        D1["⚙️ Build / Compile"]
        D2["🧪 Correctness Tests"]
        D3["📊 Benchmark / Performance"]
    end

    D --> D1
    D --> D2
    D --> D3

    classDef state fill:#e0f0ff,stroke:#3a6ea5,stroke-width:2px,color:#1a2b3c
    classDef evidence fill:#fff7e0,stroke:#d9a34a,stroke-width:2px,color:#4a3a1a
    classDef gate fill:#e6ffe6,stroke:#4a9d4a,stroke-width:2px,color:#1a3c1a

    class A,G state
    class B,F,I evidence
    class C,D,E,H gate
    class D1,D2,D3 gate
  

This is how we chose to approach the hackathon. If the problem is too broad or too difficult to solve reliably by hand, then the useful demonstration is a system that lets the AI iterate under constraint.

It generates a candidate, tests it, records what happened, repairs the next prompt state, and tries again.

It may reach the final solution. It may not. But it gives the AI a structured way to improve its attempts, and it leaves behind artifacts that show the path it took.

That is our answer to the “context, not code” framing.


What Comes Next: Applying the Manager

The next step is to apply Codex Manager to harder tasks.

The shape is now in place:

Task Pack
→ PromptState
→ Candidate
→ Isolated Execution
→ Evidence
→ Diagnosis
→ PromptDelta
→ Run Bundle
→ Submission
→ Platform Export

A new target can bring a new source file, a new build command, a new correctness harness, a new benchmark, and a new platform export. The prompt-state loop remains the same.

For a real kernel task, the path is straightforward:

Task Pack
→ Command Mode
→ Metal build command
→ Metal correctness harness
→ Metal benchmark command
→ same PromptState loop

That is the point of the architecture. The hackathon was the forcing function. The pattern is the output.



References

Humanity’s Last Hackathon

How To Win Humanity’s Last Hackathon - The hardest agent contest in AI.

OpenAI Codex Documentation OpenAI’s documentation for Codex as a coding agent for software development. Useful background for readers who want to understand the Codex product surface and workflow model.

Codex CLI Documentation Documentation for running Codex locally from the terminal. This is especially relevant to the Codex Manager idea because it frames Codex as a coding agent that can read, change, and run code in a local directory.

Codex CLI Reference Command and flag reference for Codex CLI. Useful for readers who want to compare Codex Manager’s CLI/pipeline approach with OpenAI’s Codex CLI surface.

Codex Web / Cloud Documentation Documentation for delegating coding tasks to Codex in a cloud environment.

Codex Skills Documentation Documentation on Codex skills and reusable workflows. Relevant to the broader idea of treating context and workflow structure as first-class parts of agentic coding.


Appendix A: The Codex Manager Loop in Pseudo-Code

The full implementation has engines, facades, registries, task packs, validators, run bundles, submissions, platform adapters, and a CLI. But the core idea is much smaller.

At the center is one loop:

PromptState
→ Candidate
→ AttemptResult
→ AttemptDiagnosis
→ PromptDelta
→ next PromptState

Here is that loop in pseudo-code.

def solve(task_pack, executor, profile, max_attempts):
    """
    Solve a task by evolving PromptState, not by trusting generated code.
    """

    # 1. Load task and build initial context.
    task = load_task_pack(task_pack)
    context_pack = build_context_pack(task)

    state = PromptState(
        task_id=task.id,
        context_pack=context_pack,
        prior_lessons=[],
        failure_warnings=[],
        banned_moves=[],
        success_patterns=[],
        output_contract="Return one valid unified diff and nothing else.",
    )

    attempts = []

    # 2. Iterate until budget is exhausted or a verified candidate is found.
    for attempt_index in range(max_attempts):
        # Codex is a candidate generator, not the source of truth.
        candidate = executor.generate(state)

        # The profile decides how this task is evaluated.
        # For command-mode tasks, this means:
        #   patch -> build -> correctness -> benchmark -> metrics
        result = profile.run_candidate(
            task=task,
            state=state,
            candidate=candidate,
        )

        attempts.append(result)

        # A candidate only survives if it passes the gates.
        if result.correctness_passed and result.benchmark_passed and result.speedup > 0:
            promote(result)
            break

        # 3. Convert evidence into a diagnosis.
        diagnosis = diagnose(result)

        # 4. Convert diagnosis into a prompt delta.
        delta = repair_prompt(diagnosis)

        # 5. Produce the next PromptState.
        state = apply_delta(state, delta)

    # 6. Preserve the evidence trail.
    return write_run_bundle(
        task=task,
        attempts=attempts,
        final_state=state,
    )

That is the heart of the system.

Codex generates candidates. The manager turns those candidates into evidence. The evidence changes the next prompt.


The Candidate Is Not Trusted

The executor only returns text:

class CandidateExecutor:
    def generate(self, state: PromptState) -> str:
        """
        Return a candidate artifact.

        Usually this is a unified diff.
        It may come from a mock executor, a scripted executor,
        or a live Codex/OpenAI-compatible executor.
        """
        ...

The executor does not validate the candidate. It does not repair the prompt. It does not promote anything.

It only generates.


The Profile Turns a Candidate into Evidence

A profile knows how to evaluate a task.

For a command-mode task, evaluation looks like this:

def run_candidate(task, state, candidate_diff):
    workspace = create_isolated_workspace(task.source_dir)

    patch = apply_patch(
        workspace=workspace,
        patch_text=candidate_diff,
        allowed_paths=task.allowed_patch_paths,
    )

    if not patch.applied:
        return AttemptResult(
            applied=False,
            compiled=False,
            correctness_passed=False,
            benchmark_passed=False,
            failure_reason="patch_apply_failed",
            raw_test_output=patch.error,
        )

    build = run_command(task.build_command, cwd=workspace)

    if not build.passed:
        return AttemptResult(
            applied=True,
            compiled=False,
            correctness_passed=False,
            benchmark_passed=False,
            failure_reason="compilation_failed",
            raw_test_output=build.output,
        )

    correctness = run_command(task.correctness_command, cwd=workspace)

    if not correctness.passed:
        return AttemptResult(
            applied=True,
            compiled=True,
            correctness_passed=False,
            benchmark_passed=False,
            failure_reason=classify_correctness_failure(correctness.output),
            raw_test_output=correctness.output,
        )

    benchmark = run_command(task.benchmark_command, cwd=workspace)
    metrics = parse_benchmark_output(benchmark.output)

    if not benchmark.passed or not metrics.valid:
        return AttemptResult(
            applied=True,
            compiled=True,
            correctness_passed=True,
            benchmark_passed=False,
            failure_reason="benchmark_failed",
            raw_benchmark_output=benchmark.output,
        )

    speedup = compute_speedup(metrics)

    return AttemptResult(
        applied=True,
        compiled=True,
        correctness_passed=True,
        benchmark_passed=True,
        baseline_ms=metrics["baseline_ms"],
        median_ms=metrics["median_ms"],
        speedup=speedup,
        failure_reason=None if speedup > 0 else "benchmark_regression",
        raw_test_output=correctness.output,
        raw_benchmark_output=benchmark.output,
    )

The key rule is:

if not correctness.passed:
    do_not_run_benchmark()

A faster wrong answer is not an optimization.


Diagnosis Interprets Evidence

The diagnoser does not edit the prompt. It only classifies what happened.

def diagnose(result):
    if result.failure_reason == "patch_apply_failed":
        return AttemptDiagnosis(
            failure_class="patch_apply_failed",
            prompt_lesson=(
                "The candidate was not a clean patch. "
                "The next prompt must require one valid unified diff against the target file."
            ),
        )

    if result.failure_reason == "compilation_failed":
        return AttemptDiagnosis(
            failure_class="compilation_failed",
            prompt_lesson=(
                "The candidate failed to build. "
                "The next prompt must preserve valid syntax, imports, function names, and public interfaces."
            ),
        )

    if result.failure_reason == "correctness_failed":
        return AttemptDiagnosis(
            failure_class="correctness_failed",
            prompt_lesson=(
                "The candidate changed behavior. "
                "The next prompt must preserve semantics before optimizing speed."
            ),
        )

    if result.failure_reason == "benchmark_regression":
        return AttemptDiagnosis(
            failure_class="benchmark_regression",
            prompt_lesson=(
                "The candidate was correct but slower. "
                "The next prompt must target a smaller hotspot and avoid extra branching or allocation."
            ),
        )

    return AttemptDiagnosis(
        failure_class="unknown",
        prompt_lesson="Make a smaller, safer change and preserve the baseline behavior.",
    )

This keeps interpretation separate from mutation.


Repair Mutates PromptState

The repair policy converts a diagnosis into a structured prompt update.

def repair_prompt(diagnosis):
    delta = PromptDelta()

    if diagnosis.failure_class == "patch_apply_failed":
        delta.system_additions.append(
            "Return one valid unified diff against the exact target file."
        )
        delta.banned_moves.extend([
            "Conversational response",
            "Whole-file rewrite unless explicitly requested",
            "Touching unrelated files",
        ])

    if diagnosis.failure_class == "compilation_failed":
        delta.system_additions.append(
            "Preserve imports, function names, signatures, and public interfaces."
        )
        delta.banned_moves.extend([
            "Pseudocode",
            "Undefined symbols",
            "Changed public interface",
        ])

    if diagnosis.failure_class == "correctness_failed":
        delta.user_additions.append(
            "Preserve the baseline semantics before attempting performance improvements."
        )

    if diagnosis.failure_class == "benchmark_regression":
        delta.user_additions.append(
            "The previous candidate was correct but slower. "
            "Avoid extra branching, sleeps, allocations, or unnecessary memory traffic."
        )

    delta.prior_lessons.append(diagnosis.prompt_lesson)
    return delta

The repair is explicit. It is not an improvised “try harder.”


PromptState Carries the Loop Forward

Applying the delta creates the next state:

def apply_delta(state, delta):
    return PromptState(
        task_id=state.task_id,
        context_pack=state.context_pack,

        system_prompt=state.system_prompt + "\n" + "\n".join(delta.system_additions),
        user_prompt=state.user_prompt + "\n" + "\n".join(delta.user_additions),

        prior_lessons=state.prior_lessons + delta.prior_lessons,
        failure_warnings=state.failure_warnings + delta.failure_warnings,
        banned_moves=dedupe(state.banned_moves + delta.banned_moves),
        success_patterns=dedupe(state.success_patterns + delta.success_patterns),

        output_contract=state.output_contract,
        attempt_index=state.attempt_index + 1,
    )

The model’s weights do not change.

The context around the model changes.

That is the point.


Artifacts Are Written at Every Step

A real run writes the evidence trail:

def write_attempt_artifacts(state, result, diagnosis, delta):
    write("attempts/{id}/prompt.md", state.render())
    write("attempts/{id}/result.json", result)
    write("attempts/{id}/diagnosis.md", diagnosis)
    write("attempts/{id}/next_prompt_delta.md", delta)

Prompt states get their own timeline:

def write_prompt_state(state):
    write("prompt_states/{id}/prompt.md", state.render())

The larger pipeline then packages everything:

def full_pipeline(task_pack):
    run = solve_task_pack(task_pack)
    bundle = build_run_bundle(run)
    submission = package_submission(bundle)
    export = export_platform_submission(submission)
    return validate(export)

That is the full system in miniature.

The implementation is larger because it has registries, validators, manifests, hashes, and CLI commands. But the idea remains the same:

generate
→ validate
→ diagnose
→ repair
→ try again
→ preserve the evidence

That is Codex Manager.


Glossary

Term Definition
Codex Manager A prompt-state runtime that manages Codex through candidate generation, isolated execution, diagnosis, repair, and verified promotion.
PromptState The structured state around Codex at a given point in the run, including context, prior lessons, failure warnings, banned moves, success patterns, and the output contract.
Candidate A proposed solution generated by Codex or another candidate executor. It is treated as a proposal, not as a trusted answer.
Candidate Diff A patch generated by the model, usually as a unified diff against the target file.
Attempt One cycle of generating a candidate, applying it, validating it, benchmarking it, and recording the result.
AttemptResult The structured evidence from an attempt, including whether the candidate applied, built, passed correctness, passed benchmark, and improved performance.
AttemptDiagnosis The manager’s interpretation of an AttemptResult. It identifies what failed, why it likely failed, and what lesson should carry forward.
PromptDelta A structured update to the next PromptState, such as new constraints, warnings, banned moves, or success patterns.
Prompt Repair The process of converting an AttemptDiagnosis into a PromptDelta so the next attempt is generated under better constraints.
Banned Moves Actions the next candidate should avoid, such as touching test files, changing public interfaces, assuming aligned inputs, or returning prose instead of a diff.
Success Patterns Patterns from successful attempts that should be preserved in future attempts.
Context Pack The task-specific context given to Codex, including the goal, source constraints, correctness contract, benchmark contract, known failure modes, and output contract.
Task Pack A portable YAML or JSON task definition that describes the problem, source files, target file, allowed patch paths, build command, correctness command, benchmark command, and context.
Execution Mode The way a task is evaluated, such as shadow, shadow_mock, or command.
Shadow Workspace An isolated copy of the source files where candidate patches are applied and tested without modifying the real source tree.
Patch Gate The validation step that checks whether a candidate diff applies cleanly and only touches allowed files.
Build Gate The validation step that checks whether the patched target still compiles or passes its build command.
Correctness Gate The validation step that checks whether the candidate preserves the required behavior. Benchmarking is blocked unless this passes.
Benchmark Gate The validation step that measures whether a correctness-passing candidate improves performance.
Benchmark Regression A failure where the candidate passes correctness but performs worse than the baseline.
Verified Promotion The decision to mark a candidate as the best surviving attempt only after it passes the required gates and improves the benchmark.
Candidate Executor The component that turns a PromptState into a candidate. Examples include mock, scripted, and OpenAI-compatible executors.
Executor Registry The registry that lets Codex Manager switch between candidate generators without changing the prompt-state loop.
Profile A domain-specific evaluator for a class of tasks. For example, a kernel optimization profile knows how to build context and run candidates for kernel-style tasks.
Profile Registry The registry that lets Codex Manager load different profiles without hardcoding execution paths into the engine.
Command Adapter The execution path that lets task packs define shell commands for build, correctness, and benchmark evaluation.
Run Bundle A reproducible package of a completed run, including the run manifest, context pack, prompt log, report, validation output, replay script, attempts, and best candidate evidence.
Submission Package A judge-ready artifact built from a run bundle, containing the evidence needed to inspect the run and its promoted candidate.
Platform Export A translated version of the submission package shaped for a specific external platform, such as generic_command, metal_stub, or gpu_mode_stub.
Pipeline The full end-to-end workflow: task pack, solve, run bundle, submission package, platform export, and validation.
Pipeline Manifest The machine-readable summary of a full pipeline run, including IDs, output paths, validations, and chain-of-custody hashes.
Chain of Custody The hash-linked trail connecting a run bundle, submission package, and platform export.
PROMPTS.log The append-only JSONL log of prompt attempts, prompt hashes, diagnoses, speedups, promotion status, and executor metadata.
attempts/ The directory containing executed attempts. Each attempt contains the prompt, result, diagnosis, and next prompt delta.
prompt_states/ The directory containing PromptState snapshots, including future prompt states that may not have been executed.
Replay The ability to inspect or rerun a task using the recorded task pack, manifest, and pipeline metadata.
Context Machine The broader system around Codex that preserves state, evaluates attempts, repairs prompts, and produces artifacts. Codex Manager is the context machine described in this post.