PTTE V9 | ARC-AGI-2 POSTMORTEM

AGI-2 EVAL (14b)

AGI-2 TRAIN (7b)

8.9%

EVAL CELL MATCH

TRAINING PEAK (14b)

017c7c7b

ONLY TASK SOLVED

BASELINE KILL FEED raw LLM · all runs

── AGI-1 RUNS ──────────────────────────────────────────────────────────────
qwen2.5:7b AGI-1 training ×20 correct=0/20 cell≈0% texture collapse
qwen2.5:7b AGI-1 ×8editions ×20 correct=0/20 cell=0% rotation → flatline
qwen2.5:14b AGI-1 v8 bench ×5×54 avg 28.7% AGI-1 MIRAGE
↳ smaller repetitive grids · transformer texture matching · NOT reasoning

── AGI-2 RUNS ──────────────────────────────────────────────────────────────
qwen2.5:7b AGI-2 training ×20 correct=0/20 cell=21.6% WALL
qwen2.5:14b AGI-2 training ×20 correct=1/20 cell=32.8% 017c7c7b (tiling guess)
qwen2.5:14b AGI-2 EVAL ×10 correct=0/10 cell=8.9% ████ THE WALL ████

── RUN 4: OPTIC NERVE ACTIVE ────────────────────────────────────────────────
qwen2.5:14b AGI-2 training ×5 (objects) correct=0/5 cell=31.5% +22.6pp vs eval
↳ 009d5c81: 83.7% cell match · 3-object task · rule PERCEIVED · near-hit
↳ 007bbfb7: 51.9% cell match · 2-object task · transform 50% right
↳ 00d62c1b: 0.0% cell match · 15-object task · synthesis window exceeded

RUN 4 DELTA — OBJECT DECOMP vs EVAL BASELINE

8.9% → 31.5% · +22.6pp · 3.5×

Zero model changes. Zero fine-tuning. Only the representation changed.
Raw grid → object descriptors = 3.5× cell match improvement.
009d5c81 at 83.7% proves the model CAN reason about spatial rules —
it just couldn't see the objects through 900 raw integers.

AUTOPSY — WHY THE WALL IS PERMANENT

AGI-1 at 25-35% was a mirage. Small grids. Repetitive patterns. Colour fills with 2-3 objects max. The transformer learned to pattern-match the texture of ARC training grids — not the rules. Rotate any task by 90°: score drops to ~8%. That's the tell.

AGI-2 destroyed the mirage. 30×30 grids. Novel topologies per task. No two tasks share a surface pattern. The model receives 900 integers and no geometry. Its attention mechanism cannot localise object boundaries. It hallucinates pixel values with no spatial grounding. Cell match on eval: 8.9%. Not even close to the right shape.

017c7c7b was a tiling guess. That task's output was a simple periodic repetition of the input pattern. The model got lucky on a texture it'd seen variants of. Remove it and the training score is 0/19.

Rick's Law holds: You cannot transform an object you cannot perceive. The LLM is not the retina. It was never supposed to be the retina. The Cathedral's formally verified Optic Nerve is the retina.

PHASE 3 — THE DSL HANDS

The Cathedral has eyes. Now it needs hands.

OVERSEER receives object descriptors. It synthesises a transform. But that transform must be formally verified — a Lean 4 DSL primitive, not a hallucinated pixel grid.

Two verbs to build:

-- VERB 1: safe rigid-body translation
def translate_object
    (obj : ARC_Object) (dr dc : Int)
    : Option ARC_Object
-- shift every pixel · fail if any OOB
-- must PROVE is_connected preserved
-- must PROVE is_maximal preserved

-- VERB 2: trivial recolor
def recolor_object
    (obj : ARC_Object) (c : Nat)
    : ARC_Object
-- pixels unchanged · only color field
-- is_monochrome: trivially ∀ p, c = c
-- is_connected: inherited unchanged
-- is_maximal: grid-relative · needs choice
      

Run 5 hypothesis: 009d5c81 goes 83.7% → 100% when OVERSEER outputs translate_object obj 0 2 and Lean verifies bounds before applying.

RICK'S LAW — PERMANENT FLEET STANDARD

      "Weak Signal = Zero Capture. You cannot P_c your way past weak S."

      The LLM has strong P_c (synthesis capability) and catastrophically weak S (spatial signal).

      The Optic Nerve is the S booster. Object descriptors are strong signal.

      Run 4 proved it: same model, same hardware, 3.5× better signal → 3.5× better cell match.