ARC-AGI-1 — The Benchmark AI Cracked (Sort Of)

⚠

The Semi-Private Distinction Matters

o3-preview's celebrated 87.5% score was achieved on a separate 100-task semi-private evaluation set — not the 800-task public eval. The 800 public tasks remain effectively unsolved by frontier AI models. o3 (low compute) scored 75.7% on the same semi-private set at an estimated cost of $456k per 100 tasks — approximately $4,560 per task. This is not "AI solved ARC-AGI-1" — it is "AI solved a specific subset at extraordinary cost."

◈ Model Leaderboard

#	Model	Public Eval (800)	Semi-Private (100)	Eval Set	Notes
1	o3-preview Semi-Private OpenAI · high compute · est. $456k/100 tasks	N/A	87.5%	100 tasks	High compute, not public eval
2	o3-preview Semi-Private OpenAI · low compute	N/A	75.7%	100 tasks	Low compute version
3	Gemini 2.0 Flash Google · flash reasoning	N/A	~72%	Private eval	Reported by Google
4	Claude 3.7 Sonnet Anthropic · extended thinking	N/A	~62%	Private eval	Extended thinking enabled
5	o3-released OpenAI · public release	N/A	~60%	Private eval	Lower than preview
6	GPT-4o OpenAI · standard	N/A	~57%	Private eval	No extended reasoning
7	Llama-405B Meta · open weights	N/A	~5%	Community	Community benchmarks
—	Anthropic Opus 4.6 Public Eval Current frontier · public 800-task eval	N/A	N/A	800 public	0 completions on public eval
—	GPT-5.4 High Public Eval OpenAI · current frontier	N/A	N/A	800 public	0 completions on public eval
—	Grok 4.20 Beta Public Eval xAI · beta reasoning	N/A	N/A	800 public	0 completions on public eval
—	Gemini 3.1 Pro Public Eval Google · preview reasoning	N/A	N/A	800 public	0 completions on public eval
—	EOSE TRIME TRIME⚡ 3-brain swarm · Bond Library + DESEOF	Running…	TBD	Active	Analysis in progress
∞	Human Baseline Average adult · no compute cost	~100%	~100%	Both	Fluid reasoning, native

◈ Public Eval Tasks — First 25 of 800

	Task ID	Human	Anthropic Opus	GPT-5.4	Gemini 3.1	TRIME⚡	Gap	Replay

…and 775 more tasks in the public eval set

◈ EOSE TRIME Analysis — What ARC-AGI-1 Tells Us

The Real Score

o3's 87.5% is genuinely impressive — and genuinely misleading. It required ~$4,560 per task on a 100-task semi-private set. Scale that to the full 800-task public eval and you're looking at $3.6M for one benchmark pass. Humans solve these in seconds for free. The gap isn't closing — it's being papered over with compute.

What the Tasks Reveal

ARC-AGI-1 tasks require: object permanence, symmetry detection, rule induction from 2–5 examples, and spatial reasoning — the exact capabilities EOSE TRIME models via the Bond Library (structural invariants) and DESEOF (causal flow). The fact that ARC-AGI-1 is now "superseded" doesn't mean it's solved — it means the benchmark moved before AI could genuinely pass it.

Floor Mapping

Most ARC-AGI-1 tasks map to FoundFloor and Bond Library patterns in EOSE TRIME. The PRIME-1 perspective (structural) handles ~60% of task types. PRIME-2 (causal) handles transformation chains. PRIME-3 (absence) detects what patterns are deliberately excluded. Convergence via yONE produces correct answers in TRIME simulation — but TRIME hasn't run on the full 800-task eval yet.

The $456k Question

At $456k for 100 tasks, o3's ARC-AGI-1 performance is not intelligence — it's compute arbitrage. The benchmark asked "can you reason like a human?" and the answer was "yes, if you spend 456 thousand dollars and it's a smaller private subset." ARC-AGI-2 was built specifically to close the compute-escape hatch. ARC-AGI-3 closes it further with interactive environments that resist brute-force search.