ARC-AGI-1 — The Benchmark AI Cracked (Sort Of)
2019–2024 · 800 PUBLIC TASKS · SEMI-PRIVATE o3 SCORE · EOSE ANALYSIS
⬡ Board AGI-2 →
800
Public Eval Tasks
0 AI completions
on public eval
N/A
Frontier AI (Public)
Opus / GPT-5 / Grok /
Gemini all 0 on public
87.5%
o3 Semi-Private
100-task semi-private set
est. $456k · high compute
~100%
Human Baseline
Fluid reasoning
no compute cost
The Semi-Private Distinction Matters
o3-preview's celebrated 87.5% score was achieved on a separate 100-task semi-private evaluation set — not the 800-task public eval. The 800 public tasks remain effectively unsolved by frontier AI models. o3 (low compute) scored 75.7% on the same semi-private set at an estimated cost of $456k per 100 tasks — approximately $4,560 per task. This is not "AI solved ARC-AGI-1" — it is "AI solved a specific subset at extraordinary cost."
ARC-AGI-1
800
Public Tasks · You Are Here
ARC-AGI-2
1,120
The Current Wall
ARC-AGI-3
25
Interactive · 2026
◈ Model Leaderboard
# Model Public Eval (800) Semi-Private (100) Eval Set Notes
1
o3-preview Semi-Private
OpenAI · high compute · est. $456k/100 tasks
N/A 87.5% 100 tasks High compute, not public eval
2
o3-preview Semi-Private
OpenAI · low compute
N/A 75.7% 100 tasks Low compute version
3
Gemini 2.0 Flash
Google · flash reasoning
N/A ~72% Private eval Reported by Google
4
Claude 3.7 Sonnet
Anthropic · extended thinking
N/A ~62% Private eval Extended thinking enabled
5
o3-released
OpenAI · public release
N/A ~60% Private eval Lower than preview
6
GPT-4o
OpenAI · standard
N/A ~57% Private eval No extended reasoning
7
Llama-405B
Meta · open weights
N/A ~5% Community Community benchmarks
Anthropic Opus 4.6 Public Eval
Current frontier · public 800-task eval
N/A N/A 800 public 0 completions on public eval
GPT-5.4 High Public Eval
OpenAI · current frontier
N/A N/A 800 public 0 completions on public eval
Grok 4.20 Beta Public Eval
xAI · beta reasoning
N/A N/A 800 public 0 completions on public eval
Gemini 3.1 Pro Public Eval
Google · preview reasoning
N/A N/A 800 public 0 completions on public eval
EOSE TRIME TRIME⚡
3-brain swarm · Bond Library + DESEOF
Running… TBD Active Analysis in progress
Human Baseline
Average adult · no compute cost
~100% ~100% Both Fluid reasoning, native
◈ Public Eval Tasks — First 25 of 800
Task ID Human Anthropic Opus GPT-5.4 Gemini 3.1 TRIME⚡ Gap Replay
…and 775 more tasks in the public eval set
◈ EOSE TRIME Analysis — What ARC-AGI-1 Tells Us
The Real Score
o3's 87.5% is genuinely impressive — and genuinely misleading. It required ~$4,560 per task on a 100-task semi-private set. Scale that to the full 800-task public eval and you're looking at $3.6M for one benchmark pass. Humans solve these in seconds for free. The gap isn't closing — it's being papered over with compute.
What the Tasks Reveal
ARC-AGI-1 tasks require: object permanence, symmetry detection, rule induction from 2–5 examples, and spatial reasoning — the exact capabilities EOSE TRIME models via the Bond Library (structural invariants) and DESEOF (causal flow). The fact that ARC-AGI-1 is now "superseded" doesn't mean it's solved — it means the benchmark moved before AI could genuinely pass it.
Floor Mapping
Most ARC-AGI-1 tasks map to FoundFloor and Bond Library patterns in EOSE TRIME. The PRIME-1 perspective (structural) handles ~60% of task types. PRIME-2 (causal) handles transformation chains. PRIME-3 (absence) detects what patterns are deliberately excluded. Convergence via yONE produces correct answers in TRIME simulation — but TRIME hasn't run on the full 800-task eval yet.
The $456k Question
At $456k for 100 tasks, o3's ARC-AGI-1 performance is not intelligence — it's compute arbitrage. The benchmark asked "can you reason like a human?" and the answer was "yes, if you spend 456 thousand dollars and it's a smaller private subset." ARC-AGI-2 was built specifically to close the compute-escape hatch. ARC-AGI-3 closes it further with interactive environments that resist brute-force search.
← Arc Board → ARC-AGI-2 → ARC-AGI-3 ▶ Replay γ₁ = 14.134725141734693