ARC-AGI-3 BONSAI COASTER · 25 Real Games · EOSE Labs

ARC-AGI-3 ≠ ARC-AGI-1/2. AGI-1 + AGI-2 = static grid tasks, scored by exact match %. AGI-3 = interactive games — AI is an agent that plays 25 environments, scored by RHAE (action efficiency vs human baseline). These are completely different tests. Our frontier scorecard shows 0 real agent runs yet — Apr 5 runs were framework tests with model:none.

Real Games

AR25 · BP35 · CD82 · CN04… live from API

183

Total Levels

6–10 levels per game · later levels weight more

RHAE

Scoring Method

(human_actions / ai_actions)² per level · capped 1.15×

Real Agent Runs

Apr 5 = model:none framework tests · no real score yet

55.5%

MindsAI 2024

ARC Prize $1M winner · target to beat

0–3%

Frontier 2025

All new models stuck here · 2026 leaderboard

Per-Level Score

min(1.15, (human_baseline / ai_actions)²)

AI takes 2× human → 0.25. AI takes 10× human → 0.01. Fast AI → up to 1.15

Per-Game Score

Σ(level_score[i] × (i+1)) / Σ(i+1)

Level 1 weight=1, level 2 weight=2… later = harder = more weight

Total Score

mean(game_scores) × 100%

Average across all 25 games. Must complete ALL levels for 100% on a game

The Trap

55 actions, 0 completions → 0%

Our Apr 5 runs hit the action limit without completing any levels → score 0

EOSE TARGET (AGI-3)

TBD

MindsAI 2024

55.5%

Jack Cole #1

30%

Opus 4.7 High

~3.1%

GPT-5.5 High

~2.9%

Grok 4.20 Beta

~2.4%

Gemini 3.1 Pro

~2.1%

EOSE Apr 5 runs

γ₁ anchor

14.134

FT09 human baseline

208 actions

WA30 human baseline

1,843 actions

⬡ BONSAI HELIX — 25 REAL GAMES ON THE TRACK · HEIGHT = HUMAN BASELINE ACTIONS (HARDER = TALLER)

click games (8)

keyboard games (4)

keyboard+click games (13)

hard (avg >100 actions/level)

easy (avg ≤40 actions/level)

height = total human actions · γ₁=14.134

⬡ EOSE POSITION — ARC-AGI-3 · THE REAL PICTURE

The Apr 5 scorecard runs were framework tests. model:none, provider:none — no AI was actually playing. They hit the action limit (55) without completing any levels. Score: 0. That's not our baseline, that's us proving the plumbing works.

RHAE scoring is brutal. Taking 10× the human actions on a level → 1% score. Our 55-action runs on games with 55-action human baselines = 0 completions = 0. The scoring rewards efficiency, not just completion.

The real path: Build an agent that actually understands the game rule from level 1, then plays levels 2–N efficiently. FT09 (208 human actions, no special input tags) is the starting point. WA30 (1,843 human actions, keyboard-only) is the hardest. MindsAI's 55.5% won $1M — that's the mark.