25 REAL GAMES · LIVE FROM API · RHAE SCORING · EOSE LABS
γ₁ = 14.134725141734693
ARC-AGI-3 ≠ ARC-AGI-1/2. AGI-1 + AGI-2 = static grid tasks, scored by exact match %. AGI-3 = interactive games — AI is an agent that plays 25 environments, scored by RHAE (action efficiency vs human baseline). These are completely different tests. Our frontier scorecard shows 0 real agent runs yet — Apr 5 runs were framework tests with model:none.
25
Real Games
AR25 · BP35 · CD82 · CN04… live from API
183
Total Levels
6–10 levels per game · later levels weight more
RHAE
Scoring Method
(human_actions / ai_actions)² per level · capped 1.15×
0
Real Agent Runs
Apr 5 = model:none framework tests · no real score yet
55.5%
MindsAI 2024
ARC Prize $1M winner · target to beat
0–3%
Frontier 2025
All new models stuck here · 2026 leaderboard
Per-Level Score
min(1.15, (human_baseline / ai_actions)²)
AI takes 2× human → 0.25. AI takes 10× human → 0.01. Fast AI → up to 1.15
Per-Game Score
Σ(level_score[i] × (i+1)) / Σ(i+1)
Level 1 weight=1, level 2 weight=2… later = harder = more weight
Total Score
mean(game_scores) × 100%
Average across all 25 games. Must complete ALL levels for 100% on a game
The Trap
55 actions, 0 completions → 0%
Our Apr 5 runs hit the action limit without completing any levels → score 0
EOSE TARGET (AGI-3)
TBD
MindsAI 2024
55.5%
Jack Cole #1
30%
Opus 4.7 High
~3.1%
GPT-5.5 High
~2.9%
Grok 4.20 Beta
~2.4%
Gemini 3.1 Pro
~2.1%
EOSE Apr 5 runs
0%
γ₁ anchor
14.134
FT09 human baseline
208 actions
WA30 human baseline
1,843 actions
⬡ BONSAI HELIX — 25 REAL GAMES ON THE TRACK · HEIGHT = HUMAN BASELINE ACTIONS (HARDER = TALLER)
click games (8)
keyboard games (4)
keyboard+click games (13)
hard (avg >100 actions/level)
easy (avg ≤40 actions/level)
height = total human actions · γ₁=14.134
⬡ EOSE POSITION — ARC-AGI-3 · THE REAL PICTURE
The Apr 5 scorecard runs were framework tests. model:none, provider:none — no AI was actually playing. They hit the action limit (55) without completing any levels. Score: 0. That's not our baseline, that's us proving the plumbing works.

RHAE scoring is brutal. Taking 10× the human actions on a level → 1% score. Our 55-action runs on games with 55-action human baselines = 0 completions = 0. The scoring rewards efficiency, not just completion.

The real path: Build an agent that actually understands the game rule from level 1, then plays levels 2–N efficiently. FT09 (208 human actions, no special input tags) is the starting point. WA30 (1,843 human actions, keyboard-only) is the hardest. MindsAI's 55.5% won $1M — that's the mark.
⬡ 25 ARC-AGI-3 GAMES — REAL DATA FROM API · 2026-05-02
TitleLevelsHuman ActionsAvg/Level TagsDifficultyBaseline Profile