25
Real Games
AR25 · BP35 · CD82 · CN04… live from API
183
Total Levels
6–10 levels per game · later levels weight more
RHAE
Scoring Method
(human_actions / ai_actions)² per level · capped 1.15×
0
Real Agent Runs
Apr 5 = model:none framework tests · no real score yet
55.5%
MindsAI 2024
ARC Prize $1M winner · target to beat
0–3%
Frontier 2025
All new models stuck here · 2026 leaderboard
Per-Level Score
min(1.15, (human_baseline / ai_actions)²)
AI takes 2× human → 0.25. AI takes 10× human → 0.01. Fast AI → up to 1.15
Per-Game Score
Σ(level_score[i] × (i+1)) / Σ(i+1)
Level 1 weight=1, level 2 weight=2… later = harder = more weight
Total Score
mean(game_scores) × 100%
Average across all 25 games. Must complete ALL levels for 100% on a game
The Trap
55 actions, 0 completions → 0%
Our Apr 5 runs hit the action limit without completing any levels → score 0
⬡ EOSE POSITION — ARC-AGI-3 · THE REAL PICTURE
The Apr 5 scorecard runs were framework tests. model:none, provider:none — no AI was actually playing. They hit the action limit (55) without completing any levels. Score: 0. That's not our baseline, that's us proving the plumbing works.
RHAE scoring is brutal. Taking 10× the human actions on a level → 1% score. Our 55-action runs on games with 55-action human baselines = 0 completions = 0. The scoring rewards efficiency, not just completion.
The real path: Build an agent that actually understands the game rule from level 1, then plays levels 2–N efficiently. FT09 (208 human actions, no special input tags) is the starting point. WA30 (1,843 human actions, keyboard-only) is the hardest. MindsAI's 55.5% won $1M — that's the mark.