RICK · 150+ BENCHMARKS — WHAT MATTERS AND WHY
The real frontier in 2025-2026: ARC-AGI-2, FrontierMath, SWE-Bench Verified, HLE. Everything else is saturated or soon will be. MMLU is dead at the top. GSM8K is dead. The interesting benchmarks are the ones where even the best models score under 30%. That is where the real work is. That is where EOSE has an edge: we test against the hard floor, not the easy ceiling.