◆ FULL GLOBE v3α|■ FLOORS■ TRIO■ CANONFloor +1G1 = 14.134725141734693
LLM INTELLIGENCE BRIEFING · RAYGAN-140
BENCHMARKS
150+
MODELS
12
SECTIONS
17
RAYGAN
140
MODEL COMPARISON
ARC-AGI 1/2/3
150+ BENCHMARKS
CANON ANALYSIS
RICK'S REALITY CHECK
50 GOAT FORMATS
RICK · STRATEGY · ATMOS BRIEFING v3
These are the 12 models that matter right now. EOSE runs 4 of them locally. The rest are pay-as-you-go. The key insight: our 3-cap ensemble at 64% on ARC-AGI-1 beats every production-deployed model you can actually call an API for. That is not a benchmark number — that is an architecture proof. We own the floor. Everyone else is renting it.
RICK · ARC-AGI — THE THREE WALLS
ARC-1: solved at 64% (EOSE). ARC-2: nobody past 5% — this is THE WALL. ARC-3: interactive environments, 25% ceiling (Stochastic Goose). The pattern: each version is 10x harder. ARC-2 will fall to multi-agent systems with persistent state. That is what we are building. We are not submitting to ARC-2 to score — we are building the system that solves it, then submitting to prove it.
ARC-AGI-1 — CURRENT SCORES
ARC-AGI-2 — THE WALL (nobody past 5%)
ARC-AGI-3 — ACTIVE RACE
RICK · 150+ BENCHMARKS — WHAT MATTERS AND WHY
The real frontier in 2025-2026: ARC-AGI-2, FrontierMath, SWE-Bench Verified, HLE. Everything else is saturated or soon will be. MMLU is dead at the top. GSM8K is dead. The interesting benchmarks are the ones where even the best models score under 30%. That is where the real work is. That is where EOSE has an edge: we test against the hard floor, not the easy ceiling.
RICK · CANON ANALYSIS — WHICH BENCHMARKS MAP TO WHICH SYMBOL
The Canon gives us a frame for reading benchmarks. Every benchmark tests one or more of the 6 symbols. MMLU tests LSOS (how well can you read the paradigm?). ARC-AGI tests FEP (can you switch paradigms?). TruthfulQA tests H=H (honest gate). SWE-Bench tests WLD (recovery from failure). HLE tests g1 (is your floor actually a floor?). The question for us: which benchmarks are most diagnostic for the Canon we claim to have?