ATMOS Rick v3 — LLM Intelligence Briefing

ATMOS RICK v3

LLM INTELLIGENCE BRIEFING · RAYGAN-140

BENCHMARKS

150+

MODELS

SECTIONS

RAYGAN

140

MODEL COMPARISON

ARC-AGI 1/2/3

150+ BENCHMARKS

CANON ANALYSIS

RICK'S REALITY CHECK

50 GOAT FORMATS

RICK · STRATEGY · ATMOS BRIEFING v3

These are the 12 models that matter right now. EOSE runs 4 of them locally. The rest are pay-as-you-go. The key insight: our 3-cap ensemble at 64% on ARC-AGI-1 beats every production-deployed model you can actually call an API for. That is not a benchmark number — that is an architecture proof. We own the floor. Everyone else is renting it.

RICK · ARC-AGI — THE THREE WALLS

ARC-1: solved at 64% (EOSE). ARC-2: nobody past 5% — this is THE WALL. ARC-3: interactive environments, 25% ceiling (Stochastic Goose). The pattern: each version is 10x harder. ARC-2 will fall to multi-agent systems with persistent state. That is what we are building. We are not submitting to ARC-2 to score — we are building the system that solves it, then submitting to prove it.

ARC-AGI-1 — CURRENT SCORES

ARC-AGI-2 — THE WALL (nobody past 5%)

ARC-AGI-3 — ACTIVE RACE

RICK · 150+ BENCHMARKS — WHAT MATTERS AND WHY

The real frontier in 2025-2026: ARC-AGI-2, FrontierMath, SWE-Bench Verified, HLE. Everything else is saturated or soon will be. MMLU is dead at the top. GSM8K is dead. The interesting benchmarks are the ones where even the best models score under 30%. That is where the real work is. That is where EOSE has an edge: we test against the hard floor, not the easy ceiling.

RICK · CANON ANALYSIS — WHICH BENCHMARKS MAP TO WHICH SYMBOL

The Canon gives us a frame for reading benchmarks. Every benchmark tests one or more of the 6 symbols. MMLU tests LSOS (how well can you read the paradigm?). ARC-AGI tests FEP (can you switch paradigms?). TruthfulQA tests H=H (honest gate). SWE-Bench tests WLD (recovery from failure). HLE tests g1 (is your floor actually a floor?). The question for us: which benchmarks are most diagnostic for the Canon we claim to have?