ARC-AGI-2 — 1,120 Tasks. All N/A.

∅ The Wall — 25 of 1,120 Tasks every model · every task · N/A

…and 1,095 more tasks — all N/A across all models · → Arc Board

◈ First 25 Tasks — Full Model Breakdown

Task ID	Human	Anthropic Opus 4.6 (Max)	GPT-5.4 (High)	Grok 4.20 (Beta)	Gemini 3.1 Pro (Preview)	TRIME⚡	Gap	Replay

…and 1,095 more · all N/A · all ~100% gap

◈ EOSE TRIME Analysis — Why 1,120 = 0

What ARC-AGI-2 Changed

ARC-AGI-2 was specifically designed to resist the strategies that let o3 score on ARC-AGI-1. More complex compositional rules. More ambiguous training examples. More diverse task types. The result: even o3-level reasoning, applied at extreme compute cost, cannot pass these tasks consistently. The compute-escape hatch is closed.

The Validation Standard

Every ARC-AGI-2 task required at least 2 independent human solvers to confirm it before inclusion. This means the "human pass rate" is not theoretical — it's empirical. Real people solved every task. No AI system has. The gap is not a benchmark artifact — it's a direct measurement of the intelligence difference.

TRIME Floor Analysis

TRIME analysis queued for ARC-AGI-2. Initial mapping suggests tasks span Bond Library (structural invariants), DESEOF (causal chains), and FoundFloor (frequency/counting). The 3-brain swarm should handle task decomposition better than single-model approaches — but TRIME hasn't run the full eval yet. Status: Analyzing.

The N/A Signal

A wall of N/A is not nothing — it's a precise measurement. It says: current AI architectures, regardless of scale, compute, or training data, cannot perform the type of rule induction these 1,120 tasks require. ARC-AGI-3 goes further — interactive environments where even the format is novel. N/A → N/A → 0-5%. The trajectory is clear.