What ARC-AGI-2 Changed
ARC-AGI-2 was specifically designed to resist the strategies that let o3 score on ARC-AGI-1. More complex compositional rules. More ambiguous training examples. More diverse task types. The result: even o3-level reasoning, applied at extreme compute cost, cannot pass these tasks consistently. The compute-escape hatch is closed.
The Validation Standard
Every ARC-AGI-2 task required at least 2 independent human solvers to confirm it before inclusion. This means the "human pass rate" is not theoretical — it's empirical. Real people solved every task. No AI system has. The gap is not a benchmark artifact — it's a direct measurement of the intelligence difference.
TRIME Floor Analysis
TRIME analysis queued for ARC-AGI-2. Initial mapping suggests tasks span Bond Library (structural invariants), DESEOF (causal chains), and FoundFloor (frequency/counting). The 3-brain swarm should handle task decomposition better than single-model approaches — but TRIME hasn't run the full eval yet. Status: Analyzing.
The N/A Signal
A wall of N/A is not nothing — it's a precise measurement. It says: current AI architectures, regardless of scale, compute, or training data, cannot perform the type of rule induction these 1,120 tasks require. ARC-AGI-3 goes further — interactive environments where even the format is novel. N/A → N/A → 0-5%. The trajectory is clear.