| # | Model | Public Eval (800) | Semi-Private (100) | Eval Set | Notes |
|---|---|---|---|---|---|
| 1 | o3-preview Semi-Private OpenAI · high compute · est. $456k/100 tasks |
N/A | 87.5% | 100 tasks | High compute, not public eval |
| 2 | o3-preview Semi-Private OpenAI · low compute |
N/A | 75.7% | 100 tasks | Low compute version |
| 3 | Gemini 2.0 Flash Google · flash reasoning |
N/A | ~72% | Private eval | Reported by Google |
| 4 | Claude 3.7 Sonnet Anthropic · extended thinking |
N/A | ~62% | Private eval | Extended thinking enabled |
| 5 | o3-released OpenAI · public release |
N/A | ~60% | Private eval | Lower than preview |
| 6 | GPT-4o OpenAI · standard |
N/A | ~57% | Private eval | No extended reasoning |
| 7 | Llama-405B Meta · open weights |
N/A | ~5% | Community | Community benchmarks |
| — | Anthropic Opus 4.6 Public Eval Current frontier · public 800-task eval |
N/A | N/A | 800 public | 0 completions on public eval |
| — | GPT-5.4 High Public Eval OpenAI · current frontier |
N/A | N/A | 800 public | 0 completions on public eval |
| — | Grok 4.20 Beta Public Eval xAI · beta reasoning |
N/A | N/A | 800 public | 0 completions on public eval |
| — | Gemini 3.1 Pro Public Eval Google · preview reasoning |
N/A | N/A | 800 public | 0 completions on public eval |
| — | EOSE TRIME TRIME⚡ 3-brain swarm · Bond Library + DESEOF |
Running… | TBD | Active | Analysis in progress |
| ∞ | Human Baseline Average adult · no compute cost |
~100% | ~100% | Both | Fluid reasoning, native |
| Task ID | Human | Anthropic Opus | GPT-5.4 | Gemini 3.1 | TRIME⚡ | Gap | Replay |
|---|