o3/Best-AI · best result any model
GPT-5 · GPT-5.4 High
Gemini-3 · Gemini 3.1 Pro Preview
TRIME⚡ · EOSE PEMCLAU V8 · 3-cap ensemble (KRSRHONE·JAYRHONE·BACHRHONE) · 8 editions · all silos · EVEN ═
Gap · Human% − Best_AI%
25 public demo tasks · Interactive/agent format · Human: 100% · Best AI: 4.76% (Gemini ft09) · Avg AI Gap: ~99.8%
|
Task ID |
Dataset |
Human |
o3 / Best AI |
GPT-5.4 |
Gemini 3.1 |
TRIME⚡ |
Gap |
Replay |
800 public eval tasks shown (first 25) · o3 scored 75.7%/87.5% on the semi-private 100-task set only · Public eval: all AI = N/A
|
Task ID |
Dataset |
Human |
Anthropic Opus |
GPT-5.4 |
Gemini 3.1 |
TRIME⚡ |
Gap |
Replay |
Semi-private note: o3-preview (high compute) achieved 87.5% on a separate 100-task semi-private evaluation set — estimated $456k cost. o3-preview (low compute) achieved 75.7%. These scores do NOT apply to the 800-task public eval above.
1,120 tasks · Every task validated by ≥2 humans · Zero AI systems have passed any task · All columns N/A
|
Task ID |
Dataset |
Human |
Anthropic Opus |
GPT-5.4 |
Gemini 3.1 |
TRIME⚡ |
Gap |
Replay |