Benchmark reference

AI model benchmarks without the extra noise.

A focused CyberOGZ reference for comparing AI models across coding, reasoning, speed, context, and price. Covers LLM, multimodal, image, and video models. Reading aid - not a universal winner label.

Admin database Updated May 29, 2026
Overall is the OGZ Practical Score, a CyberOGZ formula: 40% intelligence + 20% coding + 20% reasoning + 10% speed + 5% context + 5% price. See how ›
Current reference leader

OpenAI GPT-5.5 (xhigh)

General

68 overall
Top performers

Top 10 by OGZ Practical Score

US / Global Europe China & Asia
  1. #1 GPT-5.5 (xhigh) OpenAI 68
  2. #2 Claude Opus 4.8 (Adaptive Reasoning, Max Effort) Anthropic 68
  3. #3 GPT-5.5 (high) OpenAI 67
  4. #4 Gemini 3.1 Pro Preview Google 67
  5. #5 GPT-5.4 (xhigh) OpenAI 67
  6. #6 GPT-5.5 Pro (xhigh) OpenAI 66
  7. #7 GPT-5.5 (medium) OpenAI 66
  8. #8 Gemini 3 Deep Think Google 66
  9. #9 EXAONE 4.5 33B (Non-reasoning) LG AI Research 66
  10. #10 Qwen3.7 Max Alibaba 66
Humanity's Last Exam

Top 8 on the hardest current benchmark

HLE is a 3,000-question PhD-level exam across math, sciences, humanities and reasoning. Higher = harder questions answered correctly. Source: Artificial Analysis.

HLE % correct
  1. #1 Claude Opus 4.8 (Adaptive Reasoning, Max Effort) Anthropic 46%
  2. #2 Gemini 3.1 Pro Preview Google 45%
  3. #3 GPT-5.5 (xhigh) OpenAI 44%
  4. #4 GPT-5.5 (high) OpenAI 43%
  5. #5 GPT-5.4 (xhigh) OpenAI 42%
  6. #6 GPT-5.5 (medium) OpenAI 41%
  7. #7 Gemini 3.5 Flash (high) Google 41%
  8. #8 GPT-5.3 Codex (xhigh) OpenAI 40%
Head-to-head

Top 3 strengths side by side

Six dimensions, three flagships. Bigger area = stronger profile.

Intelligence Coding Reasoning HLE Speed Cost
  • GPT-5.5 (xhigh) OpenAI · Overall 68
  • Claude Opus 4.8 (Adaptive Reasoning, Max Effort) Anthropic · Overall 68
  • GPT-5.5 (high) OpenAI · Overall 67
Provider portfolio

Who's building what

Aggregated from 52 scored Artificial Analysis records. Average score is the mean across each provider's models on AA's intelligence index.

LG AI Research

US / Global
Flagship
EXAONE 4.5 33B (Non-reasoning) 66
1 models avg 66

OpenAI

US / Global
Flagship
GPT-5.5 (xhigh) 68
Coding king
GPT-5.5 Pro (xhigh) 70
Best HLE
44%
19 models avg 64

Google

US / Global
Flagship
Gemini 3.1 Pro Preview 67
Coding king
Gemini 3 Deep Think 70
Best HLE
45%
7 models avg 63

Anthropic

US / Global
Flagship
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) 68
Best HLE
46%
6 models avg 63

Kimi

China & Asia
Flagship
Kimi K2.6 63
Best HLE
36%
1 models avg 63

Alibaba

China & Asia
Flagship
Qwen3.7 Max 66
Best HLE
38%
3 models avg 62

Deep Cogito

US / Global
Flagship
Cogito v2.1 (Reasoning) 62
Best HLE
11%
1 models avg 62

xAI

US / Global
Flagship
Grok 4.3 (high) 63
Coding king
Grok 4.20 0309 (Reasoning) 42
Best HLE
35%
4 models avg 61

DeepSeek

China & Asia
Flagship
DeepSeek V4 Pro (Reasoning, Max Effort) 63
Best HLE
36%
3 models avg 61

Xiaomi

US / Global
Flagship
MiMo-V2.5-Pro 63
Best HLE
34%
3 models avg 61

MiniMax

China & Asia
Flagship
MiniMax-M2.7 61
Best HLE
28%
1 models avg 61

Z AI

China & Asia
Flagship
GLM-5.1 (Reasoning) 61
Coding king
GLM-5 (Reasoning) 44
Best HLE
28%
2 models avg 60
Release race

2026 model releases - through May

Each dot is a release date pulled from Artificial Analysis. Hover for the model name. Updates as new models ship across the year.

2026 · 39 releases
Jan Feb Mar Apr May
OpenAI
Anthropic
Google
Alibaba
LG AI Research
xAI
DeepSeek
Region
Type
Model Overall Coding Reasoning Speed Context Modalities Source
OpenAI GPT-5.5 (xhigh) General
68 59 94 72 n/a Text Artificial Analysis
OpenAI GPT-5.5 (high) General
67 58 93 71 n/a Text Artificial Analysis
Google Gemini 3.1 Pro Preview General
67 56 94 79 n/a Text Artificial Analysis
OpenAI GPT-5.4 (xhigh) General
67 57 92 75 n/a Text Artificial Analysis
OpenAI GPT-5.5 Pro (xhigh) General
66 70 70 35 n/a Text Artificial Analysis
OpenAI GPT-5.5 (medium) General
66 56 93 71 n/a Text Artificial Analysis
Google Gemini 3 Deep Think General
66 70 70 35 n/a Text Artificial Analysis
LG AI Research EXAONE 4.5 33B (Non-reasoning) General
66 70 70 35 n/a Text Artificial Analysis
Alibaba Qwen3.7 Max General
China
66 50 92 83 n/a Text Artificial Analysis
OpenAI GPT-3.5 Turbo (0613) General
66 70 70 35 n/a Text Artificial Analysis
OpenAI GPT-4o mini Realtime (Dec '24) General
Multimodal
66 70 70 35 n/a Text Artificial Analysis
OpenAI GPT-4o Realtime (Dec '24) General
Multimodal
66 70 70 35 n/a Text Artificial Analysis
OpenAI GPT-5.3 Codex (xhigh) General
Coding
64 53 92 73 n/a Text Artificial Analysis
Google Gemini 3.5 Flash (high) General
64 45 92 84 n/a Text Artificial Analysis
Anthropic Claude Opus 4.7 (Adaptive Reasoning, Max Effort) General
64 52 91 69 n/a Text Artificial Analysis
OpenAI GPT-5.4 Pro (xhigh) General
65 70 70 35 n/a Text Artificial Analysis
Kimi Kimi K2.6 General
China
63 47 91 65 n/a Text Artificial Analysis
OpenAI GPT-5.2 (xhigh) General
64 49 99 73 n/a Text Artificial Analysis
OpenAI GPT-5.4 mini (xhigh) General
63 52 88 82 n/a Text Artificial Analysis
xAI Grok 4.3 (high) General
63 41 90 85 n/a Text Artificial Analysis
OpenAI GPT-5.5 (low) General
62 52 91 70 n/a Text Artificial Analysis
Anthropic Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) General
62 51 88 72 n/a Text Artificial Analysis
Anthropic Claude Opus 4.7 (Non-reasoning, High Effort) General
62 53 88 69 n/a Text Artificial Analysis
DeepSeek DeepSeek V4 Pro (Reasoning, Max Effort) General
China
63 48 89 69 n/a Text Artificial Analysis
Xiaomi MiMo-V2.5-Pro General
63 46 87 69 n/a Text Artificial Analysis
Deep Cogito Cogito v2.1 (Reasoning) General
62 25 73 72 n/a Text Artificial Analysis
OpenAI GPT-5.1 (high) General
61 45 94 78 n/a Text Artificial Analysis
Google Gemini 3 Pro Preview (high) General
62 46 96 80 n/a Text Artificial Analysis
Google Gemini 3 Flash Preview (Reasoning) General
62 43 97 83 n/a Text Artificial Analysis
Anthropic Claude Opus 4.6 (Adaptive Reasoning, Max Effort) General
62 48 90 69 n/a Text Artificial Analysis
DeepSeek DeepSeek V4 Pro (Reasoning, High Effort) General
China
61 43 90 68 n/a Text Artificial Analysis
Z AI GLM-5.1 (Reasoning) General
China
61 43 87 71 n/a Text Artificial Analysis
OpenAI GPT-5 Codex (high) General
Coding
60 39 99 81 n/a Text Artificial Analysis
Anthropic Claude Opus 4.5 (Reasoning) General
61 48 91 70 n/a Text Artificial Analysis
xAI Grok 4.20 0309 v2 (Reasoning) General
61 40 91 84 n/a Text Artificial Analysis
Alibaba Qwen3.6 Max Preview General
China
61 45 89 66 n/a Text Artificial Analysis
OpenAI GPT-5.4 (low) General
59 46 87 72 n/a Text Artificial Analysis
xAI Grok 4.3 (medium) General
60 35 89 80 n/a Text Artificial Analysis
MiniMax MiniMax-M2.7 General
China
61 42 87 74 n/a Text Artificial Analysis
Xiaomi MiMo-V2.5 General
60 42 85 74 n/a Text Artificial Analysis
Z AI GLM-5 (Reasoning) General
China
60 44 82 73 n/a Text Artificial Analysis
Alibaba Qwen3.6 Plus General
China
60 43 88 69 n/a Text Artificial Analysis
OpenAI GPT-5.2 Codex (xhigh) General
Coding
60 43 90 77 n/a Text Artificial Analysis
xAI Grok 4.20 0309 (Reasoning) General
60 42 88 82 n/a Text Artificial Analysis
Xiaomi MiMo-V2-Pro General
59 41 87 70 n/a Text Artificial Analysis
OpenAI GPT-5.4 nano (xhigh) General
58 44 82 81 n/a Text Artificial Analysis
Google Gemini 3.5 Flash (minimal) General
59 47 83 86 n/a Text Artificial Analysis
DeepSeek DeepSeek V4 Flash (Reasoning, Max Effort) General
China
59 39 89 76 n/a Text Artificial Analysis
KwaiKAT KAT Coder Pro V2 General
Coding
59 46 86 77 n/a Text Artificial Analysis
OpenAI GPT-5.1 Codex (high) General
Coding
59 37 96 82 n/a Text Artificial Analysis
Anthropic Claude Opus 4.8 (Adaptive Reasoning, Max Effort) General
68 57 92 72 n/a Text Artificial Analysis
Google Gemini 3.5 Flash (medium) General
64 44 92 84 n/a Text Artificial Analysis
What the comparison means

Formula

40% Intelligence / overall quality 20% Coding benchmark 20% Reasoning benchmark 10% Speed (tokens/sec, log-normalised) 5% Context window (log-normalised) 5% Price efficiency

We weight intelligence highest because general quality affects most tasks. Coding and reasoning are separated because a model can be strong in one and weaker in the other. Speed, context and pricing have smaller weights, but they can still move the ranking when models are close.

Overall comparison is not a claim that one model is universally better than another. It is a practical ranking layer for CyberOGZ readers: which model looks stronger for real use, coding, reasoning, response speed, context capacity and pricing efficiency at the time the data was imported.

The goal is to help readers compare models quickly, then open the details and source notes before making a decision. A model with a lower overall value can still be the better choice for a specific workflow.

Data sources and confidence

The benchmark feed can come from API-backed sources such as Artificial Analysis, imported JSON, and admin-reviewed manual rows. Each row has a source and confidence value. Higher confidence means the row is based on a clearer external source or a cleaner imported feed.

If a model has only metadata but no trusted benchmark data, it should stay inactive or metadata-only in admin. That prevents models without benchmark data from polluting Radar rankings.

How to read this: Compare models within the same use-case lens. A reasoning specialist with a low speed score is still the right pick for complex tasks. Always check context window, modalities and pricing before committing.