AI model benchmarks without the extra noise.
A focused CyberOGZ reference for comparing AI models across coding, reasoning, speed, context, and price. Covers LLM, multimodal, image, and video models. Reading aid - not a universal winner label.
OpenAI GPT-5.5 (xhigh)
General
Top 10 by OGZ Practical Score
- #1 GPT-5.5 (xhigh) OpenAI 68
- #2 Claude Opus 4.8 (Adaptive Reasoning, Max Effort) Anthropic 68
- #3 GPT-5.5 (high) OpenAI 67
- #4 Gemini 3.1 Pro Preview Google 67
- #5 GPT-5.4 (xhigh) OpenAI 67
- #6 GPT-5.5 Pro (xhigh) OpenAI 66
- #7 GPT-5.5 (medium) OpenAI 66
- #8 Gemini 3 Deep Think Google 66
- #9 EXAONE 4.5 33B (Non-reasoning) LG AI Research 66
- #10 Qwen3.7 Max Alibaba 66
Top 8 on the hardest current benchmark
HLE is a 3,000-question PhD-level exam across math, sciences, humanities and reasoning. Higher = harder questions answered correctly. Source: Artificial Analysis.
- #1 Claude Opus 4.8 (Adaptive Reasoning, Max Effort) Anthropic 46%
- #2 Gemini 3.1 Pro Preview Google 45%
- #3 GPT-5.5 (xhigh) OpenAI 44%
- #4 GPT-5.5 (high) OpenAI 43%
- #5 GPT-5.4 (xhigh) OpenAI 42%
- #6 GPT-5.5 (medium) OpenAI 41%
- #7 Gemini 3.5 Flash (high) Google 41%
- #8 GPT-5.3 Codex (xhigh) OpenAI 40%
Top 3 strengths side by side
Six dimensions, three flagships. Bigger area = stronger profile.
-
GPT-5.5 (xhigh) OpenAI · Overall 68
-
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) Anthropic · Overall 68
-
GPT-5.5 (high) OpenAI · Overall 67
Who's building what
Aggregated from 52 scored Artificial Analysis records. Average score is the mean across each provider's models on AA's intelligence index.
LG AI Research
- Flagship
- EXAONE 4.5 33B (Non-reasoning) 66
OpenAI
- Flagship
- GPT-5.5 (xhigh) 68
- Coding king
- GPT-5.5 Pro (xhigh) 70
- Best HLE
- 44%
- Flagship
- Gemini 3.1 Pro Preview 67
- Coding king
- Gemini 3 Deep Think 70
- Best HLE
- 45%
Anthropic
- Flagship
- Claude Opus 4.8 (Adaptive Reasoning, Max Effort) 68
- Best HLE
- 46%
Kimi
- Flagship
- Kimi K2.6 63
- Best HLE
- 36%
Alibaba
- Flagship
- Qwen3.7 Max 66
- Best HLE
- 38%
Deep Cogito
- Flagship
- Cogito v2.1 (Reasoning) 62
- Best HLE
- 11%
xAI
- Flagship
- Grok 4.3 (high) 63
- Coding king
- Grok 4.20 0309 (Reasoning) 42
- Best HLE
- 35%
DeepSeek
- Flagship
- DeepSeek V4 Pro (Reasoning, Max Effort) 63
- Best HLE
- 36%
Xiaomi
- Flagship
- MiMo-V2.5-Pro 63
- Best HLE
- 34%
MiniMax
- Flagship
- MiniMax-M2.7 61
- Best HLE
- 28%
Z AI
- Flagship
- GLM-5.1 (Reasoning) 61
- Coding king
- GLM-5 (Reasoning) 44
- Best HLE
- 28%
2026 model releases - through May
Each dot is a release date pulled from Artificial Analysis. Hover for the model name. Updates as new models ship across the year.
| Model | Overall | Coding | Reasoning | Speed | Context | Modalities | Source |
|---|---|---|---|---|---|---|---|
|
OpenAI GPT-5.5 (xhigh)
General
|
68 | 59 | 94 | 72 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.5 (high)
General
|
67 | 58 | 93 | 71 | n/a | Text | Artificial Analysis |
|
Google Gemini 3.1 Pro Preview
General
|
67 | 56 | 94 | 79 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.4 (xhigh)
General
|
67 | 57 | 92 | 75 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.5 Pro (xhigh)
General
|
66 | 70 | 70 | 35 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.5 (medium)
General
|
66 | 56 | 93 | 71 | n/a | Text | Artificial Analysis |
|
Google Gemini 3 Deep Think
General
|
66 | 70 | 70 | 35 | n/a | Text | Artificial Analysis |
|
LG AI Research EXAONE 4.5 33B (Non-reasoning)
General
|
66 | 70 | 70 | 35 | n/a | Text | Artificial Analysis |
|
Alibaba Qwen3.7 Max
General
China
|
66 | 50 | 92 | 83 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-3.5 Turbo (0613)
General
|
66 | 70 | 70 | 35 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-4o mini Realtime (Dec '24)
General
Multimodal
|
66 | 70 | 70 | 35 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-4o Realtime (Dec '24)
General
Multimodal
|
66 | 70 | 70 | 35 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.3 Codex (xhigh)
General
Coding
|
64 | 53 | 92 | 73 | n/a | Text | Artificial Analysis |
|
Google Gemini 3.5 Flash (high)
General
|
64 | 45 | 92 | 84 | n/a | Text | Artificial Analysis |
|
Anthropic Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
General
|
64 | 52 | 91 | 69 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.4 Pro (xhigh)
General
|
65 | 70 | 70 | 35 | n/a | Text | Artificial Analysis |
|
Kimi Kimi K2.6
General
China
|
63 | 47 | 91 | 65 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.2 (xhigh)
General
|
64 | 49 | 99 | 73 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.4 mini (xhigh)
General
|
63 | 52 | 88 | 82 | n/a | Text | Artificial Analysis |
|
xAI Grok 4.3 (high)
General
|
63 | 41 | 90 | 85 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.5 (low)
General
|
62 | 52 | 91 | 70 | n/a | Text | Artificial Analysis |
|
Anthropic Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
General
|
62 | 51 | 88 | 72 | n/a | Text | Artificial Analysis |
|
Anthropic Claude Opus 4.7 (Non-reasoning, High Effort)
General
|
62 | 53 | 88 | 69 | n/a | Text | Artificial Analysis |
|
DeepSeek DeepSeek V4 Pro (Reasoning, Max Effort)
General
China
|
63 | 48 | 89 | 69 | n/a | Text | Artificial Analysis |
|
Xiaomi MiMo-V2.5-Pro
General
|
63 | 46 | 87 | 69 | n/a | Text | Artificial Analysis |
|
Deep Cogito Cogito v2.1 (Reasoning)
General
|
62 | 25 | 73 | 72 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.1 (high)
General
|
61 | 45 | 94 | 78 | n/a | Text | Artificial Analysis |
|
Google Gemini 3 Pro Preview (high)
General
|
62 | 46 | 96 | 80 | n/a | Text | Artificial Analysis |
|
Google Gemini 3 Flash Preview (Reasoning)
General
|
62 | 43 | 97 | 83 | n/a | Text | Artificial Analysis |
|
Anthropic Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
General
|
62 | 48 | 90 | 69 | n/a | Text | Artificial Analysis |
|
DeepSeek DeepSeek V4 Pro (Reasoning, High Effort)
General
China
|
61 | 43 | 90 | 68 | n/a | Text | Artificial Analysis |
|
Z AI GLM-5.1 (Reasoning)
General
China
|
61 | 43 | 87 | 71 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5 Codex (high)
General
Coding
|
60 | 39 | 99 | 81 | n/a | Text | Artificial Analysis |
|
Anthropic Claude Opus 4.5 (Reasoning)
General
|
61 | 48 | 91 | 70 | n/a | Text | Artificial Analysis |
|
xAI Grok 4.20 0309 v2 (Reasoning)
General
|
61 | 40 | 91 | 84 | n/a | Text | Artificial Analysis |
|
Alibaba Qwen3.6 Max Preview
General
China
|
61 | 45 | 89 | 66 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.4 (low)
General
|
59 | 46 | 87 | 72 | n/a | Text | Artificial Analysis |
|
xAI Grok 4.3 (medium)
General
|
60 | 35 | 89 | 80 | n/a | Text | Artificial Analysis |
|
MiniMax MiniMax-M2.7
General
China
|
61 | 42 | 87 | 74 | n/a | Text | Artificial Analysis |
|
Xiaomi MiMo-V2.5
General
|
60 | 42 | 85 | 74 | n/a | Text | Artificial Analysis |
|
Z AI GLM-5 (Reasoning)
General
China
|
60 | 44 | 82 | 73 | n/a | Text | Artificial Analysis |
|
Alibaba Qwen3.6 Plus
General
China
|
60 | 43 | 88 | 69 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.2 Codex (xhigh)
General
Coding
|
60 | 43 | 90 | 77 | n/a | Text | Artificial Analysis |
|
xAI Grok 4.20 0309 (Reasoning)
General
|
60 | 42 | 88 | 82 | n/a | Text | Artificial Analysis |
|
Xiaomi MiMo-V2-Pro
General
|
59 | 41 | 87 | 70 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.4 nano (xhigh)
General
|
58 | 44 | 82 | 81 | n/a | Text | Artificial Analysis |
|
Google Gemini 3.5 Flash (minimal)
General
|
59 | 47 | 83 | 86 | n/a | Text | Artificial Analysis |
|
DeepSeek DeepSeek V4 Flash (Reasoning, Max Effort)
General
China
|
59 | 39 | 89 | 76 | n/a | Text | Artificial Analysis |
|
KwaiKAT KAT Coder Pro V2
General
Coding
|
59 | 46 | 86 | 77 | n/a | Text | Artificial Analysis |
|
OpenAI GPT-5.1 Codex (high)
General
Coding
|
59 | 37 | 96 | 82 | n/a | Text | Artificial Analysis |
|
Anthropic Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
General
|
68 | 57 | 92 | 72 | n/a | Text | Artificial Analysis |
|
Google Gemini 3.5 Flash (medium)
General
|
64 | 44 | 92 | 84 | n/a | Text | Artificial Analysis |
What the comparison means
Formula
We weight intelligence highest because general quality affects most tasks. Coding and reasoning are separated because a model can be strong in one and weaker in the other. Speed, context and pricing have smaller weights, but they can still move the ranking when models are close.
Overall comparison is not a claim that one model is universally better than another. It is a practical ranking layer for CyberOGZ readers: which model looks stronger for real use, coding, reasoning, response speed, context capacity and pricing efficiency at the time the data was imported.
The goal is to help readers compare models quickly, then open the details and source notes before making a decision. A model with a lower overall value can still be the better choice for a specific workflow.
Data sources and confidence
The benchmark feed can come from API-backed sources such as Artificial Analysis, imported JSON, and admin-reviewed manual rows. Each row has a source and confidence value. Higher confidence means the row is based on a clearer external source or a cleaner imported feed.
If a model has only metadata but no trusted benchmark data, it should stay inactive or metadata-only in admin. That prevents models without benchmark data from polluting Radar rankings.