LLM Landscape 2026: Intelligence Leaderboard and Model Guide

A June 2026 snapshot of the top AI language models, ranked one representative per vendor using the AA Intelligence Index v4.0. The frontier has shifted since April: Claude Opus 4.8 now leads the composite index, GPT-5.5 succeeded GPT-5.4 at higher pricing, Gemini 3.5 Flash is GA with Gemini 3.5 Pro expected this month, and open-weight standouts Kimi K2.6 and DeepSeek V4 Pro have jumped into the top tier. Google's new Gemma 4 family reshapes the open-weight story — Gemma 4 31B (AA 39) now leads the composite open-weight comparison over gpt-oss-120B (AA 33). Note: Gemini 3.5 Pro may reshuffle the closed-model top ranks again within weeks.

Leaderboard Methodology

The table below ranks one model per provider — the provider's newest or most clearly superior flagship. Scores use the AA (Artificial Analysis) Intelligence Index v4.0, which aggregates ten evaluations — GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam (HLE), GPQA Diamond, and CritPt — into a single normalized integer. Score shifts versus the April update partly reflect this expanded methodology, not capability alone. Context windows are shown as token counts with commas; missing public data is shown as "—". Pricing is per million tokens (input / output); unavailable values are also shown as "—".

Two vendors use a second row for a distinct product line: OpenAI (GPT-5.5 API flagship vs. historical open-weight slot) and Google (Gemini API flagship at rank 3 vs. Gemma 4 open-weight at rank 20). That split reflects how practitioners actually choose — cloud Gemini/GPT vs. self-hosted Gemma/gpt-oss — without conflating API and open-weight releases under a single model name.

The Intelligence Leaderboard: Top 20 LLMs by Vendor (June 2026)

RankModelCapability Index
(AA Index)
Context Window
(tokens)
Input Cost
($/M tokens)
Output Cost
($/M tokens)
Notes
1
Claude Opus 4.8
Anthropic
611,000,000$5.00$25.00Leads AA Index v4.0; SWE-Bench Pro 69.2%, HLE 57.9%, GDPval-AA 1890; pricing flat vs Opus 4.6
2
GPT-5.5
OpenAI
581,050,000$5.00$30.00OpenAI flagship (Apr 23); Terminal-Bench 2.0 82.7%, ARC-AGI-2 85.0%; output price doubled vs GPT-5.4
3
Gemini 3.1 Pro Preview
Google
571,000,000$1.25$10.00Table entry unchanged; Gemini 3.5 Flash GA (~AA 55, $1.50/$9, 4× speed, beats 3.1 Pro on coding/agents). Gemini 3.5 Pro expected June 2026
4
Kimi K2.6
Moonshot AI
54Open Weight 1T params; top-10 open-weight contender; pricing not yet fully confirmed
5
MiMo-V2.5-Pro
Xiaomi
541,000,000Successor to MiMo-V2-Pro; pricing not publicly disclosed
6
DeepSeek V4 Pro
DeepSeek AI
521,000,000$2.19$8.76Open Weight 1.6T MoE (49B active); hybrid sparse attention; V4 Flash for cost-sensitive pipelines
7
GLM-5
Z.ai
50200,000$1.00$3.20Strong agentic engineering positioning; no material change since April
8
Grok 4.20 Beta 0309
xAI
48200,000+$2.00$6.00xAI flagship; fast, tool-heavy, agentic model
9
Qwen3.5 397B A17B
Alibaba
45262,000$0.60$3.60Open Weight Best current Qwen-family representative; Apache 2.0
10
MiniMax-M2.7
MiniMax
42Strong current entrant; notable capability/value tradeoff
11
NVIDIA Nemotron 3 Super 120B A12B
NVIDIA
361,000,000$0.30$0.75Open Weight Strong open enterprise contender; excellent price/performance
12
Mistral Large 3
Mistral
23Best current public Mistral flagship
13
Nova Premier
Amazon
191,000,000$2.50$12.50Hyperscaler representative; broad enterprise relevance
14
ERNIE 4.5 300B A47B
Baidu
15Best verifiable ERNIE-family public entry
15
Llama 4 Scout
Meta
1410,000,000Open Weight Context-window outlier; 10M tokens for self-hosting
16
Command A
Cohere
13256,000$2.50$10.00Practical enterprise/workflow model; RAG and tool use focus
17
Granite 4.0 H Small
IBM
11Open Weight Enterprise and open-governance relevance
18
Jamba 1.7 Large
AI21
11Solid enterprise positioning; hybrid SSM/Transformer architecture
19
Yi-Lightning
01.AI
Vendor-diversity slot; public specs not fully verified
20
Gemma 4 31B
Google (open-weight)
39Open Weight Top composite open-weight AA here; beats gpt-oss-120B (33) at ~4× fewer parameters. Gemini API flagship remains rank 3

gpt-oss-120B (OpenAI, AA 33, $0.30 / $0.30 per M via API) remains a useful reference for managed open-weight access but no longer leads the composite open-weight AA comparison — displaced by Gemma 4 31B. The new Gemma 4 12B (no AA score yet; likely ~22–27) targets a different trade-off: laptop-class deployment vs. multi-GPU serving for 120B-class models.

Key Takeaways

Peak Intelligence
Claude Opus 4.8 leads the AA Index v4.0 at 61 — a clear step above the prior tied leaders. GPT-5.5 (~58) and Gemini 3.1 Pro Preview (57) follow; GPT-5.5 leads Terminal-Bench 2.0 (82.7%) and ARC-AGI-2 (85.0%). Gemini 3.5 Pro, expected June 2026, may reshuffle the top ranks again.
Coding & Agentic Leadership
Claude Opus 4.8 leads SWE-Bench Pro (69.2%) and GDPval-AA (1890); Grok 4.20 still dominates fast, tool-heavy, real-time workflows. Gemini 3.5 Flash beats Gemini 3.1 Pro on coding and agents at 4× the speed.
Context-Window Outlier
Llama 4 Scout pushes open-weight context to 10,000,000 tokens — enabling full-codebase and corpus-scale analysis in a single pass. The top closed flagships cluster at 1,000,000–1,050,000 tokens.
Cost-Efficient Frontier
DeepSeek V4 Flash and NVIDIA Nemotron 3 Super ($0.30 / $0.75) still anchor the value tier. DeepSeek V4 Pro trades higher cost ($2.19 / $8.76) for a 1M context window and AA 52. Flash-class models like Gemini 3.5 Flash ($1.50 / $9) now deliver last year's premium-tier capability — the "cheap tier" is compressing upward.
Emerging Challengers
Kimi K2.6 (AA 54, 1T params) and DeepSeek V4 Pro (AA 52, 1M context) jumped from mid-tier to top-10 open-weight contenders. MiMo-V2.5-Pro (Xiaomi, AA 54) and GLM-5 (Z.ai, AA 50) reinforce a genuinely global frontier.
Open-Weight Size Efficiency (Gemma 4)
Gemma 4 31B (AA 39) leads gpt-oss-120B (AA 33) on composite intelligence; Gemma 4 26B MoE (AA 31) hits 79.2% GPQA Diamond vs. 76.2% for gpt-oss-120B. The 12B variant is not an AA leader — it is a deployment play: comparable practical value on a laptop vs. multi-GPU infrastructure for 120B-class serving.

Key Performance Metrics

Task-Specific Leaders
ModelBenchmark Leadership
Claude Opus 4.8AA Index v4.0 leader · SWE-Bench Pro · HLE · AA 61
GPT-5.5Terminal-Bench 2.0 · ARC-AGI-2 · AA ~58
Gemini 3.1 ProMultimodal · AA 57; 3.5 Flash leads coding/agents
Kimi K2.6 & DeepSeek V4 ProOpen-weight top tier · AA 52–54
Gemma 4 31BOpen-weight composite · AA 39 · beats gpt-oss-120B
Context Window Champions
ModelTokens
Llama 4 Scout10,000,000
GPT-5.51,050,000
Gemini 3.1 · Claude · MiMo · DeepSeek V4 Pro · Nemotron1,000,000
Qwen3.5 397B262,000
10M tokens fits entire codebases; 1M+ handles legal corpora and research archives in one pass
Cost Efficiency
TierModelsOutput $/M
Best ValueNemotron · DeepSeek V4 Flash~$0.75–$2
Mid-RangeDeepSeek V4 Pro · Qwen3.5 · GLM-5 · Gemini 3.5 Flash$3.20–$9.00
FlagshipGemini 3.1 · Grok 4.20$6.00–$10.00
PremiumGPT-5.5 · Claude Opus 4.8$25.00–$30.00
Open-weight models (DeepSeek V4, Nemotron, Kimi K2.6) reach top-10 AA scores at a fraction of closed-flagship output cost

Specialized Performance Highlights

Speed & Latency
Grok 4.20
Purpose-built for fast, tool-heavy, real-time agent loops — lowest-latency frontier model in the top 10
GPT-5.5 Instant & Flash-class siblings
GPT-5.5 Instant (May 5) is the new ChatGPT default for low-latency work. Gemini 3.5 Flash and Claude Haiku-class models sit outside the one-per-vendor table — use for interactive and streaming applications
NVIDIA Nemotron 3 Super
Competitive throughput at open-weight cost — strong for high-volume enterprise inference pipelines
Open-Weight Excellence
ModelKey Strength
Gemma 4 31BAA 39 · beats gpt-oss-120B on composite · ~4× smaller
Gemma 4 12BLaptop-class · size efficiency vs. 120B serving cost
Llama 4 Scout10M-token context · corpus-scale tasks
Qwen3.5 397B262K ctx · multilingual · Apache 2.0
DeepSeek V4 Pro1M context · AA 52 · MoE architecture
NVIDIA Nemotron1M context · enterprise self-hosting
Kimi K2.61T params · AA 54 · top open-weight tier
Gemma 4 31B sets the open-weight composite bar (AA 39); gpt-oss-120B (AA 33) remains strong for API-priced access. Gemma 4 12B is about deployment footprint, not displacing 120B on AA.

Model Selection Guide

Peak Intelligence
Claude Opus 4.8GPT-5.5Gemini 3.1 Pro
Opus 4.8 leads AA v4.0 (61); GPT-5.5 for Terminal-Bench and ARC-AGI-2; Gemini 3.1 Pro for multimodal — watch for Gemini 3.5 Pro in June
Coding & Agents
Claude Opus 4.8Grok 4.20Gemini 3.5 Flash
Claude Opus 4.8 for SWE-Bench Pro and long-horizon agents; Grok 4.20 for fast real-time loops; Gemini 3.5 Flash when speed and coding throughput matter
Massive Context
Llama 4 ScoutGPT-5.5DeepSeek V4 Pro
Llama 4 Scout (10M tokens, open-weight) for full-codebase tasks; GPT-5.5 (1.05M), DeepSeek V4 Pro (1M), and Gemini 3.1 Pro (1M) for closed and open long-document pipelines
Cost Optimization
DeepSeek V4 FlashNVIDIA NemotronGemini 3.5 Flash
Nemotron and DeepSeek V4 Flash for lowest per-token cost; Gemini 3.5 Flash for frontier-class speed at mid-tier pricing ($9/M output)
Self-Hosting
Gemma 4 31BLlama 4 ScoutDeepSeek V4 ProKimi K2.6Gemma 4 12B
Gemma 4 31B for best open-weight AA (39); Gemma 4 12B for laptop/single-GPU; Llama 4 Scout for 10M context; Kimi K2.6 / DeepSeek V4 Pro for frontier open-weight scale
Agentic Engineering
GLM-5MiMo-V2.5-Pro
Z.ai and Xiaomi top-10 entrants (AA 50–54) positioning for next-generation tool-use and agentic engineering pipelines

Industry Impact & Future Trends (2026)

The 2026 LLM landscape is defined by task-specific leadership, AA Index v4.0 reshuffling scores, and a widening gap between flagship and cost-efficient tiers:

Coding & Agents
Claude Opus 4.8 leads SWE-Bench Pro (69.2%) and GDPval-AA. GPT-5.5 leads Terminal-Bench 2.0. Grok 4.20 still excels in fast, tool-heavy workflows. GLM-5 remains a strong agentic-engineering option.
Context & Long-Horizon
DeepSeek V4 Pro brings 1M-token open-weight context (up from 128K in V3.2). GPT-5.5 (1.05M) and Gemini 3.1 Pro (1M) anchor closed-model pipelines. Llama 4 Scout's 10M tokens remain the open-weight outlier for corpus-scale work.
Architecture & Policy
Subquadratic (SubQ) is pursuing subquadratic sparse attention with a 12M-token context — an architectural path beyond O(n²) transformers for long-horizon agents. The U.S. Commerce Department expanded pre-release AI safety testing to five major labs (adding Google DeepMind, Microsoft, and xAI to Anthropic and OpenAI), tying frontier release cadence to regulatory review.

Conclusion

The June 2026 LLM landscape shifted materially in eight weeks. Anthropic's Claude Opus 4.8 now leads the AA Index v4.0 at 61. OpenAI's GPT-5.5 replaced GPT-5.4 at roughly double the output price. Google's Gemini 3.5 Flash is GA while Gemini 3.5 Pro is imminent — and the Gemma 4 open-weight line (31B at AA 39) displaced gpt-oss-120B as the composite open-weight benchmark in this table. Open-weight leaders Kimi K2.6 and DeepSeek V4 Pro jumped to AA 52–54 with 1M-token context on the DeepSeek side. xAI's Grok 4.20, Z.ai's GLM-5, and Xiaomi's MiMo-V2.5-Pro round out a top tier that is more globally distributed than ever. Meta's Llama 4 Scout still owns the 10M-token open-weight niche.

Strategic Takeaway (2026)

"Which LLM to use?" is now a real architectural decision with no single correct answer. Match the model to the workload: peak composite intelligence (AA v4.0) → Claude Opus 4.8; Terminal-Bench and ARC-style reasoning → GPT-5.5; coding and long-horizon agents → Claude Opus 4.8 or Gemini 3.5 Flash; tool-heavy real-time agents → Grok 4.20; best open-weight composite AA → Gemma 4 31B; laptop / single-GPU open-weight → Gemma 4 12B; massive context → Llama 4 Scout (10M) or DeepSeek V4 Pro (1M); cost-sensitive API open-weight → gpt-oss-120B ($0.30/M). Success depends on pairing context, infra, pricing, and task fit — not defaulting to a single vendor.

Looking ahead, AA Index v4.0's ten-eval composite is the new baseline for comparing models; Flash-tier pricing now buys last year's premium capability; and subquadratic attention architectures point toward agents that operate across millions of tokens without quadratic cost blowups. Expect another leaderboard refresh when Gemini 3.5 Pro ships — likely within weeks of this update.