LLM Landscape 2026: Intelligence Leaderboard and Model Guide
Leaderboard Methodology
The table below ranks one model per provider — the provider's newest or most clearly superior flagship. Scores use the AA (Artificial Analysis) Intelligence Index v4.0, which aggregates ten evaluations — GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam (HLE), GPQA Diamond, and CritPt — into a single normalized integer. Score shifts versus the April update partly reflect this expanded methodology, not capability alone. Context windows are shown as token counts with commas; missing public data is shown as "—". Pricing is per million tokens (input / output); unavailable values are also shown as "—".
Two vendors use a second row for a distinct product line: OpenAI (GPT-5.5 API flagship vs. historical open-weight slot) and Google (Gemini API flagship at rank 3 vs. Gemma 4 open-weight at rank 20). That split reflects how practitioners actually choose — cloud Gemini/GPT vs. self-hosted Gemma/gpt-oss — without conflating API and open-weight releases under a single model name.
The Intelligence Leaderboard: Top 20 LLMs by Vendor (June 2026)
| Rank | Model | Capability Index (AA Index) | Context Window (tokens) | Input Cost ($/M tokens) | Output Cost ($/M tokens) | Notes |
|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.8 Anthropic | 61 | 1,000,000 | $5.00 | $25.00 | Leads AA Index v4.0; SWE-Bench Pro 69.2%, HLE 57.9%, GDPval-AA 1890; pricing flat vs Opus 4.6 |
| 2 | GPT-5.5 OpenAI | 58 | 1,050,000 | $5.00 | $30.00 | OpenAI flagship (Apr 23); Terminal-Bench 2.0 82.7%, ARC-AGI-2 85.0%; output price doubled vs GPT-5.4 |
| 3 | Gemini 3.1 Pro Preview Google | 57 | 1,000,000 | $1.25 | $10.00 | Table entry unchanged; Gemini 3.5 Flash GA (~AA 55, $1.50/$9, 4× speed, beats 3.1 Pro on coding/agents). Gemini 3.5 Pro expected June 2026 |
| 4 | Kimi K2.6 Moonshot AI | 54 | — | — | — | Open Weight 1T params; top-10 open-weight contender; pricing not yet fully confirmed |
| 5 | MiMo-V2.5-Pro Xiaomi | 54 | 1,000,000 | — | — | Successor to MiMo-V2-Pro; pricing not publicly disclosed |
| 6 | DeepSeek V4 Pro DeepSeek AI | 52 | 1,000,000 | $2.19 | $8.76 | Open Weight 1.6T MoE (49B active); hybrid sparse attention; V4 Flash for cost-sensitive pipelines |
| 7 | GLM-5 Z.ai | 50 | 200,000 | $1.00 | $3.20 | Strong agentic engineering positioning; no material change since April |
| 8 | Grok 4.20 Beta 0309 xAI | 48 | 200,000+ | $2.00 | $6.00 | xAI flagship; fast, tool-heavy, agentic model |
| 9 | Qwen3.5 397B A17B Alibaba | 45 | 262,000 | $0.60 | $3.60 | Open Weight Best current Qwen-family representative; Apache 2.0 |
| 10 | MiniMax-M2.7 MiniMax | 42 | — | — | — | Strong current entrant; notable capability/value tradeoff |
| 11 | NVIDIA Nemotron 3 Super 120B A12B NVIDIA | 36 | 1,000,000 | $0.30 | $0.75 | Open Weight Strong open enterprise contender; excellent price/performance |
| 12 | Mistral Large 3 Mistral | 23 | — | — | — | Best current public Mistral flagship |
| 13 | Nova Premier Amazon | 19 | 1,000,000 | $2.50 | $12.50 | Hyperscaler representative; broad enterprise relevance |
| 14 | ERNIE 4.5 300B A47B Baidu | 15 | — | — | — | Best verifiable ERNIE-family public entry |
| 15 | Llama 4 Scout Meta | 14 | 10,000,000 | — | — | Open Weight Context-window outlier; 10M tokens for self-hosting |
| 16 | Command A Cohere | 13 | 256,000 | $2.50 | $10.00 | Practical enterprise/workflow model; RAG and tool use focus |
| 17 | Granite 4.0 H Small IBM | 11 | — | — | — | Open Weight Enterprise and open-governance relevance |
| 18 | Jamba 1.7 Large AI21 | 11 | — | — | — | Solid enterprise positioning; hybrid SSM/Transformer architecture |
| 19 | Yi-Lightning 01.AI | — | — | — | — | Vendor-diversity slot; public specs not fully verified |
| 20 | Gemma 4 31B Google (open-weight) | 39 | — | — | — | Open Weight Top composite open-weight AA here; beats gpt-oss-120B (33) at ~4× fewer parameters. Gemini API flagship remains rank 3 |
gpt-oss-120B (OpenAI, AA 33, $0.30 / $0.30 per M via API) remains a useful reference for managed open-weight access but no longer leads the composite open-weight AA comparison — displaced by Gemma 4 31B. The new Gemma 4 12B (no AA score yet; likely ~22–27) targets a different trade-off: laptop-class deployment vs. multi-GPU serving for 120B-class models.
Key Takeaways
Key Performance Metrics
Task-Specific Leaders
| Model | Benchmark Leadership |
|---|---|
| Claude Opus 4.8 | AA Index v4.0 leader · SWE-Bench Pro · HLE · AA 61 |
| GPT-5.5 | Terminal-Bench 2.0 · ARC-AGI-2 · AA ~58 |
| Gemini 3.1 Pro | Multimodal · AA 57; 3.5 Flash leads coding/agents |
| Kimi K2.6 & DeepSeek V4 Pro | Open-weight top tier · AA 52–54 |
| Gemma 4 31B | Open-weight composite · AA 39 · beats gpt-oss-120B |
Context Window Champions
| Model | Tokens |
|---|---|
| Llama 4 Scout | 10,000,000 |
| GPT-5.5 | 1,050,000 |
| Gemini 3.1 · Claude · MiMo · DeepSeek V4 Pro · Nemotron | 1,000,000 |
| Qwen3.5 397B | 262,000 |
Cost Efficiency
| Tier | Models | Output $/M |
|---|---|---|
| Best Value | Nemotron · DeepSeek V4 Flash | ~$0.75–$2 |
| Mid-Range | DeepSeek V4 Pro · Qwen3.5 · GLM-5 · Gemini 3.5 Flash | $3.20–$9.00 |
| Flagship | Gemini 3.1 · Grok 4.20 | $6.00–$10.00 |
| Premium | GPT-5.5 · Claude Opus 4.8 | $25.00–$30.00 |
Specialized Performance Highlights
Speed & Latency
Open-Weight Excellence
| Model | Key Strength |
|---|---|
| Gemma 4 31B | AA 39 · beats gpt-oss-120B on composite · ~4× smaller |
| Gemma 4 12B | Laptop-class · size efficiency vs. 120B serving cost |
| Llama 4 Scout | 10M-token context · corpus-scale tasks |
| Qwen3.5 397B | 262K ctx · multilingual · Apache 2.0 |
| DeepSeek V4 Pro | 1M context · AA 52 · MoE architecture |
| NVIDIA Nemotron | 1M context · enterprise self-hosting |
| Kimi K2.6 | 1T params · AA 54 · top open-weight tier |
Model Selection Guide
Industry Impact & Future Trends (2026)
The 2026 LLM landscape is defined by task-specific leadership, AA Index v4.0 reshuffling scores, and a widening gap between flagship and cost-efficient tiers:
Coding & Agents
Context & Long-Horizon
Architecture & Policy
Conclusion
The June 2026 LLM landscape shifted materially in eight weeks. Anthropic's Claude Opus 4.8 now leads the AA Index v4.0 at 61. OpenAI's GPT-5.5 replaced GPT-5.4 at roughly double the output price. Google's Gemini 3.5 Flash is GA while Gemini 3.5 Pro is imminent — and the Gemma 4 open-weight line (31B at AA 39) displaced gpt-oss-120B as the composite open-weight benchmark in this table. Open-weight leaders Kimi K2.6 and DeepSeek V4 Pro jumped to AA 52–54 with 1M-token context on the DeepSeek side. xAI's Grok 4.20, Z.ai's GLM-5, and Xiaomi's MiMo-V2.5-Pro round out a top tier that is more globally distributed than ever. Meta's Llama 4 Scout still owns the 10M-token open-weight niche.
Strategic Takeaway (2026)
Looking ahead, AA Index v4.0's ten-eval composite is the new baseline for comparing models; Flash-tier pricing now buys last year's premium capability; and subquadratic attention architectures point toward agents that operate across millions of tokens without quadratic cost blowups. Expect another leaderboard refresh when Gemini 3.5 Pro ships — likely within weeks of this update.