// category

Model Releases

New frontier models, benchmark drops, and side-by-side capability comparisons.

Frontier model releases — GPT, Claude, Gemini, Grok, DeepSeek, Llama, Qwen, and the rest. This category covers announcements, benchmark results, breaking API changes, and the head-to-heads that actually matter for coding workloads, not just marketing blogposts.

138 stories · last 90 days

MiniMax M3 on AI Gateway

MiniMax M3, the company's first model with a 1-million-token context window and native multimodal input, is now accessible through Vercel AI Gateway using the identifier `minimax/minimax-m3` in the AI SDK.

Vercel Blog · 2026-06-01

Claude Opus 4.8: Ultra Code und Dynamic Workflows im Test

Anthropic released Claude Opus 4.8, scoring 69.2% on SWE-bench Pro and 83.4% on OSWorld benchmark. The model introduces Dynamic Workflows for autonomous multi-agent orchestration and an "Ultra Code" mode, at the same price as its predecessor Opus 4.7.

Dev.to - Claude · 2026-05-31

Opus 4.8 Made Claude Smarter. Token Discipline Got Urgent.

Anthropic released Claude Opus 4.8, featuring "dynamic workflows" that can run hundreds of parallel subagents in a single session and an effort control dial. Fast mode pricing is three times lower than the previous version, while headline per-token rates remain unchanged from Opus 4.7.

The New Stack · 2026-05-31

Claude Opus 4.8 is showing up where developers work

Anthropic's Claude Opus 4.8 is now available on AWS Bedrock and as a generally available option in GitHub Copilot. A separate benchmark from IBM Research and Artificial Analysis found frontier AI models scored below 50% on agentic enterprise IT tasks.

Dev.to - Claude · 2026-05-30

Claude Opus 4.8: "a modest but tangible improvement"

Anthropic released Claude Opus 4.8, describing it as "a modest but tangible improvement" over its predecessor, with pricing unchanged at $5 per million input tokens and $25 per million output tokens. The model adds mid-conversation system messages, a January 2026 knowledge cutoff, and is reported...

Simon Willison · 2026-05-29

Opus 4.8 on AI Gateway

Anthropic's Claude Opus 4.8 model is now available on Vercel's AI Gateway, accessible via the identifier `anthropic/claude-opus-4.8` in the AI SDK. The model is designed for multi-step agentic tasks including code refactoring and document drafting.

Vercel Blog · 2026-05-29

Google’s new anything-to-anything AI model is wild

Google released a new multimodal Gemini model capable of processing and generating across multiple formats including text, images, and video. The model allows users to create realistic AI-generated video content with minimal technical effort.

The Verge - AI · 2026-05-24

Qwen 3.7 Max now available on Vercel AI Gateway

Alibaba's Qwen 3.7 Max language model is now available on Vercel AI Gateway, accessible via the model identifier `alibaba/qwen-3.7-max` in Vercel's AI SDK. The model targets coding, office workflow automation, and multi-agent orchestration tasks.

Vercel Blog · 2026-05-22

Grok Build 0.1 now available on Vercel AI Gateway

xAI's Grok Build 0.1, a beta coding model designed for agentic coding tasks, is now available on Vercel AI Gateway. The model, which powers the Grok Build CLI app, is in early access and can be called via the AI SDK using the identifier `xai/grok-build-0.1`.

Vercel Blog · 2026-05-21

Google’s Gemini 3.5 Flash beats the frontier models

Google unveiled Gemini 3.5 Flash at its I/O conference, a model that outperforms its Gemini 3.1 Pro predecessor across most benchmarks and matches or exceeds GPT-5.5 and Anthropic's Opus 4.7 on several tool-use benchmarks, while running at approximately 280 tokens per second and priced at roughly...

The New Stack · 2026-05-20

The 13 biggest announcements at Google I/O 2026

Google announced the Gemini 3.5 model family at its I/O 2026 conference, with Gemini 3.5 Flash available immediately as the new default model for the Gemini app and Search's AI Mode, and Gemini 3.5 Pro coming next month. The event also included new features for Search and Gmail and updates on Pro...

The Verge - AI · 2026-05-20

Gemini 3.5 Flash on AI Gateway

Google's Gemini 3.5 Flash model is now available on Vercel AI Gateway, accessible via the identifier `google/gemini-3.5-flash` in the AI SDK. The model defaults to a medium thinking level and includes improvements to coding, reasoning, and multi-turn coherence compared to prior Flash versions.

Vercel Blog · 2026-05-20

The last six months in LLMs in five minutes

Simon Willison presented a lightning talk at PyCon US 2026 summarizing LLM developments since November 2025, noting the leading model changed hands five times among Anthropic, OpenAI, and Google, cycling through Claude Sonnet 4.5, GPT-5.1, Gemini 3, GPT-5.1 Codex Max, and Claude Opus.

Simon Willison · 2026-05-19

Claude Opus 4.7 vs 4.6

Anthropic's Claude Opus 4.7 uses a revised tokenizer that counts 12–45% more tokens for identical text compared to 4.6, despite unchanged listed prices of $5/1M input and $25/1M output tokens. Claude Code v2.1.142, released May 14, 2026, made Opus 4.7 the default for fast mode.

Dev.to - Claude · 2026-05-18

Claude vs ChatGPT in 2026: Which One Should Devs Actually Use?

A developer comparison of Claude (Anthropic) and ChatGPT (OpenAI) in 2026 found Claude Opus 4.6 scores 80.8% on SWE-bench Verified versus GPT-5.4's roughly 80%, and 91.3% on GPQA Diamond reasoning benchmarks. Both services cost $20/month; Claude was rated stronger for long-context coding and regu...

Dev.to - Claude · 2026-05-14

OpenAI just released its answer to Claude Mythos

OpenAI launched Daybreak, a cybersecurity initiative using its Codex Security AI agent to identify attack paths, validate vulnerabilities, and automate detection of high-risk ones in an organization's code. The release follows Anthropic's announcement of Claude Mythos, a security-focused AI model...

The Verge - AI · 2026-05-12

OpenAI brings GPT-5-level reasoning to its speech models

OpenAI launched three speech models: GPT-Realtime-2, featuring GPT-5-class reasoning and a 128,000-token context window at $32 per million input tokens; GPT-Realtime-Translate, supporting 70+ input languages and 13 output languages at $0.034 per minute; and GPT-Realtime-Whisper, a streaming trans...

The New Stack · 2026-05-08

OpenAI claims ChatGPT’s new default model hallucinates way less

OpenAI released GPT-5.5 Instant as ChatGPT's new default model, claiming it produces 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts in medicine, law, and finance, based on internal evaluations. The company also says it reduced inaccurate claims by 37.3% on conversatio...

The Verge - AI · 2026-05-06

GPT-5.5 Instant System Card

OpenAI published a system card for GPT-5.5 Instant, a model in its GPT-5.5 lineup, documenting the model's safety evaluations and deployment considerations.

OpenAI Blog · 2026-05-06

Granite 4.1 3B SVG Pelican Gallery

IBM released the Granite 4.1 LLM family under an Apache 2.0 license in 3B, 8B, and 30B sizes. Simon Willison tested 21 quantized GGUF variants of the 3B model on SVG image generation and found no consistent relationship between model size and output quality.

Simon Willison · 2026-05-05

April 2026 newsletter

Simon Willison's April 2026 newsletter covers AI model releases including Opus 4.7 and GPT-5.5, both with price increases, as well as ChatGPT Images 2.0 and Claude Mythos, alongside LLM security research topics.

Simon Willison · 2026-05-05

Claude Haiku 4.5: When to Use It Over Sonnet 4.6

Anthropic's Claude Haiku 4.5, released October 2025, scores 73.3% on SWE-bench Verified and is the first Haiku model to support extended thinking and computer use, at $1.00 per million input tokens versus $3.00 for Sonnet 4.6. It produces output at roughly 91 tokens per second with a 200K context...

Dev.to - Claude · 2026-05-04

Benchmark: Llama 3 vs. Claude 3.5 for 2026 Chatbots

A benchmark of Meta's Llama 3 70B Instruct and Anthropic's Claude 3.5 Sonnet across 1.2 million user queries found Llama 3 delivered 22% lower p99 latency and cost $0.0008 per 1k tokens self-hosted versus Claude 3.5's $0.003 via API, while Claude 3.5 scored 98.0% versus 82.3% on HellaSwag reasoni...

Dev.to - Claude · 2026-05-03

Grok 4.3 on AI Gateway

xAI's Grok 4.3 is now available on Vercel's AI Gateway, accessible via the AI SDK using the model identifier `xai/grok-4.3`. The model features a 1M token context window and a December 2025 knowledge cutoff.

Vercel Blog · 2026-05-01

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

The UK AI Security Institute evaluated OpenAI's GPT-5.5 for cyber capabilities, finding its ability to identify security vulnerabilities comparable to Anthropic's Claude Mythos. Unlike Mythos, GPT-5.5 is currently generally available.

Simon Willison · 2026-05-01

Where the goblins came from

OpenAI published a post-mortem on "goblin outputs," unusual personality-driven behaviors observed in GPT-5, detailing their origin, how they spread through the model, and the fixes applied to address them.

OpenAI Blog · 2026-04-30

Introducing talkie: a 13B vintage language model from 1930

Nick Levine, David Duvenaud, and Alec Radford released talkie, a 13B-parameter language model trained on 260B tokens of pre-1931 English text, under an Apache 2.0 license. A second instruction-tuned variant was fine-tuned using synthetic data and Claude Sonnet/Opus 4.6 as judges.

Simon Willison · 2026-04-28

DeepSeek V4 Pro Just Dropped — Here's What Changed for AI Agents

DeepSeek released V4 Pro on April 24, 2026, a mixture-of-experts model with 1.6 trillion total parameters and 49 billion active parameters, supporting a 1-million-token context window under an MIT license. Pricing is set at $1.74 per million input tokens and $3.48 per million output tokens, with ...

Dev.to - AI · 2026-04-26

Quoting Romain Huet

OpenAI's Romain Huet confirmed the company will not release a separate GPT-5.5-Codex model, stating that Codex and the main model were unified into a single system starting with GPT-5.4. GPT-5.5 includes improvements in agentic coding and computer use tasks.

Simon Willison · 2026-04-26

GPT-5.5 prompting guide

OpenAI released GPT-5.5 via its API alongside a prompting guide that advises developers to treat it as a new model family rather than a drop-in replacement for gpt-5.2 or gpt-5.4. The guide recommends starting with a minimal prompt baseline and retuning reasoning effort, verbosity, and output for...

Simon Willison · 2026-04-25

GPT 5.5 on AI Gateway

OpenAI's GPT-5.5 and GPT-5.5 Pro models are now accessible through Vercel's AI Gateway, available via the identifiers `openai/gpt-5.5` and `openai/gpt-5.5-pro` in the AI SDK. Both variants target long-running agentic tasks and are described as more token-efficient than the previous generation.

Vercel Blog · 2026-04-25

DeepSeek V4 - almost on the frontier, a fraction of the price

DeepSeek released two preview models, V4-Pro (1.6T parameters, 49B active) and V4-Flash (284B parameters, 13B active), both with 1M token context windows under MIT license. V4-Pro is priced at $1.74/million input tokens and $3.48/million output tokens; V4-Flash at $0.14 and $0.28 respectively.

Simon Willison · 2026-04-24

Claude Opus 4.7 is Here: Sam Altman Might Be Losing Sleep

Anthropic released Claude Opus 4.7, which scored 64.3% on the SWE-bench Pro coding benchmark, up from 53.4% in the prior generation. The model also adds high-resolution image support up to 2576px and improved visual reasoning scores from 69.1% to 82.1% on the CharXiv benchmark.

Dev.to - Claude · 2026-04-24

GPT-5.5 System Card

OpenAI published the system card for GPT-5.5, a new language model, detailing its safety evaluations and capabilities assessments. System cards are OpenAI's standard documentation accompanying model releases.

OpenAI Blog · 2026-04-24

Introducing GPT-5.5

OpenAI released GPT-5.5, a new language model aimed at tasks including coding, research, and data analysis. The company describes it as faster than previous versions, though no specific benchmark figures were provided.

OpenAI Blog · 2026-04-24

Deepseek V4 on AI Gateway

Vercel added DeepSeek V4 to its AI Gateway, offering two variants: DeepSeek V4 Pro, aimed at agentic coding and mathematical reasoning, and DeepSeek V4 Flash, a smaller model for high-volume, latency-sensitive workloads. Both models support a 1M token context window.

Vercel Blog · 2026-04-24

It's a big one

Simon Willison published a newsletter edition covering GPT-4.5, ChatGPT Images 2.0, and Qwen3 6-27B models, along with 5 blog posts, 8 links, 3 quotes, and a new chapter of his Agentic Engineering Patterns guide.

Simon Willison · 2026-04-24

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Qwen released Qwen3.6-27B, a 27-billion-parameter dense model (55.6GB) that the company claims surpasses its previous open-source flagship Qwen3.5-397B-A17B on major coding benchmarks. A Q4_K_M quantized version runs at approximately 25 tokens/second locally at 16.8GB.

Simon Willison · 2026-04-23

GPT Image 2 on AI Gateway

OpenAI's GPT Image 2 image model is now available on Vercel's AI Gateway, accessible via the AI SDK with the identifier "openai/gpt-image-2". The model supports up to 2K resolution, multiple aspect ratios, non-English text rendering, and various visual styles.

Vercel Blog · 2026-04-22

Where's the raccoon with the ham radio? (ChatGPT Images 2.0)

OpenAI released ChatGPT Images 2.0 (gpt-image-2), with Sam Altman describing the improvement over gpt-image-1 as equivalent to the jump from GPT-3 to GPT-5. A blogger tested the model against Google's image generation models using a "Where's Waldo"-style prompt to compare output quality.

Simon Willison · 2026-04-22

Claude 4.7 vs 4.6: A Data-Driven Comparison (With Benchmarks)

Anthropic released Claude Opus 4.7 on April 16, 2026, two months after Opus 4.6. The model improved on coding benchmarks (SWE-bench Verified: 87.6% vs 80.8%) and visual acuity (98.5% vs 54.5%), but regressed on long-context retrieval (32.2% vs 78.3%) and logical reasoning (41.0% vs 94.7%), with p...

Dev.to - Claude · 2026-04-21

Kimi K2.6 on AI Gateway

Moonshot AI's Kimi K2.6 model is now available on Vercel AI Gateway, accessible via the model ID `moonshotai/kimi-k2.6` in Vercel's AI SDK. The model targets long-horizon coding tasks across languages including Rust, Go, and Python, as well as front-end, DevOps, and performance optimization work.

Vercel Blog · 2026-04-21

Changes in the system prompt between Claude Opus 4.6 and 4.7

Anthropic updated the Claude.ai system prompt with the release of Claude Opus 4.7 on April 16, 2026, adding Claude in PowerPoint as a new tool, expanding child safety instructions under a dedicated XML tag, and adding guidance instructing the model to attempt tasks before asking clarifying questi...

Simon Willison · 2026-04-19

Anthropic releases a new Opus model amid Mythos Preview buzz

Anthropic released Claude Opus 4.7, its most capable generally available model, focused on software engineering, image analysis, and instruction following. The company noted Opus 4.7 does not advance its capability frontier, as the separately released Mythos Preview — currently limited to partner...

The Verge - AI · 2026-04-17

Gemini 3.1 Flash TTS

Google released Gemini 3.1 Flash, a text-to-speech model. Simon Willison published notes and a tool interface for the new model.

Simon Willison · 2026-04-16

Gemini 3.1 Flash TTS

Google released Gemini 3.1 Flash TTS, a text-to-speech model available via the Gemini API that generates audio from text prompts and supports detailed voice direction including accents, tone, and delivery style.

Simon Willison · 2026-04-16

Seedance 2.0 Video Generation on AI Gateway

ByteDance's Seedance 2.0 video generation model is now available via Vercel's AI Gateway in Standard and Fast variants, supporting text-to-video, image-to-video, and multimodal reference-to-video generation with synchronized audio and video editing capabilities.

Vercel Blog · 2026-04-16

Trusted access for the next era of cyber defense

OpenAI released GPT-5.4-Cyber, a model variant fine-tuned for defensive cybersecurity work, and expanded its Trusted Access for Cyber program allowing identity-verified users reduced-friction access to security tools via government ID verification through Persona.

Simon Willison · 2026-04-15

Gemma 4 audio with MLX

Google's Gemma 4 E2B model can transcribe audio files on macOS using MLX and mlx-vlm via a uv command-line recipe, as demonstrated on a 14-second voice memo that was substantially transcribed with minor errors.

Simon Willison · 2026-04-13

GLM-5.1: Towards Long-Horizon Tasks

Z.ai released GLM-5.1, a 754-billion-parameter open-source AI model available under MIT license. The model demonstrated the ability to generate SVG graphics with CSS animations and to debug and fix code issues when given follow-up instructions.

Simon Willison · 2026-04-08

GLM 5.1 on AI Gateway

Z.ai's GLM 5.1 model is now available on Vercel's AI Gateway, supporting long-horizon autonomous tasks including multi-step coding workflows, conversation, creative writing, and document generation.

Vercel Blog · 2026-04-08

Opus 4.6 Fast Mode available on AI Gateway

Vercel made Fast Mode available for Claude Opus 4.6 on AI Gateway, offering 2.5x faster output token speeds at 6x standard pricing. The experimental feature is designed for human-in-the-loop workflows and coding tasks.

Vercel Blog · 2026-04-08

Anthropic’s Claude Mythos is now available, but not for you

Anthropic released Claude Mythos Preview, a new frontier AI model, to about 50 select partners including Amazon, Apple, and Microsoft through Project Glasswing for defensive cybersecurity work. The model scores 83.1% on vulnerability analysis benchmarks, compared to 66.6% for the previous flagshi...

The New Stack · 2026-04-08

Claude Mythos 5 and the 10T-Parameter AI Shift

Anthropic is testing an early-access model called Claude Mythos after a March 2026 data leak confirmed its existence; the company has not publicly launched it but described it as a significant capability increase, with potential applications in code, cybersecurity, and enterprise operations.

Dev.to - Claude · 2026-04-07

Gemini 2.0 vs GPT-5 vs Claude 4: The Spring 2026 AI Model Rankings

Google Gemini 2.0, OpenAI GPT-5.3, and Anthropic Claude 4.6 were compared across coding and reasoning benchmarks, with GPT-5.3-Codex scoring 56.8% on SWE-Bench Pro, Claude Opus 4.6 at 55%, and Gemini 2.0 at 52%, while Claude led on multi-step reasoning tasks with 95.2% on GSM8K mathematical bench...

Dev.to - Claude · 2026-04-06

claude-opus-4-vs-gpt-5

Claude Opus 4 scores higher on coding benchmarks (76.8% vs 71.8% on SWE-bench) and offers a larger 200K-token context window, while GPT-5 excels at reasoning tasks and costs less ($10-30 per 1M tokens vs $15-75). Both are available at $20/month subscription tier.

Dev.to - Claude · 2026-04-05

Try Grok 4.20 on AI Gateway

xAI's Grok 4.20 model is now available on Vercel's AI Gateway in three variants: Reasoning, Non-Reasoning, and Multi-Agent. Developers can access the models through Vercel's AI SDK with specified model identifiers.

Vercel Blog · 2026-04-03

GLM 5V Turbo on AI Gateway

Vercel AI Gateway now offers GLM 5V Turbo, a multimodal model from Z.ai that converts screenshots and designs into code and can navigate graphical interfaces. The model is accessible via the AI SDK using the identifier zai/glm-5v-turbo.

Vercel Blog · 2026-04-03

Qwen 3.6 Plus on AI Gateway

Alibaba's Qwen 3.6 Plus model is now available on Vercel's AI Gateway. The model features a 1M context window and improved agentic coding capabilities, multimodal reasoning, and tool-calling performance compared to Qwen 3.5 Plus.

Vercel Blog · 2026-04-03

Inception Mercury 2 is live on AI Gateway

Inception's Mercury 2 model is now available on Vercel's AI Gateway. The model is designed for low-latency applications including agentic loops, coding assistants, and voice interfaces.

Vercel Blog · 2026-04-03

MiniMax M2.7 is live on AI Gateway

Vercel made MiniMax M2.7 available on its AI Gateway in standard and high-speed variants. The high-speed version delivers the same performance at 2x cost with ~100 tokens per second latency.

Vercel Blog · 2026-04-03

Gemma 4 on AI Gateway

Google's Gemma 4 models—a 26B mixture-of-experts variant and a 31B dense variant—are now available on Vercel's AI Gateway. Both models support function-calling, vision capabilities, 256K context windows, and 140+ languages.

Vercel Blog · 2026-04-03

llm-gemini 0.30

llm-gemini version 0.30 added three new models: gemini-3.1-flash-lite-preview, gemma-4-26b-a4b-it, and gemma-4-31b-it.

Simon Willison · 2026-04-03