// category

Model Releases

New frontier models, benchmark drops, and side-by-side capability comparisons.

Frontier model releases — GPT, Claude, Gemini, Grok, DeepSeek, Llama, Qwen, and the rest. This category covers announcements, benchmark results, breaking API changes, and the head-to-heads that actually matter for coding workloads, not just marketing blogposts.

94 stories · last 90 days

OpenAI brings GPT-5-level reasoning to its speech models

OpenAI launched three speech models: GPT-Realtime-2, featuring GPT-5-class reasoning and a 128,000-token context window at $32 per million input tokens; GPT-Realtime-Translate, supporting 70+ input languages and 13 output languages at $0.034 per minute; and GPT-Realtime-Whisper, a streaming trans...

The New Stack · 2026-05-08

OpenAI claims ChatGPT’s new default model hallucinates way less

OpenAI released GPT-5.5 Instant as ChatGPT's new default model, claiming it produces 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts in medicine, law, and finance, based on internal evaluations. The company also says it reduced inaccurate claims by 37.3% on conversatio...

The Verge - AI · 2026-05-06

GPT-5.5 Instant System Card

OpenAI published a system card for GPT-5.5 Instant, a model in its GPT-5.5 lineup, documenting the model's safety evaluations and deployment considerations.

OpenAI Blog · 2026-05-06

Granite 4.1 3B SVG Pelican Gallery

IBM released the Granite 4.1 LLM family under an Apache 2.0 license in 3B, 8B, and 30B sizes. Simon Willison tested 21 quantized GGUF variants of the 3B model on SVG image generation and found no consistent relationship between model size and output quality.

Simon Willison · 2026-05-05

April 2026 newsletter

Simon Willison's April 2026 newsletter covers AI model releases including Opus 4.7 and GPT-5.5, both with price increases, as well as ChatGPT Images 2.0 and Claude Mythos, alongside LLM security research topics.

Simon Willison · 2026-05-05

Claude Haiku 4.5: When to Use It Over Sonnet 4.6

Anthropic's Claude Haiku 4.5, released October 2025, scores 73.3% on SWE-bench Verified and is the first Haiku model to support extended thinking and computer use, at $1.00 per million input tokens versus $3.00 for Sonnet 4.6. It produces output at roughly 91 tokens per second with a 200K context...

Dev.to - Claude · 2026-05-04

Benchmark: Llama 3 vs. Claude 3.5 for 2026 Chatbots

A benchmark of Meta's Llama 3 70B Instruct and Anthropic's Claude 3.5 Sonnet across 1.2 million user queries found Llama 3 delivered 22% lower p99 latency and cost $0.0008 per 1k tokens self-hosted versus Claude 3.5's $0.003 via API, while Claude 3.5 scored 98.0% versus 82.3% on HellaSwag reasoni...

Dev.to - Claude · 2026-05-03

Grok 4.3 on AI Gateway

xAI's Grok 4.3 is now available on Vercel's AI Gateway, accessible via the AI SDK using the model identifier `xai/grok-4.3`. The model features a 1M token context window and a December 2025 knowledge cutoff.

Vercel Blog · 2026-05-01

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

The UK AI Security Institute evaluated OpenAI's GPT-5.5 for cyber capabilities, finding its ability to identify security vulnerabilities comparable to Anthropic's Claude Mythos. Unlike Mythos, GPT-5.5 is currently generally available.

Simon Willison · 2026-05-01

Where the goblins came from

OpenAI published a post-mortem on "goblin outputs," unusual personality-driven behaviors observed in GPT-5, detailing their origin, how they spread through the model, and the fixes applied to address them.

OpenAI Blog · 2026-04-30

Introducing talkie: a 13B vintage language model from 1930

Nick Levine, David Duvenaud, and Alec Radford released talkie, a 13B-parameter language model trained on 260B tokens of pre-1931 English text, under an Apache 2.0 license. A second instruction-tuned variant was fine-tuned using synthetic data and Claude Sonnet/Opus 4.6 as judges.

Simon Willison · 2026-04-28

DeepSeek V4 Pro Just Dropped — Here's What Changed for AI Agents

DeepSeek released V4 Pro on April 24, 2026, a mixture-of-experts model with 1.6 trillion total parameters and 49 billion active parameters, supporting a 1-million-token context window under an MIT license. Pricing is set at $1.74 per million input tokens and $3.48 per million output tokens, with ...

Dev.to - AI · 2026-04-26

Quoting Romain Huet

OpenAI's Romain Huet confirmed the company will not release a separate GPT-5.5-Codex model, stating that Codex and the main model were unified into a single system starting with GPT-5.4. GPT-5.5 includes improvements in agentic coding and computer use tasks.

Simon Willison · 2026-04-26

GPT-5.5 prompting guide

OpenAI released GPT-5.5 via its API alongside a prompting guide that advises developers to treat it as a new model family rather than a drop-in replacement for gpt-5.2 or gpt-5.4. The guide recommends starting with a minimal prompt baseline and retuning reasoning effort, verbosity, and output for...

Simon Willison · 2026-04-25

GPT 5.5 on AI Gateway

OpenAI's GPT-5.5 and GPT-5.5 Pro models are now accessible through Vercel's AI Gateway, available via the identifiers `openai/gpt-5.5` and `openai/gpt-5.5-pro` in the AI SDK. Both variants target long-running agentic tasks and are described as more token-efficient than the previous generation.

Vercel Blog · 2026-04-25

DeepSeek V4 - almost on the frontier, a fraction of the price

DeepSeek released two preview models, V4-Pro (1.6T parameters, 49B active) and V4-Flash (284B parameters, 13B active), both with 1M token context windows under MIT license. V4-Pro is priced at $1.74/million input tokens and $3.48/million output tokens; V4-Flash at $0.14 and $0.28 respectively.

Simon Willison · 2026-04-24

Claude Opus 4.7 is Here: Sam Altman Might Be Losing Sleep

Anthropic released Claude Opus 4.7, which scored 64.3% on the SWE-bench Pro coding benchmark, up from 53.4% in the prior generation. The model also adds high-resolution image support up to 2576px and improved visual reasoning scores from 69.1% to 82.1% on the CharXiv benchmark.

Dev.to - Claude · 2026-04-24

GPT-5.5 System Card

OpenAI published the system card for GPT-5.5, a new language model, detailing its safety evaluations and capabilities assessments. System cards are OpenAI's standard documentation accompanying model releases.

OpenAI Blog · 2026-04-24

Introducing GPT-5.5

OpenAI released GPT-5.5, a new language model aimed at tasks including coding, research, and data analysis. The company describes it as faster than previous versions, though no specific benchmark figures were provided.

OpenAI Blog · 2026-04-24

Deepseek V4 on AI Gateway

Vercel added DeepSeek V4 to its AI Gateway, offering two variants: DeepSeek V4 Pro, aimed at agentic coding and mathematical reasoning, and DeepSeek V4 Flash, a smaller model for high-volume, latency-sensitive workloads. Both models support a 1M token context window.

Vercel Blog · 2026-04-24

It's a big one

Simon Willison published a newsletter edition covering GPT-4.5, ChatGPT Images 2.0, and Qwen3 6-27B models, along with 5 blog posts, 8 links, 3 quotes, and a new chapter of his Agentic Engineering Patterns guide.

Simon Willison · 2026-04-24

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Qwen released Qwen3.6-27B, a 27-billion-parameter dense model (55.6GB) that the company claims surpasses its previous open-source flagship Qwen3.5-397B-A17B on major coding benchmarks. A Q4_K_M quantized version runs at approximately 25 tokens/second locally at 16.8GB.

Simon Willison · 2026-04-23

GPT Image 2 on AI Gateway

OpenAI's GPT Image 2 image model is now available on Vercel's AI Gateway, accessible via the AI SDK with the identifier "openai/gpt-image-2". The model supports up to 2K resolution, multiple aspect ratios, non-English text rendering, and various visual styles.

Vercel Blog · 2026-04-22

Where's the raccoon with the ham radio? (ChatGPT Images 2.0)

OpenAI released ChatGPT Images 2.0 (gpt-image-2), with Sam Altman describing the improvement over gpt-image-1 as equivalent to the jump from GPT-3 to GPT-5. A blogger tested the model against Google's image generation models using a "Where's Waldo"-style prompt to compare output quality.

Simon Willison · 2026-04-22

Claude 4.7 vs 4.6: A Data-Driven Comparison (With Benchmarks)

Anthropic released Claude Opus 4.7 on April 16, 2026, two months after Opus 4.6. The model improved on coding benchmarks (SWE-bench Verified: 87.6% vs 80.8%) and visual acuity (98.5% vs 54.5%), but regressed on long-context retrieval (32.2% vs 78.3%) and logical reasoning (41.0% vs 94.7%), with p...

Dev.to - Claude · 2026-04-21

Kimi K2.6 on AI Gateway

Moonshot AI's Kimi K2.6 model is now available on Vercel AI Gateway, accessible via the model ID `moonshotai/kimi-k2.6` in Vercel's AI SDK. The model targets long-horizon coding tasks across languages including Rust, Go, and Python, as well as front-end, DevOps, and performance optimization work.

Vercel Blog · 2026-04-21

Changes in the system prompt between Claude Opus 4.6 and 4.7

Anthropic updated the Claude.ai system prompt with the release of Claude Opus 4.7 on April 16, 2026, adding Claude in PowerPoint as a new tool, expanding child safety instructions under a dedicated XML tag, and adding guidance instructing the model to attempt tasks before asking clarifying questi...

Simon Willison · 2026-04-19

Anthropic releases a new Opus model amid Mythos Preview buzz

Anthropic released Claude Opus 4.7, its most capable generally available model, focused on software engineering, image analysis, and instruction following. The company noted Opus 4.7 does not advance its capability frontier, as the separately released Mythos Preview — currently limited to partner...

The Verge - AI · 2026-04-17

Gemini 3.1 Flash TTS

Google released Gemini 3.1 Flash, a text-to-speech model. Simon Willison published notes and a tool interface for the new model.

Simon Willison · 2026-04-16

Gemini 3.1 Flash TTS

Google released Gemini 3.1 Flash TTS, a text-to-speech model available via the Gemini API that generates audio from text prompts and supports detailed voice direction including accents, tone, and delivery style.

Simon Willison · 2026-04-16

Seedance 2.0 Video Generation on AI Gateway

ByteDance's Seedance 2.0 video generation model is now available via Vercel's AI Gateway in Standard and Fast variants, supporting text-to-video, image-to-video, and multimodal reference-to-video generation with synchronized audio and video editing capabilities.

Vercel Blog · 2026-04-16

Trusted access for the next era of cyber defense

OpenAI released GPT-5.4-Cyber, a model variant fine-tuned for defensive cybersecurity work, and expanded its Trusted Access for Cyber program allowing identity-verified users reduced-friction access to security tools via government ID verification through Persona.

Simon Willison · 2026-04-15

Gemma 4 audio with MLX

Google's Gemma 4 E2B model can transcribe audio files on macOS using MLX and mlx-vlm via a uv command-line recipe, as demonstrated on a 14-second voice memo that was substantially transcribed with minor errors.

Simon Willison · 2026-04-13

GLM-5.1: Towards Long-Horizon Tasks

Z.ai released GLM-5.1, a 754-billion-parameter open-source AI model available under MIT license. The model demonstrated the ability to generate SVG graphics with CSS animations and to debug and fix code issues when given follow-up instructions.

Simon Willison · 2026-04-08

GLM 5.1 on AI Gateway

Z.ai's GLM 5.1 model is now available on Vercel's AI Gateway, supporting long-horizon autonomous tasks including multi-step coding workflows, conversation, creative writing, and document generation.

Vercel Blog · 2026-04-08

Opus 4.6 Fast Mode available on AI Gateway

Vercel made Fast Mode available for Claude Opus 4.6 on AI Gateway, offering 2.5x faster output token speeds at 6x standard pricing. The experimental feature is designed for human-in-the-loop workflows and coding tasks.

Vercel Blog · 2026-04-08

Anthropic’s Claude Mythos is now available, but not for you

Anthropic released Claude Mythos Preview, a new frontier AI model, to about 50 select partners including Amazon, Apple, and Microsoft through Project Glasswing for defensive cybersecurity work. The model scores 83.1% on vulnerability analysis benchmarks, compared to 66.6% for the previous flagshi...

The New Stack · 2026-04-08

Claude Mythos 5 and the 10T-Parameter AI Shift

Anthropic is testing an early-access model called Claude Mythos after a March 2026 data leak confirmed its existence; the company has not publicly launched it but described it as a significant capability increase, with potential applications in code, cybersecurity, and enterprise operations.

Dev.to - Claude · 2026-04-07

Gemini 2.0 vs GPT-5 vs Claude 4: The Spring 2026 AI Model Rankings

Google Gemini 2.0, OpenAI GPT-5.3, and Anthropic Claude 4.6 were compared across coding and reasoning benchmarks, with GPT-5.3-Codex scoring 56.8% on SWE-Bench Pro, Claude Opus 4.6 at 55%, and Gemini 2.0 at 52%, while Claude led on multi-step reasoning tasks with 95.2% on GSM8K mathematical bench...

Dev.to - Claude · 2026-04-06

claude-opus-4-vs-gpt-5

Claude Opus 4 scores higher on coding benchmarks (76.8% vs 71.8% on SWE-bench) and offers a larger 200K-token context window, while GPT-5 excels at reasoning tasks and costs less ($10-30 per 1M tokens vs $15-75). Both are available at $20/month subscription tier.

Dev.to - Claude · 2026-04-05

Try Grok 4.20 on AI Gateway

xAI's Grok 4.20 model is now available on Vercel's AI Gateway in three variants: Reasoning, Non-Reasoning, and Multi-Agent. Developers can access the models through Vercel's AI SDK with specified model identifiers.

Vercel Blog · 2026-04-03

GLM 5V Turbo on AI Gateway

Vercel AI Gateway now offers GLM 5V Turbo, a multimodal model from Z.ai that converts screenshots and designs into code and can navigate graphical interfaces. The model is accessible via the AI SDK using the identifier zai/glm-5v-turbo.

Vercel Blog · 2026-04-03

Qwen 3.6 Plus on AI Gateway

Alibaba's Qwen 3.6 Plus model is now available on Vercel's AI Gateway. The model features a 1M context window and improved agentic coding capabilities, multimodal reasoning, and tool-calling performance compared to Qwen 3.5 Plus.

Vercel Blog · 2026-04-03

Inception Mercury 2 is live on AI Gateway

Inception's Mercury 2 model is now available on Vercel's AI Gateway. The model is designed for low-latency applications including agentic loops, coding assistants, and voice interfaces.

Vercel Blog · 2026-04-03

MiniMax M2.7 is live on AI Gateway

Vercel made MiniMax M2.7 available on its AI Gateway in standard and high-speed variants. The high-speed version delivers the same performance at 2x cost with ~100 tokens per second latency.

Vercel Blog · 2026-04-03

Gemma 4 on AI Gateway

Google's Gemma 4 models—a 26B mixture-of-experts variant and a 31B dense variant—are now available on Vercel's AI Gateway. Both models support function-calling, vision capabilities, 256K context windows, and 140+ languages.

Vercel Blog · 2026-04-03

llm-gemini 0.30

llm-gemini version 0.30 added three new models: gemini-3.1-flash-lite-preview, gemma-4-26b-a4b-it, and gemma-4-31b-it.

Simon Willison · 2026-04-03