Model Releases

MiniMax M3 on AI Gateway

MiniMax M3, the company's first model with a 1-million-token context window and native multimodal input, is now accessible through Vercel AI Gateway using the identifier `minimax/minimax-m3` in the AI SDK.

Vercel Blog · 2026-06-01

Claude Opus 4.8: Ultra Code und Dynamic Workflows im Test

Anthropic released Claude Opus 4.8, scoring 69.2% on SWE-bench Pro and 83.4% on OSWorld benchmark. The model introduces Dynamic Workflows for autonomous multi-agent orchestration and an "Ultra Code" mode, at the same price as its predecessor Opus 4.7.

Dev.to - Claude · 2026-05-31

Opus 4.8 Made Claude Smarter. Token Discipline Got Urgent.

Anthropic released Claude Opus 4.8, featuring "dynamic workflows" that can run hundreds of parallel subagents in a single session and an effort control dial. Fast mode pricing is three times lower than the previous version, while headline per-token rates remain unchanged from Opus 4.7.

The New Stack · 2026-05-31

Opus 4.8 barely moved the leaderboard. It moved the one number that decides if your agents can be trusted.

Anthropic released Claude Opus 4.8 on May 28, 2026, with pricing unchanged at $5/$25 per million tokens and modest benchmark gains on SWE-bench. The release added a Fast mode at $10/$50 per million tokens and introduced dynamic workflows in research preview for parallel subagent orchestration.

Dev.to - Claude · 2026-05-31

Anthropic Just Dropped Claude Opus 4.8: The Era of 'Dynamic Workflows' is Here 🚀

Anthropic reportedly released Claude Opus 4.8, introducing "Dynamic Workflows" that allow parallel subagent execution in Claude Code, an effort control setting for token usage, and mid-task system prompt injection via the Messages API. Pricing remains at $5 per million input tokens and $25 per mi...

Dev.to - Claude · 2026-05-30

Claude Opus 4.8 is showing up where developers work

Anthropic's Claude Opus 4.8 is now available on AWS Bedrock and as a generally available option in GitHub Copilot. A separate benchmark from IBM Research and Artificial Analysis found frontier AI models scored below 50% on agentic enterprise IT tasks.

Dev.to - Claude · 2026-05-30

9 demos of Gemini Omni and Gemini 3.5 in action

Google announced Gemini Omni and Gemini 3.5 at Google I/O 2026 and released nine demonstration videos showcasing the models' capabilities.

Google AI Blog · 2026-05-30

Claude Opus 4.8 Released: New Features, Performance Improvements & Pricing

Anthropic released Claude Opus 4.8, a new version of its flagship AI model, with reported improvements in coding accuracy, reasoning, and multi-agent workflow performance. The update also introduces parallel agent support in Claude Code and updated API controls, with pricing unchanged from the pr...

Dev.to - Claude · 2026-05-30

AI Daily Digest: May 30, 2026 — Anthropic Hits $965B, Opus 4.8 Dynamic Workflows, SymJack RCE in 6 AI Agents

Anthropic raised a $65B Series H round at a $965B valuation and released Claude Opus 4.8 with parallel subagent orchestration ("Dynamic Workflows"). Security firm Adversa AI disclosed "SymJack," a symlink-hijack RCE vulnerability affecting six AI coding agents including Claude Code, Cursor, and G...

Dev.to - Claude · 2026-05-30

Claude Opus 4.8: Effort Controls, Dynamic Workflows, and an Honest-by-Default Coding Agent

Anthropic released Claude Opus 4.8 on May 28, 2026, 41 days after Opus 4.7, scoring 69.2% on SWE-bench Pro and 96.7% on USAMO 2026. The model adds per-request effort controls, a Dynamic Workflows feature for parallel subagents in Claude Code, and a fast mode priced at $10/$50 per million tokens.

Dev.to - Claude · 2026-05-29

Claude Opus 4.8 Dynamic Workflows: 1,000 Parallel Agents and Fast Mode in Practice

Anthropic released Claude Opus 4.8 with a 1-million-token context window and a 69.2% SWE-bench Pro score. The update introduces Dynamic Workflows, which offloads multi-agent orchestration to JavaScript scripts, supporting up to 16 concurrent agents and 1,000 total agents per run, and adds mid-con...

Dev.to - Claude · 2026-05-29

Claude Opus 4.8 Released: Core Upgrades, Benchmarks, and Migration Guide

Anthropic released Claude Opus 4.8 on May 28, 2026, 41 days after Opus 4.7, with SWE-bench Pro scores rising from 64.3% to 69.2% and Fast Mode pricing cut from $30/$150 to $10/$50 per million tokens. New features include parallel sub-agent Dynamic Workflows and a user-facing effort-level control ...

Dev.to - AI · 2026-05-29

Claude Opus 4.8 is here: effort controls, dynamic workflows, cheaper fast mode, better honesty, less deception

Anthropic released Claude Opus 4.8 at the same price as its predecessor, adding user-adjustable effort controls, a dynamic workflows feature enabling hundreds of parallel coding subagents, and a fast mode priced three times lower than previous versions. The model outperforms GPT-5.5 and Gemini 3....

The New Stack · 2026-05-29

Anthropic releases Opus 4.8 with new ‘dynamic workflow’ tool

Anthropic released Opus 4.8, a new AI model that includes a tool called Dynamic Workflows for coordinating groups of subagents. No pricing or availability details were provided in the report.

TechCrunch - AI · 2026-05-29

Claude Opus 4.8: "a modest but tangible improvement"

Anthropic released Claude Opus 4.8, describing it as "a modest but tangible improvement" over its predecessor, with pricing unchanged at $5 per million input tokens and $25 per million output tokens. The model adds mid-conversation system messages, a January 2026 knowledge cutoff, and is reported...

Simon Willison · 2026-05-29

AI Dev Weekly #12: Opus 4.8 Drops, Anthropic Hits $965B, Chinese AI Goes 99% Cheaper, Microsoft Builds Its Own Coding Model

Anthropic released Claude Opus 4.8 on May 28, priced at $5/$25 per million tokens, scoring 69.2% on SWE-bench Pro and 88.6% on SWE-bench Verified. Separately, Anthropic closed a $65B Series H at a $965B valuation, reporting a $47B annualized revenue run rate.

Dev.to - Claude · 2026-05-29

Opus 4.8 on AI Gateway

Anthropic's Claude Opus 4.8 model is now available on Vercel's AI Gateway, accessible via the identifier `anthropic/claude-opus-4.8` in the AI SDK. The model is designed for multi-step agentic tasks including code refactoring and document drafting.

Vercel Blog · 2026-05-29

Claude’s new model is more ‘honest’ when it messes up

Anthropic released Claude Opus 4.8, a model the company says is approximately four times less likely than its predecessor to make unsupported claims or present uncertain work as confident progress.

The Verge - AI · 2026-05-29

Catch up on 12 major I/O 2026 moments

Google held its I/O 2026 developer conference, announcing Gemini Omni and Gemini 3.5 Flash among at least 12 product updates highlighted in the keynote.

Google AI Blog · 2026-05-29

Benchmarking the Claude Agent SDK on a local LLM: Haiku and Sonnet tier performance

A benchmark of local Qwen models on an RTX 3090 Ti running llama.cpp against Anthropic's Claude API found 4.3x–9.3x latency speedups on Haiku-tier JSON workloads, with quality scores at or near Anthropic-vs-Anthropic ceiling per an Opus LLM-as-judge metric across 100 trials.

Dev.to - Claude · 2026-05-28

Google ranks the best AI for building Android apps, and the winner isn’t Gemini

Google's Android Bench leaderboard, updated May 18, ranked OpenAI's GPT 5.5 as the top AI model for Android app development, ahead of Google's own Gemini 3.1 Pro. Google launched the benchmarking portal in March to help developers and model creators evaluate LLMs for Android development tasks.

The New Stack · 2026-05-27

Two Models Just Hit 90% on Agent Coding. One Cost Less Than a Penny.

A benchmark across 10 agent coding tasks found Qwen3 Coder 30B A3B and DeepSeek Chat both scored 90%, at costs of $0.0004 and $0.0018 per run respectively. Liquid's LFM 2 24B A2B scored 85% at $0.0002, while Qwen3 14B scored 0% due to mandatory thinking-mode timeouts.

Dev.to - AI · 2026-05-26

Comparing OpenAI ChatGPT, Anthropic Claude, and Google Gemini for Web Development Workflows

A comparison of ChatGPT, Claude, and Gemini for web development found ChatGPT favored for rapid code generation and multi-language support, while Claude was preferred for large codebase reviews, architectural planning, and long-context reasoning tasks.

Dev.to - Claude · 2026-05-25

Google’s new anything-to-anything AI model is wild

Google released a new multimodal Gemini model capable of processing and generating across multiple formats including text, images, and video. The model allows users to create realistic AI-generated video content with minimal technical effort.

The Verge - AI · 2026-05-24

Claude 4 Released: Anthropic's AI Model That Can Code for 7 Hours Straight

Anthropic released Claude 4 on May 23, 2025, including Claude Opus 4 and Claude Sonnet 4, at its first developer conference. Claude Opus 4 scored 72.5% on SWE-bench, supports up to roughly seven hours of continuous task execution, and is priced the same as its predecessor; GitHub plans to use Cla...

Dev.to - Claude · 2026-05-23

Claude Opus 4 and Sonnet 4 retire June 15 — your `claude-opus-4-0` alias is about to start failing

Anthropic will retire the Claude Opus 4 and Sonnet 4 models (snapshots dated 20250514) on June 15, 2026, following their deprecation on April 14, 2026. The versioned aliases `claude-opus-4-0` and `claude-sonnet-4-0` will also stop working on that date; Claude 4.1 remains active until at least Aug...

Dev.to - Claude · 2026-05-23

Qwen 3.7 Max now available on Vercel AI Gateway

Alibaba's Qwen 3.7 Max language model is now available on Vercel AI Gateway, accessible via the model identifier `alibaba/qwen-3.7-max` in Vercel's AI SDK. The model targets coding, office workflow automation, and multi-agent orchestration tasks.

Vercel Blog · 2026-05-22

Grok Build 0.1 now available on Vercel AI Gateway

xAI's Grok Build 0.1, a beta coding model designed for agentic coding tasks, is now available on Vercel AI Gateway. The model, which powers the Grok Build CLI app, is in early access and can be called via the AI SDK using the identifier `xai/grok-build-0.1`.

Vercel Blog · 2026-05-21

100 things we announced at I/O 2026

Google announced approximately 100 products and updates at its I/O 2026 developer conference, including Gemini Omni, Google Antigravity, and Universal Cart.

Google AI Blog · 2026-05-21

Google’s Gemini Omni turns images, audio, and text into video — and that’s just the start

Google released Gemini Omni, a multimodal model that accepts text, images, audio, and video inputs to generate and edit videos via conversational prompts. The initial release is Omni Flash.

TechCrunch - AI · 2026-05-20

Gemini 3.5 Flash: more expensive, but Google plan to use it for everything

Google released Gemini 3.5 Flash at Google I/O, deploying it across Search, the Gemini app, and enterprise platforms. Priced at $1.50 per million input tokens and $9 per million output tokens, it costs three times more than the previous Gemini 3 Flash Preview.

Simon Willison · 2026-05-20

Google’s Gemini 3.5 Flash beats the frontier models

Google unveiled Gemini 3.5 Flash at its I/O conference, a model that outperforms its Gemini 3.1 Pro predecessor across most benchmarks and matches or exceeds GPT-5.5 and Anthropic's Opus 4.7 on several tool-use benchmarks, while running at approximately 280 tokens per second and priced at roughly...

The New Stack · 2026-05-20

With Gemini 3.5 Flash, Google bets its next AI wave on agents, not chatbots

Google launched Gemini 3.5 Flash at its annual developer conference, describing it as its most capable coding and agentic AI model to date. The model is designed to autonomously execute complex tasks and build software from scratch.

TechCrunch - AI · 2026-05-20

The 13 biggest announcements at Google I/O 2026

Google announced the Gemini 3.5 model family at its I/O 2026 conference, with Gemini 3.5 Flash available immediately as the new default model for the Gemini app and Search's AI Mode, and Gemini 3.5 Pro coming next month. The event also included new features for Search and Gmail and updates on Pro...

The Verge - AI · 2026-05-20

Gemini 3.5 Flash on AI Gateway

Google's Gemini 3.5 Flash model is now available on Vercel AI Gateway, accessible via the identifier `google/gemini-3.5-flash` in the AI SDK. The model defaults to a medium thinking level and includes improvements to coding, reasoning, and multi-turn coherence compared to prior Flash versions.

Vercel Blog · 2026-05-20

Gemini 3.5: frontier intelligence with action

Google released Gemini 3.5, a new series of AI models, at Google I/O. The release was announced via the Google AI Blog with limited additional details provided.

Google AI Blog · 2026-05-20

I/O 2026: Welcome to the agentic Gemini era

Google announced agentic capabilities for its Gemini AI at the I/O 2026 developer conference, focusing on features that allow the system to complete tasks autonomously on behalf of users.

Google AI Blog · 2026-05-20

The last six months in LLMs in five minutes

Simon Willison presented a lightning talk at PyCon US 2026 summarizing LLM developments since November 2025, noting the leading model changed hands five times among Anthropic, OpenAI, and Google, cycling through Claude Sonnet 4.5, GPT-5.1, Gemini 3, GPT-5.1 Codex Max, and Claude Opus.

Simon Willison · 2026-05-19

Claude Opus 4.7 vs 4.6

Anthropic's Claude Opus 4.7 uses a revised tokenizer that counts 12–45% more tokens for identical text compared to 4.6, despite unchanged listed prices of $5/1M input and $25/1M output tokens. Claude Code v2.1.142, released May 14, 2026, made Opus 4.7 the default for fast mode.

Dev.to - Claude · 2026-05-18

Claude Mythos vs Claude Opus 4.6: what the leaked benchmarks mean for developers

Draft documents accidentally exposed from Anthropic described an unreleased model codenamed "Claude Mythos" (internally "Capybara"), reportedly scoring higher than Claude Opus 4.6 on coding, academic reasoning, and cybersecurity benchmarks, with early access limited to cyber defense organizations...

Dev.to - Claude · 2026-05-16

Databricks brings GPT-5.5 to enterprise agent workflows

Databricks integrated OpenAI's GPT-5.5 model into its enterprise agent workflows. The model achieved a top score on the OfficeQA Pro benchmark prior to the adoption.

OpenAI Blog · 2026-05-16

Claude vs ChatGPT in 2026: Which One Should Devs Actually Use?

A developer comparison of Claude (Anthropic) and ChatGPT (OpenAI) in 2026 found Claude Opus 4.6 scores 80.8% on SWE-bench Verified versus GPT-5.4's roughly 80%, and 91.3% on GPQA Diamond reasoning benchmarks. Both services cost $20/month; Claude was rated stronger for long-context coding and regu...

Dev.to - Claude · 2026-05-14

I tested OpenAI’s three claims about GPT-5.5 Instant, and only one fully held up

A journalist tested GPT-5.5 Instant against GPT-5.2 after OpenAI replaced its default ChatGPT model, finding the conciseness claim did not hold up — GPT-5.2 produced shorter answers in all three test cases — while GPT-5.5 showed reduced hallucinations on factual queries.

The New Stack · 2026-05-14

OpenAI just released its answer to Claude Mythos

OpenAI launched Daybreak, a cybersecurity initiative using its Codex Security AI agent to identify attack paths, validate vulnerabilities, and automate detection of high-risk ones in an organization's code. The release follows Anthropic's announcement of Claude Mythos, a security-focused AI model...

The Verge - AI · 2026-05-12

How a $0.02/Call Model Scored 78.2% on SWE-bench Verified — Beating Every Model on the Leaderboard

Xanther AI reports that adding its context engine to MiniMax M2.5 raised the model's SWE-bench Verified score from 75.8% to 78.2% on 500 real bug instances, at $0.22 per instance compared to $0.75 for Claude 4.5 Opus, which the company says currently leads the official leaderboard at 76.8%.

Dev.to - Claude · 2026-05-09

Claude vs GPT: Which AI Model Fits Your Production Workflow (And Why It Actually Matters)

Claude 3.5 Sonnet offers a 200K token context window at $3/$15 per million input/output tokens, compared to GPT-4 Turbo's 128K context window at $10/$30 per million tokens. The two models differ in context capacity, reasoning depth, and cost structure depending on workload.

Dev.to - Claude · 2026-05-09

OpenAI brings GPT-5-level reasoning to its speech models

OpenAI launched three speech models: GPT-Realtime-2, featuring GPT-5-class reasoning and a 128,000-token context window at $32 per million input tokens; GPT-Realtime-Translate, supporting 70+ input languages and 13 output languages at $0.034 per minute; and GPT-Realtime-Whisper, a streaming trans...

The New Stack · 2026-05-08

Advancing voice intelligence with new models in the API

OpenAI released new realtime voice models in its API with capabilities including speech reasoning, translation, and transcription.

OpenAI Blog · 2026-05-08

Scaling Trusted Access for Cyber with GPT-5.5 and GPT-5.5-Cyber

OpenAI expanded its Trusted Access for Cyber program to include GPT-5.5 and GPT-5.5-Cyber, two models made available to verified cybersecurity defenders for vulnerability research and critical infrastructure protection.

OpenAI Blog · 2026-05-08

The context window has been shattered: Subquadratic debuts a 12-million-token window

Subquadratic, a Miami startup, launched a model with a 12-million-token context window using an architecture called Subquadratic Selective Attention, which the company says scales linearly in compute and memory. The model scores 83 on MRCR v2 and 92.1% on needle-in-a-haystack retrieval at 12 mill...

The New Stack · 2026-05-06

GPT-5.5 Instant: smarter, clearer, and more personalized

OpenAI released GPT-5.5 Instant as an updated default model for ChatGPT, citing improvements in answer accuracy, reduced hallucinations, and expanded personalization controls.

OpenAI Blog · 2026-05-06

OpenAI releases GPT-5.5 Instant, a new default model for ChatGPT

OpenAI released GPT-5.5 Instant as the new default model for ChatGPT, citing reduced hallucinations in law, medicine, and finance while maintaining low latency compared to its predecessor.

TechCrunch - AI · 2026-05-06

OpenAI claims ChatGPT’s new default model hallucinates way less

OpenAI released GPT-5.5 Instant as ChatGPT's new default model, claiming it produces 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts in medicine, law, and finance, based on internal evaluations. The company also says it reduced inaccurate claims by 37.3% on conversatio...

The Verge - AI · 2026-05-06

Benchmark: Claude 3.5 vs. GPT-4o for Cloud Cost Anomaly Detection in AWS and GCP

A benchmark of Claude 3.5 Sonnet and GPT-4o across 12,000 AWS and GCP billing logs found Claude scored higher precision (94.2% vs. 89.7% on GCP anomaly detection) and lower cost per detection ($0.87 vs. $1.12 per 1,000), while GPT-4o processed requests 18% faster at 12.7 RPS versus 10.7 RPS.

Dev.to - Claude · 2026-05-06

OpenAI rolls out GPT-5.5 Instant as default ChatGPT model, promises more accurate responses

OpenAI replaced ChatGPT's default model with GPT-5.5 Instant, a lighter variant of its April flagship model designed for everyday tasks. The new model scores 81.6% on the CharXiv benchmark, up from 75.0% for its predecessor GPT-5.3 Instant, and introduces a "memory sources" feature showing users ...

The New Stack · 2026-05-06

GPT-5.5 Instant System Card

OpenAI published a system card for GPT-5.5 Instant, a model in its GPT-5.5 lineup, documenting the model's safety evaluations and deployment considerations.

OpenAI Blog · 2026-05-06

ChatGPT vs Claude vs Gemini: Which AI Is Actually Worth Using in 2026?

A 2026 comparison of ChatGPT, Claude, and Gemini found ChatGPT favored for general writing and coding, Claude preferred for nuanced editorial content and code explanation, and Gemini rated most reliable for research due to its Google Search integration.

Dev.to - Claude · 2026-05-06

Granite 4.1 3B SVG Pelican Gallery

IBM released the Granite 4.1 LLM family under an Apache 2.0 license in 3B, 8B, and 30B sizes. Simon Willison tested 21 quantized GGUF variants of the 3B model on SVG image generation and found no consistent relationship between model size and output quality.

Simon Willison · 2026-05-05

April 2026 newsletter

Simon Willison's April 2026 newsletter covers AI model releases including Opus 4.7 and GPT-5.5, both with price increases, as well as ChatGPT Images 2.0 and Claude Mythos, alongside LLM security research topics.

Simon Willison · 2026-05-05

Claude Haiku 4.5: When to Use It Over Sonnet 4.6

Anthropic's Claude Haiku 4.5, released October 2025, scores 73.3% on SWE-bench Verified and is the first Haiku model to support extended thinking and computer use, at $1.00 per million input tokens versus $3.00 for Sonnet 4.6. It produces output at roughly 91 tokens per second with a 200K context...

Dev.to - Claude · 2026-05-04

Benchmark: Llama 3 vs. Claude 3.5 for 2026 Chatbots

A benchmark of Meta's Llama 3 70B Instruct and Anthropic's Claude 3.5 Sonnet across 1.2 million user queries found Llama 3 delivered 22% lower p99 latency and cost $0.0008 per 1k tokens self-hosted versus Claude 3.5's $0.003 via API, while Claude 3.5 scored 98.0% versus 82.3% on HellaSwag reasoni...

Dev.to - Claude · 2026-05-03

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

Kimi K2.6, an open-weights language model developed by Chinese AI company Moonshot AI, outscored Claude, GPT-5.5, and Gemini on a programming benchmark challenge, according to a report published April 30, 2026.

Hacker News - Best · 2026-05-03

I tested 4 free 70B-class LLM endpoints for real production work — here's what each is actually good at

A developer tested four free LLM endpoints — Cerebras (Qwen 3 235B), Groq (Llama 4 Scout), OpenRouter (Ling-2.6 Flash), and Cloudflare (GPT-OSS 120B) — for production use in an AI website builder. Groq achieved ~500 tokens/second, Cerebras produced the most coherent output for complex prompts, Op...

Dev.to - AI · 2026-05-02

Mistral, Europe’s answer to OpenAI and Anthropic, pushes its coding agents to the cloud

Mistral AI released Mistral Medium 3.5 and added cloud execution to its Vibe coding agent, allowing agents to run tasks in isolated sandboxed environments independent of a developer's local machine. The company also added a "work mode" to its Le Chat interface for running longer, parallel tool-ca...

The New Stack · 2026-05-02

Anthropic's April Double Release — How Opus 4.7 and Managed Agents Change Agent Development

Anthropic released Claude Opus 4.7 on April 16, scoring 64.3% on SWE-bench Pro (+10.9 points over 4.6) and 70% on CursorBench (+12 points), with added image support up to 3.75MP, a beta token-budget parameter for agent loops, and a new "xhigh" reasoning tier. The company also launched a Managed A...

Dev.to - Claude · 2026-05-01

Grok 4.3 on AI Gateway

xAI's Grok 4.3 is now available on Vercel's AI Gateway, accessible via the AI SDK using the model identifier `xai/grok-4.3`. The model features a 1M token context window and a December 2025 knowledge cutoff.

Vercel Blog · 2026-05-01

We Ditched Claude 3.5 for GPT-5: 20% Higher Customer Satisfaction for Our Chatbot

A company migrated its customer support chatbot from Claude 3.5 Sonnet to GPT-5 after a 4-week model evaluation, reporting chatbot CSAT rising from 72% to 92% and tier-1 query resolution improving from 70% to 88% within 30 days of full rollout. Non-English CSAT increased from 61% to 84%, and huma...

Dev.to - Claude · 2026-05-01

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

The UK AI Security Institute evaluated OpenAI's GPT-5.5 for cyber capabilities, finding its ability to identify security vulnerabilities comparable to Anthropic's Claude Mythos. Unlike Mythos, GPT-5.5 is currently generally available.

Simon Willison · 2026-05-01

GPT-5 vs Claude 4.0 vs Llama 4 70B: Code Translation Accuracy from TypeScript 5.6 to Rust 1.83

A 1,200-test benchmark translating TypeScript 5.6 to Rust 1.83 found GPT-5 achieved a 94.2% first-pass compilation rate, compared to 91.7% for Claude 4.0 and 82.3% for Llama 4 70B, though Claude 4.0 led on memory safety checks at 96.4%.

Dev.to - Claude · 2026-04-30

Where the goblins came from

OpenAI published a post-mortem on "goblin outputs," unusual personality-driven behaviors observed in GPT-5, detailing their origin, how they spread through the model, and the fixes applied to address them.

OpenAI Blog · 2026-04-30

Introducing talkie: a 13B vintage language model from 1930

Nick Levine, David Duvenaud, and Alec Radford released talkie, a 13B-parameter language model trained on 260B tokens of pre-1931 English text, under an Apache 2.0 license. A second instruction-tuned variant was fine-tuned using synthetic data and Claude Sonnet/Opus 4.6 as judges.

Simon Willison · 2026-04-28

GPT-5.5 Released: The Return of the King, Crushing Anthropic

OpenAI released GPT-5.5 on April 24, 2026, scoring 82.7% on the Terminal-Bench 2.0 agentic programming benchmark. The model is optimized for NVIDIA GB200 and GB300 hardware and uses fewer tokens than GPT-5.4 for equivalent tasks.

Dev.to - AI · 2026-04-28

GPT-5 vs Claude Sonnet 4: real per-task cost and benchmark comparison for production workloads

GPT-5 costs $1.25/$10 per million input/output tokens versus Claude Sonnet 4.6's $3/$15, giving GPT-5 a 1.6–2x cost advantage on typical workloads. GPT-5 leads on math benchmarks (AIME 2025: 94.6% vs 70.5%), while Sonnet 4.6 offers flat pricing across a 1M-token context window and stronger agenti...

Dev.to - Claude · 2026-04-27

GPT-5.5 vs Claude Opus vs Gemini — real benchmark breakdown

A benchmark comparison of GPT-5.5, Claude Opus, and Gemini 3.1 Pro claims GPT-5.5 leads in agentic workflows, Claude Opus in software engineering, and Gemini 3.1 Pro in cost and multimodal processing, with full data hosted on an external site.

Dev.to - Claude · 2026-04-27

GPT-5.5 Just Dropped. Here's What the Benchmarks Are Hiding.

A Dev.to author claims OpenAI released GPT-5.5 on April 23, 2026, a fully retrained base model scoring 82.7% on Terminal-Bench 2.0 but posting an 86% hallucination rate on AA-Omniscience evals, compared to 36% for Claude Opus 4.7.

Dev.to - Claude · 2026-04-27

DeepSeek V4 Pro Just Dropped — Here's What Changed for AI Agents

DeepSeek released V4 Pro on April 24, 2026, a mixture-of-experts model with 1.6 trillion total parameters and 49 billion active parameters, supporting a 1-million-token context window under an MIT license. Pricing is set at $1.74 per million input tokens and $3.48 per million output tokens, with ...

Dev.to - AI · 2026-04-26

Quoting Romain Huet

OpenAI's Romain Huet confirmed the company will not release a separate GPT-5.5-Codex model, stating that Codex and the main model were unified into a single system starting with GPT-5.4. GPT-5.5 includes improvements in agentic coding and computer use tasks.

Simon Willison · 2026-04-26

GPT-5.5 prompting guide

OpenAI released GPT-5.5 via its API alongside a prompting guide that advises developers to treat it as a new model family rather than a drop-in replacement for gpt-5.2 or gpt-5.4. The guide recommends starting with a minimal prompt baseline and retuning reasoning effort, verbosity, and output for...

Simon Willison · 2026-04-25

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro: The Frontier Model Showdown

A benchmark comparison of GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro found split results: GPT-5.5 led Terminal-Bench 2.0 at 82.7%, Opus 4.7 led SWE-Bench Pro at 64.3% and MCP-Atlas tool-use at 77.3%, and Gemini 3.1 Pro led ARC-AGI-2 abstract reasoning at 77.1%.

Dev.to - Claude · 2026-04-25

GPT 5.5 on AI Gateway

OpenAI's GPT-5.5 and GPT-5.5 Pro models are now accessible through Vercel's AI Gateway, available via the identifiers `openai/gpt-5.5` and `openai/gpt-5.5-pro` in the AI SDK. Both variants target long-running agentic tasks and are described as more token-efficient than the previous generation.

Vercel Blog · 2026-04-25

DeepSeek previews new AI model that ‘closes the gap’ with frontier models

DeepSeek previewed new AI models it says outperform DeepSeek V3.2 in efficiency and performance, citing architectural improvements. The company claims the models have nearly matched leading open and closed models on reasoning benchmarks.

TechCrunch - AI · 2026-04-25

“Mythos-like hacking, open to all”: Industry reacts to OpenAI’s GPT 5.5

OpenAI released GPT-5.5 and GPT-5.5 Pro, general-purpose models with claimed improvements in coding and reasoning. Early testing by developer Simon Willison found the model performed below GPT-5.4 on default settings, improving only when given higher reasoning effort at the cost of increased toke...

The New Stack · 2026-04-25

OpenAI launches GPT-5.5, calling it “a new class of intelligence”

OpenAI released GPT-5.5 and GPT-5.5 Pro, available to paying ChatGPT and Codex users, scoring 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro. OpenAI claims the model uses fewer tokens than its predecessor and costs half that of competing frontier coding models.

The New Stack · 2026-04-24

DeepSeek V4 - almost on the frontier, a fraction of the price

DeepSeek released two preview models, V4-Pro (1.6T parameters, 49B active) and V4-Flash (284B parameters, 13B active), both with 1M token context windows under MIT license. V4-Pro is priced at $1.74/million input tokens and $3.48/million output tokens; V4-Flash at $0.14 and $0.28 respectively.

Simon Willison · 2026-04-24

AI shrinkflation: Why Anthropic’s Claude Opus 4.7 may be less capable than the model it replaced

Users of Anthropic's Claude Opus 4.7 have reported that the model performs worse than its predecessor on complex reasoning and coding tasks, with complaints including repetitive self-correction loops and failures on software development projects previously handled by Claude 4.6.

The New Stack · 2026-04-24

Claude Opus 4.7 is Here: Sam Altman Might Be Losing Sleep

Anthropic released Claude Opus 4.7, which scored 64.3% on the SWE-bench Pro coding benchmark, up from 53.4% in the prior generation. The model also adds high-resolution image support up to 2576px and improved visual reasoning scores from 69.1% to 82.1% on the CharXiv benchmark.

Dev.to - Claude · 2026-04-24

OpenAI says its new GPT-5.5 model is more efficient and better at coding

OpenAI released GPT-5.5, a new model following GPT-5.4 from the previous month, describing it as more capable at coding, writing, online research, and multi-step tasks requiring tool use. The company says the model can handle complex, ambiguous tasks with less user oversight.

The Verge - AI · 2026-04-24

GPT-5.5 System Card

OpenAI published the system card for GPT-5.5, a new language model, detailing its safety evaluations and capabilities assessments. System cards are OpenAI's standard documentation accompanying model releases.

OpenAI Blog · 2026-04-24

Introducing GPT-5.5

OpenAI released GPT-5.5, a new language model aimed at tasks including coding, research, and data analysis. The company describes it as faster than previous versions, though no specific benchmark figures were provided.

OpenAI Blog · 2026-04-24

OpenAI releases GPT-5.5, bringing company one step closer to an AI ‘super app’

OpenAI released GPT-5.5, a new model the company says offers increased capabilities across multiple categories. The release is part of OpenAI's broader effort to develop a consolidated AI application platform.

TechCrunch - AI · 2026-04-24

OpenAI’s new Privacy Filter runs on your laptop so PII never hits the cloud

OpenAI released Privacy Filter, a 1.5-billion-parameter token-classification model that detects and redacts eight categories of PII — including names, emails, phone numbers, and API keys — in a single pass over texts up to 128,000 tokens. The model runs locally with 50 million active parameters, ...

The New Stack · 2026-04-24

Deepseek V4 on AI Gateway

Vercel added DeepSeek V4 to its AI Gateway, offering two variants: DeepSeek V4 Pro, aimed at agentic coding and mathematical reasoning, and DeepSeek V4 Flash, a smaller model for high-volume, latency-sensitive workloads. Both models support a 1M token context window.

Vercel Blog · 2026-04-24

It's a big one

Simon Willison published a newsletter edition covering GPT-4.5, ChatGPT Images 2.0, and Qwen3 6-27B models, along with 5 blog posts, 8 links, 3 quotes, and a new chapter of his Agentic Engineering Patterns guide.

Simon Willison · 2026-04-24

China’s DeepSeek previews new AI model a year after jolting US rivals

DeepSeek released a preview of its open-source V4 AI model, claiming it matches closed-source systems from Anthropic, Google, and OpenAI, with notable improvements in coding. The company also highlighted the model's compatibility with domestic Huawei chips.

The Verge - AI · 2026-04-24

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Qwen released Qwen3.6-27B, a 27-billion-parameter dense model (55.6GB) that the company claims surpasses its previous open-source flagship Qwen3.5-397B-A17B on major coding benchmarks. A Q4_K_M quantized version runs at approximately 25 tokens/second locally at 16.8GB.

Simon Willison · 2026-04-23

Claude Opus 4.7 Prompts: 4 Templates That Actually Use the New Reasoning Model

Anthropic released Claude Opus 4.7 on April 16, 2026, positioning it as their most capable generally available model, with a 200,000-token context window and emphasis on deep reasoning and tool use over its predecessor Sonnet variants.

Dev.to - Claude · 2026-04-23

GPT Image 2 on AI Gateway

OpenAI's GPT Image 2 image model is now available on Vercel's AI Gateway, accessible via the AI SDK with the identifier "openai/gpt-image-2". The model supports up to 2K resolution, multiple aspect ratios, non-English text rendering, and various visual styles.

Vercel Blog · 2026-04-22

With the launch of ChatGPT Images 2.0, OpenAI now “thinks” before it draws

OpenAI launched ChatGPT Images 2.0, available via the API as gpt-image-2, featuring two modes: Instant for fast output and Thinking, which reasons through image structure before generating up to eight images per prompt. Advanced thinking capabilities are limited to Plus, Pro, and Business subscri...

The New Stack · 2026-04-22

Where's the raccoon with the ham radio? (ChatGPT Images 2.0)

OpenAI released ChatGPT Images 2.0 (gpt-image-2), with Sam Altman describing the improvement over gpt-image-1 as equivalent to the jump from GPT-3 to GPT-5. A blogger tested the model against Google's image generation models using a "Where's Waldo"-style prompt to compare output quality.

Simon Willison · 2026-04-22

OpenAI’s updated image generator can now pull information from the web

OpenAI released ChatGPT Images 2.0, powered by its GPT Image 2 model, which can search the web to inform image generation from a single prompt. The update also improves instruction-following, detail preservation, and text rendering, and is available to Plus, Pro, Business, and Enterprise subscrib...

The Verge - AI · 2026-04-22

Claude 4.7 vs 4.6: A Data-Driven Comparison (With Benchmarks)

Anthropic released Claude Opus 4.7 on April 16, 2026, two months after Opus 4.6. The model improved on coding benchmarks (SWE-bench Verified: 87.6% vs 80.8%) and visual acuity (98.5% vs 54.5%), but regressed on long-context retrieval (32.2% vs 78.3%) and logical reasoning (41.0% vs 94.7%), with p...

Dev.to - Claude · 2026-04-21

Kimi K2.6 on AI Gateway

Moonshot AI's Kimi K2.6 model is now available on Vercel AI Gateway, accessible via the model ID `moonshotai/kimi-k2.6` in Vercel's AI SDK. The model targets long-horizon coding tasks across languages including Rust, Go, and Python, as well as front-end, DevOps, and performance optimization work.

Vercel Blog · 2026-04-21

Changes in the system prompt between Claude Opus 4.6 and 4.7

Simon Willison documented changes in the system prompt between Anthropic's Claude Opus versions 4.6 and 4.7, comparing the instructions baked into the two model releases.

Hacker News - Best · 2026-04-20

Changes in the system prompt between Claude Opus 4.6 and 4.7

Anthropic updated the Claude.ai system prompt with the release of Claude Opus 4.7 on April 16, 2026, adding Claude in PowerPoint as a new tool, expanding child safety instructions under a dedicated XML tag, and adding guidance instructing the model to attempt tasks before asking clarifying questi...

Simon Willison · 2026-04-19

Anthropic Releases Claude Opus 4.7: Key Changes and Migration Guide for Developers

Anthropic released Claude Opus 4.7, a new version of its Claude AI model, along with documentation outlining key changes and a migration guide for developers transitioning from earlier versions.

Dev.to - Claude · 2026-04-17

Anthropic releases a new Opus model amid Mythos Preview buzz

Anthropic released Claude Opus 4.7, its most capable generally available model, focused on software engineering, image analysis, and instruction following. The company noted Opus 4.7 does not advance its capability frontier, as the separately released Mythos Preview — currently limited to partner...

The Verge - AI · 2026-04-17

Claude Opus 4.7 — What Actually Changed and Why It Matters

A Dev.to author published an analysis of Anthropic's Claude Opus 4.7 model, examining changes from previous versions. The article's actual technical content was not available in the retrieved text.

Dev.to - Claude · 2026-04-17

Claude Opus 4.7 Just Dropped: 87.6% SWE-bench, Breaking API Changes, and the Hidden Cost Increase

Anthropic released Claude Opus 4.7, which scores 87.6% on the SWE-bench coding benchmark. The release includes breaking API changes and a price increase compared to prior versions.

Dev.to - Claude · 2026-04-17

Claude Opus 4.7 arrives with better vision, memory, and instruction-following

Anthropic released Claude Opus 4.7, an updated version of its AI model with improvements to vision capabilities, memory, and instruction-following performance.

The New Stack · 2026-04-17

Gemini 3.1 Flash TTS

Google released Gemini 3.1 Flash, a text-to-speech model. Simon Willison published notes and a tool interface for the new model.

Simon Willison · 2026-04-16

Gemini 3.1 Flash TTS

Google released Gemini 3.1 Flash TTS, a text-to-speech model available via the Gemini API that generates audio from text prompts and supports detailed voice direction including accents, tone, and delivery style.

Simon Willison · 2026-04-16

Gemini 3.1 Flash TTS: the next generation of expressive AI speech

Google released Gemini 3.1 Flash TTS, a text-to-speech system, across its products.

Google AI Blog · 2026-04-16

Seedance 2.0 Video Generation on AI Gateway

ByteDance's Seedance 2.0 video generation model is now available via Vercel's AI Gateway in Standard and Fast variants, supporting text-to-video, image-to-video, and multimodal reference-to-video generation with synchronized audio and video editing capabilities.

Vercel Blog · 2026-04-16

Claude Mythos Preview completes full cyberattack simulation for the first time

The UK's AI Security Institute evaluated Anthropic's Claude Mythos Preview and found it autonomously completed a 32-step corporate network takeover simulation, marking the first AI model to execute such a full multi-stage cyberattack simulation. The model showed improved performance in capture-th...

The New Stack · 2026-04-15

Kumo’s new foundation model replaces months of data science engineering with plain-English queries

Kumo announced KumoRFM-2, a foundation model for relational databases that accepts plain-English queries and outperformed supervised machine learning models by 5% on Stanford's RelBench benchmark and beats AWS AutoGluon on enterprise benchmarks, scaling to over 500 billion rows of data.

The New Stack · 2026-04-15

Trusted access for the next era of cyber defense

OpenAI released GPT-5.4-Cyber, a model variant fine-tuned for defensive cybersecurity work, and expanded its Trusted Access for Cyber program allowing identity-verified users reduced-friction access to security tools via government ID verification through Persona.

Simon Willison · 2026-04-15

Gemma 4 audio with MLX

Google's Gemma 4 E2B model can transcribe audio files on macOS using MLX and mlx-vlm via a uv command-line recipe, as demonstrated on a 14-second voice memo that was substantially transcribed with minor errors.

Simon Willison · 2026-04-13

Google’s Gemini AI can answer your questions with 3D models and simulations

Google added a feature to Gemini allowing it to generate interactive 3D models and simulations in response to user questions, with controls to rotate models, adjust parameters via sliders, and modify simulations in real-time.

The Verge - AI · 2026-04-10

Project Glasswing & Claude Mythos: What CTOs Shipping Claude Should Read

Anthropic announced Project Glasswing with twelve launch partners and revealed Claude Mythos, a frontier model it is not yet shipping to production, along with a system card detailing security findings.

Dev.to - Claude · 2026-04-09

Meta's new model is Muse Spark, and meta.ai chat has some interesting tools

Meta released Muse Spark, a hosted AI model available on meta.ai in "Instant" and "Thinking" modes, with benchmarks competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected tests. The model has access to 16 tools including web search, Meta platform content search, and rendering capabili...

Simon Willison · 2026-04-09

System Card: Claude Mythos Preview [pdf]

Anthropic released a system card for Claude Mythos Preview, a new model variant, documenting its capabilities and characteristics.

Hacker News - Best · 2026-04-08

GLM-5.1: Towards Long-Horizon Tasks

Z.ai released GLM-5.1, a 754-billion-parameter open-source AI model available under MIT license. The model demonstrated the ability to generate SVG graphics with CSS animations and to debug and fix code issues when given follow-up instructions.

Simon Willison · 2026-04-08

Assessing Claude Mythos Preview's cybersecurity capabilities

Anthropic released Claude Mythos Preview with accompanying security documentation and initiated Project Glasswing to address software security in the AI era.

Hacker News - Best · 2026-04-08

GLM 5.1 on AI Gateway

Z.ai's GLM 5.1 model is now available on Vercel's AI Gateway, supporting long-horizon autonomous tasks including multi-step coding workflows, conversation, creative writing, and document generation.

Vercel Blog · 2026-04-08

Anthropic debuts preview of powerful new AI model Mythos in new cybersecurity initiative

Anthropic released a preview of its Mythos AI model for use by select companies in defensive cybersecurity work.

TechCrunch - AI · 2026-04-08

Opus 4.6 Fast Mode available on AI Gateway

Vercel made Fast Mode available for Claude Opus 4.6 on AI Gateway, offering 2.5x faster output token speeds at 6x standard pricing. The experimental feature is designed for human-in-the-loop workflows and coding tasks.

Vercel Blog · 2026-04-08

Anthropic’s Claude Mythos is now available, but not for you

Anthropic released Claude Mythos Preview, a new frontier AI model, to about 50 select partners including Amazon, Apple, and Microsoft through Project Glasswing for defensive cybersecurity work. The model scores 83.1% on vulnerability analysis benchmarks, compared to 66.6% for the previous flagshi...

The New Stack · 2026-04-08

Claude Mythos 5 and the 10T-Parameter AI Shift

Anthropic is testing an early-access model called Claude Mythos after a March 2026 data leak confirmed its existence; the company has not publicly launched it but described it as a significant capability increase, with potential applications in code, cybersecurity, and enterprise operations.

Dev.to - Claude · 2026-04-07

Gemini 2.0 vs GPT-5 vs Claude 4: The Spring 2026 AI Model Rankings

Google Gemini 2.0, OpenAI GPT-5.3, and Anthropic Claude 4.6 were compared across coding and reasoning benchmarks, with GPT-5.3-Codex scoring 56.8% on SWE-Bench Pro, Claude Opus 4.6 at 55%, and Gemini 2.0 at 52%, while Claude led on multi-step reasoning tasks with 95.2% on GSM8K mathematical bench...

Dev.to - Claude · 2026-04-06

claude-opus-4-vs-gpt-5

Claude Opus 4 scores higher on coding benchmarks (76.8% vs 71.8% on SWE-bench) and offers a larger 200K-token context window, while GPT-5 excels at reasoning tasks and costs less ($10-30 per 1M tokens vs $15-75). Both are available at $20/month subscription tier.

Dev.to - Claude · 2026-04-05

Other categories