New frontier models, benchmark drops, and side-by-side capability comparisons.
Xanther AI reports that adding its context engine to MiniMax M2.5 raised the model's SWE-bench Verified score from 75.8% to 78.2% on 500 real bug instances, at $0.22 per instance compared to $0.75 for Claude 4.5 Opus, which the company says currently leads the official leaderboard at 76.8%.
Claude 3.5 Sonnet offers a 200K token context window at $3/$15 per million input/output tokens, compared to GPT-4 Turbo's 128K context window at $10/$30 per million tokens. The two models differ in context capacity, reasoning depth, and cost structure depending on workload.
OpenAI launched three speech models: GPT-Realtime-2, featuring GPT-5-class reasoning and a 128,000-token context window at $32 per million input tokens; GPT-Realtime-Translate, supporting 70+ input languages and 13 output languages at $0.034 per minute; and GPT-Realtime-Whisper, a streaming trans...
OpenAI released new realtime voice models in its API with capabilities including speech reasoning, translation, and transcription.
OpenAI expanded its Trusted Access for Cyber program to include GPT-5.5 and GPT-5.5-Cyber, two models made available to verified cybersecurity defenders for vulnerability research and critical infrastructure protection.
Subquadratic, a Miami startup, launched a model with a 12-million-token context window using an architecture called Subquadratic Selective Attention, which the company says scales linearly in compute and memory. The model scores 83 on MRCR v2 and 92.1% on needle-in-a-haystack retrieval at 12 mill...
OpenAI released GPT-5.5 Instant as an updated default model for ChatGPT, citing improvements in answer accuracy, reduced hallucinations, and expanded personalization controls.
OpenAI released GPT-5.5 Instant as the new default model for ChatGPT, citing reduced hallucinations in law, medicine, and finance while maintaining low latency compared to its predecessor.
OpenAI released GPT-5.5 Instant as ChatGPT's new default model, claiming it produces 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts in medicine, law, and finance, based on internal evaluations. The company also says it reduced inaccurate claims by 37.3% on conversatio...
A benchmark of Claude 3.5 Sonnet and GPT-4o across 12,000 AWS and GCP billing logs found Claude scored higher precision (94.2% vs. 89.7% on GCP anomaly detection) and lower cost per detection ($0.87 vs. $1.12 per 1,000), while GPT-4o processed requests 18% faster at 12.7 RPS versus 10.7 RPS.
OpenAI replaced ChatGPT's default model with GPT-5.5 Instant, a lighter variant of its April flagship model designed for everyday tasks. The new model scores 81.6% on the CharXiv benchmark, up from 75.0% for its predecessor GPT-5.3 Instant, and introduces a "memory sources" feature showing users ...
OpenAI published a system card for GPT-5.5 Instant, a model in its GPT-5.5 lineup, documenting the model's safety evaluations and deployment considerations.
A 2026 comparison of ChatGPT, Claude, and Gemini found ChatGPT favored for general writing and coding, Claude preferred for nuanced editorial content and code explanation, and Gemini rated most reliable for research due to its Google Search integration.
IBM released the Granite 4.1 LLM family under an Apache 2.0 license in 3B, 8B, and 30B sizes. Simon Willison tested 21 quantized GGUF variants of the 3B model on SVG image generation and found no consistent relationship between model size and output quality.
Simon Willison's April 2026 newsletter covers AI model releases including Opus 4.7 and GPT-5.5, both with price increases, as well as ChatGPT Images 2.0 and Claude Mythos, alongside LLM security research topics.
Anthropic's Claude Haiku 4.5, released October 2025, scores 73.3% on SWE-bench Verified and is the first Haiku model to support extended thinking and computer use, at $1.00 per million input tokens versus $3.00 for Sonnet 4.6. It produces output at roughly 91 tokens per second with a 200K context...
A benchmark of Meta's Llama 3 70B Instruct and Anthropic's Claude 3.5 Sonnet across 1.2 million user queries found Llama 3 delivered 22% lower p99 latency and cost $0.0008 per 1k tokens self-hosted versus Claude 3.5's $0.003 via API, while Claude 3.5 scored 98.0% versus 82.3% on HellaSwag reasoni...
Kimi K2.6, an open-weights language model developed by Chinese AI company Moonshot AI, outscored Claude, GPT-5.5, and Gemini on a programming benchmark challenge, according to a report published April 30, 2026.
A developer tested four free LLM endpoints — Cerebras (Qwen 3 235B), Groq (Llama 4 Scout), OpenRouter (Ling-2.6 Flash), and Cloudflare (GPT-OSS 120B) — for production use in an AI website builder. Groq achieved ~500 tokens/second, Cerebras produced the most coherent output for complex prompts, Op...
Mistral AI released Mistral Medium 3.5 and added cloud execution to its Vibe coding agent, allowing agents to run tasks in isolated sandboxed environments independent of a developer's local machine. The company also added a "work mode" to its Le Chat interface for running longer, parallel tool-ca...
Anthropic released Claude Opus 4.7 on April 16, scoring 64.3% on SWE-bench Pro (+10.9 points over 4.6) and 70% on CursorBench (+12 points), with added image support up to 3.75MP, a beta token-budget parameter for agent loops, and a new "xhigh" reasoning tier. The company also launched a Managed A...
xAI's Grok 4.3 is now available on Vercel's AI Gateway, accessible via the AI SDK using the model identifier `xai/grok-4.3`. The model features a 1M token context window and a December 2025 knowledge cutoff.
A company migrated its customer support chatbot from Claude 3.5 Sonnet to GPT-5 after a 4-week model evaluation, reporting chatbot CSAT rising from 72% to 92% and tier-1 query resolution improving from 70% to 88% within 30 days of full rollout. Non-English CSAT increased from 61% to 84%, and huma...
The UK AI Security Institute evaluated OpenAI's GPT-5.5 for cyber capabilities, finding its ability to identify security vulnerabilities comparable to Anthropic's Claude Mythos. Unlike Mythos, GPT-5.5 is currently generally available.
A 1,200-test benchmark translating TypeScript 5.6 to Rust 1.83 found GPT-5 achieved a 94.2% first-pass compilation rate, compared to 91.7% for Claude 4.0 and 82.3% for Llama 4 70B, though Claude 4.0 led on memory safety checks at 96.4%.
OpenAI published a post-mortem on "goblin outputs," unusual personality-driven behaviors observed in GPT-5, detailing their origin, how they spread through the model, and the fixes applied to address them.
Nick Levine, David Duvenaud, and Alec Radford released talkie, a 13B-parameter language model trained on 260B tokens of pre-1931 English text, under an Apache 2.0 license. A second instruction-tuned variant was fine-tuned using synthetic data and Claude Sonnet/Opus 4.6 as judges.
OpenAI released GPT-5.5 on April 24, 2026, scoring 82.7% on the Terminal-Bench 2.0 agentic programming benchmark. The model is optimized for NVIDIA GB200 and GB300 hardware and uses fewer tokens than GPT-5.4 for equivalent tasks.
GPT-5 costs $1.25/$10 per million input/output tokens versus Claude Sonnet 4.6's $3/$15, giving GPT-5 a 1.6–2x cost advantage on typical workloads. GPT-5 leads on math benchmarks (AIME 2025: 94.6% vs 70.5%), while Sonnet 4.6 offers flat pricing across a 1M-token context window and stronger agenti...
A benchmark comparison of GPT-5.5, Claude Opus, and Gemini 3.1 Pro claims GPT-5.5 leads in agentic workflows, Claude Opus in software engineering, and Gemini 3.1 Pro in cost and multimodal processing, with full data hosted on an external site.
A Dev.to author claims OpenAI released GPT-5.5 on April 23, 2026, a fully retrained base model scoring 82.7% on Terminal-Bench 2.0 but posting an 86% hallucination rate on AA-Omniscience evals, compared to 36% for Claude Opus 4.7.
DeepSeek released V4 Pro on April 24, 2026, a mixture-of-experts model with 1.6 trillion total parameters and 49 billion active parameters, supporting a 1-million-token context window under an MIT license. Pricing is set at $1.74 per million input tokens and $3.48 per million output tokens, with ...
OpenAI's Romain Huet confirmed the company will not release a separate GPT-5.5-Codex model, stating that Codex and the main model were unified into a single system starting with GPT-5.4. GPT-5.5 includes improvements in agentic coding and computer use tasks.
OpenAI released GPT-5.5 via its API alongside a prompting guide that advises developers to treat it as a new model family rather than a drop-in replacement for gpt-5.2 or gpt-5.4. The guide recommends starting with a minimal prompt baseline and retuning reasoning effort, verbosity, and output for...
A benchmark comparison of GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro found split results: GPT-5.5 led Terminal-Bench 2.0 at 82.7%, Opus 4.7 led SWE-Bench Pro at 64.3% and MCP-Atlas tool-use at 77.3%, and Gemini 3.1 Pro led ARC-AGI-2 abstract reasoning at 77.1%.
OpenAI's GPT-5.5 and GPT-5.5 Pro models are now accessible through Vercel's AI Gateway, available via the identifiers `openai/gpt-5.5` and `openai/gpt-5.5-pro` in the AI SDK. Both variants target long-running agentic tasks and are described as more token-efficient than the previous generation.
DeepSeek previewed new AI models it says outperform DeepSeek V3.2 in efficiency and performance, citing architectural improvements. The company claims the models have nearly matched leading open and closed models on reasoning benchmarks.
OpenAI released GPT-5.5 and GPT-5.5 Pro, general-purpose models with claimed improvements in coding and reasoning. Early testing by developer Simon Willison found the model performed below GPT-5.4 on default settings, improving only when given higher reasoning effort at the cost of increased toke...
OpenAI released GPT-5.5 and GPT-5.5 Pro, available to paying ChatGPT and Codex users, scoring 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro. OpenAI claims the model uses fewer tokens than its predecessor and costs half that of competing frontier coding models.
DeepSeek released two preview models, V4-Pro (1.6T parameters, 49B active) and V4-Flash (284B parameters, 13B active), both with 1M token context windows under MIT license. V4-Pro is priced at $1.74/million input tokens and $3.48/million output tokens; V4-Flash at $0.14 and $0.28 respectively.
Users of Anthropic's Claude Opus 4.7 have reported that the model performs worse than its predecessor on complex reasoning and coding tasks, with complaints including repetitive self-correction loops and failures on software development projects previously handled by Claude 4.6.
Anthropic released Claude Opus 4.7, which scored 64.3% on the SWE-bench Pro coding benchmark, up from 53.4% in the prior generation. The model also adds high-resolution image support up to 2576px and improved visual reasoning scores from 69.1% to 82.1% on the CharXiv benchmark.
OpenAI released GPT-5.5, a new model following GPT-5.4 from the previous month, describing it as more capable at coding, writing, online research, and multi-step tasks requiring tool use. The company says the model can handle complex, ambiguous tasks with less user oversight.
OpenAI published the system card for GPT-5.5, a new language model, detailing its safety evaluations and capabilities assessments. System cards are OpenAI's standard documentation accompanying model releases.
OpenAI released GPT-5.5, a new language model aimed at tasks including coding, research, and data analysis. The company describes it as faster than previous versions, though no specific benchmark figures were provided.
OpenAI released GPT-5.5, a new model the company says offers increased capabilities across multiple categories. The release is part of OpenAI's broader effort to develop a consolidated AI application platform.
OpenAI released Privacy Filter, a 1.5-billion-parameter token-classification model that detects and redacts eight categories of PII — including names, emails, phone numbers, and API keys — in a single pass over texts up to 128,000 tokens. The model runs locally with 50 million active parameters, ...
Vercel added DeepSeek V4 to its AI Gateway, offering two variants: DeepSeek V4 Pro, aimed at agentic coding and mathematical reasoning, and DeepSeek V4 Flash, a smaller model for high-volume, latency-sensitive workloads. Both models support a 1M token context window.
Simon Willison published a newsletter edition covering GPT-4.5, ChatGPT Images 2.0, and Qwen3 6-27B models, along with 5 blog posts, 8 links, 3 quotes, and a new chapter of his Agentic Engineering Patterns guide.
DeepSeek released a preview of its open-source V4 AI model, claiming it matches closed-source systems from Anthropic, Google, and OpenAI, with notable improvements in coding. The company also highlighted the model's compatibility with domestic Huawei chips.
Qwen released Qwen3.6-27B, a 27-billion-parameter dense model (55.6GB) that the company claims surpasses its previous open-source flagship Qwen3.5-397B-A17B on major coding benchmarks. A Q4_K_M quantized version runs at approximately 25 tokens/second locally at 16.8GB.
Anthropic released Claude Opus 4.7 on April 16, 2026, positioning it as their most capable generally available model, with a 200,000-token context window and emphasis on deep reasoning and tool use over its predecessor Sonnet variants.
OpenAI's GPT Image 2 image model is now available on Vercel's AI Gateway, accessible via the AI SDK with the identifier "openai/gpt-image-2". The model supports up to 2K resolution, multiple aspect ratios, non-English text rendering, and various visual styles.
OpenAI launched ChatGPT Images 2.0, available via the API as gpt-image-2, featuring two modes: Instant for fast output and Thinking, which reasons through image structure before generating up to eight images per prompt. Advanced thinking capabilities are limited to Plus, Pro, and Business subscri...
OpenAI released ChatGPT Images 2.0 (gpt-image-2), with Sam Altman describing the improvement over gpt-image-1 as equivalent to the jump from GPT-3 to GPT-5. A blogger tested the model against Google's image generation models using a "Where's Waldo"-style prompt to compare output quality.
OpenAI released ChatGPT Images 2.0, powered by its GPT Image 2 model, which can search the web to inform image generation from a single prompt. The update also improves instruction-following, detail preservation, and text rendering, and is available to Plus, Pro, Business, and Enterprise subscrib...
Anthropic released Claude Opus 4.7 on April 16, 2026, two months after Opus 4.6. The model improved on coding benchmarks (SWE-bench Verified: 87.6% vs 80.8%) and visual acuity (98.5% vs 54.5%), but regressed on long-context retrieval (32.2% vs 78.3%) and logical reasoning (41.0% vs 94.7%), with p...
Moonshot AI's Kimi K2.6 model is now available on Vercel AI Gateway, accessible via the model ID `moonshotai/kimi-k2.6` in Vercel's AI SDK. The model targets long-horizon coding tasks across languages including Rust, Go, and Python, as well as front-end, DevOps, and performance optimization work.
Simon Willison documented changes in the system prompt between Anthropic's Claude Opus versions 4.6 and 4.7, comparing the instructions baked into the two model releases.
Anthropic updated the Claude.ai system prompt with the release of Claude Opus 4.7 on April 16, 2026, adding Claude in PowerPoint as a new tool, expanding child safety instructions under a dedicated XML tag, and adding guidance instructing the model to attempt tasks before asking clarifying questi...
Anthropic released Claude Opus 4.7, a new version of its Claude AI model, along with documentation outlining key changes and a migration guide for developers transitioning from earlier versions.
Anthropic released Claude Opus 4.7, its most capable generally available model, focused on software engineering, image analysis, and instruction following. The company noted Opus 4.7 does not advance its capability frontier, as the separately released Mythos Preview — currently limited to partner...
A Dev.to author published an analysis of Anthropic's Claude Opus 4.7 model, examining changes from previous versions. The article's actual technical content was not available in the retrieved text.
Anthropic released Claude Opus 4.7, which scores 87.6% on the SWE-bench coding benchmark. The release includes breaking API changes and a price increase compared to prior versions.
Anthropic released Claude Opus 4.7, an updated version of its AI model with improvements to vision capabilities, memory, and instruction-following performance.
Google released Gemini 3.1 Flash, a text-to-speech model. Simon Willison published notes and a tool interface for the new model.
Google released Gemini 3.1 Flash TTS, a text-to-speech model available via the Gemini API that generates audio from text prompts and supports detailed voice direction including accents, tone, and delivery style.
Google released Gemini 3.1 Flash TTS, a text-to-speech system, across its products.
ByteDance's Seedance 2.0 video generation model is now available via Vercel's AI Gateway in Standard and Fast variants, supporting text-to-video, image-to-video, and multimodal reference-to-video generation with synchronized audio and video editing capabilities.
The UK's AI Security Institute evaluated Anthropic's Claude Mythos Preview and found it autonomously completed a 32-step corporate network takeover simulation, marking the first AI model to execute such a full multi-stage cyberattack simulation. The model showed improved performance in capture-th...
Kumo announced KumoRFM-2, a foundation model for relational databases that accepts plain-English queries and outperformed supervised machine learning models by 5% on Stanford's RelBench benchmark and beats AWS AutoGluon on enterprise benchmarks, scaling to over 500 billion rows of data.
OpenAI released GPT-5.4-Cyber, a model variant fine-tuned for defensive cybersecurity work, and expanded its Trusted Access for Cyber program allowing identity-verified users reduced-friction access to security tools via government ID verification through Persona.
Google's Gemma 4 E2B model can transcribe audio files on macOS using MLX and mlx-vlm via a uv command-line recipe, as demonstrated on a 14-second voice memo that was substantially transcribed with minor errors.
Google added a feature to Gemini allowing it to generate interactive 3D models and simulations in response to user questions, with controls to rotate models, adjust parameters via sliders, and modify simulations in real-time.
Anthropic announced Project Glasswing with twelve launch partners and revealed Claude Mythos, a frontier model it is not yet shipping to production, along with a system card detailing security findings.
Meta released Muse Spark, a hosted AI model available on meta.ai in "Instant" and "Thinking" modes, with benchmarks competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected tests. The model has access to 16 tools including web search, Meta platform content search, and rendering capabili...
Anthropic released a system card for Claude Mythos Preview, a new model variant, documenting its capabilities and characteristics.
Z.ai released GLM-5.1, a 754-billion-parameter open-source AI model available under MIT license. The model demonstrated the ability to generate SVG graphics with CSS animations and to debug and fix code issues when given follow-up instructions.
Anthropic released Claude Mythos Preview with accompanying security documentation and initiated Project Glasswing to address software security in the AI era.
Z.ai's GLM 5.1 model is now available on Vercel's AI Gateway, supporting long-horizon autonomous tasks including multi-step coding workflows, conversation, creative writing, and document generation.
Anthropic released a preview of its Mythos AI model for use by select companies in defensive cybersecurity work.
Vercel made Fast Mode available for Claude Opus 4.6 on AI Gateway, offering 2.5x faster output token speeds at 6x standard pricing. The experimental feature is designed for human-in-the-loop workflows and coding tasks.
Anthropic released Claude Mythos Preview, a new frontier AI model, to about 50 select partners including Amazon, Apple, and Microsoft through Project Glasswing for defensive cybersecurity work. The model scores 83.1% on vulnerability analysis benchmarks, compared to 66.6% for the previous flagshi...
Anthropic is testing an early-access model called Claude Mythos after a March 2026 data leak confirmed its existence; the company has not publicly launched it but described it as a significant capability increase, with potential applications in code, cybersecurity, and enterprise operations.
Google Gemini 2.0, OpenAI GPT-5.3, and Anthropic Claude 4.6 were compared across coding and reasoning benchmarks, with GPT-5.3-Codex scoring 56.8% on SWE-Bench Pro, Claude Opus 4.6 at 55%, and Gemini 2.0 at 52%, while Claude led on multi-step reasoning tasks with 95.2% on GSM8K mathematical bench...
Claude Opus 4 scores higher on coding benchmarks (76.8% vs 71.8% on SWE-bench) and offers a larger 200K-token context window, while GPT-5 excels at reasoning tasks and costs less ($10-30 per 1M tokens vs $15-75). Both are available at $20/month subscription tier.
Google DeepMind released Gemma 4, a family of four open-source language models sized at 2B, 4B, 31B, and 26B parameters with Apache 2.0 licensing, featuring native vision and audio processing capabilities.
xAI's Grok 4.20 model is now available on Vercel's AI Gateway in three variants: Reasoning, Non-Reasoning, and Multi-Agent. Developers can access the models through Vercel's AI SDK with specified model identifiers.
Vercel AI Gateway now offers GLM 5V Turbo, a multimodal model from Z.ai that converts screenshots and designs into code and can navigate graphical interfaces. The model is accessible via the AI SDK using the identifier zai/glm-5v-turbo.
Alibaba's Qwen 3.6 Plus model is now available on Vercel's AI Gateway. The model features a 1M context window and improved agentic coding capabilities, multimodal reasoning, and tool-calling performance compared to Qwen 3.5 Plus.
Inception's Mercury 2 model is now available on Vercel's AI Gateway. The model is designed for low-latency applications including agentic loops, coding assistants, and voice interfaces.
Vercel made MiniMax M2.7 available on its AI Gateway in standard and high-speed variants. The high-speed version delivers the same performance at 2x cost with ~100 tokens per second latency.
Google's Gemma 4 models—a 26B mixture-of-experts variant and a 31B dense variant—are now available on Vercel's AI Gateway. Both models support function-calling, vision capabilities, 256K context windows, and 140+ languages.
llm-gemini version 0.30 added three new models: gemini-3.1-flash-lite-preview, gemma-4-26b-a4b-it, and gemma-4-31b-it.