The smallest models are carrying the biggest bets

The race everyone watches is at the top of the capability curve. Bigger models, higher benchmark scores. But this week, the most consequential releases were at the bottom.

OpenAI shipped GPT-5.4 mini and nano, bringing near-frontier reasoning to free ChatGPT users and, more importantly, pricing nano at $0.20 per million input tokens. That's not a rounding error on a research budget. That's a price point where you can run thousands of lightweight agents inside a single workflow and keep the bill under control. OpenAI is explicit about the intent: nano is built for "the sub-agent era," where the unit of deployment isn't one smart model but dozens of cheap ones coordinated by something larger.

Mistral made a parallel move. Simon Willison covered the release of Mistral Small 4, a mixture-of-experts model with 119B total parameters that activates only 6B per query. It unifies reasoning, multimodal, and agentic coding capabilities in a single model, runs 40% faster than its predecessor, and ships under Apache 2.0. That last detail matters: full commercial use, no strings attached. Mistral is betting that the model powering your agent swarm should be something you can self-host, modify, and ship without a licensing call.

Then there's Nvidia, which announced the Nemotron Coalition at GTC. Eight AI labs including Cursor, LangChain, Mistral, and Perplexity will co-develop open frontier models that run on desktop hardware. The coalition members aren't chosen randomly. Cursor and LangChain contribute coding and agentic evaluation benchmarks, while Mistral provides the model architecture. The whole thing is designed to produce models optimised for the work agents actually do, not for leaderboard bragging rights.

Why small is the strategy

The pattern is worth stating plainly: three of the biggest infrastructure players in AI all shipped their smallest models in the same week. That's not a coincidence and it's not a retreat. It's a bet on where the volume will be.

Consider the economics. A single GPT-5.4 query might cost a fraction of a cent, but an agentic workflow that makes 500 model calls to complete one task needs each call to cost nearly nothing. At $0.20 per million tokens, nano makes that arithmetic work. At 6B active parameters per query, Mistral Small 4 makes the latency work. These aren't stripped-down models for budget customers. They're purpose-built for a world where AI systems call other AI systems hundreds of times per task.

This is the infrastructure shift that matters more than any single capability improvement. The agent era requires models that are fast enough to run in loops, cheap enough to run in parallel, and small enough to run everywhere. Desktop and edge hardware. Inside other models' reasoning chains. The constraint isn't intelligence anymore. It's unit economics.

The question for anyone building on top of these models is whether the quality holds. GPT-5.4 mini reportedly approaches the full GPT-5.4 on SWE-Bench Pro and OSWorld while running twice as fast. Mistral Small 4 handles 3x more queries per second than its predecessor. If those numbers hold up in production, the trade-off between capability and cost just got a lot more favourable.

I think we'll look back at this week as the moment the industry collectively decided that the next phase of AI isn't about building bigger models. It's about making the small ones good enough that you can run them everywhere, all the time, without thinking about the bill. The real question is what gets built once that constraint disappears.

The smallest models are carrying the biggest bets

Why small is the strategy

Stay up to date

More news

Today in AI — 5 June 2026

The AI coding bill just came due

Compute just got a spot price