AI's most important new trick is knowing when to stop thinking

Three separate teams shipped models this week that cut token usage by nearly half through the same insight: the best inference is often no inference at all.

·3 min read

TechCrunch

OpenAI launches GPT-5.4 with native computer use and tool search that cuts agent tokens by 47%

OpenAI launched GPT-5.4 featuring native computer use and a tool search mechanism that reduces agent token consumption by 47%.

techcrunch.com

AI's most important new trick is knowing when to stop thinking

The most wasteful part of modern AI isn't training. It's inference.

Every time an agent reasons through a problem, it burns tokens on chain-of-thought, tool selection, context management, and self-correction. Most of that thinking is unnecessary for most tasks. This week, three separate teams shipped the same insight: the best inference is often no inference at all.

TechCrunch reported that OpenAI's GPT-5.4 includes a tool search mechanism that cuts agent token consumption by 47%. Instead of reasoning through which tool to call and how to call it at inference time, the model retrieves the right tool specification directly. Nearly half the tokens an agent previously spent on tool orchestration simply disappear.

Allen AI took a different path to the same destination. Their OLMo Hybrid architecture achieves the same accuracy as its predecessor with 49% fewer tokens by blending attention mechanisms with state-space models. The efficiency gain isn't from skipping work; it's from reorganising how the model processes sequences so that redundant computation never happens.

Then Microsoft Research released Phi-4-reasoning-vision, a 15-billion-parameter model that decides at inference time whether extended thinking actually helps. For straightforward tasks, it skips the reasoning chain entirely. For hard problems, it thinks deeply. The model has learned to triage its own cognitive effort.

Why this matters more than new capabilities

The AI industry has spent the last two years in a capabilities race: bigger models, longer contexts, better benchmarks. This week marks something different. The competition is shifting from what models can do to how efficiently they do it.

For anyone running AI in production, this is the change that actually affects your margins. A 47% reduction in agent tokens means your AI agent costs nearly half as much to operate. A model that skips unnecessary reasoning means you stop paying for thinking that doesn't improve the output. These aren't incremental optimisations. At scale, they're the difference between a product that's economically viable and one that burns cash on every request.

The pattern across all three releases is consistent: none of them sacrificed capability for efficiency. They found wasted computation and eliminated it. That's the kind of progress that compounds. If models keep getting smarter about when and how much to think, the cost curve bends faster than the capability curve climbs. And when inference gets cheap enough, entire categories of application that don't close economically today suddenly work.

The question is whether this efficiency era is a phase or a permanent shift. I think it's permanent. The capabilities frontier will keep advancing, but the companies that win in production won't be the ones with the most capable models. They'll be the ones whose models know when capability isn't needed.


Read the original on TechCrunch

techcrunch.com

Stay up to date

Get notified when I publish something new, and unsubscribe at any time.

More news