Token prices are falling. Your AI bill isn't. Here's why.

Every AI cost model I've seen starts with the same assumption: prices fall as models get more efficient. Reasonable. Incomplete. And for the next three years, increasingly dangerous to plan around.

The per-token price for frontier models has dropped roughly 10x over two years. That's real. What's also real: the hardware economics are moving the other way, consumption is growing faster than prices are falling, and the world's largest GPU operator just locked its entire infrastructure into natural gas instead of betting on efficiency curves. If your AI budget plan reads "costs will keep declining," three separate forces are actively breaking that model.

The hardware constraint your forecast doesn't include

Epoch.ai published a breakdown of AI chip component costs from Q1 2024 to Q4 2025.[1] The number that matters for operators: HBM memory — the high-bandwidth memory stacked directly onto AI chips — jumped from 52% to 63% of total AI chip component costs in a single year.

Not a rounding error. Total AI chip spending grew from $22 billion to $52 billion in that period. HBM specifically went from $12 billion to $32 billion — nearly tripling, while every other component (logic dies, packaging, auxiliary parts) stayed flat or shrank as a share of the total.

| Component | Q1 2024 share | Q4 2025 share | 2024 spend | 2025 spend | |---|---|---|---|---| | HBM memory | 52% | 63% | ~$12B | ~$32B | | Advanced packaging | 19% | 15% | ~$4B | ~$8B | | Logic dies | 14% | 13% | ~$3B | ~$7B | | Auxiliary components | 15% | 10% | ~$3B | ~$5B | | Total | 100% | 100% | ~$22B | ~$52B |

HBM is produced by three suppliers globally: Samsung, SK Hynix, Micron. Yields are hard to improve. New capacity takes three to four years to build. And — this is the part that breaks the "efficiency will save us" argument — more efficient models require more HBM per inference, not less. More attention layers. More parallelism. More agentic context held in memory. The efficiency gains you're counting on require more of the one input that's getting more expensive relative to everything else.

The floor doesn't bend down on its own.

The Jevons problem: cheaper per token, more spent total

Gartner projects inference costs for sophisticated AI models will drop nearly 90% by 2030.[2] Goldman Sachs projects token consumption will grow 24-fold by 2030, hitting 120 quadrillion tokens per month.[2] Both can be true. Both are, apparently, true.

This is the Jevons paradox: when a resource gets cheaper, you use more of it, and total spend on that resource rises rather than falls. Victorian economists observed it with coal. AI operators are observing it with tokens.

You can see the pattern already. Microsoft encouraged thousands of engineers to adopt Claude Code as an AI coding assistant. After six months, the scale of adoption became too expensive to sustain — and the company started canceling licenses.[2] Uber's engineering team ran leaderboards to incentivize AI tool adoption. They burned through their entire 2026 AI coding tools budget in four months.[2]

These aren't adoption failures. They're Jevons in action. The tools worked. Developers used them constantly. The cost model assumed "a few power users" and got "everyone, all day."

Gartner analyst Will Sommer put the implication directly: "Chief Product Officers should not confuse the deflation of commodity tokens with the democratization of frontier reasoning."[2] Frontier reasoning — agentic work, multi-step tasks, context-heavy analysis — consumes far more tokens per completed task than the single-prompt calls most cost models were built around. The pricing argument for agent-based workflows runs the same direction: outcomes, not tokens, are the unit that holds.

The power bet nobody is modeling

The hardware constraint and the consumption paradox are real. There's a third force that's harder to model but worth naming.

xAI is running the world's largest GPU cluster on dozens of unregulated natural gas turbines. The company has $2.8 billion in additional compute equipment purchases planned. Elon Musk explicitly abandoned renewable energy options for both xAI and SpaceX data centers.[3]

xAI's Memphis data center infrastructure running on natural gas turbines rather than renewable power sources Story: Elon Musk has given up on solar power on Earth. Image via TechCrunch.

Why does this matter for operators? Gas infrastructure has a 15–20 year lifespan once committed. Power costs for AI compute aren't a software problem you can optimize away. When the operator running the most compute in the world explicitly rules out the efficiency path (renewables plus efficiency curves) in favor of the capacity path (more gas, more turbines), that's a signal about where the binding constraint actually sits.

AI compute isn't short on clever engineers. It's short on power delivery at scale. Gas has baseload reliability that solar and wind don't. The infrastructure bet is: build more capacity, keep running it. That's not a trajectory that produces dramatically lower operating costs over a five-year window.

flowchart TD
    A["Standard AI cost model:\ntoken prices fall ~50%/yr"]
    A --> B["Budget conclusion:\nAI costs halve by 2027"]

    C["Three forces the model misses"]
    C --> D["HBM memory: 52% → 63%\nof total chip spend in one year\n— Floor efficiency gains can't lower"]
    C --> E["Token consumption: 24x growth\nby 2030 — Goldman Sachs\n— Jevons offsets price drops"]
    C --> F["Power infrastructure: gas locked in\nfor 15–20 year horizon\n— No renewable efficiency curve"]

    D --> G["Actual AI cost trajectory:\nflat-to-rising for most teams"]
    E --> G
    F --> G

    G --> H["Rebuild your cost model\nbefore Q3 budget reviews"]

What a rebuilt cost model looks like

Three changes that move you off the broken assumption:

Price per token is not cost to run a workflow. Agentic tasks consume 10–100x more tokens than the single-step prompts most per-token pricing benchmarks are based on. Cost your workflows on tokens per completed task, not tokens per individual call. You'll find your "cheap" AI automations are substantially more expensive than the API invoice suggests.

Model consumption growth, not just price. If you're adding AI capabilities to products or operations this year, your token consumption will compound. Budget for Jevons. "We'll use it for X" becomes "we use it for everything" within six months of any successful rollout. The teams at Microsoft and Uber didn't make a planning error — they made a human error. Plan for the human version of your own team.

Stop anchoring to "this will get cheaper." The hardware supply constraint is three to four years away from significant relief at best. The consumption paradox accelerates as you add agentic workflows. If you're building anything that will run in production for 24 months or longer, model flat infrastructure costs as your conservative case, not your pessimistic one.

The AI cost curve isn't going down from here. It's going sideways at best. Plan your margins accordingly.

The operators who'll be in a good position in 2028 aren't the ones who assumed prices would keep falling and got lucky. They're the ones who ran the math on flat-to-rising costs and built operations that could absorb that — either by pricing AI labor at outcome value rather than token value, or by choosing workflows where the cost-to-value ratio holds even if infrastructure costs don't fall.

Sources

[1] Epoch.ai — AI Chip Component Cost Shares — https://epoch.ai/data-insights/ai-chip-component-cost-shares

[2] Fortune — The AI cost problem: tokens, agents, and the Jevons trap — https://fortune.com/2026/05/22/microsoft-ai-cost-problem-tokens-agents/

[3] TechCrunch — Elon Musk has given up on solar power on Earth — https://techcrunch.com/2026/05/23/elon-musk-has-given-up-on-solar-power-on-earth/

The short version

HBM memory jumped from 52% to 63% of AI chip component costs in one year. The hardware floor doesn't follow software efficiency curves down.
Goldman Sachs projects 24x token consumption growth by 2030. Microsoft and Uber already hit the Jevons wall: adoption worked, and the bill compounded.
The world's largest GPU operator locked its infrastructure into natural gas for 15–20 years. Power is the binding constraint, not compute efficiency.
Cheaper per token plus more tokens consumed does not equal cheaper in aggregate. Model consumption growth, not just price decline.
Build your AI cost model on per-outcome accounting and flat infrastructure assumptions. "It'll be half the price next year" is the wrong anchor.