Companies Are Paying Billions for AI Compute They Barely Use

Every company running AI on Kubernetes is paying a large GPU bill. Most of them are using 5% of what they're paying for.

That's the headline finding from Cast AI's 2026 State of Kubernetes Optimization Report, which analyzed tens of thousands of real production clusters across AWS, Google Cloud, and Azure. The data doesn't come from a survey or an estimate. It comes from actual workload measurement across 23,000 Kubernetes clusters. And the result is unambiguous: enterprise GPU clusters average 5% utilization. Ninety-five cents of every dollar spent on GPU capacity does nothing.

Laurent Gil, co-founder and president at Cast AI, put it plainly: "A GPU sitting idle costs dollars per hour. A CPU sitting idle costs cents. And 95% of GPU capacity is doing nothing."

At roughly $401 billion in projected global AI infrastructure spend for 2026, the math suggests well over $350 billion will effectively go to waste if utilization patterns don't shift. That's not rounding error. That's a structural failure happening at scale across virtually every industry deploying AI on modern cloud infrastructure. This isn't an edge-case problem affecting a few careless teams. It's the norm.

And the factors driving it are not self-correcting. In January 2026, AWS raised H200 Capacity Block prices by 15%. For the first time since EC2 launched in 2006, a major hyperscaler is raising reserved GPU pricing rather than cutting it. Companies paying premium rates for hardware sitting idle are now paying more for the same idle hardware.

Why Enterprise GPU Clusters Average Just 5% Utilization

GPU workloads don't run continuously. An inference server sits idle between requests. A training job runs for days, then stops. Batch pipelines fire on schedule and go quiet for hours. Some idle time is expected and normal. What is surprising is the depth of the idle gap.

At 5% average utilization, clusters aren't just resting between tasks. They're provisioned for peak loads, traffic spikes, and disaster scenarios that never materialize at the scale anticipated. Teams request H200 allocations based on what they think they'll need six months from now, not what they need today. The Cast AI report measured this across the full stack: average CPU utilization sat at 8% in 2025, memory was at 20%, and CPU overprovisioning jumped from 40% to 69% year over year. The GPU number is the most dramatic, but the whole infrastructure posture has been built around cushion rather than precision.

Two years ago, getting GPU allocations was genuinely hard. Wait times for H100 clusters stretched to months in 2023 and early 2024. Cloud providers rationed capacity. Companies that had allocations guarded them and didn't release capacity even when not actively using it, because giving it back meant joining the wait list again. That created a rational hoarding instinct: reserve what you can get today as a hedge against tomorrow's supply risk.

The problem is that hoarding feeds itself. Every company that reserves more than it needs reduces available capacity for everyone else, which creates more scarcity, which justifies even more over-reservation. NVIDIA received orders for 2 million H200 chips heading into 2026 against roughly 700,000 in inventory. That demand signal is inflated by defensive buying, not by workloads that actually need that many chips right now.

VentureBeat's analysis of this pattern makes a clarifying observation: a surprising number of H200 purchases in 2026 were made because the allocation came through, not because the workload required it. The GPU appeared to be available, the team had budget, and the fear of not having capacity later was enough justification to commit. That's how a market moves from genuine scarcity to structural overcapacity while prices continue rising. The infrastructure sits at 5% utilization by direct measurement, but the contracts and payments are fully real.

Workload design also contributes. Long-context inference pipelines retrieve far more context than a model actually uses per query. Embedding generation jobs run on fixed schedules rather than on demand. Agent workflows with multiple tool calls spin up compute cycles even when those calls could be batched or consolidated. The architecture of typical enterprise AI pipelines was built when tokens were cheap and compute access was the bottleneck. Now that compute is abundant but expensive, those design choices start to matter.

GPU Prices Are Rising as Idle Clusters Get More Expensive to Run

Cloud infrastructure pricing has followed a consistent pattern for nearly two decades: more efficient hardware, more manufacturing scale, more competition among providers means prices go down. That pattern has broken.

AWS raising H200 Capacity Block prices by 15% in January 2026 is the clearest data point. It's the first meaningful upward GPU price move from a hyperscaler in the EC2 era. The combination of demand inflated by FOMO buying, constrained supply of advanced chips relative to orders, and rising operational costs for liquid-cooled dense GPU infrastructure is bending the price curve the wrong direction.

For enterprise teams running at 5% utilization, a 15% price increase on the same idle hardware is a direct budget hit with no productivity offset. The spend is going up. The output isn't changing.

The waste accumulated during a specific phase of AI adoption: the enterprise pilot period, when organizations were exploring what AI could do and vendors were competing for adoption with flat-rate pricing. Flat-fee licenses and bundled token deals created a specific incentive structure. Whether a pipeline consumed 10 million or 100 million tokens in a month, the bill was the same. Under those conditions, there was no financial reason to optimize for efficiency. Teams built long-context agents and complex retrieval pipelines because the usage cost was invisible.

The architectural choices made under flat-rate pricing are now embedded in production systems. Retrieval chains that fetch more context than a model uses. Agent orchestration frameworks with more tool calls than a task requires. Inference hardware allocated per-service rather than shared across workloads. Those choices were defensible under flat-rate terms. They're expensive under metered billing. And the billing shift is accelerating as enterprises exit early-adopter contracts.

VentureBeat's Q1 2026 AI Infrastructure and Compute Market Tracker shows enterprises are registering the shift in real time. "Access to GPUs" dropped as the primary AI infrastructure concern from 20.8% to 15.4% in a single quarter. Companies that spent two years worried about getting GPU capacity are now worried about getting value from what they have. Interest in managed self-hosted GPU stacks jumped from 11.3% in January to 17.9% in February as teams evaluate whether owning or co-locating hardware would produce better unit economics than paying cloud providers for idle capacity.

To understand the full range of infrastructure choices available as this market shifts, the AI Infrastructure guide covers chips, cloud models, and capacity strategies in depth. The assumptions that made cloud reserved instances the obvious default in 2024 are no longer holding for every workload type in 2026.

Three Practical Ways to Cut GPU Waste Without Releasing Capacity

The 5% average doesn't mean 49% is impossible. The Cast AI dataset includes one cluster running 136 H200 GPUs at 49% sustained utilization. That's not a theoretical ceiling. It's a production result, demonstrating that the gap between typical and achievable is a matter of operational discipline, not technology availability.

The difference between a 5% cluster and a 49% cluster isn't necessarily how much AI work is happening. It's how the work is scheduled, shared, and sized. GPU time-slicing allows multiple workloads to share a single physical GPU. At low to medium concurrency, a single H200 can serve several inference workloads simultaneously with acceptable latency for many use cases. Most enterprise deployments skip this entirely, provisioning dedicated GPUs per workload or per service. Each GPU then runs at its single-workload peak and sits idle between requests.

Workload scheduling compounds the opportunity. Batch jobs that don't require low latency, including training runs, embedding generation, and scheduled analytics, can run in off-hours windows on the same hardware that serves live inference during business hours. A cluster that handles interactive inference from 9am to 6pm and runs batch jobs from 6pm to 9am is materially more efficient than two separate clusters, one for each job type. Spot or interruptible instances for workloads that can tolerate interruption typically price at 60% to 90% below reserved capacity rates, which is significant when reserved rates just went up 15%.

The first practical step doesn't require buying anything new or switching vendors. A workload audit against actual GPU utilization is free and can be run without releasing a single GPU reservation. The core question for each production GPU-backed workload: is the chip it runs on actually matched to what it does? An inference service handling modest throughput at moderate context lengths doesn't need an H200. A training job that runs three times a week doesn't need a dedicated cluster that sits idle the other four and a half days. A batch embedding pipeline that runs overnight can share hardware with a daytime inference service rather than occupying its own reserved node.

Automated rightsizing tools that continuously adjust resource requests based on observed usage patterns, rather than setting them once at deployment, have been available on major Kubernetes distributions for years. Most AI teams haven't used them because when compute seemed scarce, having too much felt acceptable. When compute is priced above the output it generates, that calculation changes fast.

The broader market is repricing. ByteDance recently announced a 25% increase in its AI infrastructure budget, reaching $30 billion, as it accelerates its own buildout with Chinese chips. That's one response to the compute pressures of 2026: spend more. For most enterprises without hyperscaler balance sheets, the more available response is to use what they already have significantly better. The Cast AI dataset shows 49% is achievable on production H200 hardware right now. Getting there starts with the audit.

Companies Are Paying Billions for AI Compute They Barely Use

Why Enterprise GPU Clusters Average Just 5% Utilization

GPU Prices Are Rising as Idle Clusters Get More Expensive to Run

Three Practical Ways to Cut GPU Waste Without Releasing Capacity

Get a weekly summary of our most popular articles

Comments

Related articles

Anthropic Lets Claude Agents Dream to Learn From Their Own Mistakes

Attackers Are Exploiting the Way AI Agents Choose Their Tools

ByteDance Boosts AI Budget 25% to $30 Billion, Pivoting to Chinese Chips