Abstract illustration of an AI agent silhouette passing through a glowing data center power grid, with energy consumption rising sharply. Navy and teal palette.

KAIST: AI agents burn 136x more power than chatbots

AIntelligenceHub
··5 min read

KAIST researchers have put the first hard number on the energy cost of AI agents. A 70B-parameter agent uses 136.5x the energy of a chatbot query. At agent scale, projected data center demand reaches 198.9 gigawatts.

A research team at KAIST has put the first hard number on what "AI agent" really costs at the data center wall: 136.5 times more energy per query than a conventional generative AI system. The study, led by Professor Minsoo Rhu of the School of Electrical Engineering, frames the result as a shift in where the AI race is being run, from model benchmarks to data center efficiency.

The 136x number and how KAIST measured it

The team built a quantitative framework for AI agent workloads in real production environments, not synthetic benchmarks. They defined an AI agent as a system that plans, calls external tools, and chains multiple LLM invocations, then measured the cost of each step end to end. The headline finding: a 70-billion-parameter LLM running in an agent loop consumes an average of 348.41 watt-hours per query, against a simple question-answering baseline. That ratio is 136.5 to 1, and the paper calls it "the cost of dynamic reasoning." The full study is published in the EurekAlert release from KAIST.

The number is not a worst case. The team found that response time can grow by up to 153.7 times for the same kind of multi-step agent work, and that GPUs sit idle for as much as 54.5 percent of total execution time while the agent waits for tool responses. The 54 percent idle finding matches what AIntelligenceHub reported in May on enterprise Kubernetes clusters, where 23,000 production deployments averaged just 5 percent GPU utilization. Both studies arrive at the same conclusion from different directions: the more an agent thinks, the more the silicon underneath is underused.

The paper was presented in February at the 32nd IEEE International Symposium on High-Performance Computer Architecture (HPCA), one of the most rigorous venues in computer systems research. Lead author Jiin Kim is a PhD student in Rhu's group, and the title is "The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective." Rhu's framing in the press materials is direct: competition in the AI era is shifting from "smarter AI" to "more efficient AI," and the gap is now measurable.

The 136.5x ratio is the first published number of its kind for production agent workloads. Prior estimates of AI energy cost have focused on inference per token, not on the multi-call, tool-using patterns that define the agent category. The KAIST work is the first to instrument the full call chain, including web search, code execution, and the waiting time between steps, and to report the cumulative cost in watt-hours. That methodological choice is what makes the number a benchmark rather than a back-of-envelope estimate.

What 198 gigawatts of agent traffic would mean

The most arresting projection in the paper is not the per-query number. It is the scale at which agents could break the data center buildout currently planned. The team modeled a future scenario in which 13.7 billion agent requests are generated per day, a volume equal to today's Google search traffic, and translated that into aggregate data center power demand. The answer is 198.9 gigawatts, equivalent to roughly half of the average electricity consumption of the United States, and a level far beyond the few-gigawatt campuses now under construction.

For context, the largest single AI data center campus planned as of mid-2026 is in the low single-digit gigawatt range. Microsoft's, Meta's, Oracle's, and AWS's announced projects each sit between one and five gigawatts. The KAIST projection implies that if the agent traffic forecast is right, today's announced capacity would meet less than three percent of the eventual power draw. The gap is not a rounding error. It is a planning problem the industry has not yet recognized as a planning problem.

The paper's interpretation is that this gap forces a co-design discipline, with AI models, AI semiconductors, data centers, and power infrastructure tuned together rather than in sequence. The current buildout assumes that model performance scales with parameter count and that inference demand scales with chat traffic. The agent category breaks both assumptions. A model that takes fifteen tool calls to complete a task consumes fifteen times the inference budget of a model that takes one, and the silicon underneath is not fifteen times faster at running that workload. The bottleneck shifts from model quality to data center throughput, and the throughput math stops working under agent traffic.

Rhu's group is not the only team reaching this conclusion. The AIntelligenceHub inference cost resource page tracks the same trend from the buyer side: per-token pricing is dropping, but per-task pricing is rising, because the average task now wraps more tokens. The KAIST study is the academic confirmation of what procurement teams have been noticing in their monthly bills for the last two quarters.

The co-design implication for AI infrastructure

Rhu's prescription is co-design, and the paper's last section spells out what that means in practice. It is not enough to make a better model. The model, the inference accelerator, the data center cooling system, and the power grid contract have to be tuned to each other. A 70-billion-parameter model that runs at 50 percent utilization is twice as expensive per query as the same model at full utilization, and the KAIST finding of 54.5 percent GPU idle time is the current state of the art, not an outlier.

The practical lever is the agent runtime itself. Tool-use latency is the largest contributor to GPU idle time, because the GPU sits waiting for the tool to return. A runtime that batches multiple agent invocations across a single tool call, or that pre-fetches likely next calls, can pull utilization back into the 70-80 percent range and halve the per-query cost. The paper releases the agent implementations and benchmarks used in the study as open source, so other research groups can replicate the result and try optimization strategies. The repository is on KAIST's GitHub under the HPCA 2026 artifact track.

The market implication is that inference providers, not model labs, will capture the next wave of efficiency gains. The lab that ships a 5 percent better model does not move the per-query cost by 5 percent. The runtime that ships a 30 percent better utilization does, because utilization is multiplied across the entire query, not added to it. Rhu's framing is that this inverts the AI race: the next round of competition is not about who trains the smartest model, but about who runs the most efficient inference stack. The 198 gigawatt projection is the upper bound on the cost of getting that wrong.

The study has limits worth flagging. The 348 watt-hour per query figure is for a 70-billion-parameter LLM in a specific agent configuration; smaller and larger models will land at different points on the curve. The 198 gigawatt projection is a scenario, not a forecast, and assumes 13.7 billion agent requests per day with no efficiency improvements over time. The research team is explicit about both caveats. The 136.5x ratio, however, holds across the configurations they tested, and that is the number the industry will be quoting through the rest of 2026.

Weekly newsletter

Get a weekly summary of our most popular articles

Every week we send one email with a summary of the most popular articles on AIntelligenceHub so you can stay up-to-date on the latest AI trends and topics.

One weekly email. No sponsored sends. Unsubscribe when you want.

Comments

Every comment is reviewed before it appears on the site.

Comments stay pending until review. Posts with more than two links are held back.

Related articles