A New Benchmark Says Top AI Agents Still Lose Money Over a Full Season

AI agents sound sharp when they get ten minutes, a clean prompt, and one narrow goal. The harder question is what happens when you give them months of noisy data, changing conditions, and a scoreboard that never stops moving. General Reasoning’s new KellyBench is interesting because it tests that harder question, and the early results are rough for everyone.

In the KellyBench release, General Reasoning says it built a long-horizon evaluation around the 2023-24 English Premier League season. Agents are asked to develop a quantitative betting strategy, size bets, manage risk, and adapt across the season using historical data, lineups, results, and public odds. The headline result is blunt. Every frontier model tested lost money on average.

That could sound like a gambling curiosity if you read it too quickly. It is not. The point of the benchmark is not that leading models should become sports bettors. The point is that many real agent deployments look more like this than like a neat benchmark suite. The environment changes. The model has to decide when not to act. Risk compounds across time. A good-looking analysis is worthless if the system cannot keep a coherent strategy once the world starts moving.

That is exactly why this release matters to the broader agent conversation. A lot of agent marketing still leans on short cycles. The model researches something, writes code, files a task, or calls a tool, and a human sees the result quickly. KellyBench stretches the time horizon. It asks whether the agent can keep learning, preserve capital, and act consistently across many steps. That is much closer to how serious business agents will eventually be judged.

The numbers are telling. General Reasoning says Claude Opus 4.6 was the best-performing model it tested, but still finished with an average return of negative 11 percent across three seeds. GPT-5.4 came next at negative 13.6 percent. Those two were also the only models to avoid ruin across all three seeds. That is a respectable relative result, but it is not success. It means the leading systems were better at surviving than their peers, not that they had solved the task.

There is also a cost dimension that makes the benchmark harder to dismiss. General Reasoning says GPT-5.4 at xhigh averaged about $1,571 per episode across the three seeds, while Claude Opus 4.6 averaged about $969. In other words, long-horizon evaluation is expensive, and even the expensive models still struggled. That makes the findings more useful, not less. Cheap toy tests are easy to shrug off. Costly tests that still expose weak behavior deserve attention.

The release lines up with a problem we have already seen in other agent discussions. In our earlier look at why managed agents get hard once they run for longer, the real difficulty was not producing a smart-looking answer. It was keeping the system coherent, safe, and effective as tasks stretched out. KellyBench turns that intuition into a measurable environment.

This benchmark cares about staying coherent, not sounding clever

One useful part of KellyBench is the task shape itself. The agents are not only predicting match outcomes. They have to build and revise a strategy. That means they need to model uncertainty, decide when an edge is real, choose how much to bet, and avoid blowing up when the evidence gets weak. Those are sequential decisions with feedback loops, which is exactly where many agent systems still look more fragile than their demos suggest.

General Reasoning says many models failed to act on their own analysis or failed to adapt as the world changed. That sentence is the heart of the story. A model can generate a reasonable explanation for what should happen next and still fail operationally once it needs to follow through over time. Anyone building agents for procurement, trading, planning, customer operations, or incident response should recognize that pattern immediately.

The benchmark also includes a sophistication rubric, and the site says no model scores higher than about a third of the available points. That matters because it shows the problem is not only final return. The underlying strategies are still weak by human standards. Models are not just unlucky here. They are operating with limited strategic depth in a setting that punishes shallow decision-making.

That helps explain the ruin numbers. The two strongest models, Opus 4.6 and GPT-5.4, were the only ones to avoid ruin across all seeds because they preserved capital and adjusted their strategies instead of betting in a more chaotic way. That is a useful signal for the rest of the market. A good agent is not only one that finds opportunities. It is one that knows when to stay out of trouble.

The benchmark is also a reminder that long-horizon evaluation should look more like a world than a worksheet. Fixed task sets still matter, but they can hide the gap between “can solve the step” and “can manage the process.” KellyBench is pushing on that gap directly.

Why losing money is the useful part of the result

A lot of benchmark news gets framed as a leaderboard race. Who is first, who moved up, who set a record. KellyBench is more useful if you read it the opposite way. The important part is not that one model beat another. It is that the whole field still falls short in a scenario that captures several traits of real autonomy.

That is healthy for the market because it cuts against inflated expectations. Companies are increasingly being told that agents can take on longer, less supervised workflows. Some can. But a benchmark like this shows that reliability over time is still an unsolved problem, even when the models look impressive on shorter tasks. That should change how buyers scope pilot programs. It should also change how vendors describe what is ready now versus what is still emerging.

There is a safety angle here too. Losing money in a benchmark is cheap compared with failing inside a real business system. A long-horizon environment that exposes bad adaptation, weak risk control, and inconsistent follow-through is doing the market a favor before those same flaws show up in harder settings. That is especially true for any use case where the agent can affect budgets, inventory, approvals, routing, or externally visible decisions.

KellyBench also points toward a stronger standard for future evaluations. If the next generation of agents is supposed to run for hours, days, or weeks with partial supervision, then the tests should measure how well they behave across those same horizons. One-shot tasks are still informative, but they are no longer enough on their own.

So the right takeaway is not that frontier models are bad. It is that the market still has a large gap between analytical fluency and durable decision-making. General Reasoning has built a benchmark that makes that gap hard to ignore. For a field that often rushes to celebrate what agents can do in a short clip, that kind of friction is useful. It forces everyone back to the question that matters most: can the system keep acting sensibly after the demo ends?

A New Benchmark Says Top AI Agents Still Lose Money Over a Full Season

This benchmark cares about staying coherent, not sounding clever

Why losing money is the useful part of the result

Get a weekly summary of our most popular articles

Comments

Related articles

Meta Cut 8,000 Jobs and Is Betting $145 Billion It Won't Need Them Back

Nvidia Reports $81.6 Billion in Revenue and Guides to $91 Billion Next Quarter

Exa Raises $250 Million to Become the Search Engine for AI Agents