Enterprise operations center with AI agent monitoring dashboards showing warning indicators and paused status indicators

Why 74% of Companies Are Pulling Their AI Agents After Deploying Them

AIntelligenceHub
··9 min read

Sinch surveyed 2,527 decision makers across 10 countries and found 74% of enterprises already rolled back deployed AI agents. The cause isn't model quality: it's the infrastructure layer most deployment plans skip.

74% of companies that deployed AI customer service agents have already rolled them back. Not paused, not reduced in scope in a minor way, but shut down or significantly reversed course after deploying into live customer operations. That figure comes from Sinch's AI Production Paradox report, published in May 2026 after surveying 2,527 senior decision makers across 10 countries and six industries.

The companies in this study aren't testing chatbots in a lab. 68% employ between 1,000 and 4,999 people, and 32% employ 5,000 or more. These are organizations with dedicated AI engineering teams, executive buy-in, and the budget to build seriously. They put real AI agents into real customer-facing workflows. Then they pulled them.

The stranger number is 81%. That's the rollback rate among organizations with fully mature governance frameworks. The companies that built the most thorough safety policies, the clearest accountability chains, and the most rigorous audit processes rolled back their AI agents more often than the companies that approached it casually. Higher discipline correlated with more failures detected, not fewer.

Sinch calls this the AI Production Paradox. It's worth understanding precisely because 98% of respondents in the same survey say they're increasing AI investment in 2026. Commitment is going up. Production reliability isn't keeping pace. The diagnosis Sinch offers points to a specific infrastructure gap that most deployment roadmaps skip entirely, and understanding it changes how you should plan an AI agent program from the start.

What the Rollback Numbers Actually Show

Start with what "rolled back" means in this context. Sinch defines it as shutting down or significantly reversing a deployed, customer-facing AI agent after it went live. Not a pilot that never graduated. Not a staging test that exposed problems before launch. A production system that reached real customers, produced outcomes that couldn't be tolerated, and got taken offline.

62% of the respondents already have AI agents running in production. That makes agent deployment mainstream in enterprise customer operations, not experimental. The rollback data applies to that production-grade category. These aren't teams that tried once and gave up. Many are on their second or third deployment attempt, refining scope and rebuilding infrastructure each time.

The geographic spread removes regional excuses. The survey covers the US, UK, Australia, Brazil, Germany, France, India, Singapore, Mexico, and Canada. Rollback rates stayed consistent across all 10 markets and across all six industries: financial services, healthcare, telecommunications, technology, retail, and professional services. The failure pattern is universal, not a quirk of one sector or one regulatory environment.

A Gartner study from June 2025 adds useful context. It found that half of enterprises planning to use AI-driven customer service to reduce headcount would abandon those plans by 2027, citing unexpected costs and unintended consequences. Sinch's May 2026 data is consistent with that trajectory. Companies launched. Problems surfaced. Plans changed. Sinch's research identifies specifically why.

The 81% rollback rate among mature governance organizations sounds paradoxical until you think about what governance actually does. It doesn't prevent problems. It detects them faster and creates a clear path to respond. That's not the same thing.

Daniel Morris, CPO at Sinch, stated it directly: "The most advanced organizations aren't failing less; they're seeing failures sooner." A team with structured monitoring, defined accountability, and explicit rollback criteria will catch a bad outcome quickly and shut the system down before it causes more harm. A team without those structures may have just as many bad outcomes, but they take longer to find them and often longer to fix once found.

This reframes what the rollback data actually measures. A high rollback rate in a well-governed organization is evidence that governance is doing its job, not that AI agents are fundamentally broken. The failures are real, but the response is controlled and fast. The more interesting question is what's producing failures that even well-structured teams can't resolve without shutting down.

84% of AI engineering teams in the Sinch study spend at least half their time on safety infrastructure rather than capability development. That ratio is a signal about system stability. If an engineering team is spending most of its cycles keeping the current agent safe rather than improving it, the underlying platform isn't stable enough to support rapid iteration. It's locked in a maintenance posture rather than a development posture.

75% of respondents say they prioritize trust, security, and compliance above AI development itself. In practice, it often means every new capability requires an extensive safety review cycle before release, which slows learning and keeps teams stuck in review mode rather than shipping mode. Good governance structure isn't the cause of that bottleneck. Insufficient infrastructure is.

Morris coined the phrase "guardrail tax" to describe what happens when safety infrastructure becomes the dominant cost of running an AI agent program. When the underlying platform isn't stable, every edge case requires a manual response. Every new failure mode requires a policy update. Every incident triggers an investigation cycle. The engineering team spends most of its time maintaining current safety rather than improving the system. The guardrail tax rises when infrastructure reliability falls.

The AI Agent Infrastructure Gap No One Plans For

The core finding in Sinch's research is that communications infrastructure satisfaction is the strongest predictor of successful AI agent deployment. Stronger than investment levels. Stronger than governance maturity. When teams have reliable, capable infrastructure handling their agents' underlying communications stack, they succeed significantly more often. When they don't, even strong models and strong governance processes break down in production.

55% of respondents are building custom infrastructure for cross-channel context management. More than half of large enterprises have concluded that the off-the-shelf tooling for managing conversation context across email, chat, voice, and SMS doesn't meet production requirements. So they're building their own. That's not a small investment. It signals how wide the gap is between what generic communications infrastructure provides and what enterprise AI agents actually need.

Cross-channel context is the ability to preserve a customer's interaction history across different communication channels within a single service engagement. A customer who started explaining a billing issue by email shouldn't have to repeat themselves when they switch to a chat session an hour later. An agent that loses context between channels behaves as though it's starting fresh every time, which means inconsistent responses and frustrated customers.

87% of respondents rate high-performance communications infrastructure as essential or very important for AI agent success. 86% are evaluating or actively considering new communications providers to support their AI programs. These aren't incremental adjustments to existing setups. They represent large enterprises reconsidering the foundational layer their AI agents run on, because the current layer isn't holding up under production conditions.

The production gap follows a consistent pattern. A team builds an AI agent that handles a defined customer workflow. In staging, with clean and formatted test inputs, it works well. In production, real customers write incomplete sentences, mix languages, reference previous conversations that happened on a different channel weeks ago, and describe issues in ways that don't match any documented scenario. The infrastructure layer is supposed to normalize that signal before the model sees it. When it can't do that reliably, model outputs become unpredictable. The governance team catches it and shuts the system down.

Concrete failure patterns make this tangible. Context loss between channels is the most common: an agent deployed on a single chat channel can perform well within that channel, but the same agent extended across chat, email, and voice, without infrastructure that passes session context reliably, behaves inconsistently in ways that look like model failures but trace back to infrastructure. The model is working as designed. It just doesn't know what happened before.

Authentication failures account for a disproportionate share of production breakdowns. In one 2026 analysis of 847 AI agent implementations, 62% of critical failures involved authentication issues. When session tokens expire without clean handling, when verification services return ambiguous states, or when the agent encounters an account configuration it wasn't trained for, the downstream behavior degrades quickly. These are infrastructure integration failures that hit compliance boundaries, not model errors.

Cascading failures are the most expensive category. When an agent misclassifies a customer issue, takes an autonomous action based on that misclassification, and triggers a downstream workflow, the error propagates before anyone notices. An invoice incorrectly categorized might trigger a payment process, update a CRM record, and send an automated confirmation email. By the time a human review catches the original error, three other systems have already acted on wrong data.

Klarna's experience illustrates the challenge at scale. The company replaced 700 customer service agents with AI and initially reported efficiency gains. It then partially reversed course, rehiring human agents for cases that AI consistently handled poorly. The cases AI failed on were the ones requiring complex multi-step resolution, unusual account states, and context from previous interactions across channels. Exactly the cases where infrastructure limitations, not model limitations, show up first.

How Teams Can Break the Production Paradox

Infrastructure investment before governance investment is the counterintuitive prescription the data supports. Enterprise AI programs tend to fund governance first and infrastructure second. The Sinch findings suggest that order should be reversed, or at minimum run in parallel rather than sequentially.

The AI Rollout Checklist for Mid-Sized Companies reflects similar logic from the practical deployment side: the most common failure in AI rollouts isn't model selection, it's sequence. Teams that establish a reliable operational foundation before expanding scope encounter fewer production surprises and learn faster.

Cross-channel context persistence must be tested with real edge cases before launch. That means multi-channel conversations, customers referencing historical interactions from weeks prior, and issues that require coordination across multiple internal systems. If context breaks in pre-launch testing under those conditions, it will break in production under the same conditions.

Authentication and identity state handling need explicit coverage for every edge case the agent might encounter. The agent's behavior should be fully defined for expired sessions, failed verification, high-security transaction requirements, and accounts in unusual states. Undefined authentication states are where many of the most surprising agent behaviors originate in production.

Rollback procedures should be defined, documented, and tested before launch, not drafted after the first crisis. Rollback readiness means defined criteria for when to invoke it, a team with clear authority to invoke it, customer communication protocols for when it activates, and a tested path back to human-handled service. Given that 74% of enterprise AI agent deployments result in a rollback, treating rollback as an edge case rather than a planned event reflects an unrealistic view of production reality.

Infrastructure-level monitoring needs to be in place from day one, not just model output monitoring. Log context propagation success rates, authentication failure rates, channel routing accuracy, and downstream workflow trigger anomalies alongside standard model quality metrics. Many production failures start at the infrastructure layer and only become visible in model metrics after they've been running for some time.

Scope discipline also matters more than it initially seems. Our earlier coverage of Amazon's Alexa for Shopping agent describes a deployment model that deliberately narrows scope: an agent that can complete purchases within defined parameters rather than handling arbitrary service flows. That constraint trades breadth for reliability. More enterprise AI programs should make that tradeoff explicitly rather than defaulting to broad capability goals that exceed what the infrastructure can currently support.

Morris described the underlying principle: "The operational cost of running AI safely at scale is much larger than most organizations expect." That cost extends well beyond model hosting and API fees. It includes the engineering work to build and maintain context routing, failover logic, multi-channel orchestration, and audit logging. These are not implementation details. They are load-bearing infrastructure, and most initial deployment plans treat them as afterthoughts.

The final point is about the cycle most enterprise AI programs are already in. 62% of enterprises have agents in production. 74% have experienced at least one rollback. Many of the companies currently running production agents have already been through at least one cycle of deployment, failure, rollback, and rebuild. The teams that come through that cycle in the best shape treat the rebuild as a signal about what the infrastructure actually needs, not as a setback to recover from. That discipline is what turns the AI Production Paradox from a recurring failure into a solvable engineering problem.

Weekly newsletter

Get a weekly summary of our most popular articles

Every week we send one email with a summary of the most popular articles on AIntelligenceHub so you can stay up-to-date on the latest AI trends and topics.

One weekly email. No sponsored sends. Unsubscribe when you want.

Comments

Every comment is reviewed before it appears on the site.

Comments stay pending until review. Posts with more than two links are held back.

Related articles