Glowing digital document fragmenting into data particles against a dark blue background, representing AI-driven document corruption during long-running tasks

AI Agents Corrupt Your Documents During Long Tasks, Microsoft Researchers Find

AIntelligenceHub
··7 min read

Microsoft tested 19 AI models on complex document editing across 52 professional fields. Frontier models corrupted 25 percent of content during long sessions. Adding agentic tools made outcomes worse, not better.

Three Microsoft researchers ran a study that anyone shipping AI agents into real knowledge work should read carefully. They built a benchmark called DELEGATE-52, tested 19 large language models on complex document editing across 52 professional fields, and found that even the best models available today degrade the documents they're supposed to help with. Silently. Over time. In ways that compound with each interaction.

The paper is titled "LLMs Corrupt Your Documents When You Delegate." It was published on April 17 by Philippe Laban, Tobias Schnabel, and Jennifer Neville at Microsoft Research. The benchmark name comes from the 52 professional domains it covers: Python programming, crystallography data files, music notation, accounting ledgers, genealogy records, and dozens of others. Each domain uses real documents, typically around 15,000 tokens of professional content. It's not a toy dataset or a simplified test environment. It's the kind of material knowledge workers actually deal with.

Frontier models corrupted 25 percent of documents in extended sessions

The researchers wanted a direct answer to whether current AI models can reliably handle extended, multi-step document editing. The entire premise of delegating knowledge work to an AI agent rests on the assumption that a model can work through a task over many interactions without degrading what it started with.

DELEGATE-52 tests exactly that. Each of the 52 domains comes with between 5 and 10 complex editing tasks that a professional might hand off to an AI. An agent works through those tasks across up to 20 interactions. The benchmark then measures the document state using a metric called the Reconstruction Score, comparing the final document against what it should look like after successful edits. The comparison uses domain-specific similarity functions, so a crystallography file is evaluated on crystallography criteria, a Python codebase on code-preservation criteria, and a music score on its own structural logic.

The 19 LLMs tested came from six different model families: OpenAI, Anthropic, Google, Mistral, xAI, and Moonshot. That's a meaningful cross-section of what enterprise teams can actually deploy right now, not a cherry-picked set of older models.

The headline number from the study is 25 percent. Frontier models, specifically Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted an average of 25 percent of document content by the end of a 20-interaction session. These are the flagship releases from three of the most competitive AI labs in the world, and they lost roughly a quarter of document content under normal extended-use conditions.

For the broader set of 19 models, average degradation was 50 percent. Half of document content, on average, was corrupted across all model and domain combinations. The errors were characterized as sparse but severe. They don't show up as obvious failures. An agent doesn't tell you the task went wrong. Content quietly disappears, gets misplaced, or gets silently overwritten. By the end of a long session, the document may look like it mostly survived, but 25 to 50 percent of its content could be missing or wrong.

More than 80 percent of all model-and-domain combinations showed what the researchers call catastrophic corruption, defined as a Reconstruction Score below 80 percent. That's not an edge case. It's the majority of test conditions. Python programming was the only domain that met the readiness threshold of 98 percent or above. Gemini 3.1 Pro came out ahead overall, reaching the readiness threshold for 11 of the 52 tested domains. The best-performing model, in other words, can reliably handle roughly one in five professional domains.

A common assumption in enterprise AI deployments is that giving agents access to tools, read and write operations, file access, API calls, will improve reliability on long tasks. The reasoning makes intuitive sense: tools give the model more ways to verify its own work. DELEGATE-52 tested this directly, and the result was the opposite. Agents with tool access performed worse, not better, compared to baseline model behavior. The average additional degradation from adding agentic tool use was 6 percent. The degradation from tool use was additive, not corrective. More scaffolding around a model doesn't automatically fix core reliability problems on long-horizon tasks.

Degradation severity also worsened with larger documents, longer interactions, and when distractor files were present alongside the target document. All three of those conditions are standard in real professional workflows. DELEGATE-52 designed the benchmark around exactly these conditions because that's where real failure risk lives.

Why errors compound rather than cancel out

The corruption problem isn't random noise. There's a structural reason it gets worse over time.

LLMs process everything in their context window, but they don't have perfect recall of what they've already written or modified. Over long interactions, a model makes decisions based on a growing, increasingly complex history of what's been done. Small errors in early interactions don't get flagged. The model moves forward assuming the document state is correct. Later interactions build on top of uncorrected errors from earlier in the session.

By interaction 20, the model may be working with a subtly broken version of the document and have no signal from the task environment that anything went wrong. The Reconstruction Score only becomes visible after an external evaluation compares the final document against the ground truth. In production settings, there often isn't an external evaluation. Users may not notice the degradation until much later, or may not notice it at all.

This is different from the kinds of failures that get caught early. A model that refuses a task, crashes, or produces obviously wrong output is easy to identify. A model that quietly removes 25 percent of a document's content while appearing to complete each assigned task is much harder to catch before the damage matters.

Performance after two interactions doesn't reliably predict outcomes at 20 interactions. A model performing well through the first two steps may deteriorate significantly through the next 18. Short testing cycles, which are common in AI product evaluation, can give false confidence about long-session reliability.

The researchers also found that stronger models delay critical failures rather than avoid them. Gemini 3.1 Pro reaches catastrophic corruption less frequently in early interactions, but the majority of its domain-and-condition combinations still fail by interaction 20. Making models bigger and stronger hasn't closed the reliability gap. It's shifted the failure point to a later position in the session.

What to know before deploying AI agents on real documents

The enterprise case for AI agents is built around delegation. You give a complex task to an AI system, it works through it autonomously, and you review the output. DELEGATE-52 shows the current generation of models can't fully support that workflow for most professional domains.

That doesn't mean AI agents have no value in document workflows. Python coding meets the readiness threshold. Code editing and refactoring in Python are well within current model capability. The DELEGATE-52 results suggest other structured programming tasks may fall into similar territory. The gap between what's ready and what enterprise teams are currently deploying agents for is where the real risk sits.

For teams already using AI agents on document-heavy workflows, the practical implication is verification. Human review at the end of long AI sessions isn't optional; it's the current state of the art. Treating agent output as final without review is the mechanism by which 25 percent of document content quietly disappears.

A few specific design changes follow from the research. Domain matters more than model: the gap between Python performance and most other professional domains is wide enough that choosing which tasks to delegate based on domain-specific readiness is more reliable than assuming a frontier model's general benchmark scores transfer. Document size is a real variable: larger documents degrade faster, so splitting complex materials into smaller working units reduces exposure to the compounding error problem. Explicit verification checkpoints during a session, not just at the end, can surface degradation before it compounds further.

Qualcomm's CEO told Fortune last week that 2026 is the year AI agents go mainstream. GitLab announced this week it's cutting staff to reinvest in what it's calling the agentic era. DELEGATE-52 doesn't argue against using agents. It maps precisely where the current generation can be trusted and where it can't. Python coding, ready. Most professional document workflows, not yet.

Microsoft's AgentRx framework, introduced alongside DELEGATE-52, helps developers localize failures when an AI agent breaks down mid-task, tracing where and why things went wrong rather than having to review long interaction logs manually. It's a debugging tool that improves post-failure analysis, not a fix for the underlying corruption rates the benchmark measures.

Gartner added relevant context on the same day, with analyst Rita Sallam noting that AI agents without semantic foundations are more likely to hallucinate, introduce bias, and produce unreliable results. The firm predicts organizations that prioritize semantics in AI-ready data will increase agent accuracy by up to 80 percent and reduce costs by up to 60 percent by 2027.

For a broader view of how current models compare on reliability and capability across use cases, our LLM Comparison resource page covers the major options in plain terms. The recent Anthropic research on agents that learn from their own mistakes explores a complementary direction: what happens when you give models structured ways to correct themselves before errors compound.

The full benchmark is available to run against your own domain. The DELEGATE-52 paper is on arXiv, and the code is at microsoft/DELEGATE52 on GitHub.

Weekly newsletter

Get a weekly summary of our most popular articles

Every week we send one email with a summary of the most popular articles on AIntelligenceHub so you can stay up-to-date on the latest AI trends and topics.

One weekly email. No sponsored sends. Unsubscribe when you want.

Comments

Every comment is reviewed before it appears on the site.

Comments stay pending until review. Posts with more than two links are held back.

Related articles