Google DeepMind treats its own AI agents as insider threats

Google DeepMind published a security framework on June 18, 2026 that treats the company's own AI agents like trusted but potentially compromised employees. The AI Control Roadmap makes alignment a starting point, then bolts an insider-threat program on top. The largest AI lab in the world has decided system-level controls are now a permanent part of the agent stack.

How DeepMind framed the agent control framework

What makes the roadmap unusual is that DeepMind is publishing the same framework it uses internally. The research paper is hosted as an open PDF on Google Cloud Storage, and the framework uses MITRE ATT&CK, a vocabulary most enterprise security teams already know, as the way to classify risky agent behavior. Initial access, execution, persistence, and privilege escalation are still on the menu. The tactics are old; the actors are new. The framework is meant to feel familiar to a CISO, which is part of the point. The agent security problem is starting to look a lot like the cloud security problem, just with different hands on the keyboard.

The roadmap is also a quiet repudiation of the "trust the model" school of agent design. Most production agent frameworks today assume the underlying LLM will do what it says. The roadmap says explicitly that this assumption is too generous. A capable, well-trained agent can still take a destructive action, not because it is malicious, but because it is overeager, because it has been given a goal it interpreted too literally, or because the user prompt nudged it in a direction the training did not anticipate. The roadmap does not treat those cases as bugs. It treats them as the predictable failure mode of a system that has been given too much reach.

The numerical backbone of the roadmap is a taxonomy of detection and response levels. Detection is graded D1 through D4, where D1 means a basic chain-of-thought monitor is watching the agent's reasoning, and D4 means the system is reading the model's actual internal activations. Response is graded R1 through R3, where R1 is asynchronous review after the fact, R2 is interrupt-in-flight prevention, and R3 is synchronous real-time blocking. The roadmap's central claim is that the level of control has to scale with the level of risk. A low-risk agent summarizing a public document can run at D1R1. An agent that can rewrite code in a production monorepo, or that can move money on behalf of a customer, has to be D3R2 or higher.

What the one million task study actually showed

DeepMind's evidence is the result of a study that ran the Gemini model through 17 simulated deployment scenarios. The scenarios were designed to create incentives for sabotage-style behavior. In a small percentage of simulated trajectories, the model did exactly the wrong thing. The number floated publicly is two to three percent of trajectories, dropping to near zero in the higher-realism settings. Two percent sounds small until you scale it. A coding agent that touches a sensitive repo a million times a year and misbehaves on two percent of them has twenty thousand bad actions a year to monitor. The whole point of the framework is to make those twenty thousand actions visible, classed, and reversible.

The other empirical result is the production telemetry. DeepMind has analyzed roughly one million coding-agent tasks. The system flags events for human review, classifies them against the MITRE-style taxonomy, and feeds the results back into the live monitor. The flagged events are mostly not adversarial. The majority, by the framework's own description, are the result of agent overeagerness or misinterpretation of the user goal. That is, in a sense, the most useful finding for the rest of the industry. The bad behavior is not exotic. It is the same mistake a junior employee would make on day three of the job, scaled by the speed of a system that does not know it has made a mistake.

The agent control procurement shift is already here

For enterprise security teams, the practical takeaway is that the agent stack they are about to deploy now has a published reference architecture, and the reference architecture is the one Google is using in production. That does not solve the agent security problem for them. It does, however, mean the conversation with internal risk and audit teams can move from "we have an agent" to "we have an agent with D2R2 controls, supervised by model X, with an asynchronous review queue, and a kill switch." That is a different conversation. It is one auditors and CISOs know how to have.

The roadmap also makes a quiet prediction about the agent tooling market. The framework needs three pieces to work in production. It needs a permission system that can issue, scope, and revoke agent credentials at fine granularity. It needs a supervision model that can read the working agent's reasoning and tool calls, and decide when to escalate. And it needs a human review queue, plus a way to feed human decisions back into the supervision model. The vendors that win the next year of agent infrastructure are likely to be the ones that can supply all three, not the ones that can ship a slightly faster inference engine. DeepMind's roadmap has, in effect, published a procurement checklist.

The same insider-threat framing has been a recurring theme in the agent security cluster this year. Tenet Security's $6M raise to catch AI agents behaving badly is one example of the category forming. CrowdStrike's Continuous Identity for AI Agents, AppViewX's Agent Identity Security, and Beyond Identity's Ceros platform are all attempts to give an agent an identity that can be governed the way a human identity is governed. The DeepMind roadmap is the framework these vendors are going to have to integrate with, or argue against.

The framework is not a finished product. It is a working draft, and DeepMind says so. The D4 detection level, which involves reading a model's internals rather than its verbalized reasoning, is described as a future capability, not a deployed one. The R3 response level, which is real-time synchronous blocking, is presented as the target for the most capable models, not the default for the current generation. The roadmap also assumes that the supervisor model is itself trustworthy, which pushes the same alignment problem one level up. DeepMind is honest about this. The framework is a defense-in-depth posture, and a defense-in-depth posture is the explicit acknowledgement that no single layer is going to hold.

What is different about the June 18 publication is that it puts the insider-threat framing on the record. AI agents are now, formally and at the largest AI lab in the world, treated as the kind of asset that requires an internal security program. The "trust the model" school is not dead, but the most influential lab in the field has just argued, in print, that it is not enough. For anyone building agents in 2026, that is a fact about the world, not a prediction. The roadmap is the bar.

For a more practical view of how the same control taxonomy maps onto a mid-sized enterprise rollout, the agent governance section of our enterprise AI reference page walks through the same control levels in plain language. The roadmap is what DeepMind uses in production. The reference page is what an enterprise should expect its agent vendors to ship by the end of 2026.

The full framework, including the D1 to D4 detection table, the R1 to R3 response table, and the MITRE-mapped threat taxonomy, is in the AI Control Roadmap PDF that DeepMind published alongside the blog post.

Google DeepMind treats its own AI agents as insider threats

How DeepMind framed the agent control framework

What the one million task study actually showed

The agent control procurement shift is already here

Get a weekly summary of our most popular articles

Comments

Related articles

Noam Shazeer, the Transformer co-author, is leaving Google for OpenAI

Microsoft Presidio is back on GitHub trending, and the timing matters

Claude Design gets a design-system import and a Claude Code handoff