AI agent network nodes scanning for security vulnerabilities in a dark cybersecurity visualization, electric blue on black background

Microsoft's 100-Agent AI System Just Found 16 Windows Vulnerabilities

AIntelligenceHub
··11 min read

Microsoft's MDASH runs more than 100 AI agents in parallel to scan Windows code. In May 2026 it found 16 real CVEs, including 4 Critical RCEs, and scored 88.45% on the CyberGym security benchmark.

The 16 security patches in Microsoft's May 2026 Patch Tuesday release weren't all found by human researchers. The majority came from MDASH, a new agentic security system that runs more than 100 specialized AI agents in parallel to hunt for vulnerabilities in Windows source code. It found four Critical remote code execution flaws that could have given attackers complete control over affected machines before any fix existed.

MDASH stands for Multi-model Agentic Scanning Harness. Microsoft's Autonomous Code Security team built and deployed it internally, published results on May 12, 2026, and is now testing it with a small group of external customers in limited private preview. The system doesn't work like a conventional static analysis scanner or a single large language model pointed at code. It coordinates a fleet of specialized agents running across multiple frontier and distilled models, using disagreements between those models as a quality signal to catch vulnerabilities that any individual model would miss.

MDASH is not the only agentic AI system targeting vulnerability discovery. Anthropic's internal security tools and independent research teams have published systems with similar architectures over the past year. What makes MDASH different is the scale of the deployment, the specificity of the benchmark claims, and the fact that the CVEs it found shipped in an actual Patch Tuesday release. This is not a proof-of-concept or a research paper. It's a production security tool with verifiable real-world results.

The public benchmark result is unambiguous: 88.45% on CyberGym, the industry benchmark that evaluates against 1,507 real-world vulnerabilities sourced from actual CVEs. That's the highest score on the public leaderboard, roughly five points ahead of the next best published system. On Microsoft's own internal test data, MDASH found 21 out of 21 planted vulnerabilities with zero false positives. On historical Microsoft Security Response Center cases for specific Windows drivers, it hit 96% recall on clfs.sys and 100% on tcpip.sys.

These results matter because they arrived alongside actual shipped patches, not just a research paper. The four Critical RCE flaws MDASH found in the Windows networking and authentication stack were patched in May's cycle, before any external researcher or attacker published an exploit. That's the concrete security outcome enterprise teams should be tracking as AI moves from experimental tooling into production security operations. The question is no longer whether agentic AI can find real vulnerabilities in production code. The question is how fast this capability scales and what it means for every organization running Windows.

How Microsoft MDASH Orchestrates 100 Agents

The architecture behind MDASH explains both the performance numbers and the broader implications for AI-powered security tooling.

A single large language model pointed at source code will find some vulnerabilities. But it brings one perspective and specific blind spots. Whatever patterns the model underrepresents in its training data will tend to be the patterns it misses in a security scan. That's a structural limitation, not something fixable with a better prompt or more compute on the same model.

MDASH addresses this by running more than 100 specialized agents across multiple models simultaneously. Different agent clusters are optimized for different vulnerability classes. One group handles memory safety issues. Another works on authentication logic. Another covers protocol parsing. Another focuses on privilege escalation paths. When multiple independent agents converge on the same finding, confidence increases. When agents diverge on whether something is exploitable, that divergence itself becomes data worth examining.

The multi-model approach matters because different frontier models have genuinely different failure modes. A model trained on a corpus that underrepresents a specific coding pattern will miss vulnerabilities that follow that pattern. A second model with a different training distribution may be weaker elsewhere but stronger there. Running them in parallel and treating disagreements as signals produces a combined coverage profile that measurably exceeds what any single model achieves alone. This isn't a theoretical claim. MDASH's recall numbers are the evidence. A single model scanning clfs.sys would likely land somewhere around 80-85% recall on the historical MSRC cases. MDASH hits 96%. Those missing percentage points represent real exploitable vulnerabilities that a single-model approach would have left in the codebase.

The pipeline runs in five stages. First, MDASH ingests source code and builds threat models for the components being analyzed (Prepare). Specialized auditor agents then scan for candidate vulnerabilities (Scan). In the Validate stage, debater agents argue for and against each finding's exploitability using structured adversarial reasoning, filtering weak candidates and strengthening the case for real ones. The Dedup stage collapses semantically equivalent findings that describe the same underlying issue from different angles. Finally, in the Prove stage, the system constructs and executes actual inputs to confirm each candidate is real and triggerable.

The Prove stage changes the economics of vulnerability triage. Most AI security tools stop at candidate generation and pass a list to human engineers who then determine which findings are actually exploitable. MDASH builds proof-of-concept inputs automatically and validates them before any human sees the result. By the time a finding reaches a security engineer, MDASH has already confirmed the vulnerability exists and demonstrated that it can be triggered. That shifts human attention from triage to impact assessment and fix development.

The debater agent layer in the Validate stage is what drives the zero false positive rate on internal testing. Rather than accepting every candidate that clears a scoring threshold, MDASH puts each finding through structured adversarial review. One group of agents argues the vulnerability is genuine and exploitable. Another argues it isn't. The quality of the argument determines what advances, not just the vote count.

The multi-model architecture also produces the benchmark results. A single model scanning clfs.sys might hit 85% recall, missing one in seven real vulnerabilities. MDASH hits 96%. Those missing 11 percentage points represent real exploitable bugs that a single-model approach would leave in place. On tcpip.sys, the network stack driver, MDASH reaches 100% recall against five years of confirmed MSRC cases, meaning it would have found every vulnerability that Microsoft's own security researchers confirmed over that period.

CyberGym is the most widely used public benchmark for AI security tools, testing against 1,507 real-world CVEs. An 88.45% success rate means MDASH correctly identified and confirmed roughly 1,332 of those cases, more than any other system on the leaderboard. The five-point lead over the second-best system is significant in a competitive field where vendors with multi-year head starts compete for the same customers. Reaching the top of the leaderboard with a published methodology and verifiable results is a different kind of claim than marketing copy about AI-powered protection.

Microsoft's own framing captures the design philosophy precisely: "The harness does the work, and the model is one input." The system's performance comes from the coordination architecture, not from any single model inside it. As better models become available, the harness can incorporate them without rebuilding the pipeline from scratch. That's an architectural advantage that compounds over time as the underlying models improve, and it also provides operational resilience: when a single model is updated, overall recall stays stable because the harness aggregates many contributors.

The CVEs That Came Out of May Patch Tuesday

Four Critical vulnerabilities in the May 2026 release came directly from MDASH. Two are worth examining in detail.

CVE-2026-33827 is a use-after-free vulnerability in the Windows TCP/IP networking stack. Use-after-free bugs occur when code accesses a block of memory after it has already been freed, creating a condition an attacker can manipulate to redirect execution. In the Windows TCP/IP stack, this creates remote code execution potential: an attacker on the same network, or potentially reachable over the internet, could send specially crafted packets to trigger the bug and execute code on a target machine without any user interaction. Microsoft rated it Critical and shipped the fix in May 2026.

CVE-2026-33824 is a double-free vulnerability in the IKEv2 implementation, the protocol Windows uses for VPN connections and IPsec sessions. Double-free bugs occur when code frees a memory allocation twice in sequence. An attacker who can time the double-free correctly can turn it into controlled code execution. This one is particularly serious because it can allow reaching LocalSystem privilege, the highest privilege level in Windows. An attacker who exploits CVE-2026-33824 successfully has complete control of the affected machine.

Both vulnerability classes, use-after-free and double-free in Windows networking components, are what advanced persistent threat actors prioritize. They're present in code paths that every Windows machine executes when handling network traffic and VPN connections. These are the classes of vulnerabilities that become wormable when weaponized, spreading automatically across networks without requiring any user action. Finding them before an external exploit appears is the core security outcome MDASH was built to produce.

The other 12 MDASH-discovered CVEs in May are rated Important, covering denial-of-service conditions, information disclosure, and privilege escalation paths. Important-severity findings don't generate the same headlines as Critical RCEs, but in enterprise environments where a compromised low-privilege account can use a privilege escalation flaw to move laterally through a network, they represent meaningful attack surface that's now patched. The fact that MDASH generated proof-of-concept inputs for all 16 findings means engineers received confirmed, reproducible vulnerabilities rather than candidate reports.

The proof-of-concept generation changes how engineers write patches. When a developer receives a confirmed, reproducible exploit input alongside a vulnerability report, they understand exactly what sequence triggers the bug. That understanding leads to more thorough patches. Bugs that look like simple boundary errors often have deeper root causes that a developer who only reads the report might miss. Providing the working exploit input alongside the report is a quality control mechanism, not just a convenience feature. It's why MDASH's Prove stage is arguably as important to long-term security outcomes as the Scan and Validate stages that come before it.

One implication of sustained performance at this level is patch cycle acceleration. If MDASH continues expanding across more Windows components, the count of AI-discovered CVEs per Patch Tuesday will grow. The security research community has started calling this the "vulnpocalypse": AI vulnerability discovery accelerating so fast that patch management capacity becomes the bottleneck rather than finding the vulnerabilities. For Windows administrators, this means more patches to evaluate and deploy per cycle, not fewer. The trade-off is that fewer zero-days should be reaching the wild, because internal AI tooling is finding them first.

To put the numbers in concrete terms: if MDASH brings systematic coverage of major Windows subsystems, the CVE count per cycle could double within two years. Security operations teams that currently allocate two days per month to patch evaluation and testing should plan for that budget to grow. The teams that treat the increase as a one-time adjustment rather than a sustained trend will find themselves behind inside twelve months.

Enterprise Security in the AI Agentic Era

The enterprise implications of MDASH fall into three distinct categories.

For enterprises using Windows at scale, the indirect effect is already active. More AI-discovered CVEs in Windows means higher patch volume per cycle, regardless of whether those enterprises ever deploy MDASH themselves. Patch management workflows designed around the slower cadence of human vulnerability research should be reviewed for capacity to absorb a multi-year acceleration trend as Microsoft's internal AI tooling matures.

For security vendors and enterprise buyers, the CyberGym benchmark result is competitive intelligence. Any vendor claiming production-grade AI vulnerability discovery now has a published score to beat or match. That makes vague claims about AI-powered protection harder to sustain without comparable numbers, and it gives enterprise buyers a concrete baseline for evaluations.

For organizations building enterprise AI programs more broadly, the MDASH announcement is one of the clearest examples of AI delivering specific, measurable value in a production context with real safety consequences. Finding Critical RCE flaws before attackers is not a workflow productivity gain. It's a safety outcome. That distinction matters for how enterprise security leaders structure the return-on-investment argument for AI investment in security operations.

The ROI case for AI-driven vulnerability discovery is unusually clear compared to other enterprise AI programs. The value of a Critical RCE that gets patched before exploitation is not speculative. Unpatched Critical RCEs in Windows networking components have historically led to ransomware campaigns, data breaches, and regulatory fines. Microsoft's own incident response teams regularly trace attacks back to vulnerabilities that existed for months or years before patches were applied. Finding those vulnerabilities faster reduces the window during which they can be exploited. The more sophisticated the threat actors targeting your infrastructure, the more that faster discovery matters.

What makes MDASH strategically significant for the broader enterprise AI market is that it demonstrates an agentic AI system delivering a safety outcome that a human organization couldn't achieve at the same speed or cost. Not a productivity improvement that might be real but hard to quantify. An outcome with a direct, measurable relationship to security events. That kind of case study is what enterprise AI programs need to justify continued investment to boards that are still skeptical about AI ROI.

MDASH is in limited private preview. No pricing or general availability date has been disclosed. When general availability arrives, the key integration questions will be how MDASH findings flow into existing vulnerability management systems like ServiceNow or Jira, and whether the proof-of-concept outputs are usable directly by engineering teams writing fixes. Enterprise programs adopting agentic AI tools for security face the same adoption and rollback pressures that affect AI agent programs broadly, and organizations that invest in readiness before GA will have an easier transition than those who try to bolt on integration work after.

The cost structure of agentic security systems is also worth naming directly. Running 100+ agents across multiple frontier models for each scanning cycle is compute-intensive. Microsoft hasn't disclosed what MDASH costs per cycle or per discovered CVE, but enterprises evaluating similar internal tooling or vendor products should insist on cost-per-finding metrics, not just coverage numbers. A system that finds 88% of vulnerabilities at 10x the compute cost of a 75% system needs to be evaluated on the full economics. The proof-of-concept generation in MDASH's Prove stage likely helps the economics: if an engineer can write a patch in two hours instead of four because they have a reproducible exploit to work from, that time saving compounds across 16 CVEs per cycle.

There is also a dual-use dimension worth naming clearly. An agentic system that finds Critical RCEs in Windows networking components and generates proof-of-concept exploit inputs is a powerful capability. The offensive application of a similar architecture exists and will be developed independently of Microsoft's defensive use. Enterprises should expect AI-accelerated offensive tooling to increase the urgency of patch deployment over the coming years, making faster patching more necessary rather than less. The asymmetry between faster vulnerability discovery and faster exploitation tilts toward defenders only when patch management keeps pace with discovery.

Microsoft's full MDASH announcement and benchmark details are publicly available for security teams who want to evaluate the methodology in depth.

For organizations building enterprise AI programs that span security, operations, and governance, AIntelligenceHub's Enterprise AI resource guide covers adoption patterns and governance frameworks for teams moving toward agentic AI at scale.

Weekly newsletter

Get a weekly summary of our most popular articles

Every week we send one email with a summary of the most popular articles on AIntelligenceHub so you can stay up-to-date on the latest AI trends and topics.

One weekly email. No sponsored sends. Unsubscribe when you want.

Comments

Every comment is reviewed before it appears on the site.

Comments stay pending until review. Posts with more than two links are held back.

Related articles