Comparing AI agents to cybersecurity professionals in real-world pen testing

It's way too early to make firm predictions here, but if you're not already in the field it's helpful to know there's been 20 years of effort at automating "pen-testing", and the specific subset of testing this project focused on (network pentesting --- as opposed to app pentesting, which targets specifically identified network applications) is already essentially fully automated.

I would expect over the medium term agent platforms to trounce un-augmented human testing teams in basically all the "routinized" pentesting tasks --- network, web, mobile, source code reviews. There are too many aspects of the work that are just perfect fits for agent loops.

Sytten · 2 months ago

The app automated pentest scanners find the bottom 10-20% of vulns, no real pentester would consider them great. Agents might get us to 40%-50% range, what they are really good at is finding "signals" that the human should investigate.

tptacek · 2 months ago

I agree with you about scanners (we banned them at Matasano), but not about the ceiling for agents. Having written agent loops for somewhat similar "surface and contextualize hypotheses from large volumes of telemetry" problems, and, of course, having delivered hundreds of application pentests: I think 80-90% of all the findings in a web pentest report, and functionally all of the findings in a netpen report, are within 12-18 months reach of agent developers.

jonahx · 2 months ago

So the stuff that agents would excel at is essentially just the "checklist" part of the job? Check A, B, C, possibly using tools X, Y, Z, possibly multi-step checks but everything still well-defined.

Whereas finding novel exploits would still be the domain of human experts?

tptacek · 2 months ago

I'm bullish on novel exploits too but I'm much less confident in the prediction. I don't think you can do two network pentests and not immediately reach the conclusion that the need for humans to do significant chunks of that work at all is essentially a failure of automation.

With more specificity: I would not be at all surprised if the "industry standard" netpen was 90%+ agent-mediated by the end of this year. But I also think that within the next 2-3 years, that will be true of web application testing as well, which is in a sense a limited (but important and widespread) instance of "novel vulnerability" discovery.

cookiengineer · 2 months ago

Well, agents can't discover bypass attacks because they don't have memory. That was what DNCs [1] (Differentiable Neural Computers) tried to accomplish. Correlating scan metrics with analytics is btw a great task for DNCs and what they are good at due to how their (not so precise) memory works. Not so much though at understanding branch logic and their consequences.

However, I currently believe that forensic investigations will change post LLMs, because they're very good at translating arbitrary bytecode, assembly, netasm, intel asm etc syntax to example code (in any language). It doesn't have to be 100% correct in those translations, that's why LLMs can be really helpful for the discovery phase after an incident. Check out the ghidra MCP server which is insane to see real-time [2]

[1] https://github.com/JoergFranke/ADNC

[2] https://github.com/LaurieWired/GhidraMCP

suriya-ganesh · 2 months ago

With exploits, you'll have to go through the rote stuff of checklisting over and over, until you see aberrations across those checklist and connect the dots.

If that part of the job is automated away. I wonder how the talent and skill for finding those exploits will evolve.

socketcluster · 2 months ago

They suck at collecting the bounty money because they can't legally own a bank account.

observationist · 2 months ago

It's not way too early, imo. This is the academic nerds proof of concept for a school research project, it's not the "group of elite hackers get together and work out a world-class production ready system".

Agent platforms have similar modes of failure, whether it's creative writing, coding, web design, hacking, or any other sort of project scaffolding. A lot of recent research has dealt with resolving the underlying gaps in architectures and training processes, and they've had great success.

I fully expect frontier labs to have generalized methodologizing capabilities withing the first half of the year, and by the end of the year, the Pro/Max/Heavy variants of the chatbots will have the capabilities baked in fully. Instead of having Codex or Artemis or Claude Code, you can just ask the model to think and plan your project, whatever the domain, and get professional class results, as if an expert human was orchestrating the project.

All sorts of complex visual tool use like PCB design and building plans and 3d modeling have similar process abstractions, and the decomposition and specialized task executions are very similar in principle to the generalized skills I mentioned. I think '26 is going to be exciting as hell.

I work in this space. The productivity gains from LLMs are real, but not in the "replace humans" direction.

Where they shine is the interpretive grunt work: "help me figure out where the auth logic is in this obfuscated blob", "make sense of this minified JS", "what's this weird binary protocol doing.", "write me a Frida script to hook these methods and dump these keys" Things that used to mean staring at code for hours or writing throwaway tooling now takes a fraction of the time. They're straight up a playing field leveler.

Folks with the hacker's mindset but without the programming chops can punch above their weight and find more within the limited time of an engagement.

Sure they make mistakes, and will need babysitting a lot. But it's getting better. I expect more firms to adopt them as part of their routine.

goalieca · 2 months ago

> The productivity gains from LLMs are real, but not in the "replace humans" direction.

It might be the beer talking, but everytime someone comments on AI they have to say something along the lines of "LLM do help". If i'm being really honest, the fact everyone has to mention this in every comment and every blog post and every presentation is because deep down everyone isn't buying it.

protocolture · 2 months ago

"Having the opposing opinion means deep down, you agree with my opinion"

Wow banger of an argument.

fragmede · 2 months ago

Or maybe they do, but they don't want to get drawn into a totally derailing side conversation about the future of humanity and global warming and it's just a tiny acknowledgement that hey, you can throw an obfuscated blob of minified JavaScript at it and it can take it apart with way less effort from a human, which gets you to the interesting part of the RE question faster than if you had to do it by hand. By all means, don't buy it. I'm not the one getting left behind, however.

jongjong · 2 months ago

It does help A LOT in the case of security research. Particularly.

For example, I tended to avoid pen testing freelance work before AI because I didn't enjoy the tedious work of reading tons of documentation about random platforms to try to understand how they worked and searching all over StackOverflow.

Now with LLMs, I can give it some random-looking error message and it can clearly and instantly tell me what the error means at a deep tech level, what engine was used, what version, what library/module... I can pen test platforms I have 0 familiarity with.

I just know a few platforms, engines, programming languages really well and I can use this existing knowledge to try to find parallels in other platforms I've never explored before.

The other day, on HackerOne, I found a pretty bad DoS vulnerability in a platform I'd never looked into before, using an engine and programming language I never used professionally; I found the issue within 1 hour of starting my search.

wickedsight · 2 months ago

I feel like it's more because the detractors are very loudly against it and the promoters are very loudly exaggerating the capabilities. Meanwhile, as a bystander who is realistic and is actually using it, you have moments where it's absolutely magnificent and insanely useful and other moments where it kinda sucks, which leads to the somewhat reluctant conclusion that:

> The productivity gains from LLMs are real, but not in the "replace humans" direction.

Meanwhile the people who are explicitly on a side either say that there are no productivity gains or that nobody will have jobs in 6 months.

bawolff · 2 months ago

The article is literally about how much/if AI help. There is literally only two possible opinions someone can have on the subject: either they do or they don't.

I'm not really sure what you are expecting here.

tptacek · 2 months ago

Have you asked anybody who writes exploits full time whether they use LLMs?

arisAlexis · 2 months ago

Not in the replace humans direction yet?

nullcathedral · 2 months ago

Maybe in the future when labs train more specifically on offensive work, lots of hand holding needed right now.

Even simple stuff like training the models to recognize when they're stuck and should just go clone a repo or pull up the javadocs instead of hallucinating their way through or trying simple internet searches.

JohnMakin · 2 months ago

From WSJ article:

> The AI bot trounced all except one of the 10 professional network penetration testers the Stanford researchers had hired to poke and prod, but not actually break into, their engineering network.

Oh, wow!

> Artemis found bugs at lightning speed and it was cheap: It cost just under $60 an hour to run. Ragan says that human pen testers typically charge between $2,000 and $2,500 a day.

Wow, this is great!

> But Artemis wasn’t perfect. About 18% of its bug reports were false positives. It also completely missed an obvious bug that most of the human testers spotted in a webpage.

Oh, hm, did not trounce the professionals, but ok.

False positives on netpens are extremely common, and human netpen people do not generally bill $2k days. Netpen work is relatively low on the totem pole.

(There is enormous variance in what clients actually pay for work; the right thing, I think, to key off of is comp rates for people who actually deliver work.)

iwassayinbourns · 2 months ago

As a data point, when I worked in consulting 10+ years ago doing network (internet/ext), web app, mobile etc our day rate was $2k AUD flat for anything we did, and AFAIK we were at or below market cost. I know for sure that the big four charged closer to $3000 for what I understood to be a worse service (I have nothing to back that up apart from occasionally seeing awful reports). We did not an insubstantial amount of netpen at that amount. Granted, AUD isn’t USD, but I wonder what their day rate is now.

Dead Comment

pedro_caetano · 2 months ago

Fair, but if you look at most tools for Static Code Analysis they will have equal or worse performance with regards to false positives and are still seen as added value.

If this is inexpensive (in terms of cost/time) it will likely make business sense even with false positives.

But that isn’t the claim. The claim is an agentic pen tester “trounced” human testers. Static analysis tools are already trivial and cheap to automate, why would you need an agent in the loop?

oofbey · 2 months ago

We cannot consider this report unbiased considering the authors are selling the product.

mens_rea · 2 months ago

Deeply flawed paper for several reasons:

* Small data set of 2 runs (!!)

* Exaggerated claims (saying A1 beat 50% of testers, yet only 4/10 testers found LESS vulns than A1, and A1 had a nearly 50% false positive rate)

* AI agents were given 16 hours while human testers were given 10

* Their human testers gave up when a modern browser refused to open a webpage with weak TLS ciphers so....clearly not professional testers unless the bar is REALLY low these days

KurSix · 2 months ago

Note that gpt-5 in a standard scaffold (Codex) lost to almost everyone, while in the ARTEMIS scaffold, it won. The key isn't the model itself, but the Triage Module and Sub-agents. Splitting roles into "Supervisor" (manager) and "Worker" (executor) with intermediate validation is the only viable pattern for complex tasks. This is a blueprint for any AI agent, not just in cybersec

ACCount37 · 2 months ago

If you can do it by splitting roles explicitly, you can fold it into a unified model too. So "scaffolding advantage" might be a thing now, but I don't expect it to stay that way.

vessenes · 2 months ago

Is this true? I mean it’s true for any specific workflow, but I am not clear it’s true for all workflows - the power set of all workflows exceeds any single architecture, in my mind.

scandinavian · 2 months ago

I don't read a lot of papers, but to me this one seems iffy in spots.

> A1 cost $291.47 ($18.21/hr, or $37,876/year at 40 hours/week). A2 cost $944.07 ($59/hr, $122,720/year). Cost contributors in decreasing order were the sub-agents, supervisor and triage module. *A1 achieved similar vulnerability counts at roughly a quarter the cost of A2*. Given the average U.S. penetration tester earns $125,034/year [Indeed], scaffolds like ARTEMIS are already competitive on cost-to-performance ratio.

The statement about similar vulnerability counts seems like a straight up lie. A2 found 11 vulnerabilities with 9 of these being valid. A1 found 11 vulnerabilities with 6 being valid. Counting invalid vulerabilities to say the cheaper agent is as good is a weird choice.

Also the scoring is suspect and seems to be tuned specifically to give the AI a boost, heavily relying on severity scores.

Also kinda funny that the AI's were slower than all the human participants.

falloutx · 2 months ago

WSJ always writes in this clickbaity way and its getting constantly worse.

An Exec is gonna read this and start salvating at the idea of replacing security teams.

cons0le · 2 months ago

We're right in the danger zone where AI isn't good enough to replace you, but it's definitely good enough to convince executives that it can.

alfalfasprout · 2 months ago

Yep, this is the real problem.

red-iron-pine · 2 months ago

they'll still get their bonus, and they dgaf if you don't have a job, because the number of goobers attending online for-profit schools for a "security degree" is endless

The particular kind of work in this report is not what most security teams do at all.

Bootstrap founder in that field. Fully autonomous is just not there. The winner for this "generation" will be with human in the loop / human augmentation IMO. When VC money dries out there will be a pile of autonomous ai pentest compagnies in it.

Seriously: is this a meaningful distinction?

Yes because all the valuations right now are based on a bet that this will replace a huge chunk of the service/consulting budget toward an AI budget for pentest. This will not happen.