I used o3 to find a remote zeroday in the Linux SMB implementation

nxobject · 3 months ago

A small thing, but I found the author's project-organization practices useful – creating individual .prompt files for system prompt, background information, and auxiliary instructions [1], and then running it through `llm`.

It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.

[1] https://github.com/SeanHeelan/o3_finds_cve-2025-37899

epolanski · 3 months ago

I find your take amusing considering that's literally the only part of the post he admits to just vibing it:

> In fact my entire system prompt is speculative so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering

conradev · 3 months ago

A good engineer can vibe good engineering plans!

Just like Eisenhower's famous "plans are useless, planning is indispensable" quote. The muscle you build is creating new plans, not memorizing them.

NitpickLawyer · 3 months ago

The difference between vibing and "engineering" is keeping good records, logs and prompt provenance in a methodical way? Also have a (manual) way of reviewing the results. :) (paraphrased from mythbusters)

zavec · 3 months ago

He admits the contents of the prompt is vibing, but I think what the parent comment was admiring was the structure of breaking it down into separate files each with a single responsibility so that you could swap them out more easily. Or at least, that's what I took away from it.

_boffin_ · 3 months ago

One person’s Vibe is another person’s dream? In my mind, the person is able to formulate a mental model complete enough to even go after vurln, unlike me, where I wouldn’t have even considered thinking about it.

kweingar · 3 months ago

How do we benchmark these different methodologies?

It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

nindalf · 3 months ago

The author is up front about the limitations of their prompt. They say

> In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.

mrlongroots · 3 months ago

I think there's two aspects around LLM usage:

1. Having workflows to be able to provide meaningful context quickly. Very helpful.

2. Arbitrary incantations.

I think No. 2 may provide some random amounts of value with one model and not the other, but as a practitioner you shouldn't need to worry about it long-term. Patterns models pay attention to will change over time, especially as they become more capable. No. 1 is where the value is at.

As my example as a systems grad student, I find it a lot more useful to maintain a project wiki with LLMs in the picture. It makes coordinating with human collaborators easier too, and I just copy paste the entire wiki before beginning a conversation. Any time I have a back-and-forth with an LLM about some design discussions that I want archived, I ask them to emit markdown which I then copy paste into the wiki. It's not perfectly organized but it keeps the key bits there and makes generating papers etc. that much easier.

TrapLord_Rhodo · 3 months ago

> ksmbd has too much code for it all to fit in your context window in one go. Therefore you are going to audit each SMB command in turn. Commands are handled by the __process_request function from server.c, which selects a command from the conn->cmds list and calls it. We are currently auditing the smb2_sess_setup command. The code context you have been given includes all of the work setup code code up to the __process_request function, the smb2_sess_setup function and a breadth first expansion of smb2_sess_setup up to a depth of 3 function calls.

The author deserves more credit here, than just "vibing".

kristopolous · 3 months ago

I usually like fear, shame and guilt based prompting: "You are a frightened and nervous engineer that is very weary about doing incorrect things so you tread cautiously and carefully, making sure everything is coherent and justifiable. You enjoy going over your previous work and checking it repeatedly for accuracy, especially after discovering new information. You are self-effacing and responsible and feel no shame in correcting yourself. Only after you've come up with a thorough plan ... "

I use these prompts everywhere. I get significantly better results mostly because it encourages backtracking and if I were to guess, enforces a higher confidence threshold before acting.

The expert engineering ones usually end up creating mountains of slop, refactoring things, and touching a bunch of code it has no business messing with.

I also have used lazy prompts: "You are positively allergic to rewriting anything that already exists. You have multiple mcps at your disposal to look for existing solutions and thoroughly read their documentation, bug reports, and git history. You really strongly prefer finding appropriate libraries instead of maintaining your own code"

naasking · 3 months ago

> Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

You just described one critical aspect of engineering: discovering a property of a system and feeding that knowledge back into a systematic, iterative process of refinement.

p0w3n3d · 3 months ago

Listen to a video made by Karpathy about LLM, he explains why made up html tags work. It's to help the tokenizer

stingraycharles · 3 months ago

It’s not that difficult to benchmark these things, eg have an expected result and a few variants of templates.

But yeah prompt engineering is a field for a reason, as it takes time and experience to get it right.

Problem with LLMs as well is that it’s inherently probabilistic, so sometimes it’ll just choose an answer with a super low probability. We’ll probably get better at this in the next few years.

Deleted Comment

ptdnxyz · 3 months ago

How do you benchmark different ways to interact with employees? Neural networks are somewhere between opaque and translucent to inspection, and your only interface with them is language.

Quantitative benchmarks are not necessary anyway. A method either gets results or it doesn't.

threeseed · 3 months ago

It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

Those prompts should be renamed as hints. Because that's all they are. Every LLM today ignores prompts if they conflict with its sole overarching goal: to give you an answer no matter whether it's true or not.

Terr_ · 3 months ago

> Those prompts should be renamed as hints. [...] its sole overarching goal: to give you an answer no matter whether it's true or not.

I like to think of them as beginnings of an arbitrary document which I hope will be autocompleted in a direction I find useful... By an algorithm with the overarching "goal" of Make Document Bigger.

baq · 3 months ago

You’re confusing engineering with maths. You engineer your prompting to maximize the chance the LLM does what you need - in your example, the true answer - to get you closer to solving your problem. It doesn’t matter what the LLM does internally as long as the problem is being solved correctly.

(As an engineer it’s part of your job to know if the problem is being solved correctly.)

CharlesW · 3 months ago

> It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

You invoke "engineering principles", but software engineers constantly trade in likelihoods, confidence intervals, and risk envelopes. Using LLMs is no different in that respect. It's not rocket science. It's manageable.

roywiggins · 3 months ago

Engineering principles are probably the best we've got when it comes to trying to work with a poorly understood system? That doesn't mean they'll work necessarily, but...

jcims · 3 months ago

>people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

What's the alternative?

limflick · 3 months ago

> It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

Are you Insinuating that dealing with unstable and unpredictable systems isn't somewhere engineering principles are frequently applied to solve complex problems?

iknowstuff · 3 months ago

Are you using 2023 LLMs? o3 and Gemini 2.5 Pro will gladly say no or declare uncertainty in my experience

stingraycharles · 3 months ago

Fun fact: if you ask an LLM about best practices and how to organize your prompts, it will hint you towards this direction.

It’s surprisingly effective to ask LLMs to help you write prompts as well, i.e. all my prompt snippets were designed with help of an LLM.

I personally keep them all in an org-mode file and copy/paste them on demand in a ChatGPT chat as I prefer more “discussion”-style interactions, but the approach is the same.

abeindoria · 3 months ago

Hah. Same. I have a step by step "reasoning" agent that asks me for confirmation after each step (understanding of problem, solutions proposed, solutions selection, and final wrap) - just so it gets red back the previous prompts and answers rather than one word salad essay.

Works incredibly well, and I created it with its own help.

rcarmo · 3 months ago

It’s all about being organized: https://taoofmac.com/space/blog/2025/05/13/2230

conception · 3 months ago

https://github.com/jezweb/roo-commander has something like 1700 prompts in it with 50+ prompts modes. And it seems to work pretty well. For me at least. It’s task/session management is really well thought out.

Enginerrrd · 3 months ago

Wrangling LLM's is remarkably like wrangling interns in my experience. Except that the LLM will surprise you by being both much smarter and much dumber.

The more you can frame the problem with your expertise, the better the results you will get.

Retr0id · 3 months ago

The article cites a signal to noise ratio of ~1:50. The author is clearly deeply familiar with this codebase and is thus well-positioned to triage the signal from the noise. Automating this part will be where the real wins are, so I'll be watching this closely.

Aurornis · 3 months ago

I’ve developed a few take-home interview problems over the years that were designed to be short, easy for an experienced developer, but challenging for anyone who didn’t know the language. All were extracted from real problems we solved on the job, reduced into something minimal.

Every time a new frontier LLM is released (excluding LLMs that use input as training data) I run the interview questions through it. I’ve been surprised that my rate of working responses remains consistently around 1:10 for the first pass, and often takes upwards of 10 rounds of poking to get it to find its own mistakes.

So this level of signal to noise ratio makes sense for even more obscure topics.

Aachen · 3 months ago

> challenging for anyone who didn’t know the language.

Interviewees don't get to pick the language?

If you're hiring based on proficiency in a particular tech stack, I'm curious why. Are there that many candidates that you can be this selective? Is the language so dissimilar that the uninitiated would need a long time to get up to speed? Does the job involve working on the language itself and so a specifically deep understanding is required?

ngneer · 3 months ago

I do the same, but entry level problems that require healthy analysis. New frontier LLMs do not manage to do so well at all.

ianbutler · 3 months ago

We’ve been working on a system that increases signal to noise dramatically for finding bugs, we’ve at the same time been thoroughly benchmarking the entire popular software agents space for this

We’ve found a wide range of results and we have a conference talk coming up soon where we’ll be releasing everything publicly so stay tuned for that itll be pretty illuminating on the state of the space

Edit: confusing wording

sebmellen · 3 months ago

Interesting. This is for Bismuth? I saw your pilot program link — what does that involve?

Deleted Comment

tough · 3 months ago

I was thinking about this the other day, wouldn't it be feasible to make fine-tune or something like that into every git change, mailist, etc, the linux kernel has ever hard?

Wouldn't such an LLM be the closer -synth- version of a person who has worked on a codebase for years, learnt all its quirks etc.

There's so much you can fit on a high context, some codebases are already 200k Tokens just for the code as is, so idk

sodality2 · 3 months ago

I'd be willing to bet the sum of all code submitted via patches, ideas discussed via lists, etc doesn't come close to the true amount of knowledge collected by the average kernel developer's tinkering, experimenting, etc that never leaves their computer. I also wonder if that would lead to overfitting: the same bugs being perpetuated because they were in the training data.

antirez · 3 months ago

I bet automatic this part will be simple. In general LLMs that have a given semantical ability "X" to do some task, have greater than X ability to check, among N replies about doing the same task, which reply is the best, especially if via binary tournament like RAInk did (it was posted here a few weeks ago). There is also the possibility to use agreement among different LLMs. I'm surprised Gemini 2.5 PRO was not used here, in my experience it is the most powerful LLM to do that kind of stuff.

andix · 3 months ago

1:50 is a great detection ratio for finding a needle in a haystack.

epolanski · 3 months ago

I don't think the author agrees as he points out the bugs weren't that difficult to find.

Aachen · 3 months ago

Nah. I'm not an expert code auditor myself but I've seen my colleagues do it and I've seen ChatGPT try its hand. Even when I give it a specific piece of code and probe/hint in the right direction, it produces five paragraphs of vulnerabilities, none of which are real, while overlooking the one real concern we identified

You can spend all day reading slop or you can get good at this yourself and be much more efficient at this task. Especially if you're the developer and know where to look and how things work already, catching up on security issues relevant to your situation will be much faster than looking for this needle in the haystack that is LLM output

manmal · 3 months ago

If the LLM wrote a harness and proof of concept tests for its leads, then it might increase S/N dramatically. It’s just quite expensive to do all that right now.

threeseed · 3 months ago

Except that in my experience half the time it will modify the implementation in order to make the tests pass.

And it will do this no matter how many prompts you try or you forcefully you ask it.

klabb3 · 3 months ago

> If the LLM wrote a harness and proof of concept tests for its leads, then it might increase S/N dramatically.

Designing and building meaningfully testable non-trivial software is orders of magnitude more complex than writing the business logic itself. And that’s if you compare writing greenfield code from scratch. Making an old legacy code base testable in a way conducive to finding security vulns is not something you just throw together. You can be lucky with standard tooling like sanitizers and valgrind but it’s far from a panacea.

quentinp · 3 months ago

Exactly. Many AI users can’t triage effectively, as a result open source projects get a lot of spam now: https://arstechnica.com/gadgets/2025/05/open-source-project-...

nialv7 · 3 months ago

maybe we ask the AI to come up with an exploit, run it and see if it works? then you can RL on this.

iandanforth · 3 months ago

The most interesting and significant bit of this article for me was that the author ran this search for vulnerabilities 100 times for each of the models. That's significantly more computation than I've historically been willing to expend on most of the problems that I try with large language models, but maybe I should let the models go brrrrr!

seanheelan · 3 months ago

I realised I didn't mention it in the article, so in case you're curious it cost about $116 to run the 100k token version 100 times.

egorfine · 3 months ago

So, half that for batch processing [1], which presumably would be just fine for this task?

[1] https://platform.openai.com/docs/guides/batch

wyldfire · 3 months ago

How many years/generations behind o3 are the freely available / local models?

ramy_d · 3 months ago

thank you, I was going to ask about this. It's not a crazy amount...

JFingleton · 3 months ago

Zero days can go for $$$, or you can go down the bug bounty route and also get $$. The cost of the LLM would be a drop in the bucket.

When the cost of inference gets near zero, I have no idea what the world of cyber security will look like, but it's going to be a very different space from today.

yencabulator · 3 months ago

Except in this case the LLM was pointed at a known-to-exist vulnerability. $116 per handler per vulnerability type, unknown how many vulnerabilities exist.

roncesvalles · 3 months ago

A lot of money is all you need~

bbarnett · 3 months ago

A lot of burned coal, is what.

The "don't blame the victim" trope is valid in many contexts. This one application might be "hackers are attacking vital infrastructure, so we need to fund vulnerabilities first". And hackers use AI now, likely hacked into and for free, to discover vulnerabilities. So we must use AI!

Therefore, the hackers are contributing to global warming. We, dear reader, are innocent.

xyst · 3 months ago

"100 times for each of the models" represents a significant amount of energy burned. The achievement of finding the most common vulnerability in C based codebases becomes less of an achievement. And more of a celebration of decadence and waste.

We are facing global climate change event, yet continue to burn resources for trivial shit like it’s 1950.

antirez · 3 months ago

Either I'm very lucky or as I suspected Gemini 2.5 PRO can more easily identify the vulnerability. My success rate is so high that running the following prompt a few times is enough: https://gist.github.com/antirez/8b76cd9abf29f1902d46b2aed3cd...

geraneum · 3 months ago

This has become a common recurrence recently.

Have a problem with clear definition and evaluation function. Let LLM reduce the size of solution space. LLMs are very good at pattern reconstruction, and if the solution has a similar pattern to what was known before, it can work very well.

In this case the problem is a specific type of security vulnerability and the evaluator is the expert. This is similar in spirit to other recent endeavors where LLMs are used in genetic optimization; on a different scale.

Here’s an interesting read on “Mathematical discoveries from program search with large language models” which was I believe was also featured in HN the past:

https://www.nature.com/articles/s41586-023-06924-6

One small note, concluding that the LLM is “reasoning” about code just _based on this experiment_ is bit of a stretch IMHO.

firesteelrain · 3 months ago

I really hope this is legit and not what keeps happening to curl

[1] https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...

meander_water · 3 months ago

I'm not sure about the assertion that this is the first vulnerability found with an LLM. For e.g. OSS-Fuzz [0] has found a few using fuzzing, and Big Sleep using an agent approach [1].

[0] https://security.googleblog.com/2024/11/leveling-up-fuzzing-...

[1] https://googleprojectzero.blogspot.com/2024/10/from-naptime-...

seanheelan · 3 months ago

It's certainly not the first vulnerability found with an LLM =) Perhaps I should have been more clear though.

What the post says is "Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances. o3 was able to comprehend this and spot a location where a particular object that is not referenced counted is freed while still being accessible by another thread. As far as I'm aware, this is the first public discussion of a vulnerability of that nature being found by a LLM."

The point I was trying to make is that, as far as I'm aware, this is the first public documentation of an LLM figuring out that sort of bug (non-trivial amount of code, bug results from concurrent access to shared resources). To me at least, this is an interesting marker of LLM progress.

empath75 · 3 months ago

Given the value of finding zero days, pretty much every intelligence agency in the world is going to be pouring money into this if it can reliably find them with just a few hundred api calls. Especially if you can fine tune a model with lots of examples, which I don't think open ai, etc are going to do with any public api.

treebeard901 · 3 months ago

Yeah, the amount of engineering they have around controlling (censoring) the output, along with the terms of service, creates an incentive to still look for any possible bugs, but not allow it in the output.

Certainly for Govt agencies and others this will not be a factor. It is just for everyone else. This will cause people to use other models and agents without these restrictions.

It is safe to assume that a large number of vulnerabilities exist in important software all over the place. Now they can be found. This is going to set off arms race game theory applied to computer security and hacking. Probably sooner than expected...