A small thing, but I found the author's project-organization practices useful – creating individual .prompt files for system prompt, background information, and auxiliary instructions [1], and then running it through `llm`.
It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.
I find your take amusing considering that's literally the only part of the post he admits to just vibing it:
> In fact my entire system prompt is speculative so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering
The difference between vibing and "engineering" is keeping good records, logs and prompt provenance in a methodical way? Also have a (manual) way of reviewing the results. :) (paraphrased from mythbusters)
He admits the contents of the prompt is vibing, but I think what the parent comment was admiring was the structure of breaking it down into separate files each with a single responsibility so that you could swap them out more easily. Or at least, that's what I took away from it.
One person’s Vibe is another person’s dream? In my mind, the person is able to formulate a mental model complete enough to even go after vurln, unlike me, where I wouldn’t have even considered thinking about it.
How do we benchmark these different methodologies?
It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?
The author is up front about the limitations of their prompt. They say
> In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.
1. Having workflows to be able to provide meaningful context quickly. Very helpful.
2. Arbitrary incantations.
I think No. 2 may provide some random amounts of value with one model and not the other, but as a practitioner you shouldn't need to worry about it long-term. Patterns models pay attention to will change over time, especially as they become more capable. No. 1 is where the value is at.
As my example as a systems grad student, I find it a lot more useful to maintain a project wiki with LLMs in the picture. It makes coordinating with human collaborators easier too, and I just copy paste the entire wiki before beginning a conversation. Any time I have a back-and-forth with an LLM about some design discussions that I want archived, I ask them to emit markdown which I then copy paste into the wiki. It's not perfectly organized but it keeps the key bits there and makes generating papers etc. that much easier.
> ksmbd has too much code for it all to fit in your context window in one go.
Therefore you are going to audit each SMB command in turn. Commands are
handled by the __process_request function from server.c, which selects a
command from the conn->cmds list and calls it. We are currently auditing the
smb2_sess_setup command. The code context you have been given includes all of
the work setup code code up to the __process_request function, the
smb2_sess_setup function and a breadth first expansion of smb2_sess_setup up
to a depth of 3 function calls.
The author deserves more credit here, than just "vibing".
I usually like fear, shame and guilt based prompting: "You are a frightened and nervous engineer that is very weary about doing incorrect things so you tread cautiously and carefully, making sure everything is coherent and justifiable. You enjoy going over your previous work and checking it repeatedly for accuracy, especially after discovering new information. You are self-effacing and responsible and feel no shame in correcting yourself. Only after you've come up with a thorough plan ... "
I use these prompts everywhere. I get significantly better results mostly because it encourages backtracking and if I were to guess, enforces a higher confidence threshold before acting.
The expert engineering ones usually end up creating mountains of slop, refactoring things, and touching a bunch of code it has no business messing with.
I also have used lazy prompts: "You are positively allergic to rewriting anything that already exists. You have multiple mcps at your disposal to look for existing solutions and thoroughly read their documentation, bug reports, and git history. You really strongly prefer finding appropriate libraries instead of maintaining your own code"
> Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?
You just described one critical aspect of engineering: discovering a property of a system and feeding that knowledge back into a systematic, iterative process of refinement.
It’s not that difficult to benchmark these things, eg have an expected result and a few variants of templates.
But yeah prompt engineering is a field for a reason, as it takes time and experience to get it right.
Problem with LLMs as well is that it’s inherently probabilistic, so sometimes it’ll just choose an answer with a super low probability. We’ll probably get better at this in the next few years.
How do you benchmark different ways to interact with employees? Neural networks are somewhere between opaque and translucent to inspection, and your only interface with them is language.
Quantitative benchmarks are not necessary anyway. A method either gets results or it doesn't.
It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.
Those prompts should be renamed as hints. Because that's all they are. Every LLM today ignores prompts if they conflict with its sole overarching goal: to give you an answer no matter whether it's true or not.
> Those prompts should be renamed as hints. [...] its sole overarching goal: to give you an answer no matter whether it's true or not.
I like to think of them as beginnings of an arbitrary document which I hope will be autocompleted in a direction I find useful... By an algorithm with the overarching "goal" of Make Document Bigger.
You’re confusing engineering with maths. You engineer your prompting to maximize the chance the LLM does what you need - in your example, the true answer - to get you closer to solving your problem. It doesn’t matter what the LLM does internally as long as the problem is being solved correctly.
(As an engineer it’s part of your job to know if the problem is being solved correctly.)
> It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.
You invoke "engineering principles", but software engineers constantly trade in likelihoods, confidence intervals, and risk envelopes. Using LLMs is no different in that respect. It's not rocket science. It's manageable.
Engineering principles are probably the best we've got when it comes to trying to work with a poorly understood system? That doesn't mean they'll work necessarily, but...
> It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.
Are you Insinuating that dealing with unstable and unpredictable systems isn't somewhere engineering principles are frequently applied to solve complex problems?
Fun fact: if you ask an LLM about best practices and how to organize your prompts, it will hint you towards this direction.
It’s surprisingly effective to ask LLMs to help you write prompts as well, i.e. all my prompt snippets were designed with help of an LLM.
I personally keep them all in an org-mode file and copy/paste them on demand in a ChatGPT chat as I prefer more “discussion”-style interactions, but the approach is the same.
Hah. Same. I have a step by step "reasoning" agent that asks me for confirmation after each step (understanding of problem, solutions proposed, solutions selection, and final wrap) - just so it gets red back the previous prompts and answers rather than one word salad essay.
Works incredibly well, and I created it with its own help.
https://github.com/jezweb/roo-commander has something like 1700 prompts in it with 50+ prompts modes. And it seems to work pretty well. For me at least. It’s task/session management is really well thought out.
Wrangling LLM's is remarkably like wrangling interns in my experience. Except that the LLM will surprise you by being both much smarter and much dumber.
The more you can frame the problem with your expertise, the better the results you will get.
The article cites a signal to noise ratio of ~1:50. The author is clearly deeply familiar with this codebase and is thus well-positioned to triage the signal from the noise. Automating this part will be where the real wins are, so I'll be watching this closely.
I’ve developed a few take-home interview problems over the years that were designed to be short, easy for an experienced developer, but challenging for anyone who didn’t know the language. All were extracted from real problems we solved on the job, reduced into something minimal.
Every time a new frontier LLM is released (excluding LLMs that use input as training data) I run the interview questions through it. I’ve been surprised that my rate of working responses remains consistently around 1:10 for the first pass, and often takes upwards of 10 rounds of poking to get it to find its own mistakes.
So this level of signal to noise ratio makes sense for even more obscure topics.
> challenging for anyone who didn’t know the language.
Interviewees don't get to pick the language?
If you're hiring based on proficiency in a particular tech stack, I'm curious why. Are there that many candidates that you can be this selective? Is the language so dissimilar that the uninitiated would need a long time to get up to speed? Does the job involve working on the language itself and so a specifically deep understanding is required?
We’ve been working on a system that increases signal to noise dramatically for finding bugs, we’ve at the same time been thoroughly benchmarking the entire popular software agents space for this
We’ve found a wide range of results and we have a conference talk coming up soon where we’ll be releasing everything publicly so stay tuned for that itll be pretty illuminating on the state of the space
I was thinking about this the other day, wouldn't it be feasible to make fine-tune or something like that into every git change, mailist, etc, the linux kernel has ever hard?
Wouldn't such an LLM be the closer -synth- version of a person who has worked on a codebase for years, learnt all its quirks etc.
There's so much you can fit on a high context, some codebases are already 200k Tokens just for the code as is, so idk
I'd be willing to bet the sum of all code submitted via patches, ideas discussed via lists, etc doesn't come close to the true amount of knowledge collected by the average kernel developer's tinkering, experimenting, etc that never leaves their computer. I also wonder if that would lead to overfitting: the same bugs being perpetuated because they were in the training data.
I bet automatic this part will be simple. In general LLMs that have a given semantical ability "X" to do some task, have greater than X ability to check, among N replies about doing the same task, which reply is the best, especially if via binary tournament like RAInk did (it was posted here a few weeks ago). There is also the possibility to use agreement among different LLMs. I'm surprised Gemini 2.5 PRO was not used here, in my experience it is the most powerful LLM to do that kind of stuff.
Nah. I'm not an expert code auditor myself but I've seen my colleagues do it and I've seen ChatGPT try its hand. Even when I give it a specific piece of code and probe/hint in the right direction, it produces five paragraphs of vulnerabilities, none of which are real, while overlooking the one real concern we identified
You can spend all day reading slop or you can get good at this yourself and be much more efficient at this task. Especially if you're the developer and know where to look and how things work already, catching up on security issues relevant to your situation will be much faster than looking for this needle in the haystack that is LLM output
If the LLM wrote a harness and proof of concept tests for its leads, then it might increase S/N dramatically. It’s just quite expensive to do all that right now.
> If the LLM wrote a harness and proof of concept tests for its leads, then it might increase S/N dramatically.
Designing and building meaningfully testable non-trivial software is orders of magnitude more complex than writing the business logic itself. And that’s if you compare writing greenfield code from scratch. Making an old legacy code base testable in a way conducive to finding security vulns is not something you just throw together. You can be lucky with standard tooling like sanitizers and valgrind but it’s far from a panacea.
The most interesting and significant bit of this article for me was that the author ran this search for vulnerabilities 100 times for each of the models. That's significantly more computation than I've historically been willing to expend on most of the problems that I try with large language models, but maybe I should let the models go brrrrr!
Zero days can go for $$$, or you can go down the bug bounty route and also get $$. The cost of the LLM would be a drop in the bucket.
When the cost of inference gets near zero, I have no idea what the world of cyber security will look like, but it's going to be a very different space from today.
Except in this case the LLM was pointed at a known-to-exist vulnerability. $116 per handler per vulnerability type, unknown how many vulnerabilities exist.
The "don't blame the victim" trope is valid in many contexts. This one application might be "hackers are attacking vital infrastructure, so we need to fund vulnerabilities first". And hackers use AI now, likely hacked into and for free, to discover vulnerabilities. So we must use AI!
Therefore, the hackers are contributing to global warming. We, dear reader, are innocent.
"100 times for each of the models" represents a significant amount of energy burned. The achievement of finding the most common vulnerability in C based codebases becomes less of an achievement. And more of a celebration of decadence and waste.
We are facing global climate change event, yet continue to burn resources for trivial shit like it’s 1950.
Have a problem with clear definition and evaluation function. Let LLM reduce the size of solution space. LLMs are very good at pattern reconstruction, and if the solution has a similar pattern to what was known before, it can work very well.
In this case the problem is a specific type of security vulnerability and the evaluator is the expert. This is similar in spirit to other recent endeavors where LLMs are used in genetic optimization; on a different scale.
Here’s an interesting read on “Mathematical discoveries from program search with large language models” which was I believe was also featured in HN the past:
I'm not sure about the assertion that this is the first vulnerability found with an LLM. For e.g. OSS-Fuzz [0] has found a few using fuzzing, and Big Sleep using an agent approach [1].
It's certainly not the first vulnerability found with an LLM =) Perhaps I should have been more clear though.
What the post says is "Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances. o3 was able to comprehend this and spot a location where a particular object that is not referenced counted is freed while still being accessible by another thread. As far as I'm aware, this is the first public discussion of a vulnerability of that nature being found by a LLM."
The point I was trying to make is that, as far as I'm aware, this is the first public documentation of an LLM figuring out that sort of bug (non-trivial amount of code, bug results from concurrent access to shared resources). To me at least, this is an interesting marker of LLM progress.
Given the value of finding zero days, pretty much every intelligence agency in the world is going to be pouring money into this if it can reliably find them with just a few hundred api calls. Especially if you can fine tune a model with lots of examples, which I don't think open ai, etc are going to do with any public api.
Yeah, the amount of engineering they have around controlling (censoring) the output, along with the terms of service, creates an incentive to still look for any possible bugs, but not allow it in the output.
Certainly for Govt agencies and others this will not be a factor. It is just for everyone else. This will cause people to use other models and agents without these restrictions.
It is safe to assume that a large number of vulnerabilities exist in important software all over the place. Now they can be found. This is going to set off arms race game theory applied to computer security and hacking. Probably sooner than expected...
It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.
[1] https://github.com/SeanHeelan/o3_finds_cve-2025-37899
> In fact my entire system prompt is speculative so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering
Just like Eisenhower's famous "plans are useless, planning is indispensable" quote. The muscle you build is creating new plans, not memorizing them.
It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?
> In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.
1. Having workflows to be able to provide meaningful context quickly. Very helpful.
2. Arbitrary incantations.
I think No. 2 may provide some random amounts of value with one model and not the other, but as a practitioner you shouldn't need to worry about it long-term. Patterns models pay attention to will change over time, especially as they become more capable. No. 1 is where the value is at.
As my example as a systems grad student, I find it a lot more useful to maintain a project wiki with LLMs in the picture. It makes coordinating with human collaborators easier too, and I just copy paste the entire wiki before beginning a conversation. Any time I have a back-and-forth with an LLM about some design discussions that I want archived, I ask them to emit markdown which I then copy paste into the wiki. It's not perfectly organized but it keeps the key bits there and makes generating papers etc. that much easier.
The author deserves more credit here, than just "vibing".
I use these prompts everywhere. I get significantly better results mostly because it encourages backtracking and if I were to guess, enforces a higher confidence threshold before acting.
The expert engineering ones usually end up creating mountains of slop, refactoring things, and touching a bunch of code it has no business messing with.
I also have used lazy prompts: "You are positively allergic to rewriting anything that already exists. You have multiple mcps at your disposal to look for existing solutions and thoroughly read their documentation, bug reports, and git history. You really strongly prefer finding appropriate libraries instead of maintaining your own code"
You just described one critical aspect of engineering: discovering a property of a system and feeding that knowledge back into a systematic, iterative process of refinement.
But yeah prompt engineering is a field for a reason, as it takes time and experience to get it right.
Problem with LLMs as well is that it’s inherently probabilistic, so sometimes it’ll just choose an answer with a super low probability. We’ll probably get better at this in the next few years.
Deleted Comment
Quantitative benchmarks are not necessary anyway. A method either gets results or it doesn't.
Those prompts should be renamed as hints. Because that's all they are. Every LLM today ignores prompts if they conflict with its sole overarching goal: to give you an answer no matter whether it's true or not.
I like to think of them as beginnings of an arbitrary document which I hope will be autocompleted in a direction I find useful... By an algorithm with the overarching "goal" of Make Document Bigger.
(As an engineer it’s part of your job to know if the problem is being solved correctly.)
You invoke "engineering principles", but software engineers constantly trade in likelihoods, confidence intervals, and risk envelopes. Using LLMs is no different in that respect. It's not rocket science. It's manageable.
What's the alternative?
Are you Insinuating that dealing with unstable and unpredictable systems isn't somewhere engineering principles are frequently applied to solve complex problems?
It’s surprisingly effective to ask LLMs to help you write prompts as well, i.e. all my prompt snippets were designed with help of an LLM.
I personally keep them all in an org-mode file and copy/paste them on demand in a ChatGPT chat as I prefer more “discussion”-style interactions, but the approach is the same.
Works incredibly well, and I created it with its own help.
The more you can frame the problem with your expertise, the better the results you will get.
Every time a new frontier LLM is released (excluding LLMs that use input as training data) I run the interview questions through it. I’ve been surprised that my rate of working responses remains consistently around 1:10 for the first pass, and often takes upwards of 10 rounds of poking to get it to find its own mistakes.
So this level of signal to noise ratio makes sense for even more obscure topics.
Interviewees don't get to pick the language?
If you're hiring based on proficiency in a particular tech stack, I'm curious why. Are there that many candidates that you can be this selective? Is the language so dissimilar that the uninitiated would need a long time to get up to speed? Does the job involve working on the language itself and so a specifically deep understanding is required?
We’ve found a wide range of results and we have a conference talk coming up soon where we’ll be releasing everything publicly so stay tuned for that itll be pretty illuminating on the state of the space
Edit: confusing wording
Deleted Comment
Wouldn't such an LLM be the closer -synth- version of a person who has worked on a codebase for years, learnt all its quirks etc.
There's so much you can fit on a high context, some codebases are already 200k Tokens just for the code as is, so idk
You can spend all day reading slop or you can get good at this yourself and be much more efficient at this task. Especially if you're the developer and know where to look and how things work already, catching up on security issues relevant to your situation will be much faster than looking for this needle in the haystack that is LLM output
And it will do this no matter how many prompts you try or you forcefully you ask it.
Designing and building meaningfully testable non-trivial software is orders of magnitude more complex than writing the business logic itself. And that’s if you compare writing greenfield code from scratch. Making an old legacy code base testable in a way conducive to finding security vulns is not something you just throw together. You can be lucky with standard tooling like sanitizers and valgrind but it’s far from a panacea.
[1] https://platform.openai.com/docs/guides/batch
When the cost of inference gets near zero, I have no idea what the world of cyber security will look like, but it's going to be a very different space from today.
The "don't blame the victim" trope is valid in many contexts. This one application might be "hackers are attacking vital infrastructure, so we need to fund vulnerabilities first". And hackers use AI now, likely hacked into and for free, to discover vulnerabilities. So we must use AI!
Therefore, the hackers are contributing to global warming. We, dear reader, are innocent.
We are facing global climate change event, yet continue to burn resources for trivial shit like it’s 1950.
Have a problem with clear definition and evaluation function. Let LLM reduce the size of solution space. LLMs are very good at pattern reconstruction, and if the solution has a similar pattern to what was known before, it can work very well.
In this case the problem is a specific type of security vulnerability and the evaluator is the expert. This is similar in spirit to other recent endeavors where LLMs are used in genetic optimization; on a different scale.
Here’s an interesting read on “Mathematical discoveries from program search with large language models” which was I believe was also featured in HN the past:
https://www.nature.com/articles/s41586-023-06924-6
One small note, concluding that the LLM is “reasoning” about code just _based on this experiment_ is bit of a stretch IMHO.
[1] https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...
[0] https://security.googleblog.com/2024/11/leveling-up-fuzzing-...
[1] https://googleprojectzero.blogspot.com/2024/10/from-naptime-...
What the post says is "Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances. o3 was able to comprehend this and spot a location where a particular object that is not referenced counted is freed while still being accessible by another thread. As far as I'm aware, this is the first public discussion of a vulnerability of that nature being found by a LLM."
The point I was trying to make is that, as far as I'm aware, this is the first public documentation of an LLM figuring out that sort of bug (non-trivial amount of code, bug results from concurrent access to shared resources). To me at least, this is an interesting marker of LLM progress.
Certainly for Govt agencies and others this will not be a factor. It is just for everyone else. This will cause people to use other models and agents without these restrictions.
It is safe to assume that a large number of vulnerabilities exist in important software all over the place. Now they can be found. This is going to set off arms race game theory applied to computer security and hacking. Probably sooner than expected...