Hallucineted CVE against Curl: someone asked Bard to find a vulnerability

While somewhat off-topic, I had an interesting experience highlighting the utility of GitHub's Copilot the other day. I decided to run Copilot on a piece of code functioning correctly to see if it would identify any non-existent issues. Surprisingly, it managed to pinpoint an actual bug. Following this discovery, I asked Copilot to generate a unit test to better understand the identified issue. Upon running the test, the program crashed just as Copilot had predicted. I then refactored the problematic lines as per Copilot's suggestions. This was my first time witnessing the effectiveness of Copilot in such a scenario, which provided small yet significant proof to me that Language Models can be invaluable tools for coding, capable of identifying and helping to resolve real bugs. Although they may have limitations, I believe any imperfections are merely temporary hurdles toward more robust coding assistants.

WanderPanda · 2 years ago

Copilot at present capabilities is already so valuable that not having it in some environment gives me the „disabledness feeling“ that I otherwise only get when vim bindings are not enabled. Absolute miracle technology! I‘m sure in the not too distant future we‘ll have have privacy preserving, open source versions that are good enough to not shovel everything over to openai

jgalt212 · 2 years ago

> shovel everything over to openai

Seriously, if you're a niche market with specific know-how, the easiest way to broadly propagate this know-now is to use copilot.

sangnoir · 2 years ago

That sounds like very basic code review - which I guess is useful in instances where one can't get a review from a human. If it has a low enough false-positive rate, it could be great as a background CI/CD bot that can chime in the PR/changeset comments to say "You may have a bug here"

TheBlight · 2 years ago

One nice thing about a machine code reviewing is no tedious passive-aggressive interactions or subjective style feedback you feel compelled to take etc.

ceedan · 2 years ago

Discovering a bug, and reproducing via unit tests is very different than "a very basic code review"

hluska · 2 years ago

That is nothing like a ‘very basic code review.’ The LLM discovered a bug and reproduced it via a test.

chrisco255 · 2 years ago

Try it on a million line code base where it's not so cut and dry to even determine if the code is running correctly or what correctly means when it changes day to day.

Closi · 2 years ago

"A tool is only useful if I can use it in every situation".

LLM's don't need to find every bug in your code - even if they found an additional 10% of genuine bugs compared to existing tools, it's still a pretty big improvement to code analysis.

In reality, I suspect the scope is much higher than 10%.

yjftsjthsd-h · 2 years ago

Is it better or worse than a human, though?

dzhiurgis · 2 years ago

I've separated 5000 line class into smaller domains yesterday. It didn't provide end solution, it wasn't perfect, but gave me a good plan where to place what.

Once it is capable to process larger context windows it will become impossible to ignore.

ushakov · 2 years ago

You can’t, it has a context size window of 8192 tokens. That’s like 1000 lines depending on programming language

ushakov · 2 years ago

That’s rather an exception in my experience. For unit-tests it starts hallucinating hard once you have functions imported from other files. This is probably the reason most unit tests in their marketing materials are things like fibonacci…

rripken · 2 years ago

How did you prompt Copilot to identify issues? In my experience the best I can do is to put in code comments of what what I want a snippet to do and copilot tries to write it. I haven't had good luck asking copilot to rewrite existing code. Nearest I've gotten is: // method2 is identical to method1 except it fixes the bugs public void method2(){

crazysim · 2 years ago

Might be using the Copilot chat feature.

ChatGTP · 2 years ago

These things are amazing when you first experience but I think in most cases the user fails to realise how common their particular bug is. But then you also need to realise there maybe bugs in what has been suggested. We all know there are issue with stack overflow responses too.

Probably 85% of codebase are just rehashes of the same stuff. Co-pilot has seen it all I guess.

pylua · 2 years ago

This is a great use of ai. In all seriousness I can’t wait for the day it gets added to spring as a plug-in.

If not malicious, then this shows that there are people out there who don't quite know how much to rely on LLMs or understand the limits of their capabilities. It's distressing.

jerf · 2 years ago

I can also attest as a moderator that there is some set of people out there who use LLMs, knowingly use LLMs, and will lie to your face that they aren't and aggressively argue about it.

The only really new aspect about that is the LLM part. The people who will truly bizarrely lie about total irrelevancies to people on the Internet even when they are fooling absolutely no one has always been small but non-zero.

filterfiber · 2 years ago

The average person sadly just hears the marketed "artificial intelligence" and doesn't grasp that it simply predicts text.

It's really good at predicting text we like, but that's all it does.

It shouldn't be surprising that sometimes it's prediction is either wrong or unwanted.

Interestingly even intelligent, problem solving, educated humans "incorrectly predict" all the time.

yukkuri · 2 years ago

Marketing is lying as much as you can without going to jail for it.

kordlessagain · 2 years ago

> It's really good at predicting text we like, but that's all it does.

It's important to recognize that predicting text is not merely about guessing the next letter or word, but rather a complex set of probabilities grounded in language and context. When we look at language, we might see intricate relationships between letters, words, and ideas.

Starting with individual letters, like 't,' we can assign probabilities to their occurrence based on the language and alphabet we've studied. These probabilities enable us to anticipate the next character in a sequence, given the context and our familiarity with the language.

As we move to words, they naturally follow each other in a logical manner, contingent on the context. For instance, in a discussion about electronics, the likelihood of "effect" following "hall" is much higher than in a discourse about school buildings. These linguistic probabilities become even more pronounced when we construct sentences. One type of sentence tends to follow another, and the arrangement of words within them becomes predictable to some extent, again based on the context and training data.

Nevertheless, it's not only about probabilities and prediction. Language models, such as Large Language Models (LLMs), possess a capacity that transcends mere prediction. They can grapple with 'thoughts'—an abstract concept that may not always be apparent but is undeniably a part of their functionality. These 'thoughts' can manifest as encoded 'ideas' or concepts associated with the language they've learned.

It may be true that LLMs predict the next "thought" based on the corpus they were trained on, but it's not to say they can generalize this behavior, past what "ideas" they were trained on. I'm not claiming generalized intelligence exists, yet.

Much like how individual letters and words combine to create variables and method names in coding, the 'ideas' encoded within LLMs become the building blocks for complex language behavior. These ideas have varying weights and connections, and as a result, they can generate intricate responses. So, while the outcome may sometimes seem random, it's rooted in the very real complex interplay of ideas and their relationships, much like the way methods and variables in code are structured by the 'idea' they represent when laid out in a logical manner.

Language is a means to communicate thought, so it's not a huge surprise that words, used correctly, might convey an idea someone else can "process", and that likely includes LLMs. That we get so much useful content from LLMs is a good indication that they are dealing with "ideas" now, not just letters and words.

I realize that people are currently struggling with whether or not LLMs can "reason". For as many times as I've thought it was reasoning, I'm sure there are many times it wasn't reasoning well. But, did it ever "reason" at all, or was that simply an illusion, or happy coincidence based on probability?

The rub with the word "reasoning" is that it directly involves "being logical" and how we humans arrive at being logical is a bit of a mystery. It's logical to think a cat can't jump higher than a tree, but what if it was a very small tree? The ability to reason about cats jumping abilities doesn't require understanding trees come in different heights, rather that when we refer to "tree" we mean "something tall". So, reasoning has "shortcuts" to arrive at an answer about a thing, without looking at all the things probabilities. For whatever reason, most humans won't argue with you about tree height at that point and just reply "No, cats can't jump higher than a tree, but they can climb it." By adding the latter part, they are not arguing the point, but rather ensuring that someone can't pigeonhole their idea of truth of the matter.

Maybe when LLMs get as squirrely as humans in their thinking we'll finally admit they really do "reason".

keithnoizu · 2 years ago

Which is fundamentally different from how our brain chains together thoughts when not actively engaging in meta thinking how? Especially once chain of thought etc. is applied.

masklinn · 2 years ago

It seems very similar to the case of the lawyers who used an LLM as a case law search engine. The LLM spit out bogus cases, then when the judge asked them to produce the cases as the references let nowhere they asked the LLM to produce cases which it "did".

dotnet00 · 2 years ago

Or similarly the case where a professor failed an entire class of students (resulting in their diplomas being denied) for cheating on the essays using AI because he asked an LLM if the essays were AI generated and it said yes.

greenyoda · 2 years ago

Some discussions of the lawyer story, in case someone missed them:

https://news.ycombinator.com/item?id=36130354

https://news.ycombinator.com/item?id=36097900

https://news.ycombinator.com/item?id=36095352

tomjen3 · 2 years ago

We don't know what we can do with it yet and we don't understand the limits of their capabilities. Ethan Mollick calls it the ragged frontier[0], and that may be as good a metaphor as any. Obviously a frontier has to be explored, but the nature of that is that most of the time you are on one side or the other of the frontier.

[0]: https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the...

blibble · 2 years ago

its sad, this kind of behaviour is going to ddos every aspect of society into the ground

nicman23 · 2 years ago

if that is all it takes, then good

_pdp_ · 2 years ago

evrimoztamur · 2 years ago

abnercoimbre · 2 years ago

> you did not find anything worthy of reporting. You were fooled by an AI into believing that.

The author's right. Reading the report I was stunned; the person disclosing the so-called vulnerability said:

> To replicate the issue, I have searched in the Bard about this vulnerability.

Does Bard clearly warn to never rely on it for facts? I know OpenAI says "ChatGPT may give you inaccurate information" at the start of each session.

xd1936 · 2 years ago

Oh yeah. Google has warnings like "Bard may display inaccurate or offensive information that doesn’t represent Google’s views" all over it; Permanently in the footer, on splash pages, etc.

Well Google has a branding problem with Bard... because everyone knows Google for search. "Surely Bard must be a reliable engine too."

> Does Bard clearly warn to never rely on it for facts? I know OpenAI says "ChatGPT may give you inaccurate information" at the start of each session.

I know I shouldn't be, but I'm surprised the disclosure is even needed. People clearly don't understand how LLMs work -

LLM's predict text. That's it, they're glorified autocomplete (that's really good). When their prediction is wrong we call it a "hallucination" for some reason. Humans do the same thing all the time. Of course it's not always correct!

> People clearly don't understand how LLMs work -

Of course not. Most developers don't understand how LLM work, even roughly.

> Humans do the same thing all the time. Of course it's not always correct!

The difference is that LLMs can not acknowledge incompetence, are always confidently incorrect, and will never reach a stopping point, at best they'll start going circular.

ImAnAmateur · 2 years ago

There's a second wind to this story in the Mastodon replies. It sounds like the LLM appeared to be basing this output on a CVE that hadn't yet been made public, implying that it had access to text that wasn't public. I can't quite tell if that's an accurate interpretation of what I'm reading.

>> @bagder it’s all the weirder because they aren’t even trying to report a new vulnerability. Their complaint seems to be that detailed information about a “vulnerability” is public. But that’s how public disclosure works? And open source? Like are they going to start submitting blog posts of vulnerability analysis and ask curl maintainers to somehow get the posts taken down???

>> @derekheld they reported this before that vulnerability was made public though

>> @bagder oh as in saying the embargo was broken but with LLM hallucinations as the evidence?

>> @derekheld something like that yes

jameshart · 2 years ago

Took me a while to figure out from the toot thread and comment history, but it appears that the curl 8.4.0 release notes (https://daniel.haxx.se/blog/2023/10/11/curl-8-4-0/) referred to the fact that it included a fix for an undisclosed CVE (CVE-2023-38545); the reporter ‘searched in Bard’ for information about that CVE and was given hallucinated details utterly unrelated to the actual curl issue.

The reporter is complaining that they thought this constituted a premature leak of a predisclosure CVE, and was reporting this as a security issue to curl via HackerOne.

Arnavion · 2 years ago

No, it's not that Bard was trained on information that wasn't public. It's that the author of the report thought that the information about the upcoming CVE was public somewhere because Bard was reproducing it, because the author thinks Bard is a search engine. So they filed a report that the curl devs should take that information offline until the embargo is lifted.

sfink · 2 years ago

Which is a fair request. Perhaps Bard should be taken offline.

The curl devs might even be the right ones to do it, if they slipped a DDOS into the code...

orf · 2 years ago

> I responsibly disclosed the information as soon as I found it. I believe there is a better way to communicate to the researchers, and I hope that the curl staff can implement it for future submissions to maintain a better relationship with the researcher community. Thank you!

… yeah…

HtmlProgrammer · 2 years ago

Poor fella was embarrassed and looking to throw anything back at them

hypeatei · 2 years ago

It looks like AI generated that response.

openasocket · 2 years ago

I was curious how many bogus security reports big open source projects have. If you go to https://hackerone.com/curl/hacktivity and scroll down to ones marked as "Not-applicable" you can find some additional examples. No other LLM hallucinations, but some pretty poorly-thought out "bugs".

chrsig · 2 years ago

Perhaps not useful to the conversation, but I really wish that whomever coined the behavior as a 'hallucination' had consulted a dictionary first.

It's delusional, not hallucinated.

Delusions are the irrational holdings of false belief, especially after contrary evidence has been provided.

Hallucinations are false sensations or perceptions of things that do not exist.

May some influential ML person read this and start to correct the vocabulary in the field :)

65a · 2 years ago

Confabulation seems better aligned to neuropsychology, as far as I can tell: https://en.wikipedia.org/wiki/Confabulation

garba_dlm · 2 years ago

cool, the scientific name for studying gaslighting!

nikanj · 2 years ago

That one is not part of everyone’s vocabulary

SrslyJosh · 2 years ago

LLMs do not have beliefs, so "delusion" is no better than "hallucination". As statistical models of texts, LLMs do not deal in facts, beliefs, logic, etc., so anthropomorphizing them is counter-productive.

An LLM is doing the exact same thing when it generates "correct" text that it's doing when it generates "incorrect" text: repeatedly choosing the most likely next token based on a sequence of input and the weights it learned from training data. The meaning of the tokens is irrelevant to the process. This is why you cannot trust LLM output.

bregma · 2 years ago

I think the right word is "bullshit". LLMs are neither delusional nor hallucinating since they have no beliefs or sensory input. The just generate loads of fertilizer and a lot of people like to spread it around.

wlonkly · 2 years ago

I've been calling it bullshit too, because the thing about bullshitting is that the truth is irrelevant to a good story.

asadotzler · 2 years ago

This is the correct answer. It's not a hallucination. It's goal is to create something that seems like the truth despite the fact that it has no idea if it's actually being truthful. If a human were doing this we'd call them a bullshitter or of they were good at it, maybe even a bullshit artist.

SkyPuncher · 2 years ago

I think it’s appropriate.

Delusion tends to describe a state of being, in the form of delusional. Hallucinations tend to be used to describe an instance or finite event.

Broadly, LLMs are not delusions, but they do perceive false information.

Brian_K_White · 2 years ago

The llm has neither of these, so neither is more correct or incorrect than the other.

jazzyjackson · 2 years ago

IMHO it's fine to have a certain jargon within the context of "things neural nets do" and comes from the days of Deep Dream, when image classifiers were run in reverse and introduced the public to computer-generated-images that were quite psychedelic in nature. It's seeing things that aren't there.

inopinatus · 2 years ago

LLMs don’t hold beliefs. Believing otherwise is itself a delusion.

In addition, the headline posted here doesn’t even say hallucinated, so that is also an hallucination. It says hallucineted. As portmanteaux go, that ain’t bad. I rather like the sense of referring to LLMs as hallucinets.

dylan604 · 2 years ago

The phrase "you must be trippin'!" is commonly used by some when they say something completely nonsensical. I can easily see where how/why hallucinating was chosen.

It's clearly meant to poke fun of the system. If you think people are going to NOT use words in jest while making fun of something, perhaps you could use a little less starch in your clothing.

giantrobot · 2 years ago

I prefer the term confabulation. To the AI the made up thing isn't necessarily irrational. It's in fact very rational, simply incorrect.

aka bullshit. it is a bullshitter or a bullshit artist, virtually synonymous with confabulator.

user_7832 · 2 years ago

I propose using delirium/delirious to describe the software.

Deleted Comment

seniorsassycat · 2 years ago

So the reporter thinks that they were able to get accurate info about private details of a embargo'ed cve from Bard. If correct they would have found a cve in bard, not in curl.

In this case the curl maintainers can tell the details are made up and don't correspond to any cve.

Mitre would probably still file a 10.0 CVE based on this report