Everyone in the comments seems to be arguing over the semantics of the words and anthropomorphization of LLMs. Putting that aside, there is a real problem with this approach that lies at the mathematical level.
For any given input text, there is a corresponding output text distribution (e.g. the probabilities of all words in a sequence which the model draws samples from).
The approach of drawing several samples and evaluating the entropy and/or disagreement between those draws is that it relies on already knowing the properties of the output distribution. It may be legitimate that one distribution is much more uniformly random than another, which has high certainty. Its not clear to me that they have demonstrated the underlying assumption.
Take for example celebrity info, "What is Tom Cruise known for?". The phrases "movie star", "katie holmes", "topgun", and "scientology" are all quite different in terms of their location in the word vector space, and would result in low semantic similarity, but are all accurate outputs.
On the other hand, "What is Taylor Swift known for?" the answers "standup comedy", "comedian", and "comedy actress" are semantically similar but represent hallucinations. Without knowing the distribution characteristics (e.g multivariate moments and estimates) we couldn't say for certain these are correct merely by their proximity in vector space.
As some have pointed out in this thread, knowing the correct distribution of word sequences for a given input sequence is the very job the LLM is solving, so there is no way of evaluating the output distribution to determine its correctness.
There are actual statistical models to evaluate the amount of uncertainty in output from ANNs (albeit a bit limited), but they are probably not feasible at the scale of LLMs. Perhaps a layer or two could be used to create a partial estimate of uncertainty (e.g. final 2 layers), but this would be a severe truncation of overall network uncertainty.
Another reason I mention this is most hallucinations I encounter are very plausible and often close to the right thing (swapping a variable name, confabulating a config key), which appear very convincing and "in sample", but are actually incorrect.
> On the other hand, "What is Taylor Swift known for?" the answers "standup comedy", "comedian", and "comedy actress" are semantically similar but represent hallucinations. Without knowing the distribution characteristics (e.g multivariate moments and estimates) we couldn't say for certain these are correct merely by their proximity in vector space.
It depends on the fact that a high uncertainty answer by definition is less probable. That means if you ask multiple times you will not get the same unlikely answer, such as that Taylor swift is a comedian, you will instead get several semantically different answers.
Maybe you’re saying the same thing, but if so I’m missing the problem. If your training data tells you that Taylor Swift is known as a comedian, then hallucinations are not your problem.
It was just a contrived example to illustrate low variance in the response distribution doesn't necessarily indicate accuracy. Just indicating that "hallucination" is a different axis from "generates different responses" though they might not be totally orthogonal.
A better example might be that the model overtrained on AWS cloud formation API 2 and when v3 comes out produces low entropy answers that are wrong for v3 but right for v2 (due to training bias), but the answers are low variance (e.g. "bucket" instead of the new "bucket_name" key).
Another example based on a quick test I did on GPT4:
In a single phrase, what is Paris?
Paris is the city of Light.
Paris is the capital of France.
Paris is the romantic capital of the world renowned for its art, fashion, and culture.
This. For a model to consistently output that Taylor Swift is a comedian or something similarly wrong at reasonable temperature settings, there must be a problem in the training data. That doesn't mean that "Taylor Swift is a comedian" needs to be in the training data, it it can simply mean that "Taylor Swift" doesn't appear at all. Then "singer" and "comedian" (and tons of other options) will likely appear at similar probabilities during generation.
Blaming human semantics for LLMs is generally a bad idea, since we only use human semantics to qualitatively explain how the models abstract ideas. In practice you simply don't know how the model relates words.
You seem to have explained in much more technical terms than what my "Computer-engineering-without-maths" brain tells me.
To me this sounds very similar to lowering temperature. It doesn't sound like it pulls better from grounded-truth but rather more probabilistic in the vector space. Does this jive?
I think you make a good point, but my guess is that e.g. your Taylor Swift example, a well-grounded model would have a low likelihood of outputting multiple consecutive answers about her being a comedian, which isn't grounded in the training data.
For your Tom Cruise example, since all those phrases are true and grounded in the training data, the technique may fire off a false positive "hallucination decision".
However, the example they give in the paper seems to be for "single-answer" questions, e.g., "What is the receptor that this very specific medication acts on?", or "Where is the Eiffel Tower located?", in which case I think this approach could be helpful. So perhaps this technique is best-suited for those single-answer applications.
Perhaps another way to phrase this is "sampling and evaluating the similarity of samples can determine the dispersion of a distribution, but not its correctness." I can sample a gaussian and tell you how sparse the samples are (standard deviation) but this in no way tells me whether the distribution is accurate (it is possible to have a highly accurate distribution of a high-entropy variable). On the other hand, its possible to have a tight distribution with low standard deviation that is simply inaccurate, but I can't know that simply by sampling from it (unless I already know apriori what the output should look like).
- groups them by semantic similarity and gives them an id ([music, musician, concert] -> MUSIC, [actress] -> ACTING, [superbowl] -> SPORTS), note that they just use an integer or something for the id
- sums the probability of those grouped answers and normalizes: (MUSIC:2.4, ACTING:0.5, SPORTS:0.6 -> MUSIC:0.686, SPORTS:0.171, ACTING:0.143)
They also go to pains in the paper to clearly define what they are trying to prevent, which is confabulations.
> We focus on a subset of hallucinations which we call ‘confabulations’ for which LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed.
Common misconceptions will still be strongly represented in the dataset. What this method does is it penalizes semantically isolated answers (answers dissimilar to other possible answers) with mediocre likelihood.
Now technically, this paper only compares the effectiveness of "detecting" the confabulation to other methods - it doesn't offer an improved sampling method which utilizes that detection. And of course, if it were used as part of a generation technique it is subject to the extreme penalty of 10xing the number of model generations required.
Right, but the problem pointed out here is that if you compare that to the answers for Tom Cruise, you’d get a bunch of disparate answers that under this method would seem to indicate that it was confabulating, when in reality, Tom Cruise is just known for a lot of different things.
I don't think discretizing the results solves the problem, we don't know whether the distribution is accurate without apriori knowledge. See my real GTP4 output about Paris. Are the words "city of light" "center of culture" and "capital of France" confabulations? Without apriori knowledge is it more or less confabulatory than "city of roses", "site of religious significance", "capital of Korea"? If it simply output "Capital of Rome" 3 times, would that indicate its probably not a confabulation? You can discretize the concepts but that only serves to reduce the granularity of comparisons, and does solve the underlying problem I originally described.
> On the other hand, "What is Taylor Swift known for?" the answers "standup comedy", "comedian", and "comedy actress" are semantically similar but represent hallucinations.
Taylor Swift has appeared multiple times on SNL, both as a host and as a surprise guest, beyond being a musical performer[0]. Generally, your point is correct, but she has appeared on the most famous American television show for sketch comedy, making jokes. One can argue whether she was funny or not in her appearances, but she has performed as a comedian, per se.
Though she hasn't done a full-on comedy show, she has appeared in comedies in many credits (often as herself).[1] For example she appeared as "Elaine" in a single episode of The New Girl [2x25, "Elaine's Big Day," 2013][2]. She also appeared as Liz Meekins in "Amsterdam" [2022], a black comedy, during which her character is murdered.[3]
It'd be interesting if there's such a thing as a negatory hallucination, or, more correctly, an amnesia — the erasure of truth that the AI (for whatever reason) would ignore or discount.
They're not using vector embeddings for determining similarity - they use finetuned NLI models that take the context into account to determine semantic equivalence. So it doesn't depend on knowing properties of the output distribution up front at all. All you need to be able to do is draw a representative sample (up to your preferred error bounds).
Garbage in, garbage out. If the "training data" is scraped from online Taylor Swift forums, where her fans are commenting about something funny she did "OMG Taytay is so funny!" "She's hilarious" "She made me laugh so hard" - then the LLM is going to sometimes report that Taylor Swift is a comedian. It's really as simple as that. It's not "hallucinating", it's probability. And it gets worse with AIs being trained on data from reddit and other unreliable sources, where misinformation and disinformation get promoted regularly.
Current architecture of LLMs focus mainly on the retrieval part and the weights learned are just converged to get best outcome for next token prediction. Whereas, ability to put this data into a logical system should also have been a training goal IMO. Next token prediction + Formal Verification of knowledge during training phase itself = that would give LLM ability to keep consistency in it's knowledge generation and see the right hallucinations (which I like to call imagination)
The process can look like-
1. Use existing large models to convert the same previous dataset they were trained on into formal logical relationships. Let them generate multiple solutions
2. Take this enriched dataset and train a new LLM which not only outputs next token but also a the formal relationships between previous knowledge and the new generated text
3. Network can optimize weights until the generated formal code get high accuracy on proof checker along with the token generation accuracy function
In my own mind I feel language is secondary - it's not the base of my intelligence. Base seems more like a dreamy simulation where things are consistent with each other and language is just what i use to describe it.
This suggestion revisits the classic "formal top-down" vs "informal bottom-up" approaches to building a semantic knowledge management system. Top-down has been tried extensively in the pre-big-data models and pre-probabilistic models era, but required extensive manual human curation while being starved for knowledge. The rise of big-data bode no cure for the curation problem. Because its curation can't be automated, larger scale just made the problem worse. AI's transition to probability (in the ~1990s) paved the way to the associative probabilistic models in vogue today, and there's no sign that a more-curated more-formal approach has any hope of outcompeting them.
How to extend LLMs to add mechanisms for reasoning, causality, etc (Type 2 thinking)? However that will eventually be done, the implementation must continue to be probabilistic, informal, and bottom-up. Manual human curation of logical and semantic relations into knowledge models has proven itself _not_ to be sufficiently scalable or anti-brittle to do what's needed.
> How to extend LLMs to add mechanisms for reasoning, causality, etc (Type 2 thinking)?
We could just use RAG to create a new dataset. Take each known concept or named entity, search it inside the training set (1), search it on the web (2), generate it with a bunch of models in closed book mode (3).
Now you got three sets of text, put all of them in a prompt and ask for a wikipedia style article. If the topic is controversial, note the controversy and distribution of opinions. If it is settled, notice that too.
By contrasting web search with closed-book materials we can detect biases in the model and lacking knowledge or skills. If they don't appear in the training set you know what is needed in the next iteration. This approach combines self testing with topic focused research to integrate information sitting across many sources.
I think of this approach as "machine study" where AI models interact with the text corpus to synthesize new examples, doing a kind of "review paper" or "wiki" reporting. This can be scaled for billions of articles, making a 1000x larger AI wikipedia.
Interacting with search engines is just one way to create data with LLMs. Interacting with code execution and humans are two more ways. Just human-AI interaction alone generates over one billion sessions per month, where LLM outputs meet with implicit human feedback. Now that most organic sources of text have been used, the LLMs will learn from feedback, task outcomes and corpus study.
Yes, that's why there was no human in the loop and I was using LLMs as a proxy to bottom up approach in step 1. But the hallucinations can creep into the knowledge graph also as mentioned by another commentator
Yann LeCun said something to the effect you cannot get reasoning with fixed computation budgets, which I found to be a simple way to explain and understand a hypothesized limitation
Logic has all its own problems. See "Godel, Escher, Bach" or ask why OWL has been around for 20 years and had almost no market share, or why people have tried every answer to managing asynchronous code other than RETE, why "complex event processing" is an obscure specialty and not a competitor to Celery and other task runners. Or for that matter why can't Drools give error messages that make any sense?
As a computational biologist, I've used ontologies quite a bit. They have utility, but there is a bit of an economic mismatch between their useful application and the energy required to curate them. You have some experience in this space. Do you think LLMs could speed up ontology / knowledge graph curation with expert review? Or, do you think structured knowledge has a fundamental problem limiting its use?
LLMs right now don't employ any logic. There can always be corners of "I don't know" or "I can't do that" - than the current system which is 100% confident in it's answer because it's not actually trying to match any constraint at all. So at some point the system will apply logic but may not be as formal as we do in pure math.
But the problem is with the new stuff it hasn't seen, and questions humans don't know the answers to. It feels like this whole hallucinations thing is just the halting problem with extra steps. Maybe we should ask ChatGPT whether P=NP :)
Haha, asking chat-gpt surely won't work. Everything can "feel" like a halting problem if you want perfect results with zero error with uncertain and ambiguous new data adding.
My take - Hallucinations can never be made to perfect zero but they can be reduced to a point where these systems in 99.99% will be hallucinating less than humans and more often than not their divergences will turn out to be creative thought experiments (which I term as healthy imagination). If it hallucinates less than a top human do - I say we win :)
Yeah but when you come to halting problems on that level of complexity multi-hierarchical-emergent phenomena occur aperiodically and chaotically that is to say in the frequency domain the aperiodicity is fractal like, discreet and mappable.
For the first step CYC[1] could be a valid solution. From my experience I whould call it a meaningful relation schema for DAGs. There is also an open source version available [2]. But it is no longer maintained by the company itself.
Interesting. I haven't really seen much into this space. But anything which can provably represent concepts and relationships without losing information can work. Devil might be in details; nothing is as simple as it looks on first sight.
Formal verification of knowledge/logical relationships? how would you formally verify a sci-fi novel or a poem? What about the paradoxes that exist in nature, or contradicting theories that are logically correct? This is easier said than done. What you are proposing is essentially 'let's solve this NP-hard problem, that we don't know how to solve and then it will work'.
Oh, exactly. But let me know your thoughts on this - let's say if you have a graph which represents existing sci-fi novel = rather than the current model which is just blindly generating text on statistical probabilities would it not help to have to model output also try to fit into this rather imperfect sci-fi novel KG? If it doesn't fit logically. Based on how strong your logic requirements are system can be least creative to most creative etc.
I was not actually aware that building KG from text is NP-hard problem. I will check it out. I thought it was a time consuming problem when done manually without LLMs but didn't thought it was THAT hard. Hence I was trying to introduce LLM into the flow. Thanks, will read about all this more!
Eg, KGs (RDF, PGs, ...) are logical, but in automated construction, are not semantic in the sense of the ground domain of NLP, and in manual construction, tiny ontology. Conversely, fancy powerful logics like modal ones are even less semantic in NLP domains. Code is more expressive, but brings its own issues.
I had KGs in mind with automated construction which can improve and converge during training phase. I was hypothesizing that if we give incentive during training phase to also construct KGs and bootstrap the initial KGs from existing LLMs - the convergence towards a semantically correct KG extension during inference can be achieved. What do you think?
One formulation is that these are hallucinations. Another is that these systems are "orthogonal to truth". They have nothing to do with truth or falsity.
It's like asking if a probability distribution is truthful or a liar. It's a category error to speak about algorithms as if they had personal characteristics.
The linked paper is about detecting when the LLM is choosing randomly versus consistently at the level of factoids. Procedurally-generated randomness can be great for some things like brainstorming, while consistency suggests that it's repeating something that also appeared fairly consistently in the training material. So it might be true or false, but it's more likely to have gotten it from somewhere.
Knowing how random the information is seems like a small step forward.
Take social media like Reddit for example. It has a filtering mechanism for content that elevates low-entropy thoughts people commonly express and agree with. And I don’t think that necessarily equates such popular ideas there to the truth.
LLMs are trained with the objective: “no matter what, always have at least three paragraphs of response”. and that response is always preferred to silence or “unfriendly” responses like: “what are you talking about?”
Then yes, it is being taught to bullshit.
Similar to how an improv class teaches you to keep a conversation interesting and “never to say no” to your acting partner.
The botulinum that developed in this person's[1] garlic and olive oil mixture wouldn't particularly care to alter its toxicity to make Gemini's recommendation look better.
Yea IMO these LLMs seem more similar to a subconscious mind than a conscious mind. Jung would probably call it an "antinomy": it's goal is not to represent the truth, but to represent the totality of possible answers.
It seems like a useful adaptation of the term to a new usage, but I can understand if your objection is that it promotes anthropomorphizing these types of models. What do you think we should call this kind output, instead of hallucination?
Maybe another way of looking at it is - the paper is attempting to explain what LLMs are actually doing to people who have already anthropomorphised them.
Sometimes, to lead people out of a wrong belief or worldview, you have to meet them where they currently are first.
> In this paper, we argue against the view that when ChatGPT and the like produce false claims they are lying or even hallucinating, and in favour of the position that the activity they are engaged in is bullshitting, in the Frankfurtian sense (Frankfurt, 2002, 2005). Because these programs cannot themselves be concerned with truth, and because they are designed to produce text that looks truth-apt without any actual concern for truth, it seems appropriate to call their outputs bullshit.
> We think that this is worth paying attention to. Descriptions of new technology, including metaphorical ones, guide policymakers’ and the public’s understanding of new technology; they also inform applications of the new technology. They tell us what the technology is for and what it can be expected to do. Currently, false statements by ChatGPT and other large language models are described as “hallucinations”, which give policymakers and the public the idea that these systems are misrepresenting the world, and describing what they “see”. We argue that this is an inapt metaphor which will misinform the public, policymakers, and other interested parties.
The criticism that people shouldn't anthropomorphize AI models that are deliberately and specifically replicating human behavior is already so tired. I think we need to accept that human traits will no longer be unique to humans (if they ever were, if you expand the analysis to non-human species), and that attributing these emergent traits to non-humans is justified.
"Hallucination" may not be the optimal metaphor for LLM falsehoods, but some humans absolutely regularly spout bullshit in the same way that LLMs do - the same sort of inaccurate responses generated from the same loose past associations.
Isn’t it true that the only thing that LLM’s do is “hallucinate”?
The only way to know if it did “hallucinate” is to already know the correct answer. If you can make a system that knows when an answer is right or not, you no longer need the LLM!
Hallucination implies a failure of an otherwise sound mind. What current LLMs do is better described as bullshitting. As the bullshitting improves, it happens to be correct a greater and greater percentage of the time
Sometimes when I am narrating a story I don't care that much about trivial details but focus on the connection between those details. Is there LLM counterpart to such a behaviour? In this case, one can say I was bullshitting on the trivial details.
Does every thread about this topic have to have someone quibbling about the word “hallucination”, which is already an established term of art with a well understood meaning? It’s getting exhausting.
The term hallucination is a fundamental misunderstanding of how LLMs work, and continuing to use it will ultimately result in a confused picture of what AI and AGI are and what is "actually happening" under the hood.
Wanting to use accurate language isn't exhausting, it's a requirement if you want to think about and discuss problems clearly.
you stole a term which means something else in an established domain and now assert that the ship has sailed, whereas a perfectly valid term in both domains exists. don't be a lazy smartass.
> We need systems that try to be coherent, not systems that try to be unequivocally right, which wouldn't be possible.
The fact that it isn't possible to be right about 100% of things doesn't mean that you shouldn't try to be right.
Humans generally try to be right, these models don't, that is a massive difference you can't ignore. The fact that humans often fails to be right doesn't mean that these models shouldn't even try to be right.
The answer is no, otherwise this paper couldn't exist. Just because you can't draw a hard category boundary doesn't mean "hallucination" isn't a coherent concept.
(the OP is referring to one of the foundational concepts relating to the entropy of a model of a distribution of things -- it's not the same terminology that I would use but the "you have to know everything and the model wouldn't really be useful" is why I didn't end up reading the paper after skimming a bit to see if they addressed it.
It's why this arena things are a hard problem. It's extremely difficult to actually know the entropy of certain meanings of words, phrases, etc, without a comical amount of computation.
This is also why a lot of the interpretability methods people use these days have some difficult and effectively permanent challenges inherent to them. Not that they're useless, but I personally feel they are dangerous if used without knowledge of the class of side effects that comes with them.)
The idea behind this research is to generate answer few times and if results are semantically vastly different from each other then probably they are wrong.
> Isn’t it true that the only thing that LLM’s do is “hallucinate”?
The Boolean answer to that is "yes".
But if Boolean logic were a god representation of reality, we would already have solved that AGI thing ages ago. On practice, your neural network is trained with a lot of samples, that have some relation between themselves, and to the extent that those relations are predictable, the NN can be perfectly able to predict similar ones.
There's an entire discipline about testing NNs to see how well they predict things. It's the other side of the coin of training them.
Then we get to this "know the correct answer" part. If the answer to a question was predictable from the question words, nobody would ask it. So yes, it's a definitive property of NNs that they can't create answers for questions like people have been asking those LLMs.
However, they do have an internal Q&A database they were trained on. Except that the current architecture can not know if an answer comes from the database either. So, it is possible to force them into giving useful answers, but currently they don't.
Maybe for the moment it would be better if the AI companies simply presented their chatbots as slightly-steered text generation tools. Then people could use them appropriately.
Yes, there seems to be a little bit of grokking and the models can be made to approximate step-by-step reasoning a little bit. But 95% of the function of these black boxes is text generation. Not fact generation, not knowledge generation. They are more like improv partners than encyclopedias and everyone in tech knows it.
I don’t know if LLMs misleading people needs a clever answer entropy solution. And it is a very interesting solution that really seems like it would improve things — effectively putting certainty scores to statements. But what if we just stopped marketing machine learning text generators as near-AGI, which they are not? Wouldn’t that undo most of the damage, and arguably help us much more?
I’m working with a LLM right this moment to build some front end with react and redux, the technologies that I have almost no knowledge of. I posed questions and the LLM gave me the answers along with JavaScript code, a language that I’m also very rusty with. All of the code compiled, and most of them worked as expected. There were errors, some of them I had no idea what they were about. LLM was able to explained the issues and gave me revised code that worked.
All in all it’s been a great experience, it’s like working with a mentor along the way. It must have saved me a great deal of time, given how rookie I am. I do need to verify the result.
Where did you get the 95% figure? And whether what it does is text generation or fact or knowledge generation is irrelevant. It’s really a valuable tool and is way above anything I’ve used.
The last 6 weeks there's been a pronounced uptick in comments, motivated by tiredness of seeing "AI", manifested as a fever dream of them not being useful at all, and swindling the unwashed masses who just haven't used them enough yet to know their true danger.
I've started calling it what it is: lashing out in confusion at why they're not going away, given a prior that theres no point in using them
I have a feeling there'll be near-religious holdouts in tech for some time to come. We attract a certain personality type, and they tend to be wedded to the idea of things being absolute and correct in a way things never are.
"We show how to detect confabulations by developing a quantitative measure of when an input is likely to cause an LLM to generate arbitrary and ungrounded answers. ... Intuitively, our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings."
That's reasonable for questions with a single objective answer. It probably won't help when multiple, equally valid answers are possible.
However, that's good enough for search engine applications.
The concept of semantic entropy reminds me of a bank, whose name I can't remember, that, in the aftermath of the Enron catastrophe, did make a "bullshitometer" to measure the level of bullshit in press-releases. In that case, they applied it to the Enron press releases before the company's implosion and showed it could have predicted the collapse.
There's a concept in statistics called sensitivity analysis. It seems like this is somewhat similar, but an alternative approach that might be interesting would be to modify the input in a way that you think should preserve the semantic meaning, and see how that alters the meaning of the output.
Of course, altering the input without changing the meaning is the hard part, but doesn't seem entirely infeasible. At the least, you could just ask the LLM to try to alter the input without changing the meaning, although you might end up in a situation where it alters the input in a way that aligns with its own faulty understanding of an input, meaning it could match the hallucinated output better after modification.
For any given input text, there is a corresponding output text distribution (e.g. the probabilities of all words in a sequence which the model draws samples from).
The approach of drawing several samples and evaluating the entropy and/or disagreement between those draws is that it relies on already knowing the properties of the output distribution. It may be legitimate that one distribution is much more uniformly random than another, which has high certainty. Its not clear to me that they have demonstrated the underlying assumption.
Take for example celebrity info, "What is Tom Cruise known for?". The phrases "movie star", "katie holmes", "topgun", and "scientology" are all quite different in terms of their location in the word vector space, and would result in low semantic similarity, but are all accurate outputs.
On the other hand, "What is Taylor Swift known for?" the answers "standup comedy", "comedian", and "comedy actress" are semantically similar but represent hallucinations. Without knowing the distribution characteristics (e.g multivariate moments and estimates) we couldn't say for certain these are correct merely by their proximity in vector space.
As some have pointed out in this thread, knowing the correct distribution of word sequences for a given input sequence is the very job the LLM is solving, so there is no way of evaluating the output distribution to determine its correctness.
There are actual statistical models to evaluate the amount of uncertainty in output from ANNs (albeit a bit limited), but they are probably not feasible at the scale of LLMs. Perhaps a layer or two could be used to create a partial estimate of uncertainty (e.g. final 2 layers), but this would be a severe truncation of overall network uncertainty.
Another reason I mention this is most hallucinations I encounter are very plausible and often close to the right thing (swapping a variable name, confabulating a config key), which appear very convincing and "in sample", but are actually incorrect.
It depends on the fact that a high uncertainty answer by definition is less probable. That means if you ask multiple times you will not get the same unlikely answer, such as that Taylor swift is a comedian, you will instead get several semantically different answers.
Maybe you’re saying the same thing, but if so I’m missing the problem. If your training data tells you that Taylor Swift is known as a comedian, then hallucinations are not your problem.
A better example might be that the model overtrained on AWS cloud formation API 2 and when v3 comes out produces low entropy answers that are wrong for v3 but right for v2 (due to training bias), but the answers are low variance (e.g. "bucket" instead of the new "bucket_name" key).
Another example based on a quick test I did on GPT4:
In a single phrase, what is Paris?
Paris is the city of Light.
Paris is the capital of France.
Paris is the romantic capital of the world renowned for its art, fashion, and culture.
Blaming human semantics for LLMs is generally a bad idea, since we only use human semantics to qualitatively explain how the models abstract ideas. In practice you simply don't know how the model relates words.
To me this sounds very similar to lowering temperature. It doesn't sound like it pulls better from grounded-truth but rather more probabilistic in the vector space. Does this jive?
For your Tom Cruise example, since all those phrases are true and grounded in the training data, the technique may fire off a false positive "hallucination decision".
However, the example they give in the paper seems to be for "single-answer" questions, e.g., "What is the receptor that this very specific medication acts on?", or "Where is the Eiffel Tower located?", in which case I think this approach could be helpful. So perhaps this technique is best-suited for those single-answer applications.
> draw[ing] several samples and evaluating the entropy and/or disagreement between those draws
the method from the paper (as I understand it):
- samples multiple answers, (e.g. "music:0.8, musician:0.9, concert:0.7, actress:0.5, superbowl:0.6")
- groups them by semantic similarity and gives them an id ([music, musician, concert] -> MUSIC, [actress] -> ACTING, [superbowl] -> SPORTS), note that they just use an integer or something for the id
- sums the probability of those grouped answers and normalizes: (MUSIC:2.4, ACTING:0.5, SPORTS:0.6 -> MUSIC:0.686, SPORTS:0.171, ACTING:0.143)
They also go to pains in the paper to clearly define what they are trying to prevent, which is confabulations.
> We focus on a subset of hallucinations which we call ‘confabulations’ for which LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed.
Common misconceptions will still be strongly represented in the dataset. What this method does is it penalizes semantically isolated answers (answers dissimilar to other possible answers) with mediocre likelihood.
Now technically, this paper only compares the effectiveness of "detecting" the confabulation to other methods - it doesn't offer an improved sampling method which utilizes that detection. And of course, if it were used as part of a generation technique it is subject to the extreme penalty of 10xing the number of model generations required.
link to the code: https://github.com/jlko/semantic_uncertainty
Taylor Swift has appeared multiple times on SNL, both as a host and as a surprise guest, beyond being a musical performer[0]. Generally, your point is correct, but she has appeared on the most famous American television show for sketch comedy, making jokes. One can argue whether she was funny or not in her appearances, but she has performed as a comedian, per se.
Though she hasn't done a full-on comedy show, she has appeared in comedies in many credits (often as herself).[1] For example she appeared as "Elaine" in a single episode of The New Girl [2x25, "Elaine's Big Day," 2013][2]. She also appeared as Liz Meekins in "Amsterdam" [2022], a black comedy, during which her character is murdered.[3]
It'd be interesting if there's such a thing as a negatory hallucination, or, more correctly, an amnesia — the erasure of truth that the AI (for whatever reason) would ignore or discount.
[0] https://www.billboard.com/lists/taylor-swift-saturday-night-...
[1] https://www.imdb.com/name/nm2357847/
[2] https://newgirl.fandom.com/wiki/Elaine
[3] https://www.imdb.com/title/tt10304142/?ref_=nm_flmg_t_7_act
The process can look like-
1. Use existing large models to convert the same previous dataset they were trained on into formal logical relationships. Let them generate multiple solutions
2. Take this enriched dataset and train a new LLM which not only outputs next token but also a the formal relationships between previous knowledge and the new generated text
3. Network can optimize weights until the generated formal code get high accuracy on proof checker along with the token generation accuracy function
In my own mind I feel language is secondary - it's not the base of my intelligence. Base seems more like a dreamy simulation where things are consistent with each other and language is just what i use to describe it.
How to extend LLMs to add mechanisms for reasoning, causality, etc (Type 2 thinking)? However that will eventually be done, the implementation must continue to be probabilistic, informal, and bottom-up. Manual human curation of logical and semantic relations into knowledge models has proven itself _not_ to be sufficiently scalable or anti-brittle to do what's needed.
We could just use RAG to create a new dataset. Take each known concept or named entity, search it inside the training set (1), search it on the web (2), generate it with a bunch of models in closed book mode (3).
Now you got three sets of text, put all of them in a prompt and ask for a wikipedia style article. If the topic is controversial, note the controversy and distribution of opinions. If it is settled, notice that too.
By contrasting web search with closed-book materials we can detect biases in the model and lacking knowledge or skills. If they don't appear in the training set you know what is needed in the next iteration. This approach combines self testing with topic focused research to integrate information sitting across many sources.
I think of this approach as "machine study" where AI models interact with the text corpus to synthesize new examples, doing a kind of "review paper" or "wiki" reporting. This can be scaled for billions of articles, making a 1000x larger AI wikipedia.
Interacting with search engines is just one way to create data with LLMs. Interacting with code execution and humans are two more ways. Just human-AI interaction alone generates over one billion sessions per month, where LLM outputs meet with implicit human feedback. Now that most organic sources of text have been used, the LLMs will learn from feedback, task outcomes and corpus study.
My take - Hallucinations can never be made to perfect zero but they can be reduced to a point where these systems in 99.99% will be hallucinating less than humans and more often than not their divergences will turn out to be creative thought experiments (which I term as healthy imagination). If it hallucinates less than a top human do - I say we win :)
[1] https://cyc.com
[2] https://github.com/asanchez75/opencyc
I was not actually aware that building KG from text is NP-hard problem. I will check it out. I thought it was a time consuming problem when done manually without LLMs but didn't thought it was THAT hard. Hence I was trying to introduce LLM into the flow. Thanks, will read about all this more!
Eg, KGs (RDF, PGs, ...) are logical, but in automated construction, are not semantic in the sense of the ground domain of NLP, and in manual construction, tiny ontology. Conversely, fancy powerful logics like modal ones are even less semantic in NLP domains. Code is more expressive, but brings its own issues.
Why should you believe the output of the LLM just because it is formatted a certain way (i.e. "formal logical relationships")?
One expression of that idea is in this paper: https://link.springer.com/article/10.1007/s10676-024-09775-5
Knowing how random the information is seems like a small step forward.
Take social media like Reddit for example. It has a filtering mechanism for content that elevates low-entropy thoughts people commonly express and agree with. And I don’t think that necessarily equates such popular ideas there to the truth.
Then yes, it is being taught to bullshit.
Similar to how an improv class teaches you to keep a conversation interesting and “never to say no” to your acting partner.
[1] https://old.reddit.com/r/ChatGPT/comments/1diljf2/google_gem...
Sometimes, to lead people out of a wrong belief or worldview, you have to meet them where they currently are first.
> We think that this is worth paying attention to. Descriptions of new technology, including metaphorical ones, guide policymakers’ and the public’s understanding of new technology; they also inform applications of the new technology. They tell us what the technology is for and what it can be expected to do. Currently, false statements by ChatGPT and other large language models are described as “hallucinations”, which give policymakers and the public the idea that these systems are misrepresenting the world, and describing what they “see”. We argue that this is an inapt metaphor which will misinform the public, policymakers, and other interested parties.
The only way to know if it did “hallucinate” is to already know the correct answer. If you can make a system that knows when an answer is right or not, you no longer need the LLM!
Wanting to use accurate language isn't exhausting, it's a requirement if you want to think about and discuss problems clearly.
https://en.wikipedia.org/wiki/Confabulation
Sometimes it is coherent (grounded in physical and social dynamics) and sometimes it is not.
We need systems that try to be coherent, not systems that try to be unequivocally right, which wouldn't be possible.
The fact that it isn't possible to be right about 100% of things doesn't mean that you shouldn't try to be right.
Humans generally try to be right, these models don't, that is a massive difference you can't ignore. The fact that humans often fails to be right doesn't mean that these models shouldn't even try to be right.
It's why this arena things are a hard problem. It's extremely difficult to actually know the entropy of certain meanings of words, phrases, etc, without a comical amount of computation.
This is also why a lot of the interpretability methods people use these days have some difficult and effectively permanent challenges inherent to them. Not that they're useless, but I personally feel they are dangerous if used without knowledge of the class of side effects that comes with them.)
The Boolean answer to that is "yes".
But if Boolean logic were a god representation of reality, we would already have solved that AGI thing ages ago. On practice, your neural network is trained with a lot of samples, that have some relation between themselves, and to the extent that those relations are predictable, the NN can be perfectly able to predict similar ones.
There's an entire discipline about testing NNs to see how well they predict things. It's the other side of the coin of training them.
Then we get to this "know the correct answer" part. If the answer to a question was predictable from the question words, nobody would ask it. So yes, it's a definitive property of NNs that they can't create answers for questions like people have been asking those LLMs.
However, they do have an internal Q&A database they were trained on. Except that the current architecture can not know if an answer comes from the database either. So, it is possible to force them into giving useful answers, but currently they don't.
the fact checker doesn’t synthesize the facts or the topic
Yes, there seems to be a little bit of grokking and the models can be made to approximate step-by-step reasoning a little bit. But 95% of the function of these black boxes is text generation. Not fact generation, not knowledge generation. They are more like improv partners than encyclopedias and everyone in tech knows it.
I don’t know if LLMs misleading people needs a clever answer entropy solution. And it is a very interesting solution that really seems like it would improve things — effectively putting certainty scores to statements. But what if we just stopped marketing machine learning text generators as near-AGI, which they are not? Wouldn’t that undo most of the damage, and arguably help us much more?
All in all it’s been a great experience, it’s like working with a mentor along the way. It must have saved me a great deal of time, given how rookie I am. I do need to verify the result.
Where did you get the 95% figure? And whether what it does is text generation or fact or knowledge generation is irrelevant. It’s really a valuable tool and is way above anything I’ve used.
I've started calling it what it is: lashing out in confusion at why they're not going away, given a prior that theres no point in using them
I have a feeling there'll be near-religious holdouts in tech for some time to come. We attract a certain personality type, and they tend to be wedded to the idea of things being absolute and correct in a way things never are.
That's reasonable for questions with a single objective answer. It probably won't help when multiple, equally valid answers are possible.
However, that's good enough for search engine applications.
Of course, altering the input without changing the meaning is the hard part, but doesn't seem entirely infeasible. At the least, you could just ask the LLM to try to alter the input without changing the meaning, although you might end up in a situation where it alters the input in a way that aligns with its own faulty understanding of an input, meaning it could match the hallucinated output better after modification.