Overcoming the limits of current LLMs

We can't develop a universally coherent data set because what we understand as "truth" is so intensely contextual that we can't hope to cover the amount of context needed to make the things work how we want, not to mention the numerous social situations where writing factual statements would be awkward or disastrous.

Here are a few examples of statements that are not "factual" in the sense of being derivable from a universally coherent data set, and that nevertheless we would expect a useful intelligence to be able to generate:

"There is a region called Hobbiton where someone named Frodo Baggins lives."

"We'd like to announce that Mr. Ousted is transitioning from his role as CEO to an advisory position while he looks for a new challenge. We are grateful to Mr. Ousted for his contributions and will be sad to see him go."

"The earth is round."

"Nebraska is flat."

smokel · a year ago

> We can't develop a universally coherent data set because

Yet every child seems to manage, when raised by a small village, over a period of about 18 years. I guess we just need to give these LLMs a little more love and attention.

antisthenes · a year ago

And then you go out into the real world, talk to real adults, and discover that the majority of people don't have a coherent mental model of the world, and have completely ridiculous ideas that aren't anywhere near an approximation of the real physical world.

bugglebeetle · a year ago

Or maybe hundreds of millions of years of evolutionary pressure to build unbelievably efficient function approximation.

Deleted Comment

js8 · a year ago

You're right. We don't really know how to handle uncertainty and fuzziness in logic properly (to avoid logical contradictions). There has been many mathematical attempts to model uncertainty (just to name a few - probability, Dempster-Shafer theory, fuzzy logic, non-monotone logics, etc.), but they all suffer from some kind of paradox.

At the end of the day, none of these theoretical techniques prevailed in the field of AI, and we ended up with, empirically successful, neural networks (and LLMs specifically). We know they model uncertainty but we have no clue how they do it conceptually, or whether they even have a coherent conception of uncertainty.

So I would pose that the problem isn't that we don't have the technology, but it's rather we don't understand what we want from it. I am yet to see a coherent theory of how humans manipulate the human language to express uncertainty that would encompass broad (if not all) range of how people use language. Without having that, you can't define what is a hallucination of an LLM. Maybe it's making a joke (some believe that point of the joke is to highlight a subtle logical error of some sort), because, you know, it read a lot of them and it concluded that's what humans do.

So AI eventually prevailed (over humans) in fields where we were able to precisely define the goal. But what is our goal vis-a-vis human language? What do we want from AI to answer to our prompts? I think we are stuck at the lack of definition of that.

Man it seems like the ship has sailed on "hallucination" but it's such a terrible name for the phenomenon we see. It is a major mistake to imply the issue is with perception rather than structural incompetence. Why not just say "incoherent output"? It's actually descriptive and doesn't require bastardizing a word we already find meaningful to mean something completely different.

jwuphysics · a year ago

> Why not just say "incoherent output"? Because the biggest problem with hallucinations is that the output is usually coherent but factually incorrect. I agree that "hallucination" isn't the best word for it... perhaps something like "confabulation" is better.

nl · a year ago

And we use "hallucination" because in the ancient times when generative AI meant image generation models would "hallucinate" extra fingers etc.

The behavior of text models is similar enough that the wording stuck, and it's not all that bad.

linguistbreaker · a year ago

I appreciated a post on here recently that likened AI hallucination to 'bullshitting'. It's coherent, even plausible output without any regard for the truth.

breatheoften · a year ago

I think it's a pretty good name for the phenomenon -- maybe the only problem with the term is that what models are doing is 100% hallucination all the time -- it's just that when the hallucinations are useful we don't call them hallucinations -- so maybe that is a problem with the term (not sure if that's what you are getting at).

But there's nothing at all different about what the model is doing between these cases -- the models are hallucinating all the time and have no ability to assess when they are hallucinating "right" or "wrong" or useful/non-useful output in any meaningful way.

darby_nine · a year ago

They aren't hallucinating in any way comparable to humans, which implies a delusion in perception. You're describing the quality of output by using a word used to describe the quality of input.

Slow_Hand · a year ago

I prefer “confabulate” to describe this phenomena.

: to fill in gaps in memory by fabrication

> In psychology, confabulation is a memory error consisting of the production of fabricated, distorted, or misinterpreted memories about oneself or the world.

It’s more about coming up with a plausible explanation in the absence of a readily-available one.

skdotdan · a year ago

This is not what hallucination means in the pre-LLM machine learning literature.

exmadscientist · a year ago

"Hallucinations" implies that someone isn't of sound mental state. We can argue forever about what that means for a LLM and whether that's appropriate, but I think it's absolutely the right attitude and approach to be taking toward these things.

They simply do not behave like humans of sound minds, and "hallucinations" conveys that in a way that "confabulations" or even "bullshit" does not. (Though "bullshit" isn't bad either.)

devjab · a year ago

I disagree with this take because LLMs are, always, hallucinating. When they get things right it’s because they are lucky. Yes, yes, it’s more complicated than that, but the essence of LLMs is that they are very good at being lucky. So good that they will often give you better results than random search engine clicks, but not good enough to be useful for anything important.

I think what calling the times they get things wrong hallucinations is largely an advertising trick. So that they can sort of fit the LLMs into how all IT is sometimes “wonky” and sell their fundamentally flawed technology more easily. I also think it works extremely well.

kimixa · a year ago

I don't really immediately link "Hallucinations" with "Unsound mind" - most people I know have experienced auditory hallucinations - often things like not sure if the doorbell went off, or if someone said their name.

And I couldn't find a single one of my friends who hadn't experienced "phantom vibration syndrome".

Both I'd say are "Hallucinations", without any real negative connotation.

darby_nine · a year ago

Hallucinations implies that they do behave like a human mind. Why else would you use the word if you were not trying to draw this parallel?

mistermann · a year ago

"Sound" minds for humans is graded on a curve, and this trick is not acknowledged, or popular.

linguistbreaker · a year ago

How about "dream-reality confusion (DRC)" ?

marcosdumay · a year ago

Bullshit is the most descriptive one.

LLMs don't do it because they are out of their right mind. They do it because every single answer they say is invented caring only about form, and not correctness.

But yeah, that ship has already sailed.

TeaBrain · a year ago

The problem with "incoherent output" is that it isn't describing the phenomenon at all. There have been cases where LLM output has been incoherent, but modern LLM hallucinations are usually coherent and well-contructed, just completely fabricated.

darby_nine · a year ago

Do you have an example? Your statement seems trivially contradictory—how do you know it's fabricated without incoherence showing you this? Isn't fabrication the entire point of generative ai?

lopatin · a year ago

Why is "incoherent output" better? When an LLM hallucinates, it coherently and confidently lies to you. I think "hallucination" is the perfect word for this.

wkat4242 · a year ago

Hallucination is one single word. Even if it's not perfect it's great as a term. It's easy to remember and people new to the term already have an idea of what it entails. And the term will bend to cover what we take it to mean anyway. Language is flexible. Hallucination in an LLM context doesn't have to be the exact same as in a human context. All it matters is that we're aligned on what we're talking about. It's already achieved this purpose.

freilanzer · a year ago

Hallucination perfectly describes the phenomenon.

adrianmonk · a year ago

On a literal level, hallucinations are perceptual.

But "hallucination" was already (before LLMs) being used in a figurative sense, i.e. for abstract ideas that are made up out of nothing. The same is also true of other words that were originally visual, like "illusion" and "mirage".

skdotdan · a year ago

It’s used incorrectly. Hallucination has (or used to have) a very specific meaning in machine learning. All hallucinations are errors but not all errors are hallucinations.

dgs_sgd · a year ago

I think calling it hallucination is because of our tendency to anthropomorphize things.

Humans hallucinate. Programs have bugs.

threeseed · a year ago

The point is that this isn't a bug.

It's inherent to how LLMs work and is expected although undesired behaviour.

import time import random SESSION_DURATION = 50 * 60 start_time = time.time() while True: current_time = time.time() elapsed_time = current_time - start_time if elapsed_time >= SESSION_DURATION: print("Our time is up. That will be $150. See you next week!") break _ = input("") print(random.choice(["Mmm hmm", "Tell me more", "How does that make you feel?"])) time.sleep(1)

mitthrowaway2 · a year ago

LLMs don't only hallucinate because of mistaken statements in their training data. It just comes hand-in-hand with the model's ability to remix, interpolate, and extrapolate answers to other questions that aren't directly answered in the dataset. For example if I ask ChatGPT a legal question, it might cite as precedent a case that doesn't exist at all (but which seems plausible, being interpolated from cases that do exist). It's not necessarily because it drew that case from a TV episode. It works the same way that GPT-3 wrote news releases that sounded convincing, matching the structure and flow of real articles.

Training only on factual data won't solve this.

Anyway, I can't help but feel saddened sometimes to see our talented people and investment resources being drawn in to developing these AI chatbots. These problems are solvable, but are we really making a better world by solving them?

dweinus · a year ago

100% I think the author is really misunderstanding the issue here. "Hallucination" is a fundamental aspect of the design of Large Language Models. Narrowing the distribution of the training data will reduce the LLM's ability to generalize, but it won't stop hallucinations.

sean_pedersen · a year ago

I agree in that a perfectly consistent dataset won't completely stop statistical language models from hallucinating but it will reduce it. I think it is established that data quality is more important than quantity. Bullshit in -> bullshit out, so a focus on data quality is good and needed IMO.

I am also saying LMs output should cite sources and give confidence scores (which reflects how much the output is in or out of the training distrtibution).

_venkatasg · a year ago

Most sentences in the world are not about truth or falsity. Training on a high quality corpus isn’t going to fix ‘hallucination’. The complete separation of facts from sentences is what makes LLMs powerful.

CamperBob2 · a year ago

These problems are solvable, but are we really making a better world by solving them?

When you ask yourself that question -- and you do ask yourself that, right? -- what's your answer?

I do, all the time! My answer is "most likely not". (I assumed that answer was implied by my expressing sadness about all the work being invested in them.) This is why, although I try to keep up-to-date with and understand these technologies, I am not being paid to develop them.

bosch_mind · a year ago

AI noob, but instead of training and fine tuning the LLM itself, don’t more specific and targeted embeddings paired with the model help alleviate the hallucination where you incorporate semantic search context with the question?

Dead Comment

RodgerTheGreat · a year ago

One of the main factors that makes LLMs popular today is that scaling up the models is a simple and (relatively) inexpensive matter of buying compute capacity and scraping together more raw text to train them. Without large and highly diverse training datasets to construct base models, LLMs cannot produce even the superficial appearance of good results.

Manually curating "tidy", properly-licensed and verified datasets is immensely more difficult, expensive, and time-consuming than stealing whatever you can find on the open internet. Wolfram Alpha is one of the more successful attempts in that curation-based direction (using good-old-fashioned heuristic techniques instead of opaque ML models), and while it is very useful and contains a great deal of factual information, it does not conjure appealing fantasies of magical capabilities springing up from thin air and hands-off exponential improvement.

> properly-licensed and verified datasets is immensely more difficult, expensive

Arguably the bigger problem is that many of those datasets e.g. WSJ articles are proprietary and can be exclusively licensed like we've seen recently with OpenAI.

So we end up with in a situation where competition is simply not possible.

piva00 · a year ago

> Arguably the bigger problem is that many of those datasets e.g. WSJ articles are proprietary and can be exclusively licensed like we've seen recently with OpenAI.

> So we end up with in a situation where competition is simply not possible.

Exactly, and Technofeudalism advances a little more into a new feud.

OpenAI is trying to create its moat by shoring up training data, probably attempting to not allow competitors to train on the same datasets they've been licencing, at least for a while. Training data is the only possible moat for LLMs, models seem to be advancing quite well between different companies but as mentioned here a tidy training dataset is the actual gold.

motohagiography · a year ago

the irony is that if large media providers aren't represented in the training sets, my comments on internet forums over the decades will be over-represented, which is kind of great, really.

totetsu · a year ago

It’s not unethical if people in positions of privilege and power do it to maintain their rightful position of privilege and power.

dang · a year ago

Please don't post in the flamewar style to HN. It degrades discussion and we're trying to go in the opposite direction here, to the extent that is possible on the internet.

https://news.ycombinator.com/newsguidelines.html

nyrikki · a year ago

> ...manually curate a high-quality (consistent) text corpus based on undisputed, well curated wikipedia articles and battle tested scientific literature.

This assumption is based on the mistaken assumption that science is about objective truth.

It is confusing the map for the territory. Scientific models are intended to be useful, not perfect.

Statistical learning, vs symbolic learning is about existential quantification vs universal quantification respectively.

All models are wrong some are useful, this applies to even the most unreasonably accurate versions like QFT and GR.

Spherical cows, no matter how useful are hotly debated outside of the didactic half truths of low level courses.

The corpus that the above seeks doesn't exist in academic circles, only in popular science where people don't see that practical, useful models are far more important that 'correct' ones.

lsy · a year ago

ainoobler · a year ago

The article suggests a useful line of research. Train an LLM to detect logical fallacies and then see if that can be bootstrapped into something useful because it's pretty clear that all the issues with LLMs is the lack of logical capabilities. If an LLM was capable of logical reasoning then it would be obvious when it was generating made-up nonsense instead of referencing existing sources of consistent information.

343rwerfd · a year ago

> If an LLM was capable of logical reasoning

the prompt interfaces + smartphone apps were (from the beginning), and are ongoing training for the next iteration, they provide massive RLHF for further improvements in already quite RLHFed advanced models.

Whatever tokens they're extracting from all the interactions, the most valuable are those from metadata, like "correct answer in one shot", or "correct answer in three shots".

The inputs and potentially the outputs can be gibberish, but the metadata can be mostly accurate given some implicit/explicit (the tumbs up, the "thanks" answers from users, maybe), human feedback.

The RLHF refinement extracted from getting the models face the entire human population for to be continuously, 24x7x365, prompted in all languages, about all the topics interesting for the human society, must be incredible. If you just can extract a single percentage of definitely "correct answers" from the total prompts answered, it should be massive compared to just a few thousands of QA dedicated RLHF people working on the models in the initial iterations of training.

That was GPT2,3,4, initial iterations of the training. Having the models been evolved to more powerful (mathematical) entities, you can use them to train the next models. Like is almost certainly happening.

My bet is that one of two

- The scaling thing is working spectacularly, they've seen linear improvement in blue/green deployments across the world + realtime RLHF, and maybe it is going a bit slow, but the improvements justify just a bit more waiting to get trained a more powerful,refined model, incredible more better answers from even the previous datasets used (now more deeply inquired by the new models and the new massive RLHF data), if in a year they have a 20x GPT4, Claude, Gemini, whatever, they could be "jumping" to the next 40x GPT4, Claude, Gemini, a lot faster, if they have the most popular, prompted model in the market (in the world).

- The scaling stuff already sunk, they have seen the numbers and it doesn't add by now, or they've seen disminished returns coming. This is being firmly denied by anyone on the record or off the record.

dtx1 · a year ago

I think we should start smaller and make them able to count first.

Terr_ · a year ago

Yeah, you can train an LLM to recognize the vocabulary and grammatical features of logical fallacies... Except the nature of fallacies is that they look real on that same linguistic level, so those features aren't distinctive for that purpose.

Heck, I think detecting sarcasm would be an easier goal, and still tricky.

RamblingCTO · a year ago

My biggest problem with them is that I can't quite get it to behave like I want it to. I built myself a "therapy/coaching" telegram bot (I'm healthy, but like to reflect a lot, no worries). I even built a self-reflecting memory component that generates insights (sometimes spot on, sometimes random af). But the more I use it, the more I notice that neither the memory nor the prompt matters much. I just can't get it to behave like a therapist would. So in other words: I can't find the inputs to achieve a desirable prediction from the SOTA LLMs. And I think that's a bigger problem for them not to be a shallow hype.

coldtea · a year ago

>I just can't get it to behave like a therapist would

Thank me later!

haha, good one! although I'm German and it was free for me when I did it. I just had the best therapist. $150 a session is insane!

trte9343r4 · a year ago

> One could spin this idea even further and train several models with radically different world views by curating different training corpi that represent different sets of beliefs / world views.

You can get good results by combining different models in chat, or even the same model with different parameters. Model usually gives up on hallucinations when challenged. Sometime it pushes back and provides explanation with sources.

I have a script that puts models into dialog, moderates discussion and takes notes. I run this stuff overnight, so getting multiple choices speeds up iteration.