If you had a piece of code or software that sometimes produces totally wrong output we would consider that a bug.
Yet it seems like with AI all the investors/founders/PMs don’t really care and just ship a broken product anyway
I feel like I’m going crazy seeing all the AI stuff ship in products that gives straight up wrong outputs
It’s like a big collective delusion where we just ignore it or hand wave that it’ll get fixed eventually magically
Once I started seeing these behaviors in our robots, their appearance became much more pronounced every time I dug deeply into proposed ML systems: autonomous vehicles, robotic assistants, chatbots, and LLMs.
As I've had time to reflect on our challenges, I think that neural networks very quickly tend to overfit, and deep neural networks are incomparably overfitted. That condition makes them sensitive to hidden attractors that cause the system to break down when it is near these areas - catastrophically.
How do we define "near"? That would have to be determined using some topological method. But these systems are so complicated that we can't analyze their networks' topology or even brute-force probe their activations. Further, the larger, deeper, and more highly connected the network, the more challenging these hidden attractors are to find.
I was bothered by this topic a decade ago, and nothing I have seen today has alleviated my concern. We are building larger, deeper, and more connected networks on the premise that we'll eventually get to a state so unimaginably overfitted that it becomes stable again. I am unnerved by this idea and by the amount of money flowing in that direction with reckless abandon.
I feel like I can't trust anything it says. Mostly I use it to parse things I don't understand and then do my own verification that it's correct.
All that to say, from my perspective, they're losing some small amount of ground. The other side is that the big corps that run them don't want their golden gooses to be cooked. So they keep pushing them and shoving them into everything unnecessarily and we just have to eat it.
So I think it's a perception thing. The corps want us to think it's super useful so it continues to give them record profits. While the rest of us are slowly waking up to how useless they are if they will confidently tell us incorrect answers and are moving away from it.
So you may just be seeing sleezy marketing at work here.
Same thing happened to me. I asked for all the Ukrainian noun cases, it listed and described six.
I responded that there are seven. "Oh, right." It then named and described the seventh.
That's no better than me taking an exam, so why should I rely on it, or use it at all?
But you must admit that it is still useful, and usage will not drop to zero.
LLMs are not factual databases. They are not trained to retrieve or produce factual statements.
LLMs give you the most likely word after some prior words. They are incredibly accurate at estimating the probabilities of the next word.
It is a weird accident that you can use auto-regressive next word prediction to make a chat bot. It's even weirder that you can ask the chatbot questions and give it requests and it appears to produce coherent answers and responses.
LLMs are best thought of as language generators (or "writers") not as repositories of knowledge and facts.
LLM chatbots were a happy and fascinating (and for some, very helpful) accident. But they were not designed to be "factually correct" they were designed to predict words.
People don't care about (or are willing to accept) the "wrong answers" because there are enough use cases for "writing" that don't require factual accuracy. (see for instance, the entire genre of fiction writing)
I would argue that it is precisely LLMs ability to escape the strict accuracy requirements of the rest of CS and just write/hallucinate some fiction that is actually what makes this tech fascinating and uniquely novel.
For this question, what LLMs were designed for is I think less relevent than what they are advertised for, e.g.
"Get answers. Find inspiration. Be more productive. Free to use. Easy to try. Just ask and ChatGPT can help with writing, learning, brainstorming, and more." https://openai.com/chatgpt/
No mention of predicting words.
And the utility of a "language generator" without reliable knowledge or facts is extremely limited. The technical term for that kind of language is bullshit.
> People don't care about (or are willing to accept) the "wrong answers" because there are enough use cases for "writing" that don't require factual accuracy. (see for instance, the entire genre of fiction writing)
Fiction, or at least good fiction, requires factual accuracy, just not the kind of factual accuracy you recalling stuff from an encyclopedia. For instance: factual accuracy about what it was like to live in the world in a certain time or place, so you can create a believable setting; or about human psychology, so you can create believable characters.
I'd also argue that the economic value of coherent bullshit is ... quite high. Many people have made careers out of producing coherent bullshit (some even with incoherent bullshit :-).
Of course, in the long run, factual accuracy has more economic value than bullshit.
There is, additionally, the fact that there is no easy (or even medium difficult) way to fix this aspect of LLM's, and it means that the choices are either: 1) ship it now anyway and hope people pay for it regardless 2) admit that this is a niche product, useful in certain situations but not for most
Option 1 means you get a lot of money (at least for a little while). Option 2 doesn't.
It's precisely that analogy we learned early in our study of neural networks: the layers analyze the curves, straight segments, edges, size, shape, etc. But, when we look at the activation patterns, we see they are not doing anything remotely like that. They look like stochastic correlations, and the activation pattern was almost entirely random.
The same thing is happening here, but at incomprehensible scales and with fortunes being sunk into hope.
So you have a good definition for "intelligent", and it applies to LLM? Please tell us! And explain how that definition is so infallible that you know that everyone who says LLM aren't intelligent are wrong?
What gets difficult is evaluating the response, but let's not pretend that's any easier to do when interacting with a human. Experts give wrong answers all the time. It's generally other experts who point out wrong answers provided by one of their peers.
My solution? Query multiple LLMs. I'd like to have three so I can establish a quorum on an answer, but I only have two. If they agree then I'm reasonably confident the answer is correct. If they don't agree - well, that's where some digging is required.
To your point, nobody is expecting these systems to be infallible because I think we intuitively understand that nothing knows everything. Wouldn't be surprised if someone wrote a paper on this very topic.
We have many rules, regulations, strategies, patterns, and legions of managers and management philosophy for dealing with humans.
With humans, they're incorrect sometimes, yes, and we actively work around their failures.
We expect humans to develop over time. We expect them to join a profession and give bad answers a lot. As time goes on, we expect them to produce better answers, and if they don't we have remediations to limit the negative impact they have on our business processes. We fire them. We recommend they transfer to a different discipline. We recommend they go to college.
Comparing the successes and failures of LLMs to humans is silly. We would have fired them all by now.
The big difference is that computers CAN get every single question correctly. They ARE better than humans. LLMs are a huge step back from the benefits we got from computers.
I emphatically disagree on that point. AFAIK, nobody has been able to demonstrate, even in principle, that omniscience is possible over a domain of sufficient complexity and subtlety. My gut tells me this is related to Gödel's Incompleteness Theorem.
b) Experts may give wrong answers but it will happen once. LLMs will do it over and over again.
Well... Sometimes "experts" will give the wrong answer repeatedly.
Garry Tan from YC is a great example of this.
It's not that he doesn't care. It's just that he believes that the next model will be the one that fixes it. And companies that jump on board now can simply update their model and be in prime position. Similar to how Tesla FSD is always 2 weeks away from perfection and when it happens they will dominate the market.
And because companies are experimenting with how to apply AI these startups are making money. So investors jump in on the optimism.
The problem is that for many use cases e.g. AI agents, assistance, search, process automation etc. they very much do care about accuracy. And they are starting to run out of patience for the empty promises. So there is a reckoning coming for AI in the coming year or two and it will be brutal. Especially in this fundraising environment.
No, what he does is he hopes that they can keep the hype alive long enough to cash out and then go to the next hype. Not only Garry Tan, but most VCs. That's the fundamental business model of VCs. That's also why Tesla FSD is always two weeks away. The gold at the end of the rainbow.
AI is like that right now. It's only right sometimes. You need to use judgement. Still useful though.
In the rare cases where more complexity produces a more reliable system, that complexity is always incremental, not sudden.
With our current approach to deep neural networks and LLMs, we missed the incremental step and jumped to rodent brain levels of complexity. Now, we are hoping that we can improve our way to stability.
I don't know of any examples where that has happened - so I am not optimistic about the chances here.
The question then becomes, "How wrong can it be and still be useful?" This depends on the use case. It is much harder for applications that require high deterministic output but less important for those that do not. So yes, it does provide wrong outputs, but it depends on what the output is and the tolerance for variation. In the context of Question and Answer, where there is only one right answer, it may seem wrong, but it could also provide the right answer in three different ways. Therefore, understanding your tolerance for variation is most important, in my humble opinion.
Inference is no excuse for inconsistency. Inference can be deterministic and so deliver consistency.