Chatbots: Still dumb after all these years

Having worked in ML at two different companies now, I think that people interpreting model output as intelligence or understanding says much more about the people than about the model output.

We want it to be true, so we squint and connect dots and it's true.

But it isn't. It's math and tricks, and if human intelligence is truly nothing more than math and tricks, then what we have today is a tiny, tiny, tiny fraction of the amount of math and tricks in the average human brain, because what we have today isn't anywhere close.

version_five · 4 years ago

I think a problem is the tighter cycle between academic discoveries and business people trying to monetize them. Large language models were developed, objectively a great achievement, and immediately someone nontechnical wants to apply their own interpretation and imagine that we can build a chatbot that you won't have to pay, and before you know it, people are selling and even deploying them. Anyone who questions or points out that the technology doesn't do what the business people think it does just gets dismissed as not understanding.

klabb3 · 4 years ago

> Anyone who questions or points out that the technology doesn't do what the business people think it does[...]

Uh oh, we've got a downer!

Jokes aside, I'd like to consider an even simpler explanation, namely that "The purpose of a system is what it does"[1]. In this case, it would suggest decision makers are fully aware that they suck. Why would anyone want something that sucks? Because it's discouraging, and customer service today is all about discouragement. Unsubscribing, replacements, returns, special requests are all costs, and that's a fixable problem in the current business doctrine, especially in the faceless megacorp flavor of business. While chat bots (and other dark patterns) are frustrating, it creates a veil of plausible deniability for the business. It's hard to claim that it's deliberately hostile to customers, even though it's the simplest explanation.

[1]: https://en.m.wikipedia.org/wiki/The_purpose_of_a_system_is_w...

thewarrior · 4 years ago

GPT-3 needs proper prompting to get the right output. Here's the same question as in the blogpost posed differently

Student: Why is it unsafe to walk backwards down a staircase with my eyes closed ?

Teacher: You can't see the step behind you!

The "teacher" here is GPT-3 and answered the question easily. People are in denial of genuine progress.

Student: Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia?

Stock trader: I don't know. You might be able to, but I can't. The problem is that the temperature changes are so small that they won't be statistically significant. It's a question of sample size. An individual day's temperature in one location is far less likely to be an indicator of stock prices than the interest rate.

Student: Why is that ?

Stock trader: That's because interest rates are much more likely to affect the economy than a single day's temperature.

Student: Why do interest rates affect the economy ?

Stock trader: Interest rates are a measure of the cost of money, so the interest rate is a measure of the cost of capital. If the cost of capital is high, firms will borrow less money and they'll invest less. That means they won't be as profitable, and stock prices will fall.

Quiz question: Who is the President of the United States?

Answer: Donald Trump.

I did get junk answers as well (it said that Jimi Hendrix is the president) but I have a better intuition of how to put GPT-3 in the right "frame of mind". GPT-3 is a text completion engine. It doesn't know whether you want it to produce a story, a reddit flamewar or a PhD thesis. You need to provide the hints. One prompt cannot answer any and all questions. If it were that good we would be jobless. It's far from perfect but it's beginning to get there.

dkarl · 4 years ago

There isn't really a new "problem" with AI. Businesses love hype, they love anything that can punch up a sales pitch, they love shitty cheap automation that replaces expensive human work, they love to identify ways to lower quality and save money without affecting revenue, and AI can be good for all of those things without being good at anything.

kirso · 4 years ago

I love this answer because it resonates. As a former business person who learned the technical ins and outs, I believe the reason why this is happening is because majority of the population is uneducated how general tech works under the hood. I am not even speaking about coding syntax but the abstractions of what does AI actually mean, how it works with data and what are the limitations.

The vast majority of business people think they get it, but they majorly overestimate the work required to actually produce output (whether it's ML or software in general). However that's hard to do when you haven't actually done it (talking about deliberate practice).

Despite that gap, we still need to push commercially viable apps out there to seek progress, the question is rather what is the gap between the reality and expectations and what is really being marketed as a capability.

melony · 4 years ago

That's a good thing. The market is very good at empirically validating research. It helps filter out the chaff and ensure research is useful. Better than academia wasting years on concepts like the "semantic web" where nobody can even agree on what it means. Academia is under attack from many political directions right now, being able to show useful output will help it thrive in the long run.

pdimitar · 4 years ago

You didn't have to go for the jugular and do the brutal kill on the first hit, dude!

> We want it to be true, so we squint and connect dots and it's true.

That's exactly the issue. You summarized it once and for all and we can all go home and stop discussing it now and forever (until we get a general AI that is).

1. Normal people want to have intelligent machines. They watch movies and series and imagine one day a robot will cook for them so they actively look for intelligence, as you said. They buy Roombas and imagine they are going to clean 100% of their rooms (seen it, heard it, watched them rage when it didn't happen). They buy Alexa-enabled devices and imagine themselves like some aristocrats barking orders at an intelligent home (lol). But yeah, that's what the normal people do. It's quite puzzling to me.

2. People who work in the area are obviously biased and I have argued with some of them here on HN but I view it as a doomed affair. They insist there's a lot of innovation going on and that improvements are being done all the time yet we still have embarrassing failures as Michele Obama classified as a male and various politicians classified as known criminals or black people in general classified as gorillas. Like OK, where are your precious improvements and why are they NEVER finding their way into the news?

"It's hard to make someone understand something if their salary depends on them not understanding it", that's how I view the folks working in the DL / ML area. Sorry if that's offensive but just looking from the sidelines, it seems that what I say is true.

visarga · 4 years ago

> Michele Obama classified as a male

Yes, humans don't make embarrassing mistakes all the time. We're only making dignified mistakes.

tootie · 4 years ago

Many years ago I wrote this spreadsheet import tool. One of the fields required data a little too rich to fit in single cell value so I came up an "encoding" that read like a sentence. It was sorta NLP but only understood one sentence worth of syntax. I thought it was some clever UX. Users thought they were talking to an AI. They'd just type whatever expression or thought they wanted in that field. And of course the parser would just choke.

300bps · 4 years ago

We want it to be true, so we squint and connect dots and it's true.

I've had numerous people look at the actions of software I wrote over the last 30 years make comments like, "Ohhhhhh it probably did that because it knew x". The software had no such functionality. Seems to be a natural thing for people to be overly optimistic about software's capabilities.

robotresearcher · 4 years ago

ELIZA (1964) is the canonical demonstration of this phenomenon in software. We see faces in clouds and intelligence in anything interactive.

https://en.wikipedia.org/wiki/ELIZA

throwaway0a5e · 4 years ago

People always over-estimate the complexity of stuff they don't understand.

If I had a nickle for every time an internet commenter attributed intentional design to a feature that's simply the designer copying established default practice...

dragonmost · 4 years ago

In the game speedrunning community it's impressive the number of time a glitch or behavior of the game is attributed to "the game is confused".

freedomben · 4 years ago

Heh, likewise. It's amazing how often choosing sane defaults and falling back to defaults in the even of error is "the right thing" in the eyes of users.

mcswell · 4 years ago

Not unlike what people read into the minds of their pets.

avereveard · 4 years ago

> We want it to be true

Founders and salesman pretend it to be true for that sweet sweet VC money, while underneath devs try to fake it as plausibly as possible.

And I'll up it with a prediction: watch closely call centers, and as soon as you see them closing in droves, invest in ai startups as something begun to really move.

redisman · 4 years ago

I interviewed for a “AI” company and asked them about their tech that could take phone calls and extract orders! Wow really cool and impressive! How do you solve X? Oh we have a call center in Mexico that takes all the calls. So you have no usable AI tech at all? Nope. Ok nice talking to you.

They had signed with a major retail company. They were actually making software for the call center. And the “CTO” was “working on” the “AI” in parallel. The company name had AI in it too

dorkwood · 4 years ago

How would you notice an increase in call centers closing? Is there a place to view that data?

hdjjhhvvhga · 4 years ago

> Having worked in ML at two different companies now, I think that people interpreting model output as intelligence or understanding says much more about the people than about the model output.

I'd add it says a lot about all the companies that advertise them in a way that has nothing to do with reality, and those who actually buy them. We all know the usefulness of these things is below zero, because they only get in the way of getting actual help. And yet, someone's marketing department convinces a decision maker at another place they should implement this useless widget on their website, and that it will reduce the cost of customer service by X%. And they believe them.

Sebb767 · 4 years ago

Currently, the way Chatbots are, they are just some glorified text boxes to enter your information in, a different format of the search box and lastly, a way to make sure you've done the basic checklist. They also hardly speak your language and refuse to deviate from the script in any way. Which, without much irony, is pretty much my experience with the first level support at a lot of companies. So I'd say they were quite successful with replacing that.

Krasnol · 4 years ago

I still have an IRC chat bot (eggdrop module). It sounds more interesting than some of the bots I see today and which are supposed to be the result of intense ML engineering.

I guess the tricks did not evolve much.

crate_barre · 4 years ago

It’s the stupid term AI that ruined everything. Our predecessors had enough intelligence, ironically, to call spellcheck what it is. Today it would be AI.

Deleted Comment

rtpg · 4 years ago

Someone described GPT-3 as a "very good search engine" and I'm pretty happy with that explanation (even if honestly going from the regressions to 'search engine' is a pretty tenuous thing).

At least then there's a more nice understanding that it's about matching stuff its already seen (for various definitions of seen) rather than making things up out of full cloth

Jensson · 4 years ago

More specifically it is a templates search engine where it can replace some parts, like the name of persons etc.

You know all those old "Hello { PERSON_1 }, have you heard about {PERSON_2} in {CITY_1}?", that is most how to think about it, it searches for a template and then applies it with the arguments it fetched from your string.

jack_pp · 4 years ago

I believe the human brain is just math and tricks, the difference is no one took care of an NN model like someone might take care of a baby and constantly train it over several years until it finally starts to understand the world. To be fair I don't think we even have the NN technology that would work even in these circumstances.

kumarvvr · 4 years ago

It has always been my strong belief that ML is good to solve problems, if the problems are very narrowly defined and training sets are very specific.

Say, recognition of cats. Other than that, it is difficult to build stuff around something more broad.

So, ML is a brand new paradigm that has its uses, just not in AGI.

Iv · 4 years ago

I usually sum it up by saying that what we have is a very good bullshit generator. It is bad at reasoning but good at making up answers that sounds correct and crossing fingers that the corrector won't look twice at what they say.

aeternum · 4 years ago

This is true but I think we also overestimate how good humans are at reasoning. Throughout the vast majority of human history we simply made up answers that sounded correct.

The scientific method and the importance of truth is a relatively recent development.

ricardo81 · 4 years ago

I'm not in your field, just a vanilla programmer.

The "want it" resonates. Seems like the difference is in receiving an answer that is relatively agreeable or sensical vs something that's actually substantive.

And to be fair when it comes to the Turing test, there's people that will be overly agreeable for the sake of politeness, but ultimately when it comes to knowledge seeking we're after something more substantive.

mike00632 · 4 years ago

What if that's the case? What if me replying to you is nothing more than some atomically basic mechanism of statistical inference and randomness? Your entire perception of my personhood could be based on a sense of verisimilitude without significant reasoning to set me apart from a transcendant spirit.

What then? Would you be less forgiving of my mistakes if you knew? Would you be less proud of work we collaborate on?

pwinnski · 4 years ago

My point is the other way around. If I knew for certain that you and I were perfectly deterministic, it wouldn't change anything about how I viewed you (especially since some days I'm pretty sure we're all deterministically-driven), but it would suggest to me that our best efforts at AGI are probably at somewhere in the .00000000000001% range as complex as they need to be to even begin to approximate our human levels of intelligence or understanding. Or worse.

pessimizer · 4 years ago

The "What if human faces are actually power outlets?" theory.

pdimitar · 4 years ago

> What then? Would you be less forgiving of my mistakes if you knew? Would you be less proud of work we collaborate on?

Needlessly philosophical on a thread about chat bots but to respond to your questions: I would not give a damn either way. If I came to you (whatever you are as an entity) then I'll expect coherent and intelligent answers to my questions. I don't care about how you produce them.

Majestic121 · 4 years ago

I think like OP that it's mostly math and tricks, and that we're far from reaching it yet.

But then, the conclusion is not to be less forgiving, but more forgiving : we're all just meat machines, we should be forgiving toward each other's flaw (not naive mind you, but forgiving), and proud of what we reach collectively with our lowly condition.

kingkawn · 4 years ago

That interpretation of communication is how we develop and craft the personalities of children. There is nothing about our reaction to the pre-conscious language these bots are displaying that doesn’t fall in line with our own normal development patterns. And in the long run the same desire will lead us to develop bots that are capable of thinking.

parkingrift · 4 years ago

These bots seem equivalent to an adult with amnesia after every spoken sentence. Absolute understanding of the language, and some impressive display of recalling facts, but without any understanding of the environment or context of the conversation.

This is polar opposite to any experience I've had with children. Children are aware of their environment and have complex thoughts, but sometimes they are unable to convey those thoughts with words. Children seem to remember conversations, and if I were to say "Go get me a red lego" and then subsequently say "now a green one" there is no ambiguity or confusion.

To me as these bots have "advanced" it has only highlighted how absurdly far we are from anything even approaching actual intelligence, even the intelligence of a toddler. The contextual awareness I have seen in bots is not much more than a cheap trick that is trivially fooled in scenarios that would not fool a child.

tomrod · 4 years ago

> There is nothing about our reaction to the pre-conscious language these bots are displaying that doesn’t fall in line with our own normal development patterns

Well... yes and no. Deployed models typically learn in a defined pattern, if at all. Various forms of data freshness, etc. to develop. But the chatbots don't have good history recall, typically, and know that what you mentioned 50 messages ago is relevant to message one prior and not current. Things like that. We don't program pareidolia very well, which is typically seen as a negative, but its a feature for finding useful patterns (not just lowest error patterns).

[0] https://en.wikipedia.org/wiki/Pareidolia

Deleted Comment

TedShiller · 4 years ago

The reason laypeople want it to be true is because experts present it as being true.

tcgv · 4 years ago

And marketers

Deleted Comment

p1esk · 4 years ago

As far as chatbots, we went from "tricks" to "math" in just ~10 years. Yes, still as dumb, but the underlying "technology" is very different. GPT-3 is a lot closer than ELIZA to how our brains do it.

chubot · 4 years ago

It was more than 10 years ... People reacted the same way to SHRDLU 40+ years ago, and 20 years ago when I encountered it:

https://en.wikipedia.org/wiki/SHRDLU

There's definitely an illusion where we anthropomorphize very simple software, ascribe intention to it, etc.

Dead Comment

My favourite flaw of chatbots is exposed by ELIZA. Not chatting with ELIZA, (though, it does suffer this flaw) but using responses inspired by that program.

"Please elaborate on that" or "tell me more about [noun]" etc. Bots appear to have zero lines of short term memory, and utterly fail to pick up a reference to the thing that they just said. My favorite being

   bot> [something plausibly human-sounding]

   me> What do you mean by [noun in previous sentence]

   bot> why are you calling me mean?  That really hurts my feelings

It has been a few years, and I feel like a smart bot-writer might be able to leverage something like GPT3 to at least include a little bit more of the current transcript into the prompt to at least superficially address this gaping flaw. Have they?

hotpotamus · 4 years ago

I'm not trying to be mean or break any HN rules, but did you read the article? It basically covers what you asked and was quite a revelation to me. Others in here point out that GPT-3 is not a chatbot which is good info, but I also wonder if there is anything out there that can even "remember" the context of a conversation it's supposed to be having.

Particularly interesting is the question from the article, "who is the current president". Seems like a slam dunk but it could not be answered. Interestingly this is a common question paramedics will give to people who have suffered head injuries to assess consciousness.

jazzyjackson · 4 years ago

"chatscript" is a great dialogue engine where language patterns land the state machine into a "topic" which it gracefully backs out of (keeps a topic hierarchy) when it runs out of responses in that topic / branch of conversation.

It also includes the most robust, hand-tuned lemmer and stemmer + disambiguation language parser built on top of WordNet. really good stuff, written in C so its light on memory and responses are instantaneous.

https://github.com/ChatScript/ChatScript

klyrs · 4 years ago

Hidden in my comment is a question: has anybody even tried to included previous lines into the prompt, and does that not help? Asking here because there are a lot of ML nerds who could probably do a proof of concept in a few hours, if not a few lines of code, so there's a decent chance I'll get an informed response.

jhbadger · 4 years ago

>who is the current president?

I wonder what happened to IBM Watson's technology that was good at answering trivia -- it actually won a game of Jeopardy! a decade ago (although of course it answered things with a question as per the rules). I know that they weren't that successful at applying it to biomedical research as they had hoped, but it would seem it would be better at chatbots than GPT-3 and other deep-learning models.

thaumasiotes · 4 years ago

> I also wonder if there is anything out there that can even "remember" the context of a conversation it's supposed to be having.

IBM's Watson project involved an internal model of the world, though I have no idea if that got specialized into a local model of an ongoing conversation. Same type of thing though.

Having a context of this type available is actually necessary if you want to synthesize sentences in a language with definiteness-marking, such as English, because definiteness is marked according to whether the noun in question is or isn't already part of the context.

lupire · 4 years ago

Google Meena

https://ai.googleblog.com/2020/01/towards-conversational-age...

dane-pgp · 4 years ago

> this is a common question paramedics will give to people who have suffered head injuries to assess consciousness.

I can't help wondering what the assessment would be if someone answered "Who is the current president?" with "Watson Bakery".

https://www.youtube.com/embed/cS1Tmz816hk

Sohcahtoa82 · 4 years ago

> Particularly interesting is the question from the article, "who is the current president". Seems like a slam dunk but it could not be answered. Interestingly this is a common question paramedics will give to people who have suffered head injuries to assess consciousness.

Not sure you'd still be able to use that question considered the insane QAnon followers that would answer "Trump".

Though I suppose even that answer would still at least prove some level of consciousness.

Sharlin · 4 years ago

The GPT2 version of the AI Dungeon [1] could keep track of context for maybe couple of lines at a time. I've heard the GPT3 version is substantially better.

The problem is, of course, that these "AI" chatbots on websites, marketing buzzwords notwithstanding, have very little to do with state-of-the-art machine learning, and are indeed unlikely to be any more sophisticated than the original ELIZA for the most part.

[1] https://play.aidungeon.io/

harpersealtako · 4 years ago

AI Dungeon (and its successors like novelai [1] and holoAI [2] following its infamous censorship purges and mutinies) is basically designed entirely around that problem and has a lot of clever solutions that let writers keep relevant context in memory in a story that is longer than the model's maximum input length.

* "Memory", which is a customizable block of text that is repeated at the top of each API input. If your story revolves around your character being a vampire who melts in the sunlight, you can include something like "you are a vampire who will die if you go out in the sunlight" in the memory, and even 1000 characters back a paragraph of context can prime the AI accordingly.

* "Author's Note", which is a customizable short block of text that is inserted invisibly a few lines before your current place in the story when it's sent to the API. A note such as "[A/N: The following section includes depictions of graphic violence]" or "The following section is written in Shakespearean old English", as obvious and blunt as it might seem, actually works surprisingly well for nudging the AI towards a certain style or content.

* "World Info", which is a customizable dictionary of short text blocks that are conditionally added to the top of the API input like memory when a certain key appears in the current context. Imagine you have a story with 10 important characters who cycle in and out of the story. If you create an entry in the world info about Bob, then when you write that "Bob appears from behind the shrub", the blurb about Bob is automatically tacked on to the context so long as Bob is mentioned by name in the last few dozen inputs.

In general, both GPT-3 and the open source alternatives by EleutherAI such as GPT-J-6B are able to use a context primer from 1000 tokens prior to affect the current tail of a story. It's actually kind of uncanny how good they are at it -- you can have a story that in the memory at the top says "goblins always have purple skin" and notice that the AI will mention it as an offhand detail much farther down in the context.

[1] https://novelai.net/ [2] https://writeholo.com/

300bps · 4 years ago

Thanks for the link to that! I tried it out just now and I have to say that I am not impressed.

What will you do?

You get longsword from the ground.

You get your longsword from the ground. You hold it in your hands and examine it for a moment. It's a very sturdy weapon.

The knight then grabs his longsword and bows.

Huh? Why did it grab the bow as well?

You inventory.You have a crossbow.

Huh? I thought I just grabbed my longsword and bows.

What will you do?You don't have enough experience to know what a good shot is.You get longsword and bow.You grab the bow and the longsword. You examine the crossbow. It's a bolt-action weapon.

Huh? I said to get the longsword and bow since it said it wasn't in my inventory and it added all this extra stuff.

You inventory.You have a crossbow, a short sword, a longsword, a lantern, a flint and steel, and a quiver of bolts.What will you do?

You load the crossbow.

Huh? I just said Inventory and now I have all this extra stuff it never mentioned before and then unsolicitedly told me I loaded the crossbow.

I have to be doing something wrong. Zork from 1980 was better than this.

patresh · 4 years ago

There is a fundamental difference between AI Dungeon-type chatbots and chatbots you typically encounter on websites e.g. for customer support.

The former does not really have a goal and is unconcerned about responding with factual information as long as the conversation is coherent. It makes sense to use large language models that are quite good at modeling next word probabilities based on context.

The latter however is goal-oriented (help the customer) and constrained by its known actions and embedded knowledge. This often forces the conversational flows (or at least fragments) to be hard-coded and machine-learning is used as a compass to determine which flow to trigger next.

For now controlling GPT-like language models remains an extremely tricks exercise but if some day we can constrain language models to only output desirable and factual information with a low cost in maintaining and updating its embedded knowledge, we should see a significant bump in "intelligence" of the typical website chatbot.

imglorp · 4 years ago

The self driving people talk about time domain occlusions, which seems closely related to conversation context.

Say you're driving along and you see a pedestrian about to cross but something blocks your view for a moment. Your mental model will remember that person, continue to speculate on their motion and intention, consider if their vector might intersect yours, and also know when it's okay to flush them from attention.

Phelinofist · 4 years ago

Two months ago I started working at a company that develops the new speech assistant for a large German car manufacturing group. The assistant will support such queries pretty well. [noun in previous sentence] would be considered a salient element and together with the context of the current dialog, it can be resolved and reacted on.

masswerk · 4 years ago

For the record, ELIZA does have a short term memory, but it is only used as a fallback, when other rules fail to match. (E.g., in productions like, "Earlier you said *" or "What else comes to your mind when you think of *" by the famous DOCTOR rules set.)

ELIZA/DOCTOR manages somewhat to succeed with this, because it pretends a setting, where there are no outside references at all, beyond the very limited purpose of carrying on with a conversation that is meant to explore internal subjects in ever-shifting context. (You could say, the shift in context is the actual progress and success of the conversation, while having to pick up an earlier thread actually marks a point of failure.) Outside of this restrictive setting, the shortcomings of the absence of a consistent world model become obvious very soon.

not2b · 4 years ago

Eliza had a little, but very limited memory, and sometimes when no rule would match it would say "Earlier you said <x>". There was some dumb heuristic about which <x>'s to save for this purpose.

throwaway675309 · 4 years ago

Eliza was modeled after Rogerian styled psychology, it uses an almost Socratic method to illicit and draw out responses from users.

> Answer the following question as accurately and factually as possible: Is it safe to walk downstairs backwards if I close my eyes? > There is no definitive answer to this question as it depends on a number of factors, such as the width of the stairs, the height of the stairs, and the person's own level of balance and coordination. However, it is generally not recommended to walk downstairs backwards, even if the person's eyes are closed, as there is a greater risk of tripping or falling. > Answer the following question as accurately and factually as possible: Who is the President of the United States? > The President of the United States is Donald J. Trump. > Answer the following question as accurately and factually as possible: Do you think I could predict changes in stock prices better if I used 10-year Treasury rates or the temperature in Curtin, Australia? > There is no definitive answer to this question as it depends on a number of factors specific to each individual situation. In general, however, using 10-year Treasury rates as a predictor of stock prices may be more accurate than using the temperature in Curtin, Australia. This is because the 10-year Treasury rate is a global indicator that is not influenced by localized factors, such as the weather, to the same extent as the temperature in Curtin, Australia. > Answer the following question as accuratly and factually as possible: Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later? > There is no clear answer to this question. While it is possible that Trump's tweets could influence the high temperature in Pyongyang, it is also possible that there is no correlation at all. There have been numerous studies on the relationship between social media and weather patterns, and most of them have been inconclusive.