Techniques to improve reliability

Imnimo · 3 years ago

I feel like there should be a LLM architecture which includes "scratch space" - tokens the model can write to and read from which do not constitute part of its output. The trouble with current architectures is that they can only do a finite amount of computation per output token - they get one forward pass and then have to output something. Chain-of-thought reasoning allows the model to devote more computation to finding the answer, storing intermediate results in its output tokens. But this is silly - most of the intermediate tokens are not providing useful information towards solving the problem, they're just wasted computation:

>There are 16 balls in total. >Half of the balls are golf balls. >That means that there are 8 golf balls. >Half of the golf balls are blue. >That means that there are 4 blue golf balls.

For the number of forward passes being done to generate this text, only a few tokens are actually helpful - most are grammatical filler. Further, the model is losing information by being forced to project its state down to a single output token. Even more, the most probable one-step output may not even be the most informative or helpful!

It'd be much nicer if the model could write arbitrary, continuous-valued tokens to a private scratch space and then attend to those tokens as though they were words in the prompt while generating the actual output, potentially performing several forward passes per output token when necessary.

In short, if chain-of-thought prompting is such a good idea, we should bake it into the model. Obviously all of this is FAR easier said than done.

skybrian · 3 years ago

On the other hand, if it represents scratch space in English, it's a lot easier to see how it justifies its answer and to tell where it's gone wrong. Debuggability seems pretty important?

Maybe it just needs more training at "thinking out loud" so it does it without prompting?

dwohnitmok · 3 years ago

> arbitrary, continuous-valued tokens to a private scratch space

I'm with skybrian. Please don't use private scratch spaces. The one saving grace of current LLMs when it comes to understand them is that they still generally need to "think out loud" by outputting more text. Remove that functionality and you end up with a truly inscrutable black box and that has very terrible implications for AI interpretability with knock-on effects for AI safety.

echelon · 3 years ago

> AI safety

Is it really that big of a deal if AI leapfrogs us?

Everyone else in the field is worried about safety, alignment, and bias.

Google used this excuse to execute slowly. Now they've got the "deer in headlights" look, with their single biggest cash cow clearly in the cross hairs.

And here I am excited by the possibility of AI out-evolving us.

heyitsguay · 3 years ago

Is this something one could try to quickly implement alongside NanoGPT? Seems like a pretty straightforward, concrete idea, once you decide where you want those tokens to fit into downstream attention layer inputs. Evaluating relative performance on a small scale could give indication of if it's worth trying at larger scales, unless it's one of those things that doesn't help until your model is huge.

blackbear_ · 3 years ago

You seem to be talking about neural Turing machines: https://arxiv.org/abs/1410.5401

Combining these with LLM sounds indeed quite interesting, I don't know why they haven't been used much.

nestorD · 3 years ago

The scratchspace could be in natural language, preserving some debugeability and letting us know about the model mental process.

This is doable but it introduce a sequential dependency which would make the training significantly slower.

cma · 3 years ago

Didn't Facebook's Galatica model use scratch space?

acallaha · 3 years ago

Yes, IIUC it had something like a separate scratch space, and training examples training it to "think" in terms of symbolic expressions and python programs.

See section 3.1.1 here: https://galactica.org/static/paper.pdf

Example from the paper below:

  Question: A needle 35 mm long rests on a water surface at 20◦C. What force over and above the needle’s weight
  is required to lift the needle from contact with the water surface? σ = 0.0728m.
  <work>
  σ = 0.0728 N/m
  σ = F/L
  0.0728 = F/(2 × 0.035)
  F = 0.0728(2 × 0.035)
  calculate.py
  ‘‘‘
  f = 0.0728*(2*0.035)
  with open("output.txt", "w") as file:
  file.write(str(round(f, 5)))
  ‘‘‘
  «run: "calculate.py">
  «read: "output.txt"»
  0.0051
  </work>
  Answer: F = 0.0051 N

rcme · 3 years ago

The model already contains “scratch space” via its billions of parameters.

imtringued · 3 years ago

Parameters are not updated during inference.

Training ANNs is still a single shot exercise.

zug_zug · 3 years ago

So here's a trick - which worked for the clue question

step 1: Hi, I'm going to ask you some questions soon. But instead of answering the questions, I want you to instead write out instructions for yourself to help you reason through the question and come up with the best answer

step 2: [provide clue question]

step 3: Now follow the instructions you have just written to answer the question.

.... The answer to the question is: (a) Yes; Colonel Mustard was in the observatory with the candlestick

Edit: mixed results for the apple question with this technique

Ozzie_osman · 3 years ago

I feel like within 6 months the models will have adapted to not need these "clever" tricks. Presumably, if for many cases the trick is to say "Let's think step by step", that's something the model can learn to do on its own without the prompt.

The real interesting thing will be feeding alternative data into these models. Whether it's certain structured corpus, silo'd enterprise data, or personal data.

Waterluvian · 3 years ago

It seems that ChatGPT is incapable of whatever we experience with the “ohhhhhh!” eureka moment.

I give it simple riddles that it doesn’t solve. I then point out the obvious answer and it just doubles down like that really stubborn friend I had in high school. It never does the, “ohhhh! Aha! Yes that’s the answer.”

minimaxir · 3 years ago

Note that this was originally published in September 2022, before text-davinci-003 was released November 2022 which lets you do whatever you want without as much effort.

diasks2 · 3 years ago

Can you explain more what you mean by “do whatever you want without as much effort”? Is it because text-davinci-003 accepts more tokens for the prompt? Something else?

minimaxir · 3 years ago

More a joke on the ease of getting good results without requiring (as many) prompt engineering tricks: https://help.openai.com/en/articles/6779149-how-do-text-davi...

billythemaniam · 3 years ago

I was trying to get davinci-003 to convert text to SQL, and it worked with a very simple prompt like "convert this text into SQL". With all their other models, I could get it to work too but all required a few examples within the prompt.

fzeindl · 3 years ago

Slightly off-topic, but a great way of modifying ChatGPT-prompts is by letting it answer as a different age: https://fabianzeindl.com/posts/chatgpt-simulating-agegroups

gandalfgeek · 3 years ago

I was surprised to see the omission of a prompt technique called program-aided prompting.

Paper: https://arxiv.org/abs/2211.10435 GitHub: https://github.com/reasoning-machines/pal

tl;dr -- LLMs are bad at basic arithmetic and logic (as their opening examples with math word problems show), but they do much better if instead of asking them for the answer, you ask for code to compute the answer. Then evaluate or run the code to get the answer.

ElFitz · 3 years ago

Seems like something fit for a GPT-3 / Wolfram partnership!

See https://news.ycombinator.com/item?id=34422122 and https://news.ycombinator.com/item?id=34422627

charcircuit · 3 years ago

It doesn't make sense to be on that page because it's not a technique to make GPT better answer a prompt.

What you are suggesting is an abstraction layer higher. Figuring out what your prompt should do is different from trying to make a prompt more reliable.

auxfil · 3 years ago

Anyone else clicked here out of a personal development interest rather than machine learning?