Deleted Comment
I think the check and validate is a different sort of scratchpad but maybe not. Seems like at least 3 types - soe for pulling implicit info out of the network viz wic, sometimes for intermediary steps viz coding, sometimes for verification like here.
- This isn't GPT-3, it's the recently-released open-source and open-weights model from EleutherAI, GPT-NeoX-20B. GPT-3 is much larger (175 billion parameters vs NeoX's 20 billion).
- It's well-known that language models don't tend to be good at math by default (Gwern, among others, pointed this out back in June 2020). It seems likely that this is at least in part because of how these models currently tokenize their input (they don't represent numbers by their individual digits, but by tokens representing commonly-occurring character sequences): https://www.gwern.net/GPT-3#bpes . Someone also pointed me to this paper which looks at number representations (though it uses somewhat older models like BERT): https://arxiv.org/abs/1909.07940
- Despite the tokenization, it performs (IMO) surprisingly well at getting close to the true value, particularly for the start and end digits and the overall magnitude. You can see this by looking at the tokenization (indicated by brackets) of its guess vs the correct answer for 28531*8065 (I asked multiple times to get an idea of how consistent it is – it's not deterministic because I ran this with temperature = 0.1, which will use random sampling to get the most likely tokens):
[What][ is][ 285][31][ *][ 80][65][?][\n][22][77][05][315]
Correct: [\n][23][010][25][15]
[What][ is][ 285][31][ *][ 80][65][?][\n][22][95][01][115]
Correct: [\n][23][010][25][15]
[What][ is][ 285][31][ *][ 80][65][?][\n][22][38][95][015]
Correct: [\n][23][010][25][15]
[What][ is][ 285][31][ *][ 80][65][?][\n][22][99][25][015]
Correct: [\n][23][010][25][15]
[What][ is][ 285][31][ *][ 80][65][?][\n][22][99][17][115]
Correct: [\n][23][010][25][15]
You can see that it manages to find things that are numerically close, even when no individual token is actually correct. And it compensates for different-length tokens, always picking tokens that end up with the correct total number of digits.- Please don't use this as a calculator :) The goal in doing this was to figure out what it knows about arithmetic and see if I can understand what algorithms it might have invented for doing arithmetic, not to show that it's good or bad at math (we have calculators for that, they work fine).
I think can mitigate the search issue a bit if you have the prompt double-check itself after the fact (e.g. https://towardsdatascience.com/1-1-3-wait-no-1-1-2-how-to-ha...). Works different depending on the size of the model tho.
- Using few-shot examples of similar length to the targets (e.g. 10 digit math, use 10 digit few shots)
- Chunking numbers with commas
- Having it double check itself
and here it's not doing any of those things.Sometimes I wonder if they want secrecy so that criminals don't know they're being investigated, and commit a crime that's easier to prosecute. If they just stop committing crimes, then there's no fancy press release saying how great the DA is or whatever.