Great analysis, props to these students for taking the time to challenge such a sensational headline. In the conclusion they mention my biggest problem with the paper which is that it appears gpt4 grades the answers as well (see section 2.6 "Automatic Grading").
In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades. To be clear the grading gpt4 has the answers so it does have more information but it still might overlook important subtleties in how the real answer differs from the generated answer due to it's own failure to understand the material.
> In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades.
Even this is overstating it, because for each question, GPT-4 is considered to get it "correct" if, across the (18?) trials with various prompts, it ever produces one single answer that GPT-4 then, for whatever reason, accepts. That's not getting "100%" on a test.
If people haven't seen it UT Prof Scott Aaronson had GPT4 take his Intro Quantum
final exam and had his TA grade it. It made some mistakes, but did surprisingly well with a "B". He even had it argue for a better grade on a problem it did poorly on.
Of course this was back in April when you could still get the pure unadulterated GPT4 and they hadn't cut it down with baby laxative for the noobs.
It literally did not change. Not one bit. Please, if you're reading this, speak up when people say this. It's a fundamental misunderstanding, there's so much chatter around AI, not much info, and the SnR is getting worse
This "GPT4 evaluating LLMs" problem is not limited to this case. I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.
Couple this with the reliance on crowd-sourcing to create evaluation datasets and heavy use of GPT3.5 and GPT4 by MTurk workers, you have a big fat feed-forward process benefiting only one party: OpenAI.
The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out. Reddit, Twitter and the like are awakening just now - to find that they're basically powerless against this wave of distorted future standards.
When sufficiently proven to pass every existing test on Earth, every institution would be so reliant on producing work with GPT that we won't have a "%100 handmade exam" anymore. No problem will be left for GPT to be tackled with.
>> I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.
Why? Because machine learning is not a scientific field. That means anyone can say and do whatever they like and there's no way to tell them that what they're doing is wrong. At this point, machine learning research is like the social sciences: a house of cards, unfalsifiable and unreproducible research built on top of other unfalsifiable and unreproducible research. People simply choose whatever approach they like, cite whatever result they like, because they like the result, not because there's any reason to trust it.
Let me not bitch again about the complete lack of anything like objective measures of success in language modelling, in particular. There have been no good metrics, no meaningful benchmarks, for many decades now, in NLP as a whole, but in language generation even more so. This is taught at students in NLP courses (our tutors discussed it in my MSc course) there is scholarship on it, there is a constant chorus of "we have no idea what we're doing" but nothing changes. It's too much hard work to try and find good metrics, build good benchmarks. It's much easier to put a paper on arxiv that shows SOTA results (0.01 more than the best system compared to!). And so the house of cards rises ever towards the sky.
Here's a recent paper that points out the sorry state of Natural Language Understanding (NLU) benchmarking:
What Will it Take to Fix Benchmarking in Natural Language Understanding?
There are many more, going back years. There are studies of how top-notch performance on NLU benchmarks is reduced to dust when the statistical regularities that models learn to overfit to in test datasets are removed. Nobody. fucking. cares. You can take your science and go home, we're making billion$$$ here!
Yes, I have similar concerns. These models regurgitate previously seen strings, previous benchmarks included. When you try to evaluate their sheer ability to reason on the text, however, they perform poorly.
(Our experiments with GPT-3 are here: https://doi.org/10.5220/0012007500003470)
> but it still might overlook important subtleties
If there's one thing we can be certain of, it's that LLMs often overlooks important subtleties.
Can't believe they used GPT4 to also evaluate the results. I mean, we wouldn't trust a student to grade their own exam even when given the right answers to grade with.
I noticed that when I read the paper. I know it's hard to scale but I'd want to see competent TAs doing the grading. I also found the distribution of courses a bit odd. Some of it might be just individual samples but intro courses I'd expect to be pretty cookie cutter (for GPT) were fairly far down the list and things I'd expect to be really challenging had relatively good results.
Can attest that the distribution is odd from the test set that we sampled.
We've already run the compute to run the zero-shot GPT model on all of the datapoints in the provided test set. We're going through the process now of grading them manually (our whole fraternity is chipping in!) and should have the results out relatively soon.
I can say that, so far, it's not looking good for that 90% correct zero-shot claim either.
Even research from OpenAI has attempted to use GPT-4 as quasi-ground truth (as a replacement for human evaluators). For example, their method in the recent paper "Language models can explain neurons in language models" [1] is:
1. Using GPT-4, generate a text explanation of a neuron's activations on sample input.
2. Using GPT-4 again, use the text explanation to simulate the neuron on some new text input.
3. Compare the result to the actual neuron's activations on the new text input.
They justify this by saying human contractors do equally poorly at coming up with text descriptions. However, the procedure is such a black box that it is difficult to make scientific conclusions from the results.
> Even research from OpenAI has attempted to use GPT-4 as quasi-ground truth (as a replacement for human evaluators).
The way OpenAI used GPT-4 is fundamentally different than how GPT-4 was used to score the answers to the MIT exam. In OpenAI's case, they had GPT-4 generate an explanation of when a neuron in GPT-3 would fire. They then gave that explanation back to GPT-4 and had GPT-4 predict when the specific neuron in GPT-3 would fire. The scoring was done by computing the correlation between when GPT-4 predicted the neuron would fire and when it actually fired. The scoring was not done by GPT-4 as was done for the MIT exam
In addition OpenAI did have human evaluators score the explanations as well to make sure they were human interpretable[0]
Indeed, the OpenAI paper is more well-founded since the "ground truth" was not generated by GPT-4.
However, both papers rely on black boxes instead of well-understood procedures. This places the papers on weaker scientific footing. For example, a poor explanation/simulation of a neuron's behavior may simply be a consequence of GPT-4. Instead, a scientist would want to "prove" some form of unexplainability.
To do this, a researcher would not use a human at all to explain neuronal behavior. Instead, a simple repeatable algorithm such as topic modeling would be applied. This would lead to significantly stronger scientific conclusions about the neurons. It also proves it is not possible to explain the neuron in some specific sense.
An interesting follow-up to the OpenAI paper might be to quantify how much "more powerful" its textual descriptions are than simpler, well-understood techniques such as topic modeling. That could at least reinforce its use.
I don't understand your objection. Step (3) is the one that actually assesses how well the proposed description works, and that is a comparison with the 'real' ground truth.
This is one of the most embarrassing reviews I have ever read (for the paper reviewed). AI research needs urgently certain good practices to adhere to, but the current status is that it is really hard to take many of the results seriously due to the opaqueness that characterises it in many steps of the process. And such serious mistakes and bad practices certainly do not help the field to achieve any credibility.
> Several of the authors listed on the discussed paper are undergraduate researchers. Consequently, we believe it's inappropriate to hold these individuals accountable for any lapses present in the work.
Instead, we believe the responsibility should lie with the supervising authors. They are the ones who are expected to ensure that the work meets the rigorous standards of public scholarship within their field.
I've gone from working with 500-1k lines of code at a time in GPT-4, to not bothering beyond one or two small functions because it gets confused so easily now.
Seems they're enshittifying already, I hope a competitive model is released soon.
It's not that it didn't do an OK job, but more that you couldn't rely on what it had produced totally, nor rely on it having corrected the list without first having it reanalyse the "corrected" list.
It's still extremely helpful, I just found it strange that it seemed like a simple task - for something that has been fed millions of documents, for it to still give some incorrect results - especially AFTER it had analysed its own results and found some noun artikels to be incorrect.
Curious, could you share your prompt? I just tried asking GPT4 (paid) to create a list of German nouns with der/die/das and it managed to do it correctly.
Does it work if you ask in German? I found it's a better if you tell it via system prompt that it's a language professor (using your target language) than if you just use english for tasks involving foreign language. The power of the LARP.
(I use normal machine translation API for a lot of this, but you can also ask it in another context window to translate the text to other languages. I use this approach for e.g. sindarian)
> I would have thought this would be a super easy task for it
Why did you think that? This isn't meant to be critical, but I'm honestly curious, what led you to believe that technology underlying GPT-4 made it a good fit for this or any particular task?
It has probably seen the correct nouns used millions of times in the training data - and asking it to produce the correct nouns for a bunch of words is really just "tell me which case you saw most during training", which is something LLM's are really good at.
It is a purely statistical model. It does not know any "rules" about the language (it doesn't know any language at all), but it is fed data and derives from that sophisticated probabalistical relationships between words.
It shouldn't have much of a problem to generate the correct grammatical formulations, as it has been extensively trained on them. Moreso than any other technology neural networks are suited for this kind of tasks where hard rules do not exist (as a German I couldn't tell you why "rain" is masculine, but "machine" is feminine), but lots of data correctly implements the rule, does.
I think this task is beyond the capabilities of what GPT4 can handle, this is simply asking too much out of it. For other languages I'm sure it has no problems.
In greek it is also really bad (and makes rather obvious mistakes), in french it seems much better but makes some very obvious mistakes too. To make it interesting, I emphasise that I want nouns that refer to objects only (else it just spits out profession names and stuff like that, which is not interesting).
Also tbh, with all the hype of LLMs one would think that such a task would not be such a challenge.
With LLMs hallucinating and being generally unreliable, I'm coming to the conclusion that LLMs should only be used to transform natural language to structured data (specific to your application), and knowledge about the world should be stored somewhere else (some sort of vector databases? tools?), and, fortunately, smaller models are already quite good at the former! Trying to extract world knowledge from LLMs can be a deadend... I don't see how the hallucination problem can be 100% fixed for LLMs themselves.
Strange take, do you consider world knowledge produced by humans to be anywhere near 100% accurate?
Suppose you had an oracle that when asked a question gives you 5 answers, 1 of which is true. Or even 1 of which is true only 50% of the time. You would still generally be a fool to throw away that oracle.
I do not trust LLMs. They may very often be right, and may very often be useful, but I do not trust them. If you require correct results, their output should be checked somehow.
Neil posted the paper in our fraternity's ML group chat (MIT things lol), and I expressed some skepticism at the results.
Initially we started looking into it more for curiosity's sake, but as we started digging we kept finding more and more ridiculous stuff to the point where we decided to start working on blog post. Then David joined in on the action and helped a ton with the research and writeup.
No professor was involved. The paper was released yesterday, so we just documented as we went along in the investigation process. It only took like 8 hours of work to compile the doc. We finished it last night and posted to Twitter in the morning.
I'm not sure what to make of this post. There is always a degree of uncertainty with the experimental design and it's not surprising that there are a couple of buggy questions. Imagenet (one of the most famous CV datases) at this point is known to have many such buggy answers. What is surprising is the hearsay that plays out on social media that blows the proportion of the results out of the water and leads to opinion pieces like these targeting the authors instead.
Most of the damning claims in the conclusion section (Obligatory: I haven't read the paper entirely, just skimmed it.) usually get ironed out in the final deadline run by the advisors anyway. I'm assuming this is a draft paper for the EMNLP deadline this coming Friday published on arxiv. So this paper hasn't even gone through the peer review process yet.
ImageNet has five orders of magnitude more answers, which I would assume makes QA a completely different category of problem.
The authors could probably have carefully review all ~300 of their questions. If they couldn't they could have just reduced their sample size to say 50.
I admit that Imagenet isn't the best analogy here. But I'm pretty confident that this data cleaning issue would be caught in peer review. The biggest issue which I still don't understand was the removal of the test set. That was bad practice on the authors' part.
Has the academia established clear standards on what is competent work and what is not? It occurs to me that while a small subset of papers stand out, many papers are struggling with conforming to basic practices like publishing runnable code and data, keeping them up to date with latest libraries and models within a year after the paper is published, and demonstration of performance in varied, potentially subjective ways, rather than picking a random benchmark and showing accuracy improvements by 0.01% and call it scientific.
Is it just me, or if for the majority of papers, the effort required to understand and get value out of the paper is so much higher than the effort put in by the authors and reviewers to publish it?
In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades. To be clear the grading gpt4 has the answers so it does have more information but it still might overlook important subtleties in how the real answer differs from the generated answer due to it's own failure to understand the material.
Even this is overstating it, because for each question, GPT-4 is considered to get it "correct" if, across the (18?) trials with various prompts, it ever produces one single answer that GPT-4 then, for whatever reason, accepts. That's not getting "100%" on a test.
Deleted Comment
Of course this was back in April when you could still get the pure unadulterated GPT4 and they hadn't cut it down with baby laxative for the noobs.
https://scottaaronson.blog/?p=7209
Couple this with the reliance on crowd-sourcing to create evaluation datasets and heavy use of GPT3.5 and GPT4 by MTurk workers, you have a big fat feed-forward process benefiting only one party: OpenAI.
The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out. Reddit, Twitter and the like are awakening just now - to find that they're basically powerless against this wave of distorted future standards.
When sufficiently proven to pass every existing test on Earth, every institution would be so reliant on producing work with GPT that we won't have a "%100 handmade exam" anymore. No problem will be left for GPT to be tackled with.
Why? Because machine learning is not a scientific field. That means anyone can say and do whatever they like and there's no way to tell them that what they're doing is wrong. At this point, machine learning research is like the social sciences: a house of cards, unfalsifiable and unreproducible research built on top of other unfalsifiable and unreproducible research. People simply choose whatever approach they like, cite whatever result they like, because they like the result, not because there's any reason to trust it.
Let me not bitch again about the complete lack of anything like objective measures of success in language modelling, in particular. There have been no good metrics, no meaningful benchmarks, for many decades now, in NLP as a whole, but in language generation even more so. This is taught at students in NLP courses (our tutors discussed it in my MSc course) there is scholarship on it, there is a constant chorus of "we have no idea what we're doing" but nothing changes. It's too much hard work to try and find good metrics, build good benchmarks. It's much easier to put a paper on arxiv that shows SOTA results (0.01 more than the best system compared to!). And so the house of cards rises ever towards the sky.
Here's a recent paper that points out the sorry state of Natural Language Understanding (NLU) benchmarking:
What Will it Take to Fix Benchmarking in Natural Language Understanding?
https://aclanthology.org/2021.naacl-main.385/
There are many more, going back years. There are studies of how top-notch performance on NLU benchmarks is reduced to dust when the statistical regularities that models learn to overfit to in test datasets are removed. Nobody. fucking. cares. You can take your science and go home, we're making billion$$$ here!
If OpenAI ceased to be – probably for some legislative reason –, would the problems go away?
If there's one thing we can be certain of, it's that LLMs often overlooks important subtleties.
Can't believe they used GPT4 to also evaluate the results. I mean, we wouldn't trust a student to grade their own exam even when given the right answers to grade with.
We've already run the compute to run the zero-shot GPT model on all of the datapoints in the provided test set. We're going through the process now of grading them manually (our whole fraternity is chipping in!) and should have the results out relatively soon.
I can say that, so far, it's not looking good for that 90% correct zero-shot claim either.
1. Using GPT-4, generate a text explanation of a neuron's activations on sample input.
2. Using GPT-4 again, use the text explanation to simulate the neuron on some new text input.
3. Compare the result to the actual neuron's activations on the new text input.
They justify this by saying human contractors do equally poorly at coming up with text descriptions. However, the procedure is such a black box that it is difficult to make scientific conclusions from the results.
[1] https://openai.com/research/language-models-can-explain-neur...
The way OpenAI used GPT-4 is fundamentally different than how GPT-4 was used to score the answers to the MIT exam. In OpenAI's case, they had GPT-4 generate an explanation of when a neuron in GPT-3 would fire. They then gave that explanation back to GPT-4 and had GPT-4 predict when the specific neuron in GPT-3 would fire. The scoring was done by computing the correlation between when GPT-4 predicted the neuron would fire and when it actually fired. The scoring was not done by GPT-4 as was done for the MIT exam
In addition OpenAI did have human evaluators score the explanations as well to make sure they were human interpretable[0]
[0] https://openaipublic.blob.core.windows.net/neuron-explainer/...
However, both papers rely on black boxes instead of well-understood procedures. This places the papers on weaker scientific footing. For example, a poor explanation/simulation of a neuron's behavior may simply be a consequence of GPT-4. Instead, a scientist would want to "prove" some form of unexplainability.
To do this, a researcher would not use a human at all to explain neuronal behavior. Instead, a simple repeatable algorithm such as topic modeling would be applied. This would lead to significantly stronger scientific conclusions about the neurons. It also proves it is not possible to explain the neuron in some specific sense.
An interesting follow-up to the OpenAI paper might be to quantify how much "more powerful" its textual descriptions are than simpler, well-understood techniques such as topic modeling. That could at least reinforce its use.
> Several of the authors listed on the discussed paper are undergraduate researchers. Consequently, we believe it's inappropriate to hold these individuals accountable for any lapses present in the work.
Instead, we believe the responsibility should lie with the supervising authors. They are the ones who are expected to ensure that the work meets the rigorous standards of public scholarship within their field.
Recently I found that GPT4 can't even reliably create a list of german nouns with a given article (der / die / das).
It will mess up a simple list - if you ask it to analyse it, it'll be able to tell you that it's wrong.
Then you get it to correct the list and it may still be wrong.
It can take several iterations to make the list correct. I would have thought this would be a super easy task for it, but apparently not.
Seems they're enshittifying already, I hope a competitive model is released soon.
> Erstelle eine Liste mit 10 Nomen, die den Artikel "der" haben.
Maybe "reliably" is doing a lot of the heavy lifting?
It's still extremely helpful, I just found it strange that it seemed like a simple task - for something that has been fed millions of documents, for it to still give some incorrect results - especially AFTER it had analysed its own results and found some noun artikels to be incorrect.
I actually didn't realise it was giving me incorrect info until my gf started looking at it!
(I was trying to use it to help me learn german)
(I use normal machine translation API for a lot of this, but you can also ask it in another context window to translate the text to other languages. I use this approach for e.g. sindarian)
Otherwise my current strategy is to put it in an analysis loop until it deems the list to be correct.
Why did you think that? This isn't meant to be critical, but I'm honestly curious, what led you to believe that technology underlying GPT-4 made it a good fit for this or any particular task?
It has probably seen the correct nouns used millions of times in the training data - and asking it to produce the correct nouns for a bunch of words is really just "tell me which case you saw most during training", which is something LLM's are really good at.
It is a purely statistical model. It does not know any "rules" about the language (it doesn't know any language at all), but it is fed data and derives from that sophisticated probabalistical relationships between words.
It shouldn't have much of a problem to generate the correct grammatical formulations, as it has been extensively trained on them. Moreso than any other technology neural networks are suited for this kind of tasks where hard rules do not exist (as a German I couldn't tell you why "rain" is masculine, but "machine" is feminine), but lots of data correctly implements the rule, does.
https://faculty.georgetown.edu/jod/texts/twain.german.html
Also tbh, with all the hype of LLMs one would think that such a task would not be such a challenge.
Deleted Comment
Deleted Comment
Suppose you had an oracle that when asked a question gives you 5 answers, 1 of which is true. Or even 1 of which is true only 50% of the time. You would still generally be a fool to throw away that oracle.
I remember the very late nights at Burton Connor, maybe students will get more sleep now :)
Initially we started looking into it more for curiosity's sake, but as we started digging we kept finding more and more ridiculous stuff to the point where we decided to start working on blog post. Then David joined in on the action and helped a ton with the research and writeup.
No professor was involved. The paper was released yesterday, so we just documented as we went along in the investigation process. It only took like 8 hours of work to compile the doc. We finished it last night and posted to Twitter in the morning.
Most of the damning claims in the conclusion section (Obligatory: I haven't read the paper entirely, just skimmed it.) usually get ironed out in the final deadline run by the advisors anyway. I'm assuming this is a draft paper for the EMNLP deadline this coming Friday published on arxiv. So this paper hasn't even gone through the peer review process yet.
The authors could probably have carefully review all ~300 of their questions. If they couldn't they could have just reduced their sample size to say 50.
Is it just me, or if for the majority of papers, the effort required to understand and get value out of the paper is so much higher than the effort put in by the authors and reviewers to publish it?