When I went to the APS March Meeting earlier this year, I talked with the editor of a scientific journal and asked them if they were worried about LLM generated papers. They said actually their main worry wasn't LLM-generated papers, it was LLM-generated reviews.
LLMs are much better at plausibly summarizing content than they are at doing long sequences of reasoning, so they're much better at generating believable reviews than believable papers. Plus reviews are pretty tedious to do, giving an incentive to half-ass it with an LLM. Plus reviews are usually not shared publicly, taking away some of the potential embarrassment.
We already got an LLM generated meta review that was very clearly just summarization of reviews. There were some pretty egregious cases of borderline hallucinated remarks. This was ACL Rolling Review, so basically the most prestigious NLP venue and the editors told us to suck it up. Very disappointing and I genuinely worry about the state of science and how this will affect people who rely on scientometric criteria.
This is a problem in general, but the unmitigated disaster that is ARR (ACL Rolling Review) doesn't help.
On the one hand, if you submit to a conference, you are forced to "volunteer" for that cycle. Which is a good idea from a "justice" point of view, but its also a sure way of generating unmotivated reviewers. Not only because a person might be unmotivated in general, but because the -rather short- reviewing period may coincide in your vacation (this happened to many people with EMNLP, whose reviewing period was in the summer) and you're not given any alternative but to "volunteer" and deal with it.
On the other hand, even regular reviewers aren't treated too well. Lately they implemented a minimum max load of 4 (which can push people towards choosing uncomfortable loads, in fact, that seems to be the purpose) and loads aren't even respected (IIRC there have been mails to the tune of "some people set a max load but we got a lot of submissions so you may get more submissions than your load, lololol").
While I don't condone using LLMs for reviewing and I would never do such a thing, I am not too surprised that these things happen given that ARR makes the already often thankless job of reviewing even more annoying.
To be honest, lately, I have gotten better quality reviews from the supposedly second-tier conferences that haven't joined ARR (e.g. this year's LREC-COLING) than from ARR. Although sample size is very small, of course.
Most conferences have been flooded with submissions, and ACL is no exception.
A consequence of that is that there are not sufficient numbers of reviewers available who are qualified to review these manuscripts.
Conference organizers might be keen to accept many or most who offer to volunteer, but clearly there is now a large pool of people that have never done this before, and were never taught how to do this. Add some time pressure, and people will try out some tool, just because it exists.
GPT-generated docs have a particular tone that you can detect if you've played a bit with ChatGPT and if you have a feel for language. Such reviews should be kicked out. I would be interested to view this review (anonymized if you like - by taking out bits that reveal too narrowly what it's about).
The "rolling" model of ARR is a pain, though, because instead of slaving for a month you feel like slaving (conducting scientific peer review free of charge = slave labor) all year round.
Last month, I got contacted by a book editor to review a scientific book for $100. I told her I'm not going to read 350 pages, to write two pages worth of book review; to do this properly one would need two days, and I quoted my consulting day rate. On top of that, this email came in the vacation month of August. Of course, said person was never heard of again.
Not defending LLM papers at all, but these people can go to hell. If "scientometrics" was ever a good idea, after making the measure the target, it for sure isn't anymore. A longer, carefully written, comprehensive paper is rated worse than many short, incremental, hastily written papers.
Well, given that the only thing that matters for tenure reviews is the “service”, i.e., roughly a list of conferences the applicant reviewed/performed some sort of service at, this is barely a surprise.
Right now there is now incentive to do a high quality review unless the reviewer is motivated.
LLMs reviewing LLM generated articles via LLM editors is more or less guaranteed to become a massive thing given the incentive structures/survival pressures of everyone involved.
Researchers get massive CVs, reviewers and editors get off easy, admins get to show great output numbers from their institutions, and of course the publishers continue making hand over fist.
It might follow to say that current LLM;s arent trained to generate papers, BUT they also don't really need to reason.
They just need to mimic the appearance of reason, follow the same pattern of progression. Ingesting enough of what amounts to executed templates will teach it to generate its own results as if output from the same template.
I can see how LLMs contribute to raise the standard in that field. For example, surveying related research. Also, maybe in the not too distant future, reproducing (some) of the results.
Writing consists of iterated re-writing (to me, anyways), i.e. better and better ways to express content 1. correctly, 2. clearly and 3. space-economically.
By writing it down (yourself) you understand what claims each piece of related work discussed has made (and can realistically make - as there sometims are inflationary lists of claims in papers), and this helps you formulate your own claim as it relates to them (new task, novel method for known task, like older method but works better, nearly as good as a past method but runs faster etc.).
If you outsource it to a machine you no longer see it through, and the result will be poor unless you are a very bad writer.
I can, however, see a role for LLMs in an electronic "learn how to write better" tutoring system.
Hmm there may be a bug in the authors’ python script that searches google scholar for the phrases "as of my last knowledge update" or "I don't have access to real-time data". You can see the code in appendix B.
The bug happens if the ‘bib’ key doesn’t exist in the api response. That leads to the urls array having more rows than the paper_data array. So the columns could become mismatched in the final data frame. It seems they made a third array called flag which could be used to detect and remove the bad results, but it’s not used any where in the posted code.
Not clear to me how this would affect their analysis, it does seem like something they would catch when manually reviewing the papers. But perhaps the bibliographic data wasn’t reviewed and only used to calculate the summary stats etc.
That sounds important enough to contact the authors. Best case, they fixed it up manually; worst case, lots of papers are publicly accused of being made up and the whole farming/fish-focused summary they produced is completely wrong.
Hi there! My name is Kristofer, one of the authors of this research note. I also wrote the script. We were notified via email about this comment. Please see below for our response. Thank you for your interest in our research! (I'm removing the sender's name to respect their privacy)
"""
Dear XXXX,
MY name is Kristofer, I’m one of the co-authors for the GPT paper. I also wrote the script for the data collection. Jutta forwarded your email regarding the possible bug.
First of all, let me apologise for the late response. Apparently your email made its way to the spam folder, which of course is regrettable.
I would also like to thank you for reaching out to us. We are pleased to see the interest of the HN community in transparent and reliable research.
We looked at the comment and the concern around the bug. We’d like to point out that the original commenter was right in saying “it does seem like something they would catch when manually reviewing the papers”. We in fact reviewed the output manually and carefully for any potential errors. In other words, we opened and searched for the query string manually, which also helped determine whether the use of LLMs was declared in some form or other. This is of course a sensitive topic and we took great care to be thorough.
Nevertheless, we once more did a manual review of the code and the data, in light of this potential bug, and we’re glad to say no row-column mismatch is present. You can find the data here: https://doi.org/10.7910/DVN/WUVD8X
Please don’t hesitate if you have any more questions.
As a tangent to the paper topic itself, what should be the standard procedure for publishing data gathering code like this? Given that they don't specify which version of any libraries or APIs used and that updates occur over time, API's change etc. inevitably resulting in code rot. It will eventually be impossible to figure out exactly what this code did.
With meticulous version records it should at least be possible to ascertain what the code did by reconstructing that exact version (assuming stored back versions exist)
In my opinion, archive the data that was actually gathered and the code's intermediate & final outputs. Write the code clearly enough that what it did can be understood by reading it alone, since with pervasive software churn it won't be runnable as-is forever. As a bonus, this approach works even when some steps are manual processes.
GPT might make fabricating scientific papers easier, but let's not forget how many humans fabricated scientific research in recent years - they did a great job without AI!
For any who haven't seen/heard, this makes for some entertaining and eye-opening viewing!
I think it’s important to remember that while the tidal wave of spam just starting to crest courtesy of the less scrupulous LLM vendors is uh, necessary to address, this century’s war on epistemology was well underway already in the grand traditions of periodic wars on the idea that facts are even aspirationally, directionally worthwhile. The phrase “alternative facts” hit the mainstream in 2016 and the idea that resistance is futile on broad-spectrum digital weaponized bytes was muscular then (that was around the time I was starting to feel ill for being a key architect of it).
Now technology is a human artifact and always ends up resembling its creators or financiers or both: I’d have nice fonts on my computer in 2024 most likely either way, but it’s directly because of Jobs they were available in 1984 to a household budget.
If someone other than Altman had or some other insight than “this thing can lie in a newly scalable way” was the escape velocity moment on LLMs then we’d still have test sets and metrics and just science going on in the Commanding Heights of the S&P 500, but these people are a symptom of our apathy around any noble instinct. If we had stuck firm on our values no effective altruism cult leader type would even make the press.
Difficulty and scale matter where it comes to fabrication.
Academia is a lot about barriers, which while sometimes unpleasant and malfunctioning nevertheless serve a purpose (unfortunately, it is impossible to evaluate everything fully on per-case basis, so humans need shortcuts to filter out noise and determine quicker if it is worth spending attention on). One of the barriers is in the form of the paper itself. The fall of this barrier (notably through often unauthorised use of others’ IP) would likely bring about not sudden idyllic meritocracy but increased noise and/or strengthening of other barriers.
Sure, but that takes time, AI has the potential to generate “real sounding”papers in under a second. At least the fake papers before were rate limited.
This kind of fabricated result is not a problem for practitioners in the relevant fields, who can easily distinguish between false and real work.
If there are instances where the ability to make such distinctions is lost, it is most likely to be so because the content lacks novelty, i.e. it simply regurgitates known and established facts. In which case it is a pointless effort, even if it might inflate the supposed author's list of publications.
As to the integrity of researchers, this is a known issue. The temptation to fabricate data existed long before the latest innovations in AI, and is very easy to do in most fields, particularly in medicine or biosciences which constitute the bulk of irreproducible research. Policing this kind of behavior is not altered by GPT or similar.
The bigger problem, however, is when non-experts attempt to become informed and are unable to distinguish between plausible and implausible sources of information. This is already a problem even without AI, consider the debates over the origins of SARS-CoV2, for example. The solution to this is the cultivation and funding of sources of expertise, e.g. in Universities and similar.
Non-experts actually attempting to become informed (instead of just feeling like they're informed) can easily tell the difference too. The people being fooled are the ones who want to be fooled. They're looking for something to support their pre-existing belief. And for those people, they'll always find something they can convince themselves supports their belief, so I don't think it matters what false information is floating around.
It seems to be kind of a new thing for laymen to be reading scientific papers. 20 years ago, they just weren't accessible. You had to physically go to a local university library and work out how to use the arcane search tools, which wouldn't really find what you wanted anyway. And even then, you couldn't take it home and half the time you couldn't even photocopy it because you needed a student ID card to use the photocopier.
For a paper that includes both a broad discussion of the scholarly issues raised by LLMs and wide-ranging policy recommendations, I wish the authors had taken a more nuanced approach to data collection than just searching for “as of my last knowledge update” and/or “I don’t have access to real-time data” and weeding out the false positives manually. LLMs can be used in scholarly writing in many ways that will not be caught with such a coarse sieve. Some are obviously illegitimate, such as having an LLM write an entire paper with fabricated data. But there are other ways that are not so clearly unacceptable.
For example, the authors’ statement that “[GPT’s] undeclared use—beyond proofreading—has potentially far-reaching implications for both science and society” suggests that, for them, using LLMs for “proofreading” is okay. But “proofreading” is understood in various ways. For some people, it would include only correcting spelling and grammatical mistakes. For others, especially for people who are not native speakers of English, it can also include changing the wording and even rewriting entire sentences and paragraphs to make the meaning clearer. To what extent can one use an LLM for such revision without declaring that one has done so?
Last time we discussed this, someone basically searched for phrases such as "certainly I can do X for you" and assumed that meant GPT was used. HN noticed that many of the accused papers actually predated openai.
> Two main risks arise... First, the abundance of fabricated “studies” seeping into all areas of the research infrastructure... A second risk lies in the increased possibility that convincingly scientific-looking content was in fact deceitfully created with AI tools...
A third risk: ChatGPT has no understanding of "truth" in the sense of facts reported by established, trusted sources. I'm doing a research project related to use of data lakes and tried using ChatGPT to search for original sources. It's a shitshow of fabricated links and pedestrian summaries of marketing materials.
Existence of LLMs make Google search even more relevant for cross-checking rather than less relevant for deep research. Daniel Dennett said we should have all levels of searches available for everyone i.e. from basic string matching to semantic matching. [0]
> tried using ChatGPT to search for original sources
That's a bad idea, do not do that. Regardless of the the knowledge contained in ChatGPT, it's a completely wrong tool/tech - like using a jackhammer as a screwdriver. If your want original sources, then services like https://perplexity.ai can do it. It's not even an issue with ChatGPT as such, it was never intended for that - that's why they're trying to create search as well https://openai.com/index/searchgpt-prototype/
It's silly that there's a stigma attached to AI generated images in cases where it's perfectly reasonable to do. People seem to appreciate things more for the fact that they were created by spending time out of another human's life more than what it actually is.
It would be silly if they were indistinguishable from human-created images, but they aren't, exhibiting the typical AI artifacts and weirdness, and thereby signal a lack of care/caring.
> create a picture of scrabble pieces strewn on a table, with a closeup of a line of scrabble letters spelling "CHATGPT" on top of them. photographic, realistic quality, maintain realism and believability
The number markings on the Scrabble pieces are nonsensical, the wooden ground looks like plastic, there are strange artifacts like the white smudge on the edge of the “E” tile in the front, and so on.
AI-generated images are clearly identifiable as such, and it just gets annoying to continually see those desultory fabrications.
LLMs are much better at plausibly summarizing content than they are at doing long sequences of reasoning, so they're much better at generating believable reviews than believable papers. Plus reviews are pretty tedious to do, giving an incentive to half-ass it with an LLM. Plus reviews are usually not shared publicly, taking away some of the potential embarrassment.
On the one hand, if you submit to a conference, you are forced to "volunteer" for that cycle. Which is a good idea from a "justice" point of view, but its also a sure way of generating unmotivated reviewers. Not only because a person might be unmotivated in general, but because the -rather short- reviewing period may coincide in your vacation (this happened to many people with EMNLP, whose reviewing period was in the summer) and you're not given any alternative but to "volunteer" and deal with it.
On the other hand, even regular reviewers aren't treated too well. Lately they implemented a minimum max load of 4 (which can push people towards choosing uncomfortable loads, in fact, that seems to be the purpose) and loads aren't even respected (IIRC there have been mails to the tune of "some people set a max load but we got a lot of submissions so you may get more submissions than your load, lololol").
While I don't condone using LLMs for reviewing and I would never do such a thing, I am not too surprised that these things happen given that ARR makes the already often thankless job of reviewing even more annoying.
To be honest, lately, I have gotten better quality reviews from the supposedly second-tier conferences that haven't joined ARR (e.g. this year's LREC-COLING) than from ARR. Although sample size is very small, of course.
A consequence of that is that there are not sufficient numbers of reviewers available who are qualified to review these manuscripts.
Conference organizers might be keen to accept many or most who offer to volunteer, but clearly there is now a large pool of people that have never done this before, and were never taught how to do this. Add some time pressure, and people will try out some tool, just because it exists.
GPT-generated docs have a particular tone that you can detect if you've played a bit with ChatGPT and if you have a feel for language. Such reviews should be kicked out. I would be interested to view this review (anonymized if you like - by taking out bits that reveal too narrowly what it's about).
The "rolling" model of ARR is a pain, though, because instead of slaving for a month you feel like slaving (conducting scientific peer review free of charge = slave labor) all year round. Last month, I got contacted by a book editor to review a scientific book for $100. I told her I'm not going to read 350 pages, to write two pages worth of book review; to do this properly one would need two days, and I quoted my consulting day rate. On top of that, this email came in the vacation month of August. Of course, said person was never heard of again.
I see "dogfooding" has now been taken to its natural conclusion.
Not defending LLM papers at all, but these people can go to hell. If "scientometrics" was ever a good idea, after making the measure the target, it for sure isn't anymore. A longer, carefully written, comprehensive paper is rated worse than many short, incremental, hastily written papers.
Right now there is now incentive to do a high quality review unless the reviewer is motivated.
If you can please publish it and maybe post here on HN or reddit.
Researchers get massive CVs, reviewers and editors get off easy, admins get to show great output numbers from their institutions, and of course the publishers continue making hand over fist.
It's a rather broken system.
They just need to mimic the appearance of reason, follow the same pattern of progression. Ingesting enough of what amounts to executed templates will teach it to generate its own results as if output from the same template.
By writing it down (yourself) you understand what claims each piece of related work discussed has made (and can realistically make - as there sometims are inflationary lists of claims in papers), and this helps you formulate your own claim as it relates to them (new task, novel method for known task, like older method but works better, nearly as good as a past method but runs faster etc.).
If you outsource it to a machine you no longer see it through, and the result will be poor unless you are a very bad writer.
I can, however, see a role for LLMs in an electronic "learn how to write better" tutoring system.
The bug happens if the ‘bib’ key doesn’t exist in the api response. That leads to the urls array having more rows than the paper_data array. So the columns could become mismatched in the final data frame. It seems they made a third array called flag which could be used to detect and remove the bad results, but it’s not used any where in the posted code.
Not clear to me how this would affect their analysis, it does seem like something they would catch when manually reviewing the papers. But perhaps the bibliographic data wasn’t reviewed and only used to calculate the summary stats etc.
""" Dear XXXX,
MY name is Kristofer, I’m one of the co-authors for the GPT paper. I also wrote the script for the data collection. Jutta forwarded your email regarding the possible bug.
First of all, let me apologise for the late response. Apparently your email made its way to the spam folder, which of course is regrettable. I would also like to thank you for reaching out to us. We are pleased to see the interest of the HN community in transparent and reliable research.
We looked at the comment and the concern around the bug. We’d like to point out that the original commenter was right in saying “it does seem like something they would catch when manually reviewing the papers”. We in fact reviewed the output manually and carefully for any potential errors. In other words, we opened and searched for the query string manually, which also helped determine whether the use of LLMs was declared in some form or other. This is of course a sensitive topic and we took great care to be thorough.
Nevertheless, we once more did a manual review of the code and the data, in light of this potential bug, and we’re glad to say no row-column mismatch is present. You can find the data here: https://doi.org/10.7910/DVN/WUVD8X
Please don’t hesitate if you have any more questions.
All the best, Kristofer """
Contact info for the first author
With meticulous version records it should at least be possible to ascertain what the code did by reconstructing that exact version (assuming stored back versions exist)
For any who haven't seen/heard, this makes for some entertaining and eye-opening viewing!
https://www.youtube.com/results?search_query=academic+fraud
Now technology is a human artifact and always ends up resembling its creators or financiers or both: I’d have nice fonts on my computer in 2024 most likely either way, but it’s directly because of Jobs they were available in 1984 to a household budget.
If someone other than Altman had or some other insight than “this thing can lie in a newly scalable way” was the escape velocity moment on LLMs then we’d still have test sets and metrics and just science going on in the Commanding Heights of the S&P 500, but these people are a symptom of our apathy around any noble instinct. If we had stuck firm on our values no effective altruism cult leader type would even make the press.
Now this sounds like a story worth hearing!
Academia is a lot about barriers, which while sometimes unpleasant and malfunctioning nevertheless serve a purpose (unfortunately, it is impossible to evaluate everything fully on per-case basis, so humans need shortcuts to filter out noise and determine quicker if it is worth spending attention on). One of the barriers is in the form of the paper itself. The fall of this barrier (notably through often unauthorised use of others’ IP) would likely bring about not sudden idyllic meritocracy but increased noise and/or strengthening of other barriers.
If there are instances where the ability to make such distinctions is lost, it is most likely to be so because the content lacks novelty, i.e. it simply regurgitates known and established facts. In which case it is a pointless effort, even if it might inflate the supposed author's list of publications.
As to the integrity of researchers, this is a known issue. The temptation to fabricate data existed long before the latest innovations in AI, and is very easy to do in most fields, particularly in medicine or biosciences which constitute the bulk of irreproducible research. Policing this kind of behavior is not altered by GPT or similar.
The bigger problem, however, is when non-experts attempt to become informed and are unable to distinguish between plausible and implausible sources of information. This is already a problem even without AI, consider the debates over the origins of SARS-CoV2, for example. The solution to this is the cultivation and funding of sources of expertise, e.g. in Universities and similar.
It seems to be kind of a new thing for laymen to be reading scientific papers. 20 years ago, they just weren't accessible. You had to physically go to a local university library and work out how to use the arcane search tools, which wouldn't really find what you wanted anyway. And even then, you couldn't take it home and half the time you couldn't even photocopy it because you needed a student ID card to use the photocopier.
For example, the authors’ statement that “[GPT’s] undeclared use—beyond proofreading—has potentially far-reaching implications for both science and society” suggests that, for them, using LLMs for “proofreading” is okay. But “proofreading” is understood in various ways. For some people, it would include only correcting spelling and grammatical mistakes. For others, especially for people who are not native speakers of English, it can also include changing the wording and even rewriting entire sentences and paragraphs to make the meaning clearer. To what extent can one use an LLM for such revision without declaring that one has done so?
Hope this research is better.
A third risk: ChatGPT has no understanding of "truth" in the sense of facts reported by established, trusted sources. I'm doing a research project related to use of data lakes and tried using ChatGPT to search for original sources. It's a shitshow of fabricated links and pedestrian summaries of marketing materials.
This feels like an evolutionary dead end.
[0] https://youtu.be/arEvPIhOLyQ?t=1139
That's a bad idea, do not do that. Regardless of the the knowledge contained in ChatGPT, it's a completely wrong tool/tech - like using a jackhammer as a screwdriver. If your want original sources, then services like https://perplexity.ai can do it. It's not even an issue with ChatGPT as such, it was never intended for that - that's why they're trying to create search as well https://openai.com/index/searchgpt-prototype/
(edited: typo)
Deleted Comment
Dead Comment
> create a picture of scrabble pieces strewn on a table, with a closeup of a line of scrabble letters spelling "CHATGPT" on top of them. photographic, realistic quality, maintain realism and believability
https://ideogram.ai/assets/image/lossless/response/vF81gKjHS...
https://ideogram.ai/assets/image/lossless/response/EcRpDLumS...
Almost the same prompt.
AI-generated images are clearly identifiable as such, and it just gets annoying to continually see those desultory fabrications.