GPT-fabricated scientific papers on Google Scholar

When I went to the APS March Meeting earlier this year, I talked with the editor of a scientific journal and asked them if they were worried about LLM generated papers. They said actually their main worry wasn't LLM-generated papers, it was LLM-generated reviews.

LLMs are much better at plausibly summarizing content than they are at doing long sequences of reasoning, so they're much better at generating believable reviews than believable papers. Plus reviews are pretty tedious to do, giving an incentive to half-ass it with an LLM. Plus reviews are usually not shared publicly, taking away some of the potential embarrassment.

matusp · 2 years ago

We already got an LLM generated meta review that was very clearly just summarization of reviews. There were some pretty egregious cases of borderline hallucinated remarks. This was ACL Rolling Review, so basically the most prestigious NLP venue and the editors told us to suck it up. Very disappointing and I genuinely worry about the state of science and how this will affect people who rely on scientometric criteria.

Al-Khwarizmi · 2 years ago

This is a problem in general, but the unmitigated disaster that is ARR (ACL Rolling Review) doesn't help.

On the one hand, if you submit to a conference, you are forced to "volunteer" for that cycle. Which is a good idea from a "justice" point of view, but its also a sure way of generating unmotivated reviewers. Not only because a person might be unmotivated in general, but because the -rather short- reviewing period may coincide in your vacation (this happened to many people with EMNLP, whose reviewing period was in the summer) and you're not given any alternative but to "volunteer" and deal with it.

On the other hand, even regular reviewers aren't treated too well. Lately they implemented a minimum max load of 4 (which can push people towards choosing uncomfortable loads, in fact, that seems to be the purpose) and loads aren't even respected (IIRC there have been mails to the tune of "some people set a max load but we got a lot of submissions so you may get more submissions than your load, lololol").

While I don't condone using LLMs for reviewing and I would never do such a thing, I am not too surprised that these things happen given that ARR makes the already often thankless job of reviewing even more annoying.

To be honest, lately, I have gotten better quality reviews from the supposedly second-tier conferences that haven't joined ARR (e.g. this year's LREC-COLING) than from ARR. Although sample size is very small, of course.

jll29 · 2 years ago

Most conferences have been flooded with submissions, and ACL is no exception.

A consequence of that is that there are not sufficient numbers of reviewers available who are qualified to review these manuscripts.

Conference organizers might be keen to accept many or most who offer to volunteer, but clearly there is now a large pool of people that have never done this before, and were never taught how to do this. Add some time pressure, and people will try out some tool, just because it exists.

GPT-generated docs have a particular tone that you can detect if you've played a bit with ChatGPT and if you have a feel for language. Such reviews should be kicked out. I would be interested to view this review (anonymized if you like - by taking out bits that reveal too narrowly what it's about).

The "rolling" model of ARR is a pain, though, because instead of slaving for a month you feel like slaving (conducting scientific peer review free of charge = slave labor) all year round. Last month, I got contacted by a book editor to review a scientific book for $100. I told her I'm not going to read 350 pages, to write two pages worth of book review; to do this properly one would need two days, and I quoted my consulting day rate. On top of that, this email came in the vacation month of August. Of course, said person was never heard of again.

userbinator · 2 years ago

so basically the most prestigious NLP venue

I see "dogfooding" has now been taken to its natural conclusion.

ahartmetz · 2 years ago

> people who rely on scientometric criteria

Not defending LLM papers at all, but these people can go to hell. If "scientometrics" was ever a good idea, after making the measure the target, it for sure isn't anymore. A longer, carefully written, comprehensive paper is rated worse than many short, incremental, hastily written papers.

reliabilityguy · 2 years ago

Well, given that the only thing that matters for tenure reviews is the “service”, i.e., roughly a list of conferences the applicant reviewed/performed some sort of service at, this is barely a surprise.

Right now there is now incentive to do a high quality review unless the reviewer is motivated.

Der_Einzige · 2 years ago

With NeurIPS 2024 reviews going on right now, I'm sure that a whole lot of these kind of reviews are being generated daily.

nextaccountic · 2 years ago

Could you share it publicly or would you face adverse consequences?

If you can please publish it and maybe post here on HN or reddit.

jampekka · 2 years ago

LLMs reviewing LLM generated articles via LLM editors is more or less guaranteed to become a massive thing given the incentive structures/survival pressures of everyone involved.

Researchers get massive CVs, reviewers and editors get off easy, admins get to show great output numbers from their institutions, and of course the publishers continue making hand over fist.

It's a rather broken system.

basch · 2 years ago

It might follow to say that current LLM;s arent trained to generate papers, BUT they also don't really need to reason.

They just need to mimic the appearance of reason, follow the same pattern of progression. Ingesting enough of what amounts to executed templates will teach it to generate its own results as if output from the same template.

Eisenstein · 2 years ago

What is the difference between 'reasoning' and 'appearing to be reasoning' if the results are the same with the same input?

kovezd · 2 years ago

I can see how LLMs contribute to raise the standard in that field. For example, surveying related research. Also, maybe in the not too distant future, reproducing (some) of the results.

jll29 · 2 years ago

Writing consists of iterated re-writing (to me, anyways), i.e. better and better ways to express content 1. correctly, 2. clearly and 3. space-economically.

By writing it down (yourself) you understand what claims each piece of related work discussed has made (and can realistically make - as there sometims are inflationary lists of claims in papers), and this helps you formulate your own claim as it relates to them (new task, novel method for known task, like older method but works better, nearly as good as a past method but runs faster etc.).

If you outsource it to a machine you no longer see it through, and the result will be poor unless you are a very bad writer.

I can, however, see a role for LLMs in an electronic "learn how to write better" tutoring system.

Hmm there may be a bug in the authors’ python script that searches google scholar for the phrases "as of my last knowledge update" or "I don't have access to real-time data". You can see the code in appendix B.

The bug happens if the ‘bib’ key doesn’t exist in the api response. That leads to the urls array having more rows than the paper_data array. So the columns could become mismatched in the final data frame. It seems they made a third array called flag which could be used to detect and remove the bad results, but it’s not used any where in the posted code.

Not clear to me how this would affect their analysis, it does seem like something they would catch when manually reviewing the papers. But perhaps the bibliographic data wasn’t reviewed and only used to calculate the summary stats etc.

viraptor · 2 years ago

That sounds important enough to contact the authors. Best case, they fixed it up manually; worst case, lots of papers are publicly accused of being made up and the whole farming/fish-focused summary they produced is completely wrong.

krisrs · a year ago

Hi there! My name is Kristofer, one of the authors of this research note. I also wrote the script. We were notified via email about this comment. Please see below for our response. Thank you for your interest in our research! (I'm removing the sender's name to respect their privacy)

""" Dear XXXX,

MY name is Kristofer, I’m one of the co-authors for the GPT paper. I also wrote the script for the data collection. Jutta forwarded your email regarding the possible bug.

First of all, let me apologise for the late response. Apparently your email made its way to the spam folder, which of course is regrettable. I would also like to thank you for reaching out to us. We are pleased to see the interest of the HN community in transparent and reliable research.

We looked at the comment and the concern around the bug. We’d like to point out that the original commenter was right in saying “it does seem like something they would catch when manually reviewing the papers”. We in fact reviewed the output manually and carefully for any potential errors. In other words, we opened and searched for the query string manually, which also helped determine whether the use of LLMs was declared in some form or other. This is of course a sensitive topic and we took great care to be thorough.

Nevertheless, we once more did a manual review of the code and the data, in light of this potential bug, and we’re glad to say no row-column mismatch is present. You can find the data here: https://doi.org/10.7910/DVN/WUVD8X

Please don’t hesitate if you have any more questions.

All the best, Kristofer """

hiddencost · 2 years ago

https://www.hb.se/en/research/research-portal/researchers/JU...

Contact info for the first author

Lerc · 2 years ago

As a tangent to the paper topic itself, what should be the standard procedure for publishing data gathering code like this? Given that they don't specify which version of any libraries or APIs used and that updates occur over time, API's change etc. inevitably resulting in code rot. It will eventually be impossible to figure out exactly what this code did.

With meticulous version records it should at least be possible to ascertain what the code did by reconstructing that exact version (assuming stored back versions exist)

jpeloquin · 2 years ago

In my opinion, archive the data that was actually gathered and the code's intermediate & final outputs. Write the code clearly enough that what it did can be understood by reading it alone, since with pervasive software churn it won't be runnable as-is forever. As a bonus, this approach works even when some steps are manual processes.

jerpint · 2 years ago

Using a colab with printed outputs could be a good option to at the very least hint to reproducing results independently

Strilanc · 2 years ago

refibrillator · 2 years ago

nomilk · 2 years ago

GPT might make fabricating scientific papers easier, but let's not forget how many humans fabricated scientific research in recent years - they did a great job without AI!

For any who haven't seen/heard, this makes for some entertaining and eye-opening viewing!

https://www.youtube.com/results?search_query=academic+fraud

benreesman · 2 years ago

I think it’s important to remember that while the tidal wave of spam just starting to crest courtesy of the less scrupulous LLM vendors is uh, necessary to address, this century’s war on epistemology was well underway already in the grand traditions of periodic wars on the idea that facts are even aspirationally, directionally worthwhile. The phrase “alternative facts” hit the mainstream in 2016 and the idea that resistance is futile on broad-spectrum digital weaponized bytes was muscular then (that was around the time I was starting to feel ill for being a key architect of it).

Now technology is a human artifact and always ends up resembling its creators or financiers or both: I’d have nice fonts on my computer in 2024 most likely either way, but it’s directly because of Jobs they were available in 1984 to a household budget.

If someone other than Altman had or some other insight than “this thing can lie in a newly scalable way” was the escape velocity moment on LLMs then we’d still have test sets and metrics and just science going on in the Commanding Heights of the S&P 500, but these people are a symptom of our apathy around any noble instinct. If we had stuck firm on our values no effective altruism cult leader type would even make the press.

eli_gottlieb · 2 years ago

>(that was around the time I was starting to feel ill for being a key architect of it).

Now this sounds like a story worth hearing!

greesil · 2 years ago

The metric is in fact the stock price.

__loam · 2 years ago

Post-modernism was a mistake.

anileated · 2 years ago

Difficulty and scale matter where it comes to fabrication.

Academia is a lot about barriers, which while sometimes unpleasant and malfunctioning nevertheless serve a purpose (unfortunately, it is impossible to evaluate everything fully on per-case basis, so humans need shortcuts to filter out noise and determine quicker if it is worth spending attention on). One of the barriers is in the form of the paper itself. The fall of this barrier (notably through often unauthorised use of others’ IP) would likely bring about not sudden idyllic meritocracy but increased noise and/or strengthening of other barriers.

EasyMark · 2 years ago

Sure, but that takes time, AI has the potential to generate “real sounding”papers in under a second. At least the fake papers before were rate limited.

bumby · 2 years ago

Is there good data on how many are fraudulent? I know there’s reasonable data on replicability issues, but that’s potentially different.

croes · 2 years ago

But AI is to papers what the assembly line was to cars.

pcrh · 2 years ago

This kind of fabricated result is not a problem for practitioners in the relevant fields, who can easily distinguish between false and real work.

If there are instances where the ability to make such distinctions is lost, it is most likely to be so because the content lacks novelty, i.e. it simply regurgitates known and established facts. In which case it is a pointless effort, even if it might inflate the supposed author's list of publications.

As to the integrity of researchers, this is a known issue. The temptation to fabricate data existed long before the latest innovations in AI, and is very easy to do in most fields, particularly in medicine or biosciences which constitute the bulk of irreproducible research. Policing this kind of behavior is not altered by GPT or similar.

The bigger problem, however, is when non-experts attempt to become informed and are unable to distinguish between plausible and implausible sources of information. This is already a problem even without AI, consider the debates over the origins of SARS-CoV2, for example. The solution to this is the cultivation and funding of sources of expertise, e.g. in Universities and similar.

EnigmaFlare · 2 years ago

Non-experts actually attempting to become informed (instead of just feeling like they're informed) can easily tell the difference too. The people being fooled are the ones who want to be fooled. They're looking for something to support their pre-existing belief. And for those people, they'll always find something they can convince themselves supports their belief, so I don't think it matters what false information is floating around.

It seems to be kind of a new thing for laymen to be reading scientific papers. 20 years ago, they just weren't accessible. You had to physically go to a local university library and work out how to use the arcane search tools, which wouldn't really find what you wanted anyway. And even then, you couldn't take it home and half the time you couldn't even photocopy it because you needed a student ID card to use the photocopier.

tkgally · 2 years ago

For a paper that includes both a broad discussion of the scholarly issues raised by LLMs and wide-ranging policy recommendations, I wish the authors had taken a more nuanced approach to data collection than just searching for “as of my last knowledge update” and/or “I don’t have access to real-time data” and weeding out the false positives manually. LLMs can be used in scholarly writing in many ways that will not be caught with such a coarse sieve. Some are obviously illegitimate, such as having an LLM write an entire paper with fabricated data. But there are other ways that are not so clearly unacceptable.

For example, the authors’ statement that “[GPT’s] undeclared use—beyond proofreading—has potentially far-reaching implications for both science and society” suggests that, for them, using LLMs for “proofreading” is okay. But “proofreading” is understood in various ways. For some people, it would include only correcting spelling and grammatical mistakes. For others, especially for people who are not native speakers of English, it can also include changing the wording and even rewriting entire sentences and paragraphs to make the meaning clearer. To what extent can one use an LLM for such revision without declaring that one has done so?

daghamm · 2 years ago

Last time we discussed this, someone basically searched for phrases such as "certainly I can do X for you" and assumed that meant GPT was used. HN noticed that many of the accused papers actually predated openai.

Hope this research is better.

xandrius · 2 years ago

How else would that phrase go into a real paper then?

hodgesrm · 2 years ago

> Two main risks arise... First, the abundance of fabricated “studies” seeping into all areas of the research infrastructure... A second risk lies in the increased possibility that convincingly scientific-looking content was in fact deceitfully created with AI tools...

A third risk: ChatGPT has no understanding of "truth" in the sense of facts reported by established, trusted sources. I'm doing a research project related to use of data lakes and tried using ChatGPT to search for original sources. It's a shitshow of fabricated links and pedestrian summaries of marketing materials.

This feels like an evolutionary dead end.

kenjackson · 2 years ago

It sounds like your use of AI is one of the worst uses. Standard semantic search would be much better and appropriate.

passion__desire · 2 years ago

Existence of LLMs make Google search even more relevant for cross-checking rather than less relevant for deep research. Daniel Dennett said we should have all levels of searches available for everyone i.e. from basic string matching to semantic matching. [0]

[0] https://youtu.be/arEvPIhOLyQ?t=1139

No disagreement with that. My expectations were not high--but I was still surprised how bad it was. There are absolutely no guardrails.

nis0s · 2 years ago

If summarization and analysis isn’t the main use of AI, then what is?

HeatrayEnjoyer · 2 years ago

How do you run a semantic search

> tried using ChatGPT to search for original sources

That's a bad idea, do not do that. Regardless of the the knowledge contained in ChatGPT, it's a completely wrong tool/tech - like using a jackhammer as a screwdriver. If your want original sources, then services like https://perplexity.ai can do it. It's not even an issue with ChatGPT as such, it was never intended for that - that's why they're trying to create search as well https://openai.com/index/searchgpt-prototype/

Perplexity.ai looks a lot better. Thanks for the link.

(edited: typo)

Deleted Comment

layer8 · 2 years ago

I appreciate that, appropriately, the article image is not AI-generated.

It's silly that there's a stigma attached to AI generated images in cases where it's perfectly reasonable to do. People seem to appreciate things more for the fact that they were created by spending time out of another human's life more than what it actually is.

It would be silly if they were indistinguishable from human-created images, but they aren't, exhibiting the typical AI artifacts and weirdness, and thereby signal a lack of care/caring.

It's built on theft and it's a negative quality signal usually.

Dead Comment

judge2020 · 2 years ago

I was able to get pretty close with chatgpt: https://rr.judge.sh/Commabutterfly/76b34e/nmwguWGt8pIe.jpg

> create a picture of scrabble pieces strewn on a table, with a closeup of a line of scrabble letters spelling "CHATGPT" on top of them. photographic, realistic quality, maintain realism and believability

GaggiX · 2 years ago

You can get much better results with Ideogram 2 (also free):

https://ideogram.ai/assets/image/lossless/response/vF81gKjHS...

https://ideogram.ai/assets/image/lossless/response/EcRpDLumS...

Almost the same prompt.

The number markings on the Scrabble pieces are nonsensical, the wooden ground looks like plastic, there are strange artifacts like the white smudge on the edge of the “E” tile in the front, and so on.

AI-generated images are clearly identifiable as such, and it just gets annoying to continually see those desultory fabrications.

Daub · 2 years ago

I prefer yours. Better lighting.