Readit News logoReadit News
ashdksnndck · 2 months ago
I’m not sure we can accept the premise that LLMs haven’t made any breakthroughs. What if people aren’t giving the LLM credit when they get a breakthrough from it?

First time I got good code out of a model, I told my friends and coworkers about it. Not anymore. The way I see it, the model is a service I (or my employer) pays for. Everyone knows it’s a tool that I can use, and nobody expects me to apportion credit for whether specific ideas came from the model or me. I tell people I code with LLMs, but I don’t commit a comment saying “wow, this clever bit came from the model!”

If people are getting actual bombshell breakthroughs from LLMs, maybe they are rationally deciding to use those ideas without mentioning the LLM came up with it first.

Anyway, I still think Gwern’s suggestion of a generic idea-lab trying to churn out insights is neat. Given the resources needed to fund such an effort, I could imagine that a trading shop would be a possible place to develop such a system. Instead of looking for insights generally, you’d be looking for profitable trades. Also, I think you’d do a lot better if you have relevant experts to evaluate the promising ideas, which means that more focused efforts would be more manageable. Not comparing everything to everything, but comparing everything to stuff in the expert’s domain.

If a system like that already exists at Jane Street or something, I doubt they are going to tell us about it.

therealpygon · 2 months ago
It is hard to accept as a premise because the premise is questionable from the beginning.

Google already reported several breakthroughs as a direct result of AI, using processes that almost certainly include LLMs, including a new solution in math, improved chip designs, etc. DeepMind has AI that predicted millions of protein folds which are already being used in drugs among many other things they do, though yes, not an LLM per se. There is certainly the probability that companies won’t announce things given that the direct LLM output isn’t copyrightable/patentable, so a human-in-the-loop solves the issue by claiming the human made said breakthrough with AI/LLM assistance. There isn’t much benefit to announcing how much AI helped with a breakthrough unless you’re engaged in basically selling AI.

As for “why aren’t LLMs creating breakthroughs by themselves regularly”, that answer is pretty obvious… they just don’t really have that capacity in a meaningful way based on how they work. The closest example is Google’s algorithmic breakthrough absolutely was created by a coding LLM, which was effectively achieved through brute force in a well established domain, but that doesn’t mean it wasn’t a breakthrough. That alone casts doubt on the underlying premise of the post.

starlust2 · 2 months ago
> through brute force

The same is true of humanity in aggregate. We attribute discoveries to an individual or group of researchers but to claim humans are efficient at novel research is a form of survivorship bias. We ignore the numerous researchers who failed to achieve the same discoveries.

js8 · 2 months ago
I would say that real breakthrough was training NNs as a way to create practical approximators for very complex functions over some kind of many-valued logics. Why they work so well in practice we still don't fully theoretically understand (in the sense we don't know what kind of underlying logic best models what we want from these systems). The LLMs (and application to natural language) are just a consequence of that.
Yizahi · 2 months ago
You are contradicting yourself. Either LLM programs can do breakthrough on their own, or they don't have that capacity in a meaningful way based on how they work.
PaulHoule · 2 months ago
Almost certainly an LLM has, in response to a prompt and through sheer luck, spat out the kernel of an idea that a super-human centaur of the year 2125 would see as groundbreaking that hasn't been recognized as such.

We have a thin conception of genius that can be challenged by Edison's "1% inspiration, 99% perspiration" or the process of getting a PhD were you might spend 7 years getting to the point where you can start adding new knowledge and then take another 7 years to really hit your stride.

I have a friend who is 50-something and disabled with some mental illness, he thinks he has ADHD. We had a conversation recently where he repeatedly expressed his fantasy that he could show up somewhere with his unique perspective and sprinkle some pixie dust on their problems and be rewarded for it. I found it exhausting. When I would hear his ideas, or if I hear any idea, I immediately think "how would we turn this into a product and sell it?" or "write a paper about it?" or "convince people of it?" and he would have no part of it and think that operationalizing or advocating for that was uninteresting and that somebody else would do all that work and my answer is -- they might, but not without the advocacy.

And it comes down to that.

If an LLM were to come up with a groundbreaking idea and be recognized as having a groundbreaking idea it would have to do a sustained amount of work, say at least 2 person × years equivalent to win people over. And they aren't anywhere near equipped to do that, nobody is going to pay the power bill to do that, and if you were paying the power bill you'd probably have to pay the power bill for a million of them to go off in the wrong direction.

ben_w · a month ago
Broadly agree (I see lots of "ideas people" who have no interest in doing), the only thing I would say is that the occasional results from the big AI groups suggests it takes less than 1e6 machines — but probably more than 100 even for low-hanging fruit, which is already too much for a lot of people to stomach, so the point is still valid.
nico · 2 months ago
> but I don’t commit a comment saying “wow, this clever bit came from the model!”

The other day, Claude Code started adding a small signature to the commit messages it was preparing for me. It said something like “This commit was co-written with Claude Code” and a little robot emoji

I wonder if that just happened by accident or if Anthropic is trying to do something like Apple with the “sent from my iPhone”

danielbln · 2 months ago
morsch · 2 months ago
Aider does the same thing (and has a similar setting). I tend to squash the AI commits and remove it that way, though I suppose a flag indicating the degree of AI authorship could be useful.
catigula · 2 months ago
letting claude pen commits is wild.
kajumix · 2 months ago
Most interesting novel ideas originate at the intersection of multiple disciplines. Profitable trades could be found in the biomedicine sector when the knowledge of biomedicine and finance are combined. That's where I see LLMs shining because they span disciplines way more than any human can. Once we figure out a way to have them combine ideas (similar to how Gwern is suggesting), there will be, I suspect, a flood of novel and interesting ideas, inconceivable with humans.
Yizahi · 2 months ago
This is bordering conspiracy theory. Thousands of people are getting novel breakthroughs generated purely by LLM an not a single person discloses such result? Not even one of the countless LLM corporation engineers who depend on the billion dollar IV injections from deluded bankers just to continue surviving, and not one has bragged about LLM doing that revolution? Hard to believe.
esafak · 2 months ago
Countless people are increasing their productivity and talking about it here ad nauseam. Even researchers are leaning on language models; e.g., https://mathstodon.xyz/@tao/114139125505827565

We haven't successfully resolved famous unsolved research problems through language models yet but one can imagine that they will solve increasingly challenging problems over time. And if it happens in the hands of a researcher rather than model's lab, one can also imagine that the researcher will take credit, so you will still have the same question.

BizarroLand · 2 months ago
I wonder if it's not the LLM making the breakthrough but rather that the person using the system just needed the information available presented in a clear and orderly fashion to make the breakthrough itself.

After all, the LLM currently has no cognizance, it is unable to understand what it is saying in a meaningful way. At its best it is a P-Noid Zombie machine, right?

In my opinion anything amazing that comes from an LLM only becomes amazing when someone who was capable of recognizing the amazingness perceives it, like a rewrite of a zen koan, "If an LLM generates a new work of William Shakespeare, and nobody ever reads it, was anything of value lost?"

HarHarVeryFunny · 2 months ago
> Despite impressive capabilities, large language models have yet to produce a genuine breakthrough. The puzzle is why.

I don't see why this is remotely surprising. Despite all the hoopla, LLMs are not AGI or artifical brains - they are predict-next-word language models. By design they are not built for creativity, but rather quite the opposite, they are designed to continue the input in the way best suggested by the training data - they are essentially built for recall, not creativity.

For an AI to be creative it needs to have innate human/brain-like features such as novelty (prediction failure) driven curiosity, boredom, as well as ability to learn continuously. IOW if you want the AI to be creative it needs to be able to learn for itself, not just regurgitate the output of others, and have these innate mechanisms that will cause it to pursue discovery.

fragmede · 2 months ago
Define creativity. Three things LLMs can do is write song lyrics, poems, and jokes, all of which require some level of what we think of as human creativity. Of course detractors will say LLM versions of those three aren't very good, and they may even be right, but a twelve year old child coming up with the same would be seen as creative, even if they didn't get significant recognition for it.
HarHarVeryFunny · 2 months ago
Sure, but the author of TFA is well versed in LLMs and so is addressing something different. Novelty isn't the same as creativity, especially when limited to generating based on a fixed repertoire of moves.

The term "deductive closure" has been used to describe what LLMs are capable of, and therefore what they are not capable of. They can generate novelty (e.g. new poem) by applying the rules they have learnt in novel ways, but are ultimately restricted by their fixed weights and what was present in the training data, as well as being biased to predict rather than learn (which they anyways can't!) and explore.

An LLM may do a superhuman job of applying what it "knows" to create solutions to novel goals (be that a math olympiad problem, or some type of "creative" output that has been requested, such as a poem), but is unlikely to create a whole new field of math that wasn't hinted at in the training data because it is biased to predict, and anyways doesn't have the ability to learn that would allow it to build a new theory from the ground up one step at a time. Note (for anyone who might claim otherwise) that "in-context learning" is really a misnomer - it's not about learning but rather about using data that is only present in-context rather than having been in the training set.

tmaly · 2 months ago
I think we will see more breakthroughs with an AI/Human hybrid approach.

Tobias Rees had some interesting thoughts https://www.noemamag.com/why-ai-is-a-philosophical-rupture/ where he poses this idea that AI and humans together can think new types of thoughts that humans alone cannot think.

karmakaze · 2 months ago
Yes LLMs choose probable sequences because they recognize similarity. Because of that, it can diverge from similarity to be creative: increase the temperature. What LLMs don't have is (good) taste—we need to build an artificial tongue and feed it as a prerequisite.
HarHarVeryFunny · 2 months ago
It depends on what you mean by "creative" - they can recombine fragments of training data (i.e. apply generative rules) in any order - generate the deductive closure of the training set, but that is it.

Without moving beyond LLMs to a more brain-like cognitive architecture, all you can do is squeeze the juice out of the training data, by using RL/etc to bias the generative process (according to reasoning data, good taste or whatever), but you can't move beyond the training data to be truly creative.

grey-area · 2 months ago
Well they also don't have understanding, a model of the world, and the ability to reason (no chain-of-thought created by AI companies is not reasoning), as well as having no taste.

So there is quite a lot missing.

vonneumannstan · 2 months ago
> Despite all the hoopla, LLMs are not AGI or artifical brains - they are predict-next-word language models. By design they are not built for creativity, but rather quite the opposite, they are designed to continue the input in the way best suggested by the training data - they are essentially built for recall, not creativity.

This is just a completely base level of understanding of LLMs. How do you predict the next token with superhuman accuracy? Really think about how that is possible. If you think it's just stochastic parroting you are ngmi.

>large language models have yet to produce a genuine breakthrough. The puzzle is why. I think you should really update on the fact that world class researchers are surprised by this. They understand something you don't and that is that it's clear these models build robust world models and that text prompts act as probes into those world models. The surprising part is that despite these sophisticated world models we can't seem to get unique insights out which almost surely already exist in those models. Even if all the model is capable of is memorizing text then just the sheer volume it has memorized should yield unique insights, no human can ever hope to hold this much text in their memory and then make connections between it.

It's possible we just lack the prompt creativity to get these insights out but nevertheless there is something strange happening here.

HarHarVeryFunny · 2 months ago
> This is just a completely base level of understanding of LLMs. How do you predict the next token with superhuman accuracy? Really think about how that is possible. If you think it's just stochastic parroting you are ngmi.

Yes, thank-you, I do understand how LLMs work. They learn a lot of generative rules from the training data, and will apply them in flexible fashion according to the context patterns they have learnt. You said stochastic parroting, not me.

However, we're not discussing whether LLMs can be superhuman at tasks where they had the necessary training - we're discussing whether they are capable of creativity (and presumably not just the trivially obvious case of being able to apply their generative rules in any order - deductive closure, not stochastic parroting in the dumbest sense of that expression).

awongh · 2 months ago
Not just prompting, it also could be we haven't done the right kind of RLHF for these kinds of outputs?

Dead Comment

zhangjunphy · 2 months ago
I also hope we have something like this. But sadly, this is not going to work. The reason is this line from the article, which is so much harder that it looks:

> and a critic model filters the results for genuinely valuable ideas.

In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.

So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.

imiric · 2 months ago
Exactly.

This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.

The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.

yorwba · 2 months ago
Reasoning models are trained from non-reasoning models of the same scale, and the training data is the output of the same model, filtered through a verifier. Generating intermediate context to improve the final output is not an idea that reasoning models are based on, but an outcome of the training process. Because empirically it does produce answers that pass the verifier more often if it generates the intermediate steps first.

That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.

danenania · 2 months ago
> The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data.

Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.

amelius · 2 months ago
But what if the critic is just hard reality? If you ask an LLM to write a computer program, instead of criticizing it, you can run it and test it. If you ask an LLM to prove a theorem, let it write the proof in a formal logic language so it can be verified. Etcetera.
Yizahi · 2 months ago
Generated code only works because "test" part (compile/validate/analyze etc.) is completely external and written before any mass-market LLMs. There is no such external validator for new theorems, books, pictures, text guides etc. You can't just run hard_reality.exe on a generated poem or a scientific paper to deem it "correct". It is only possible with programming languages, and even then not always.
zhangjunphy · 2 months ago
I think if we can have a good enough simulation of reality, and a fast one. Something like an accelerable minecraft with real world physics. Then this idea might actually work. But the hard reality we currenly could generate efficiently and feed into LLMs usually has a narrow scope. It feels liking teaching only textbook math to a kid for several years but nothing else. The LLM mostly overoptimize in these very specific fields, but the overall performance might even be worse.
jerf · 2 months ago
Those things are being done. Program testing is now off-the-shelf tech, and as for math proofs, see: https://www.geeky-gadgets.com/google-deepmind-alphaproof/
imtringued · 2 months ago
That didn't stop actor-critic from becoming one of the most popular deep RL methods.
zhangjunphy · 2 months ago
True, and the successful ones usually require an external source of information. For AlphaGo, it is the simple algorithm which decide who is the winner of a game of Go. For GAN, it is the images labled by human. In these scenarios, the critic is the medium which transforms external information into gradient which optimized the actor, but not the direct source of that information.
jsbg · 2 months ago
> the LLM tries too hard to satisfy the critic

The LLM doesn't have to know about the critic though. It can just output things and the critic is a second process that filters the output for the end user.

Dead Comment

blueflow · 2 months ago
I have not yet seen AI doing a critical evaluation of data sources. AI willcontradict primary sources if the contradiction is more prevalent in the training data.

Something about the whole approach is bugged.

My pet peeve: "Unix System Resources" as explanation for the /usr directory is a term that did not exist until the turn of the millenium (rumor is that a c't journalist made it up in 1999), but AI will retcon it into the FHS (5 years earlier) or into Ritchie/Thompson/Kernigham (27 years earlier).

_heimdall · 2 months ago
> Something about the whole approach is bugged.

The bug is that LLMs are fundamentally designed for natural language processing and prediction, not logic or reasoning.

We may get to actual AI eventually, but an LLM architecture either won't be involved at all or it will act as a part of the system mimicking the language center of a brain.

jumploops · 2 months ago
How do you critique novelty?

The models are currently trained on a static set of human “knowledge” — even if they “know” what novelty is, they aren’t necessarily incentivized to identify it.

In my experience, LLMs currently struggle with new ideas, doubly true for the reasoning models with search.

What makes novelty difficult, is that the ideas should be nonobvious (see: the patent system). For example, hallucinating a simpler API spec may be “novel” for a single convoluted codebase, but it isn’t novel in the scope of humanity’s information bubble.

I’m curious if we’ll have to train future models on novelty deltas from our own history, essentially creating synthetic time capsules, or if we’ll just have enough human novelty between training runs over the next few years for the model to develop an internal fitness function for future novelty identification.

My best guess? This may just come for free in a yet-to-be-discovered continually evolving model architecture.

In either case, a single discovery by a single model still needs consensus.

Peer review?

n4r9 · 2 months ago
It's a good question. A related question is: "what's an example of something undeniably novel?". Like if you ask an agent out of the blue to prove the Collatz conjecture, and it writes out a proof or counterexample. If that happens with LLMs then I'll be a lot more optimistic about the importance to AGI. Unfortunately, I suspect it will be a lot murkier than that - many of these big open questions will get chipped away at by a combination of computational and human efforts, and it will be impossible to pinpoint where the "novelty" lies.
jacobr1 · 2 months ago
Good point. Look at patents. Few are truly novel in some exotic sense of "the whole idea is something never seen before." Most likely it is a combination of known factors applied in a new way, or incremental development improving on known techniques. In a banal sense, most LLM content generated is novel, in that the specific paragraphs might be unique combinations of words, even if the ideas are just slightly rearranged regurgitations.

So I strongly agree that, especially when are talking about the bulk of human discovery and invention, the incrementalism will be increasingly in striking distance of human/AI collaboration. Attribution of the novelty in these cases is going to be unclear, when the task is, simplified something like, "search for combinations of things, in this problem domain, that do the task better than some benchmark" be that drug discovery, maths, ai itself or whatever.

zbyforgotp · 2 months ago
I think our minds don’t use novelty - but salience and it also might be easier to implement.
velcrovan · 2 months ago
I’m once again begging people to read David Gelernter’s 1994 book “The Muse in the Machine”. I’m surprised to see no mention of it in Gwern’s post, it’s the exact book he should be reaching for on this topic.

In examining the possibility of genuinely creative computing, Gelernter discovers and defends a model of cognition that explains so much about the human experience of creativity, including daydreaming, dreaming, everyday “aha” moments, and the evolution of human approaches to spirituality.

https://uranos.ch/research/references/Gelernter_1994/Muse%20...

johnfn · 2 months ago
It's an interesting premise, but how many people

- are capable of evaluating the LLM's output to the degree that they can identify truly unique insights

- are prompting the LLM in such a way that it could produce truly unique insights

I've prompted an LLM upwards of 1,000 times in the last month, but I doubt more than 10 of my prompts were sophisticated enough to even allow for a unique insight. (I spend a lot of time prompting it to improve React code.) And of those 10 prompts, even if all of the outputs were unique, I don't think I could have identified a single one.

I very much do like the idea of the day-dreaming loop, though! I actually feel like I've had the exact same idea at some point (ironic) - that a lot of great insight is really just combining two ideas that no one has ever thought to combine before.

cantor_S_drug · 2 months ago
> are capable of evaluating the LLM's output to the degree that they can identify truly unique insights

I noticed one behaviour in myself. I heard about a particular topic, because it was a dominant opinion in the infosphere. Then LLMs confirmed that dominant opinion (because it was heavily represented in the training) and I stopped my search for alternative viewpoints. So in a sense, LLMs are turning out to be another reflective mirror which reinforces existing opinion.

MrScruff · 2 months ago
Yes, it seems like LLMs are system one thinking taken to the extreme. Reasoning was supposed to introduce some actual logic but you only have to play with these models for a short while to see that the reasoning tokens are a very soft constraint on the models eventual output.

Infact, they're trained to please us and so in general aren't very good at pushing back. It's incredibly easy to 'beat' an LLM in an argument since they often just follow your line of reasoning (it's in the models context after all).

zyklonix · 2 months ago
Totally agree, most prompts (especially for code) aren’t designed to surface novel insights, and even when they are, it’s hard to recognize them. That’s why the daydreaming loop is so compelling: it offloads both the prompting and the novelty detection to the system itself. Projects like https://github.com/DivergentAI/dreamGPT are early steps in that direction, generating weird idea combos autonomously and scoring them for divergence, without user prompting at all.
throwaway328 · 2 months ago
The fact that LLMs haven't come up with anything "novel" would be a serious puzzle - as the article claims - only if they were thinking, reasoning, being creative, etc. If they aren't doing anything of the sort, it'd be the only thing you'd expect.

So it's a bit of an anti-climactic solution to the puzzle but: maybe the naysayers were right and they're not thinking at all, or doing any of the other anthropomorphic words being marketed to users, and we've simply all been dragged along by a narrative that's very seductive to tech types (the computer gods will rise!).

It'd be a boring outcome, after the countless gallons of digital ink spilled on the topic the last years, but maybe they'll come to be accepted as "normal software", and not god-like, in the end. A medium to large improvement in some areas, and anywhere from minimal to pointless to harmful in others. And all for the very high cost of all the funding and training and data-hoovering that goes in to them, not to mention the opportunity cost of all the things we humans could have been putting money into and didn't.