SEGyges (u/SEGyges) - Readit News

SEGyges commented on Do we understand how neural networks work? verysane.ai/p/do-we-under... · Posted by u/Philpax

Nevermark · 7 months ago

Here is a slightly sideways take on the question. Since some misunderstandings crop up over and over.

Things that are not true about neural networks and LLMs:

1. They are not stochastic parrots. Statistics can be used to judge model performances, as with any type of model whatsoever.

But that doesn’t make a model statistical. Neural network components are not statistically driven, they are gradient driven. They don’t calculate statistics, they learn topological representations of relationships.

2. LLMs are not just doing word prediction or text completion. Yes, that is the basic task they are trained on, but the “just” (that is often stated or implied) trivializes what the network actually has to learn to perform well.

Task type, and what must be learned to achieve success at that task, are two entirely different things.

To predict the kinds of human reasoning documented in writing in training sets, requires that kind of reasoning be learned. Not just some compressed generalization of people’s particular responses.

Simple proof that LLMs are not just compressing a lot of human behavior comes easily. Just ask an LLM to do something involving several topics unlikely to have been encountered together before, and their one-shot answer might not be God’s word on the issue, but is a far cry from the stumbling that good mimic could ever do. (Example task, ask for a Supreme Court brief to argue for rabbits rights based on sentient animal and native rights, with serious arguments, but written in Dr. Seuss prose by James Bond.)

3. LLMs do reason. Not like us or as well as us. But also, actually better than us in some ways.

LLMs are far superior to us at very wide, somewhat shallow reasoning.

They are able to one-shot weave together information from tens of thousands of disparate topics and ideas, on demand.

But they don’t notice their own blind spots. They miss implications. Those are things humans do quickly in continuous narrower deeper reasoning cycles.

They are getting better.

And some of their failures should be attributed to the strict constraint of one/few-shot(s) to well written responses, that we put on them.

We don’t hold humans to such strict standards. And if we did, we would also make a lot more obvious errors.

Wide & shallow reasoning in ways humans can’t match, is not a trivial success.

4. LLMs are very creative.

As with their reasoning, not like us: very wide in terms of instantly and fluidly weaving highly disparate information together. Somewhat shallow (but gaining) in terms of iteratively recognizing and self-correcting their own mistakes and oversights.

See random original disparate topic task above.

First, spend several hours performing that task oneself. Or take a day or two if need be.

Then, give the task to an LLM and wait a few seconds.

Compare. Ouch.

—

TLDR; somewhat off topic and high level, but when trying to understand models it helps to avoid seemingly endlessly repeated misunderstandings.

SEGyges · 7 months ago

i hate "stochastic parrot" because it's not even really meaningful

I think it's true that models are statistical, inasmuch as P(A|B) where B is the prior sequence is what the loss is computing, and that's statistical. It's just computing that function in an absurdly complex way, which involves creating topological representations of relationships, etc.

I agree that "just" autocomplete implies the wrong thing. It turns out autocomplete is amazing if you scale it up.

I think it's true that they reason and area creative but these are really hard points because people mean subtly different things when saying "reason" and "creative".

SEGyges commented on Do we understand how neural networks work? verysane.ai/p/do-we-under... · Posted by u/Philpax

astrange · 7 months ago

The text returned by the tool itself makes it not "next token prediction". Aside from having side effects, the reason it's helpful is that it's out of distribution for the model. So it changes the properties of the system.

SEGyges · 7 months ago

This is true of the system as a whole, but the core neural network is still a next-token predictor.

SEGyges commented on Do we understand how neural networks work? verysane.ai/p/do-we-under... · Posted by u/Philpax

sirwhinesalot · 7 months ago

This is such a weird take to me. We know exactly how LLMs and neural networks in general work.

They're just highly scaled up versions of their smaller curve fitting cousins. For those we can even make pretty visualizations that show exactly what is happening as the network "learns".

I don't mean "we can see parts of the brain light up", I mean "we can see the cuts and bends each ReLU is doing to the input".

We built these things, we know exactly how they work. There's no surprise beyond just how good prediction accuracy gets with a big enough model.

Deep Neural Networks are also a very different architecture from what is going on in our brains (which work more like Spiking Neural Networks) and our brains don't do backpropagation, so you can't even make direct comparisons.

SEGyges · 7 months ago

fortunately i wrote an entire post about what the difference is between the parts of this that it is easy to make sense of and the parts of it that it is prohibitively difficult to make sense of and it was posted on hackernews

SEGyges commented on Do we understand how neural networks work? verysane.ai/p/do-we-under... · Posted by u/Philpax

nickm12 · 7 months ago

I was, for a time, a neuroscience major and have had this same thought. I'm concerned that we're treating these systems as engineered systems when they are closer to evolved or biological systems. They at at least different in that we can study them much more precisely than biological systems because their entire state is visible and we can run them on arbitrary inputs and measure responses.

SEGyges · 7 months ago

I agree, what we do is much closer to growing them than to engineering them. We basically engineer the conditions for growth, and then check the results and try again.

My best argument that insights from neuroscience will transfer to neural networks, and vice versa:

For sufficiently complex phenomena (e.g., language), there should only be one reasonably efficient solution to the problem, and small variations on that solution. So there should be some reversible mapping between any two tractable solutions to the problem that is pretty close to lossless, provided both solutions actually solve the problem.

And, yeah, the main advantage of neural networks is that they're white-box. You can also control your experiments in a way you can't in the real world.

SEGyges commented on Do we understand how neural networks work? verysane.ai/p/do-we-under... · Posted by u/Philpax

porridgeraisin · 7 months ago

Nevertheless it is next token prediction. Each token it predicts is an action for the RL setup, and context += that_token is the new state. Solutions are either human labelled (RLHF) or to math/code problems (deterministic answer) and prediction == solution is used as the reward signal.

Policy gradient approaches in RL are just supervised learning, when viewed through some lenses. You can search for karpathys more fleshed out argument for the same, I'm on mobile jow.

SEGyges · 7 months ago

My short explanation would be that even for RL, you are training on a next token objective; but the next token is something that has been selected very very carefully for solving the problem, and was generated by the model itself.

So you're amplifying existing trajectories in the model by feeding the model's outputs back to itself, but only when those outputs solve a problem.

This elides the kl penalty and the odd group scoring, which are the same in the limit but vastly more efficient in practice.

SEGyges commented on LLMs are cheap snellman.net/blog/archive... · Posted by u/Bogdanp

pama · 9 months ago

Please read the DeepSeek analysis of their API service (linked in this article): they have 500% profit margin and they are cheaper than any of the US companies serving the same model. It is conceivable that the API service of OpenAI or Anthropic have much higher profit margins yet.

(GPUs are generally much more cost effective and energy efficient than CPU if the solution maps to both architectures. Anthropic certainly caches the KV-cache of their 24k token system prompt.)

SEGyges · 9 months ago

Every LLM provider caches their KV-cache, it's a publicly documented technique (go stuff that KV in redis after each request, basically) and a good engineering team could set it up in a month.

SEGyges commented on The Biggest Statistic About AI Water Use Is a Lie verysane.ai/p/the-biggest... · Posted by u/SEGyges

SEGyges · 9 months ago

Because this specific number comes up constantly and is incredibly frustrating it seemed like it really needed to be addressed directly.

If we can get past this specific thing we can perhaps have a fact-based conversation about what's going on with power use in ai or tech.

SEGyges commented on Linux kernel 6.14 is a big leap forward in performance and Windows compatibility zdnet.com/article/linux-k... · Posted by u/CrankyBear

ASalazarMX · a year ago

Does he willingly take that treatment from others? I don't really know, but that would be the actual test of congruence.

SEGyges · a year ago

I am not deep into the Linus weeds but my impression is that he doesn't especially care if he's on the receiving end of this. It only started to feel different from "well, the Linux list is the PvP zone" when Linus was sufficiently weighty/famous that you almost had to take an insult from him to heart, and he did eventually correct his behavior there.

SEGyges commented on Probably pay attention to tokenizers cybernetist.com/2024/10/2... · Posted by u/ingve

lechatonnoir · a year ago

I mean, it's not completely fatal, but it means an approximately 16x increase in runtime cost, if I'm not mistaken. That's probably not worth trying to solve letter counting in most applications.

SEGyges · a year ago

it is not necessarily 16x if you, e.g., decrease model width by a factor of 4 or so also, but yeah naively the RAM and FLOPs scale up by n^2