The maths you need to start understanding LLMs

This is not about _Large_ Language models though. This explains math for word vectors and token embeddings. I see this is the source of confusion for many people. They think LLMs just do this to statistically predict the next word. That was pre-2020s. They ignore the 1.8+ Trillion-parameter Transformer network. Embeddings are just the input of that giant machine. We don't know what is going on exactly in those trillions of parameters.

ants_everywhere · 4 days ago

But surely you need this math to start understanding LLMs. It's just not the math you need to finish understanding them.

HarHarVeryFunny · 4 days ago

It depends on what level of understanding, and who you are talking about. For the 99% of people outside of software development or machine learning, it is totally irrelevant, as is any details of the Transformer architecture, or the mechanism by which a trained Transformer operates.

For the man in the street, inclined to view "AI" as some kind of artificial brain or sentient thing, the best explanation is that basically it's just matching inputs to training samples and regurgitating continuations. Not totally accurate of course, but for that audience at least it gives a good idea and is something they can understand, and perhaps gives them some insight into what it is, how it works/fails, and that it is NOT some scary sentient computer thingy.

For anyone in the remaining 1% (or much less - people who actually understand ANNs and machine learning), then learning about the Transformer architecture and how a trained Transformer works (induction heads etc) is what they need to learn to understand what an (Transformer-based, vs LSTM-based) LLM is and how it works.

Knowing about the "math" of Transformers/ANNs is only relevant to people who are actually implementing them from ground up, not even those who might just want to build one using PyTorch or some other framework/lbrary where the math has already been done for you.

Finally, embeddings aren't about math - they are about representation, which is certainly important to understanding how Transformers and other ANNs work, but still a different topic.

* US population of ~300M has ~1M software developers, of which a large fractions are going to be doing things like web development and only at a marginal advantage over someone smart outside of development in terms of learning how ANNs/etc work.

HSO · 4 days ago

"necessary but not sufficient"

cranx · 4 days ago

But we do. A series of mathematical functions are applied to predict the next tokens. It’s not magic although it seems like it is. People are acting like it’s the dark ages and Merlin made a rabbit disappear in a hat.

ekunazanu · 4 days ago

Depends on your definition of knowing. Sure, we know it is predicting next tokens, but do we understand why they output the things they do? I am not well versed with LLMs, but I assume even for smaller modles interpretability is a big challenge.

umanwizard · 4 days ago

Doesn’t this apply to any process (including human brains) that outputs sequences of words? There is some statistical distribution describing what word will come next.

clickety_clack · 4 days ago

That is what they do though. It might have levels of accuracy we can barely believe, but it is still a statistical process that predicts the next tokens.

ozgung · 4 days ago

Not necessarily. They can generate letters, tokens, or words in any order. They can even write them all at once like they do in a diffusion model. Next token generation (auto-reggresion) is just a design choice of GPT, mostly for practical reasons. It fits naturally to the task at hand (we humans also generate words in sequential order). Also they have to train GPT in a self-supervised manner since we don't have labeled internet scale data. Auto-regression solves that problem as well.

The distinction I want to emphasize is that they don't just predict words statistically. They model the world, understand different concepts and their relationships, can think on them, can plan and act on the plan, can reason up to a point, in order to generate the next token. It learns all of these via that training scheme. It doesn't learn just the frequency of word relationships, unlike the old algorithms. Trillions are parameters do much more than that.

measurablefunc · 4 days ago

It's exactly the same math. There is no mathematics in any neural network, regardless of its scale, that can not be expressed w/ matrix multiplications & activation functions.

libraryofbabel · 4 days ago

* You’re right that a lot of people take a cursory look at the math (or someone else’s digest of it) and their takeaway is “aha, LLMs are just stochastic parrots blindly predicting the next word. It’s all a trick!”

* So we find ourselves over and over again explaining that that might have been true once, but now there are (imperfect, messy, weird) models of large parts of the world inside that neutral network.

* At the same time, the vector embedding math is still useful to learn if you want to get into LLMs. It’s just that the conclusions people draw from the architecture are often wrong.

baxtr · 4 days ago

Wait so you’re saying it’s not a high-dimensional matrix multiplication?

dmd · 4 days ago

Everything is “just” ones and zeros, but saying that doesn’t help with understanding.

tatjam · 4 days ago

Pretty much all problems can be reduced to some number of matrix multiplications ;)

These are technical details of computations that are performed as part of LLMs.

Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.

This is as if you started explaining how an ICE car works by diving into chemical properties of petrol. Yeah that really is the basis of it all, but no it is not where you start explaining how a car works.

jasode · 4 days ago

>This is as if you started explaining how an ICE car works by diving into chemical properties of petrol.

But wouldn't explaining the chemistry actually be acceptable if the title was, "The chemistry you need to start understanding Internal Combustion Engines"

That's analogous to what the author did. The title was "The maths ..." -- and then the body of the article fulfills the title by explaining the math relevant to LLMs.

It seems like you wished the author wrote a different article that doesn't match the title.

InCom-0 · 4 days ago

'The maths you need to start understanding LLMs'.

You don't need that math to start understanding LLMs. In fact, I'd argue its harmful to start there unless your goal is to 'take me on a epic journey of all the things mankind needed to figure out to make LLMs work from the absolute basics'.

bryanrasmussen · 4 days ago

>Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.

maybe this is the target group of people who would need particular "maths" to start understanding LLMS.

antegamisou · 4 days ago

Find someone on HN that doesn't trivialize fundamental math yet encourages everyone to become a PyTorch monkey that ends up having no idea why their models are shite: impossible.

49pctber · 4 days ago

Anyone who would like to run an LLM would need to perform their computations on hardware. So picking hardware that is good at matrix multiplication is important for them, even if they didn't develop their LLM from scratch. Knowing the basic math also explains some of the rush to purchase GPUs and TPUs on recent years.

All that is kind of missing the point though. I think people being curious and sharpening their mental models of technology is generally a good thing. If you didn't know an LLM was a bunch of linear algebra, you might have some distorted views of what it can or can't accomplish.

InCom-0 · 4 days ago

Being curious is good ... nothing wrong with that. What I took issue with above is (what I see as) attempt to derail people into low level math when that is not the crux of the question at all.

Also: nobody who wants to run LLMs will write their own matrix multiplications. Nobody doing ML / AI comes close to that stuff ... its all abstracted and not something anyone actually thinks about (except the few people who actually write the underlying libraries ie. at Nvidia).

saagarjha · 4 days ago

If you're just piecing together a bunch of libraries, sure. But anyone who is adjacent to ML research should know how these work.

InCom-0 · 4 days ago

Anyone actually physically doing ML research knows it ... but doesn't write the actual code for this stuff (or god forbid write some byzantine math notations somewhere), doesn't even think about this stuff except through X levels of higher level abstractions.

Also, those people understand LLMs already :-).

ivape · 4 days ago

Also, people need to accept that they’ve been doing regular ass programming for many years and can’t just jump into whatever they want. The idea that developers were well rounded general engineers is a myth mostly propagated from within the bubble.

Most people’s educations right here probably didn’t even involve Linear Algebra (this is a bold claim, because the assumption is that everyone here is highly educated, no cap).

Way back when, I did a masters in physics. I learned a lot of math: vectors, a ton of linear algebra, thermodynamics (aka entropy), multi-variable and then tensor calculus.

This all turned out to be mostly irrelevant in my subsequent programming career.

Then LLMs came along and I wanted to learn how they work. Suddenly the physics training is directly useful again! Backprop is one big tensor calculus calculation, minimizing… entropy! Everything is matrix multiplications. Things are actually differentiable, unlike most of the rest of computer science.

It’s fun using this stuff again. All but the tensor calculus on curved spacetime, I haven’t had to reach for that yet.

r-bryan · 4 days ago

Check out this 156-page tome: https://arxiv.org/abs/2104.13478: "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges"

The intro says that it "...serves a dual purpose: on one hand, it provides a common mathematical framework to study the most successful neural network architectures, such as CNNs, RNNs, GNNs, and Transformers. On the other hand, it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented."

Working all the way through that, besides relearning a lot of my undergrad EE math (some time in the previous century), I learned a whole new bunch of differential geometry that will help next time I open a General Relativity book for fun.

minhaz23 · 3 days ago

I have very little formal education in advanced maths, but I’m highly motivated to learn the math needed to understand AI. Should i take a stab at parsing through and trying to understand this paper (maybe even using AI to help, heh) or would that be counter-productive from the get-go and I'm better off spending my time following some structured courses in pre-requisite maths before trying to understand these research papers?

Thank you for sharing this paper!

Quizzical4230 · 3 days ago

Thank you for sharing the paper!

The link is broken though and you may want to remove the `:` at the end.

psb217 · 4 days ago

That past work will pay off even more when you start looking into diffusion and flow-based models for generating images, videos, and sometimes text.

pornel · 4 days ago

Breakthrough in image generation speed literally came from applying better differential equations for diffusion taken from statistical mechanics physics papers:

https://youtu.be/iv-5mZ_9CPY

JBits · 4 days ago

For me, it's the very basics of general relativity which made the distinction between the cotangent and tangents space click. Optimisation on Riemannian manifolds might give an opportunity to apply more interesting tensor calculus with a non-trivial metric.

jwar767 · 4 days ago

I have the same experience but with a masters in control theory. Suddenly all the linear algebra and differential equations are super useful in understanding this.

CrossVR · 4 days ago

Any reason you didn't pick up computer graphics before? Everything is linear algebra and there's even actual physics involved.

ls-a · 4 days ago

Is that you Will Hunting

alguerythme · 4 days ago

Well, calculus on curved space, please let me introduce you to: https://arxiv.org/abs/2505.18230 (This is self advertising) If you know how to incorporate time into that, I am interested.

3abiton · 3 days ago

The funny thing about physics maths, is that we didn't have to learn the intuition behind it, it was a mean to an end. Going through undergrad mathematically blind was a right of passage.

lazarus01 · 3 days ago

Modern numeric compute frameworks provide automatic differentiation to calculate derivatives, including Tensorflow and Jax.

Dead Comment

spinlock_ · 4 days ago

For me, working through Karpathy's video series (instead of just "watching" them) helped me tremendously to understand how LLMs work and gave me the confidence to read through more advanced material, if I feel like it. But to be honest, the knowledge I gained through his videos are already enough for me. It's kind of like learning how a CPU works in general and ignoring all the fancy optimization steps that I'm not interested in.

Thanks Andrej for the time and effort you put into your videos.

meken · 4 days ago

+1. His cs231n class he taught at Stanford gave me a great foundation.

karpathy · 4 days ago

romanoonhn · 4 days ago

Can you share what you mean by "working through" the videos? This playlist has been on my todo for a few weeks so I'm interested.

spinlock_ · 2 days ago

Sure, I was talking about Andrej's "Zero to Hero" playlist:

https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9Gv...

rsanek · 4 days ago

Anyone else read the book that the author mentions, Build a Large Language Model (from Scratch) [0]? After watching Karpathy's video [1] I've been looking for a good source to do a deeper dive.

[0] https://www.manning.com/books/build-a-large-language-model-f...

[1] https://www.youtube.com/watch?v=7xTGNNLPyMI

tanelpoder · 4 days ago

Yes, can confirm, the book is great. I was also happy to see that the author correctly (in my mind) used the term “embedding vectors” vs. “vector embeddings” that most others seem to use… Some more context about my pet peeve: https://tanelpoder.com/posts/embedding-vectors-vs-vector-emb...

malshe · 4 days ago

Here is the code used in the book - https://github.com/rasbt/LLMs-from-scratch

gchadwick · 4 days ago

I thought it was a great book, dives into all the details and lays it out step by step with some nice examples. Obviously it's a pretty basic architecture and very simplistic training but I found it gave me the grounding to then understand more complex architectures.

kamranjon · 4 days ago

It’s good - I’m working through it right now

horizion2025 · 4 days ago

Is there a non-video equivalent. I always prefer reading/digesting at my own pace compared to following a video.

gpjt · 4 days ago

Check the first link in the parent comment, it's a link to the book.

tra3 · 4 days ago

Is [1] worth a watch if I want to get a high level/basic understanding of how LLMs work?

Yeah, it's very well done

ForceBru · 4 days ago

Yes, it's really good

armcat · 4 days ago

One of the most interesting mathematical aspects to me are the fact that LLMs are logit emitters. And associated with this output is uncertainty. Lot of ppl talk about networks of agents. But what you are doing is accumulating uncertainty - every model in the chain introduces its own uncertainty on top of what it inherits. In some situations I've seen a complete collapse after 3 LLM calls chained together. Hence why lot of people recommend "human in the loop" as much as possible to try and reduce that uncertainty (shift the posterior if you will); or they recommend more of a workflow approach - where you have a single orchestrator that decides which function to call, and most of the emphasis (and context engineering) is placed on that orchestrator. But it all ties together in the maths of LLMs.

Deleted Comment

Rubio78 · 4 days ago

Working through Karpathy's series builds a foundational understanding of LLMs, providing enough confidence to explore further. A key insight is that LLMs are logit emitters, and their inherent uncertainty compounds dangerously in multi-agent chains, often requiring a human-in-the-loop or a single orchestrator to manage it. Crucially, people confuse word embeddings with the full LLM; embeddings are just the input to a vast, incomprehensible trillion-parameter transformer. The underlying math of these networks is surprisingly simple, built on basic additions and multiplications. The real mystery isn't the math but why they work so well. Ultimately, AI research is a mix of minimal math, extensive data engineering, massive compute power, and significant trial and error.

erdehaas · 3 days ago

The title is misleading. The maths explained in the blog is the math that is used to build an LLM (how it internally does calculations to do inference etc.). The math to understand LLMs, i.e. that explains in mathematical rigor why LLMs work, is not fully developed yet. That is what the LLM Explainability is about, the effort to understand and clarify the complex, "black-box" decision-making processes of Large Language Models (LLMs) in human-interpretable terms.