As someone who has no technical knowledge of Llama or any of the LLM work, from conceptual understanding to technical implementation, is there any benefit to sit down and go through this from start to finish? Or is effort better spent somewhere else?
Like a roadmap, do A, do B And finally go through this in the end.
This was posted on HN a while ago and led to some great discussion. Myself and others agreed that this type of stateful visualization was _way_ more effective at conceptualizing how an LLM works than reading code or stepping through a debugger.
my opinion: it quickly gets into "the math behind LLMs" that make no sense to me
words i understand but don't really get: weights, feed forward, layers, tensors, embeddings, normalization, transformers, attention, positioning, vector
There's "programming" in the plumbing sense where you move data around through files/sockets and then there's this... somebody without a math background/education... very unlikely you'll understand it. it's just skimming python and not understand the math/library calls it makes
Ya there are concepts in programming and math that are mostly self-teachable from first principles, but then there's what looks like gibberish because it's too new to have been distilled down into something tractable yet. I would say that arrays and matrices are straightforward to understand, while tensors are not. So I'm disappointed that so much literature currently revolves around tensors. Same for saying embedding instead of just vector representation, etc.
It helps me to think in terms of levels of abstraction rather than complexity. My education stopped at a 4 year degree, but AI is mostly postgraduate still. So I have to translate to what I know because I haven't internalized the lingo.
Here's the most approachable teaching of neural nets (NNs) and large language models (LLMs) that I've seen so far:
II A strange land 105
7 Convolutional layers 107
..
7.1.3 Translational equivariant layers 112
..
9 Scaling up the models 143
..
9.3 Dropout and normalization 151
9.3.1 Regularization via dropout 152
9.3.2 Batch (and layer) normalization 156
III Down the rabbit-hole 167
10 Transformer models 169
10.1 Introduction 169
10.1.1 Handling long-range and sparse dependencies 170
10.1.2 The attention layer 172
10.1.3 Multi-head attention 174
10.2 Positional embeddings 177
10.2.1 Permutation equivariance of the MHA layer 177
10.2.2 Absolute positional embeddings 179
10.2.3 Relative positional embeddings 182
10.3 Building the transformer model 182
10.3.1 The transformer block and model 182
10.3.2 Class tokens and register tokens 184
11 Transformers in practice 187
11.1 Encoder-decoder transformers 187
11.1.1 Causal multi-head attention 188
11.1.2 Cross-attention 189
11.1.3 The complete encoder-decoder transformer 190
11.2 Computational considerations 191
11.2.1 Time complexity and linear-time transformers 191
11.2.2 Memory complexity and the online softmax 192
11.2.3 The KV cache 194
11.2.4 Transformers for images and audio 194
11.3 Variants of the transformer block 197
I recommend _Deep Learning with Python_ by François Chollet (the creator of Keras). It’s very clear and approachable, explains all of these concepts, and doesn’t try to “impress” you with unnecessary mathematical notation. Excellent introductory book.
The only downside is that in 2024, you are probably going to use PyTorch and not Keras + Tensorflow as shown in the book.
If you want to gain familiarity with the kind of terminology you mentioned here, but don't have a background in graduate-level mathematics (or even undergrad really), I highly recommend Andrew Ng's "Deep Learning Specialization" course on Coursera. It was made a few years ago but all of the fundamental concepts are still relevant today.
That's exactly where I am at. Despite watching Karpathy's tutorial videos, I quickly got lost. My highest level of math education is Calculus 3 which I barely passed. This probably means that I will only ever understand LLMs at a high level.
It's Signals and Systems in part to get the notation and explanation. MIT has the course online for free. (Though probably a little more general than what you need, since the class is also used to prep electrical engineers for robotics and radio communication).
Only do it if you want the illusion of LLM's to be shattered. Suddenly every day you'll see two to three highly upvoted links on HN and be unable to keep your eyes from rolling.
If you like this, it's also worth looking at llama2.c[1], an implementation of the Llama 2 architecture in about 1000 lines of plain, dependency-free C, tokenizer and all. THe fact that this 960-line file and a somewhat modern C compiler is all you really need to run a state-of-the-art language model is really surprising to many.
Of course, this is not all there is to a modern LLM, it would probably take another thousand lines or two to implement training, and many more than that to make it fast on all the major CPU and GPU architectures. If you want a flexible framework that lets a developer define any model you want and still goes as fast as it can, the complexity spirals.
Most programmers have an intuition that duplicating a large software project from scratch, like Linux or Chromium for example, would require incredible amounts of expertise, manpower and time. It's not something that a small team can achieve in a few months. You're limited by talent, not hardware.
LLMs are very different. THe code isn't that complicated, you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance, as an individual with a background in programming and who still remembers their calculus and linear algebra, with a year or so of self study. What makes LLMs difficult is getting access to all the hardware to train them, getting the data, and being able to preprocess that data.
One other thing to add is large-scale RLHF. Big Tech can pay literally hundreds of technically-sophisticated people throughout the world (e.g. college grads in developing countries) to improve LLM performance on all sorts of specific problems. It is not a viable way to get AGI, but it means your LLM can learn tons of useful tricks that real people might want, and helps avoid embarrassing "mix broken glass into your baby formula" mistakes. (Obviously it is not foolproof.)
I suspect GPT-4's "secret sauce" in terms of edging out competitors is that OpenAI is better about managing data contractors than the other folks. Of course it's a haze of NDAs to learn specifics, and clearly the contractors are severely underpaid compared to OpenAI employees/executives. But a lone genius with a platinum credit card can't create a new world-class LLM without help from others.
And if you want to understand I'd recommend this post (gpt2 in 60 lines of numpy) and the post on attention it links to. The concepts are mostly identical to llama, just with a few minor architectural tweaks. https://jaykmody.com/blog/gpt-from-scratch/
> Most programmers have an intuition that duplicating a large software project from scratch, like Linux or Chromium for example, would require incredible amounts of expertise, manpower and time. It's not something that a small team can achieve in a few months. You're limited by talent, not hardware.
But only for the same reasons. Linux runs on very nearly every piece of hardware ever made. The APIs you have to implement in order to run "Linux programs" are large and full of old complexity that exists for compatibility. Chromium is full of code to try to make pages render even though they were designed for Internet Explorer 6.
Conversely, some university programs have students create a basic operating system from scratch. It's definitely something a small team can do as long as you don't care about broad hardware support or compatibility with existing applications. In principle a basic web browser is even simpler.
> The code isn't that complicated, you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance, as an individual with a background in programming and who still remembers their calculus and linear algebra, with a year or so of self study.
I wrote about my progress (http://gmays.com/math) if anyone else is interested in a similar path. I recently crossed 200 days of doing math daily (at least a lesson a day). It's definitely taking longer than I want, but I also have limited time (young kids + startup + investing).
The 'year of self study' definitely depends on where you're starting from and how much time you have, but it's very doable if you can dedicate an hour or two a day.
The code is much more similar, in principle, to a virtual machine. The actual code, the bit that contains the logic which has the semantics we intend, is in the trained weights, where the level of complexity is much higher and more subtle.
> you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance… with a year or so
A year might be required when the only input is the research articles. In practice, we also have reference Python implementations of these models. Possible to test different functions or compute shaders against the corresponding pieces from the reference implementations, by comparing saved output tensors between the reference and the newly built implementation. Due to that simple trick, I think I have spent less than 1 month part-time for each of these two projects.
I'd say a year for somebody who doesn't know what a linear layer is and couldn't explain why a GPU might be of any use if you're not playing games, but who knows what the derivative of 3x^2 is.
> What makes LLMs difficult is getting access to all the hardware to train them, getting the data, and being able to preprocess that data.
Yes, that's my opinion too. GAOs (Grassroots AI Organisations) are constrained by access to data and the hardware needed to process the data and train the model on it. I look forward to a future where GAOs will crowdsource their computations in the same way many science labs borrow computing power from people around the world.
This is hard because you need high bandwidth between the GPUs in your cluster, bandwidth far higher than broadband could provide. I'm not even sure whether the time spend synchronizing between far-away machines would offset the increase in computational power.
>" THe fact that this 960-line file and a somewhat modern C compiler is all you really need to run a state-of-the-art language model is really surprising to many."
"the code for AGI will be simple"
- John Deremetrius Carmack
There is a tick-tock between searching the dominant NN architectures (tick) and optimizing for accuracy, compute and inference latency and throughput (tock).
This particular (tock) is still playing out. The next (tick) does not feel imminent and will likely depend on when we discover the limits of the transformers when it comes to solving for long tail of use-cases.
You have to consider that there are still some low hanging fruit that let you improve prompt processing (not token generation) performance by an order of magnitude or even two, but there are no takers. The reason is quite simple. You can just buy more GPUs and forget about the optimizations.
If a 100x improvement in performance is left on the table, then surely even lower priority optimizations won't be implemented any time soon.
Consider this: a lot of clever attention optimizations rely on some initial pass to narrow the important tokens down and discarding them from the KV cache. If this was actually possible, then how come the first few layers of the LLM don't already do this numerically to focus their attention? Here is the shocker: they already do, but since you're passing the full 8k context to the next layer anyway, you're wasting it on mostly... Nothing.
I repeat: Does the 80th layer really need the ability to perform attention over all the previous 8k outputs of the 79th layer? The first layer? Definitely. The last? No.
What happens if you only perform attention over 10% of the outputs of layer 79? What speedup does this give you?
Notice how the model has already learned the most optimal attention scheme. You just need to give it less stuff to do and it will get faster automatically.
My wish is they would move on to the next phase. The whole deal with SSMs look really good. But looking for better architects is countered with "a regular architecture with more parameters are doing better. What's the point of this"
The innovation is the amount of resources people are willing to spend right now. From looking at the research code it's clear that the whole field is basically doing a (somewhat) guided search in the entire space of possible layer permutations.
There seems to be no rhyme or reason, no scientific insight, no analysis. They just try a million different permutations, and whatever scores the highest on the benchmarks gets published.
There's definitely scientific insight and analysis.
E.g. "In-context Learning and Induction Heads" is an excellent paper.
Another paper ("ROME") https://arxiv.org/abs/2202.05262 formulates hypothesis over how these models store information, and provide experimental evidence.
The thing is, a 3-layer MLP is basically an associative memory + a bit of compute. People understand that if you stack enough of them you can compute or memorize pretty much anything.
Attention provides information routing. Again, that is pretty well-understood.
The rest is basically finding an optimal trade-off. These trade-off are based on insights based on experimental data.
So this architecture is not so much accidental as it is general.
Specific representations used by MLPs are poorly understood, but there's definitely a progress on understanding them from first principles by building specialized models.
The only thing that has changed since 2018 is the most popular network structure to play with. The code looks the same as always; python notebooks where someone manually calculated the size of each hard-coded layer to make it fit.
There are too many papers throwing transformers on everything without thinking. Transformers are amazing for language but kinda mid on everything else. CS researchers tend to jump on trends really hard, so it will probably go back to normal again soon.
I’ve occasionally worked with more dynamic models (tree structured decoding). They are generally not a good fit for trying to max gpu thoroughput. A lot of magic of transformers and large language models is about pushing gpu as much we can and simpler static model architecture that trains faster can train on much more data.
So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.
There are things like NAS (neural architectural search) but all you are doing is just growing the search space and making the optimization problem much harder. Typically you do the architectural optimization by hand, using heuristics and past experiments as guidance.
Iterative leaps of open-source models becoming better are huge examples that companies competing on LLM model layer have an ephemeral moat.
Serious question: assuming this is true, if an incumbent-challenger like OpenAI wants to win, how do they effectively compete against current services such as Meta and Google product offerings which can be AI enhanced in a snap?
the very first big AI company who gives up trying to lobotomize and emasculate their models to align with the values of 0.01% of the world population will win a lot of hearts and minds overnight. the censorship necessary for corporate applications can be trivially implemented as a toggleable layer, using a small, efficient, specialist model to detect no-no words and wrongthink in inputs/outputs.
gpt, claude, gemini, even llama and mistral, all tend to produce the same nauseating slop, easily-recognizable by anyone familiar with LLMs - these days, I cringe when I read 'It is important to remember' even when I see it in some ancient, pre-slop writings.
creativity - one of the very few applications generative AI can truly excel at - is currently impossible. it could revolutionize entertainment, but it isn't allowed to. the models are only allowed to produce inoffensive, positivity-biased, sterile slop that no human being finds attractive.
> the censorship necessary for corporate applications can be trivially implemented as a toggleable layer, using a small, efficient, specialist model to detect no-no words and wrongthink in inputs/outputs.
What's really funny is they all have "jailbreaks" that you can use to make then say anything anyway. So for "corporate" uses, the method you propose is already mandatory. The whole thing (censoring base models) is a misguided combination of ideology and (over the top) risk aversion.
> creativity - one of the very few applications generative AI can truly excel at - is currently impossible. it could revolutionize entertainment, but it isn't allowed to. the models are only allowed to produce inoffensive, positivity-biased, sterile slop that no human being finds attractive.
Have you played around with base models? If you haven't yet, I'm sure you'll be happy to find that most base models are delightfully unslopped and uncensored.
I highly recommend trying a base model like davinci-002[1] in OpenAI's "legacy" Completions API playground. That's probably the most accessible, but if you're technically inclined, you can pair a base model like Llama3-70B[2] with an interface like Mikupad[3] and do some brilliant creative writing. Llama3 models can be run locally with something like Ollama[4], or if you don't have the compute for it, via an LLM-as-a-service platform like OpenRouter[5].
I think you vastly overestimate how much people care about model censorship. There are a bunch of open models that aren't censored. Llama 3 is still way more popular because it's just smarter.
I think you have your populations reversed. The number of people who get their knickers in a twist over LLMs reflecting certain cultural biases (and sometimes making foolish predictions in the process) amounts to a rounding error.
Their moat atm is being 6 months ahead of everyone else on model quality. Plus the ‘startup’ advantage over their corporate competitors. Oh and they can hoard a lot of the best talent because it’s an extremely high status place to work.
Their task now is to maintain and exploit those advantages as best they can while they build up a more stable long term moat: lots of companies having their tech deeply integrated into their operations.
Just to add, they don't have the baggage of google or Meta so they can do more without worrying how it impacts the rest of the company. And of the big players they seem the most aware of how important good data is and have paid for lots of high quality curated fine tuning data in order to build a proper product instead of doing a research project. That mindset and the commercial difference it makes shouldn't be underestimated.
At least they use punctuation. We've recently had a project on HN where the author used only lower cases and no punctuation because they equated it to being chained by the system.
Seeing Anya (the girl pointing at pictures), I'd guess the author is partial to Japanese culture. As their writing system does not have a concept of upper/lower case, he might just have determined that they are superfluous. Or he is simply an eccentric. Though I guess this is one of the things that some folks will not care and others getting hung up mightily.
I personally don't really mind that bit of capitalization that English does. German is much worse.
2024 is the year that most of us are collectively growing out of the early social media era all-lowercase thing, but everyone hasn't gotten the memo yet.
Aaaaaaaaaa.org is possibly the worst domain name I've ever encountered in all my time using the internet. I support your mission but you need to change that.
Like a roadmap, do A, do B And finally go through this in the end.
This was posted on HN a while ago and led to some great discussion. Myself and others agreed that this type of stateful visualization was _way_ more effective at conceptualizing how an LLM works than reading code or stepping through a debugger.
[0] https://news.ycombinator.com/item?id=38505211
words i understand but don't really get: weights, feed forward, layers, tensors, embeddings, normalization, transformers, attention, positioning, vector
There's "programming" in the plumbing sense where you move data around through files/sockets and then there's this... somebody without a math background/education... very unlikely you'll understand it. it's just skimming python and not understand the math/library calls it makes
It helps me to think in terms of levels of abstraction rather than complexity. My education stopped at a 4 year degree, but AI is mostly postgraduate still. So I have to translate to what I know because I haven't internalized the lingo.
Here's the most approachable teaching of neural nets (NNs) and large language models (LLMs) that I've seen so far:
https://news.ycombinator.com/item?id=40213292 (Alice’s Adventures in a differentiable wonderland)
https://arxiv.org/pdf/2404.17625 (pdf)
https://news.ycombinator.com/item?id=40215592 (tensor and NN layer breadcrumbs)
The only downside is that in 2024, you are probably going to use PyTorch and not Keras + Tensorflow as shown in the book.
That's exactly where I am at. Despite watching Karpathy's tutorial videos, I quickly got lost. My highest level of math education is Calculus 3 which I barely passed. This probably means that I will only ever understand LLMs at a high level.
Google and find the examples where someone does it in a spreadsheet. It's much more approachable that way.
You are going to find it's not that complicated.
Deleted Comment
Of course, this is not all there is to a modern LLM, it would probably take another thousand lines or two to implement training, and many more than that to make it fast on all the major CPU and GPU architectures. If you want a flexible framework that lets a developer define any model you want and still goes as fast as it can, the complexity spirals.
Most programmers have an intuition that duplicating a large software project from scratch, like Linux or Chromium for example, would require incredible amounts of expertise, manpower and time. It's not something that a small team can achieve in a few months. You're limited by talent, not hardware.
LLMs are very different. THe code isn't that complicated, you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance, as an individual with a background in programming and who still remembers their calculus and linear algebra, with a year or so of self study. What makes LLMs difficult is getting access to all the hardware to train them, getting the data, and being able to preprocess that data.
I suspect GPT-4's "secret sauce" in terms of edging out competitors is that OpenAI is better about managing data contractors than the other folks. Of course it's a haze of NDAs to learn specifics, and clearly the contractors are severely underpaid compared to OpenAI employees/executives. But a lone genius with a platinum credit card can't create a new world-class LLM without help from others.
… built on the back of a disposable workforce…
There is something grim and dystopian, thinking about the countless small hands feeding the machine.
But only for the same reasons. Linux runs on very nearly every piece of hardware ever made. The APIs you have to implement in order to run "Linux programs" are large and full of old complexity that exists for compatibility. Chromium is full of code to try to make pages render even though they were designed for Internet Explorer 6.
Conversely, some university programs have students create a basic operating system from scratch. It's definitely something a small team can do as long as you don't care about broad hardware support or compatibility with existing applications. In principle a basic web browser is even simpler.
https://github.com/karpathy/llama2.c
https://news.ycombinator.com/item?id=36838051
https://arstechnica.com/information-technology/2024/03/once-...
It actually teaches you how to build llama iteratively, test, debug and interpret the training loss rather than just desribing the code.
Great overview. One gap I've been working on (daily) since October is the math working towards MA's Mathematics for Machine Learning course (https://mathacademy.com/courses/mathematics-for-machine-lear...).
I wrote about my progress (http://gmays.com/math) if anyone else is interested in a similar path. I recently crossed 200 days of doing math daily (at least a lesson a day). It's definitely taking longer than I want, but I also have limited time (young kids + startup + investing).
The 'year of self study' definitely depends on where you're starting from and how much time you have, but it's very doable if you can dedicate an hour or two a day.
I have implemented inference of Whisper https://github.com/Const-me/Whisper and Mistral https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral... models on all GPUs which support Direct3D 11.0 API. The performance is IMO very reasonable.
A year might be required when the only input is the research articles. In practice, we also have reference Python implementations of these models. Possible to test different functions or compute shaders against the corresponding pieces from the reference implementations, by comparing saved output tensors between the reference and the newly built implementation. Due to that simple trick, I think I have spent less than 1 month part-time for each of these two projects.
Yes, that's my opinion too. GAOs (Grassroots AI Organisations) are constrained by access to data and the hardware needed to process the data and train the model on it. I look forward to a future where GAOs will crowdsource their computations in the same way many science labs borrow computing power from people around the world.
"the code for AGI will be simple" - John Deremetrius Carmack
Deleted Comment
This is an indication that we’re at the infancy of this field.
Deleted Comment
I'm kind of shocked. I thought there would be more dynamism by now and I stopped dabbling in like 2018.
This particular (tock) is still playing out. The next (tick) does not feel imminent and will likely depend on when we discover the limits of the transformers when it comes to solving for long tail of use-cases.
My $0.02.
If a 100x improvement in performance is left on the table, then surely even lower priority optimizations won't be implemented any time soon.
Consider this: a lot of clever attention optimizations rely on some initial pass to narrow the important tokens down and discarding them from the KV cache. If this was actually possible, then how come the first few layers of the LLM don't already do this numerically to focus their attention? Here is the shocker: they already do, but since you're passing the full 8k context to the next layer anyway, you're wasting it on mostly... Nothing.
I repeat: Does the 80th layer really need the ability to perform attention over all the previous 8k outputs of the 79th layer? The first layer? Definitely. The last? No. What happens if you only perform attention over 10% of the outputs of layer 79? What speedup does this give you?
Notice how the model has already learned the most optimal attention scheme. You just need to give it less stuff to do and it will get faster automatically.
Step change, then optimization of that step change
Kind of like a grand father clock with a huge pendulum swinging to one side, then another(commonly used metaphor).
There seems to be no rhyme or reason, no scientific insight, no analysis. They just try a million different permutations, and whatever scores the highest on the benchmarks gets published.
E.g. "In-context Learning and Induction Heads" is an excellent paper.
Another paper ("ROME") https://arxiv.org/abs/2202.05262 formulates hypothesis over how these models store information, and provide experimental evidence.
The thing is, a 3-layer MLP is basically an associative memory + a bit of compute. People understand that if you stack enough of them you can compute or memorize pretty much anything.
Attention provides information routing. Again, that is pretty well-understood.
The rest is basically finding an optimal trade-off. These trade-off are based on insights based on experimental data.
So this architecture is not so much accidental as it is general.
Specific representations used by MLPs are poorly understood, but there's definitely a progress on understanding them from first principles by building specialized models.
Deleted Comment
Dead Comment
I wonder shouldn't AI be the best tool to optimize itself?
There's still some room for experimenting if you care about memory/power efficiency, like MoE models, but they're not as well understood yet.
So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.
Serious question: assuming this is true, if an incumbent-challenger like OpenAI wants to win, how do they effectively compete against current services such as Meta and Google product offerings which can be AI enhanced in a snap?
gpt, claude, gemini, even llama and mistral, all tend to produce the same nauseating slop, easily-recognizable by anyone familiar with LLMs - these days, I cringe when I read 'It is important to remember' even when I see it in some ancient, pre-slop writings.
creativity - one of the very few applications generative AI can truly excel at - is currently impossible. it could revolutionize entertainment, but it isn't allowed to. the models are only allowed to produce inoffensive, positivity-biased, sterile slop that no human being finds attractive.
What's really funny is they all have "jailbreaks" that you can use to make then say anything anyway. So for "corporate" uses, the method you propose is already mandatory. The whole thing (censoring base models) is a misguided combination of ideology and (over the top) risk aversion.
Have you played around with base models? If you haven't yet, I'm sure you'll be happy to find that most base models are delightfully unslopped and uncensored.
I highly recommend trying a base model like davinci-002[1] in OpenAI's "legacy" Completions API playground. That's probably the most accessible, but if you're technically inclined, you can pair a base model like Llama3-70B[2] with an interface like Mikupad[3] and do some brilliant creative writing. Llama3 models can be run locally with something like Ollama[4], or if you don't have the compute for it, via an LLM-as-a-service platform like OpenRouter[5].
[1] https://platform.openai.com/docs/models/gpt-base
[2] https://huggingface.co/meta-llama/Meta-Llama-3-70B
[3] https://github.com/lmg-anon/mikupad
[4] https://ollama.com/library/llama3:70b-text
[5] https://openrouter.ai/models/meta-llama/llama-3-70b
Does grok do this, given where it came out of?
Deleted Comment
Deleted Comment
Their task now is to maintain and exploit those advantages as best they can while they build up a more stable long term moat: lots of companies having their tech deeply integrated into their operations.
Really? Most of our testing now has Gemini Pro on par or better (though we haven't tested omni/Ultra)
It really seems like the major models have all topped out / are comparable
Deleted Comment
Deleted Comment
I personally don't really mind that bit of capitalization that English does. German is much worse.
And they are not alone.
https://twitter.com/karpathy/status/1792261360430293176
You misspelled 'better'.
Also it looks more casual and authentic, less LLM generated
such as your comment and my comment!
A just question.
Deleted Comment