whoateallthepy (u/whoateallthepy)

whoateallthepy commented on I struggled with Git, so I'm making a game to spare others the pain initialcommit.com/blog/im... · Posted by u/initialcommit

whoateallthepy · 6 months ago

I learned Git from an O'Reilly book and I loved that it started with the internals first.

The git CLI has some rough edges, but once you have concepts of work tree, index, commits and diffs down, it is extremely powerful. magit in Emacs is also incredible.

whoateallthepy commented on I kind of killed Mercurial at Mozilla glandium.org/blog/?p=4346... · Posted by u/sylvestre

danking00 · 2 years ago

I've heard this point of view many times, but cannot find an extensive explanation of it. Could anyone elaborate on the issue with the GitHub review UI/UX?

> I hate the GitHub review UI with a passion. At least, right now, GitHub PRs are not a viable option for Mozilla [...] the more general shortcomings in the review UI.

whoateallthepy · 2 years ago

One point: if you are re-reviewing, other platforms (e.g. Phabricator, Gerrit) have much more developed ways to compare changes relative to one another.

whoateallthepy commented on Transformers from Scratch (2021) e2eml.school/transformers... · Posted by u/jasim

Buttons840 · 2 years ago

> Alternatively, these embeddings can be concatenated horizontally to our matrix: this guarantees the positional information is kept entirely separate from the linguistic (at the cost of having a larger model dimension).

Yes, the entire description is helpful, but I especially appreciate this validation that concatenating the position encoding is a valid option.

I've been thinking a lot about aggregation functions, usually summation since it's the most basic aggregation function. After adding the token embedding and the positional encoding together, it seems information has been lost, because the resulting sum cannot be separated back into the original values. And yet, that seems to be what they do in most transformers, so it must be worth the trade-off.

It reminds me of being a kid, when you first realize that zipping a file produces a smaller file and you think "well, what if I zip the zip file?" At first you wonder if you can eventually compress everything down to a single byte. I wonder the same with aggregation / summation, "if I can add the position to the embedding, and things still work, can I just keep adding things together until I have a single number?" Obviously there are some limits, but I'm not sure where those are. Maybe nobody knows? I'm hoping to study linear algebra more and perhaps I will find some answers there?

whoateallthepy · 2 years ago

One thing to bear in mind is that these embedding vectors are high dimensional, so that it is entirely possible that the token embedding and position embedding are near-orthogonal to one another. As a result, information isn't necessarily lost.

whoateallthepy commented on Transformers from Scratch (2021) e2eml.school/transformers... · Posted by u/jasim

dist-epoch · 2 years ago

> The input string is tokenized into a sequence of token indices (integers)

How is this tokenization done? Sometimes a single word can be two tokens. My understanding is that the token indices are also learned, but by whom? The same transformer? Another neural network?

whoateallthepy · 2 years ago

The tokenization is done by the tokenizer which can be thought of as just a function that maps strings to integers before the neural network. Tokenizers can be hand-specified or learned, but in either case this is typically done separately from training the model. It is also less frequently necessary unless you are dealing with an entirely new input type/language.

Tokenizers can be quite gnarly internally. https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt is a good resource on BPE tokenization.

whoateallthepy commented on Transformers from Scratch (2021) e2eml.school/transformers... · Posted by u/jasim

Buttons840 · 2 years ago

This article describes positional encodings based on several sine waves with different frequencies, but I've also seen positional "embeddings" used, where the position (the position is an integer value) is used to select an differentiable embedding from an embedding table. Thus, the model learns its own positional encoding. Does anyone know how these compare?

I've also wondered why we add the positional encoding to the value, rather than concatenating them?

Also, the terms encoding, embedding, projection, and others are all starting to sound the same to me. I'm not sure exactly what the difference is. Linear projections start to look like embeddings start to look like encodings start to look like projections, etc. I guess that's just the nature of linear algebra? It's all the same? The data is the computation, and the computation is the data. Numbers in, numbers out, and if the wrong numbers come out then God help you.

I digress. Is there a distinction between encoding, embedding, and projection I should be aware of?

I recently read in "The Little Learner" book that finding the right parameters is learning. That's the point. Everything we do in deep learning is focused on choosing the right sequence of numbers and we call those numbers parameters. Every parameter has a specific role in our model. Parameters are our choice, those are the nobs that we (as a personified machine learning algorithm) get to adjust. Ever since then the word "parameters" has been much more meaningful to me. I'm hoping for similar clarity with these other words.

whoateallthepy · 2 years ago

This is a great set of comments/questions! To try and answer this a bit briefly:

The input string is tokenized into a sequence of token indices (integers) as the first step of processing the input. For example, "Hello World" is tokenized to:

  [15496, 2159]

The first step in a transformer network is to embed the tokens. Each token index is mapped to a (learned or fixed) embedding (a vector of floats) via the embeddings table. The Embeddings module from PyTorch is commonly used. After mapping, the matrix of embeddings will look something like:

  [[-0.147, 2.861, ..., -0.447],
   [-0.517, -0.698, ..., -0.558]]

where the number of columns is the model dimension.

A single transformer block takes a matrix of embeddings and transforms them to a matrix of identical dimensions. An important property of the block is that if you reorder the rows of the matrix (which can be done by reordering the input tokens), the output will be reordered but otherwise identical too. (The formal name for this is permutation equivariance).

In problems related to language it seems inappropriate to have the order of tokens not matter, so to solve for this we need to adjust the embeddings of the tokens initially based on their position.

There are a few common ways you might see this done, but they broadly work by assigning fixed or learned embeddings to each position in the input token sequence. These embeddings can be added to our matrix above so that the first row gets the embedding for the first position added to it, the second row gets the embedding for the second position, and so on. Now if the tokens are reordered, the combined embedding matrix will not be the same. Alternatively, these embeddings can be concatenated horizontally to our matrix: this guarantees the positional information is kept entirely separate from the linguistic (at the cost of having a larger model dimension).

I put together this repository at the end of last year to better help visualize the internals of a transformer block when applied to a toy problem: https://github.com/rstebbing/workshop/tree/main/experiments/.... It is not super long, and the point is to try and better distinguish between the quantities you referred to by seeing them (which is possible when embeddings are in a low dimension).

I hope this helps!

whoateallthepy commented on Understanding and coding the self-attention mechanism of large language models sebastianraschka.com/blog... · Posted by u/mariuz

hackyhacky · 3 years ago

This explanation walks you through the math and the corresponding code, but (at least in my case, maybe I'm dumb) it failed to help me understand why these steps are necessary or to relate the math to the intended outcome. As a result, I don't feel that I'm any closer to really understanding the heart of self-attention.

whoateallthepy · 3 years ago

At the end of last year I put together a repository to try and show what is achieved by self-attention on a toy example: detect whether a sequence of characters contains both "a" and "b".

The toy problem is useful because the model dimensionality is low enough to make visualization straightforward. The walkthrough also goes through how things can go wrong, and how it can be improved, etc.

The walkthrough and code is all available here: https://github.com/rstebbing/workshop/tree/main/experiments/....

It's not terse like nanoGPT or similar because the goal is a bit different. In particular, to gain more intuition about the intermediate attention computations, the intermediate tensors are named and persisted so they can be compared and visualized after the fact. Everything should be exactly reproducible locally too!

whoateallthepy commented on The Transformer Family lilianweng.github.io/post... · Posted by u/alexmolas

hn_throwaway_99 · 3 years ago

Somewhat off topic. As someone who did some neural network programming in Matlab a couple decades ago, I always feel a bit dismayed that I'm able to understand so little about modern AI given the explosion in advances in the field starting in about the late 00s or so with things like convolutional neural networks and deep learning, transformers, large language models, etc.

Can anyone recommend some great courses or other online resources for getting up to speed on the state-of-the-art with respect to AI? Not really so much looking for an "ELI5" but more of a "you have a strong programming and very-old-school AI background, here are the steps/processes you need to know to understand modern tools".

Edit: thanks for all the great replies, super helpful!

whoateallthepy · 3 years ago

I put together a repository at the end of last year to walk through a basic use of a single layer Transformer: detect whether "a" and "b" are in a sequence of characters. Everything is reproducible, so hopefully helpful at getting used to some of the tooling too!

https://github.com/rstebbing/workshop/tree/main/experiments/...