mfn (u/mfn) - Readit News

mfn commented on Transformer Architecture: The Positional Encoding kazemnejad.com/blog/trans... · Posted by u/sebg

mfn · 8 months ago

I believe the key realization here is that we're applying a rotation matrix as part of the encoding. Why does this work? I've found that it's helpful to consider just the two dimensional case. Say we have two vectors. When designing a positional encoding scheme, what we're trying to do is somehow modify each vector so that it contains some information about its position.

The idea is that we can just rotate the vector by an amount proportional to its position - this has the property that the dot product of two vectors encoded this way only depends on the difference in positions, not the absolute positions themselves (since the dot product is based on angle between the vectors). And we care about the dot product, since that's what the attention operation ultimately applies to the vectors.

I've written up a _somewhat_ first principles derivation of this here: https://mfaizan.github.io/2023/04/02/sines.html, if interested!

mfn commented on Why Are Sinusoidal Functions Used for Position Encoding? mfaizan.github.io/2023/04... · Posted by u/mfn

mfn · 2 years ago

Sinusoidal positional embeddings have always seemed a bit mysterious - even more so since papers don't tend to delve much into the intuition behind them. For example, from Vaswani et al., 2017:

> That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos).

Inspired largely by the RoFormer paper (https://arxiv.org/abs/2104.09864), I thought I'd write a post that dives a bit into how intuitive considerations around linearity and relative positions can lead to the idea of using sinusoidal functions to encode positions.

Would appreciate any thoughts or feedback!

mfn commented on Why is symmetry so important in particle physics? mfaizan.github.io/2022/10... · Posted by u/mfn

ranger207 · 3 years ago

I don't like to be critical, but I've been wanting to understand symmetry in physics for a while, so here's a few points of confusion I have

> The principle, then, is that the particles and fields that were used to build up the theory will move in a way that minimizes or maximizes the sum of L over the path taken by the system

The next couple of examples only minimize the Lagrangian; are there any systems in this article that maximize it?

> a collection of objects and a recipe to build a Lagrangian from those objects, with the movement of those objects determined by a path that minimizes the Lagrangian

What in this case are the objects? Just particles? Is mass in the first example (of classical motion) an object? I'm trying to figure out what kind of objects to use to build an equation, or basically, what type (in the programming sense) an object is

> This is a surprising fact

Why? Are there other theories, maybe from earlier in the development of physics, that used a different approach?

> We could use particles as the building blocks, and represent each particle by its position and velocity. However, fields turn out to be a much more useful way of representing the way particles behave. A field ϕ(x,t) is a function that takes a point in spacetime and spits something out for each point.

I assume that in this passage "building blocks" is equivalent to "objects" in the last passage? Why are fields more useful? Is there an example of what using particles as objects would look like? In particular, a field looks to me like a function; if you used particles as an object, would you represent a particle as a function using its position and velocity? Would that function have time as a parameter like fields do? Typing that out I can kind of see why you'd use fields

> For example, a field could take a position (in spacetime) and spit out a number (which could be real or complex)

What does the output represent? Anything in particular? If not, then it seems like you could define the field function to be anything since the output doesn't represent anything, then when you feed the field function into the Lagrangian eventually you'd get massively different results

> The simplest Lagrangian we can write for this field is: L=δtϕ⋆δtϕ

How is the Lagrangian constructed? This "simplest" Lagrangian is the derivative of the field with respect to time, along with the derivative of the field's complex conjugate with respect to time, but how'd you know to do that? What makes this the simplest possible Lagrangian? Calling this the "simplest Lagrangian" hints that there are other equally valid ways to create a Lagrangian; is that correct? What are the rules for that? Why would you make a more complex Lagrangian?

> One interesting observation about Lagrangians is that any term of the form V(ϕ) represents potential energy.

What is V(ϕ)? My initial assumption would be velocity, but how do you take velocity of a field? Actually, I can see what they're doing: velocity of a particle is the derivative of it's position with respect to time, so I guess V(ϕ) aka velocity of the field is the derivative of the field with respect to time. That could've stood to have been spelled out

> Since there are no time or space derivatives involved

Ok I guess V(ϕ) doesn't represent the velocity of the field. I've got no clue what it is

> Now let’s take this a step further. In the previous example, we rotated the field at all points by some angle. But why do we need to rotate the field the same way everywhere? What if we measure things at one point with one coordinate system, but measure them at a different point using a different coordinate system that’s rotated. Although it’s hard to imagine why anyone would want to do this, one would still expect that this shouldn’t affect the actual physics predicted by the theory... Multiplying these together, we see that the Lagrangian is different. This is not what we wanted - rotating the complex plane in has affected the results of our theory.

I have no idea what's going on here. Why would you measure different points of the field with different coordinate systems and expect sensical results? I'm imagining a surveyor walking in a line starting from the origin: he takes a measurement at the origin, then at (1,0), then at (2,0), then at (3,0), etc. (Imagine that the underlying field is frozen in time so we aren't dealing with the Lagrangian yet.) Since we know the field equations we can predict what he'll measure at each of those points in the line.

But if the coordinate system changes with every step, he's still moving in a straight line as seen from a bird flying overhead, but at his first step he's at (1,0), then at the next step (2,0) turns into (5,1), then at the next step (6,1) (aka (3,0)) turns into (12,-3), etc, because the coordinate system changes each step. It's still (1,0),(2,0),(3,0) if you measure in the original coordinate system. But the underlying field wouldn't change in that case. Sure, if you put (5,1) into the field equation you'll get a different result than if you put in (2,0), but if you're only changing the coordinate system then that has to be compensated for in the field equation itself and you're not going to get different results for the same physical point. I mean, you should get the same result if you do f((2,0), coordinate system a) as if you did f((5,1), coordinate system b)

Edit: I think the core of my confusion is that in order for f((2,0), coordinate system a) to equal f((5,1), coordinate system b) then you need knowledge of how the coordinate system changes, and I don't see how that gets incorporated into the function

> Note that the issue here is that when we take the derivative of the field, we get an extra iϕδθδt term proportional to the derivative

Proportional to the derivative of what?

> This property - that a theory is not affected by changing some symmetry parameter throughout spacetime - is called gauge invariance

Why is it called that? I assume someone chose that name because it made sense to them for good reasons

> Also note that this new term, iAϕ, looks like a potential from the perspective of our field, with V(ϕ)=iAϕ

There's V(ϕ) again. I still don't know what it represents

> our theory now predicts some type of force involved in the interaction between our field and this new field

Wait, "our field" and "the new field"? What fields are those? We were talking about a field defined by the function ϕ(x,t) and thinking about its Lagrangian. We added a term A to the Lagrangian and that was it. What's the "new field"? Why does ϕ(x,t) have an interaction with it? Is A the new field?

> This mechanism of introducing an additional field to make an existing theory gauge invariant is exactly what gives rise to photons in the Standard Model! ‘Rotations’ of the electron field correspond to an additional field, called a gauge field, that behaves exactly the way photons do.

Ok, I guess A is a new field. I can see how it arises, but I'm not sure how you actually get to it

> This derivative doesn’t really make sense if we aren’t using the same coordinate system everywhere

you don't say

> The way we measure our field at x is different than the way we measure it at x+δx, so to get the actual difference, we need to make the field comparable by fixing it up before subtracting it

What is "the way we're measuring it"? I think it's, basically, the coordinate system of the surveyor changes each step, so that's a different "way" of measuring it? I still don't see why changing the coordinate system makes new stuff pop out of the equation

> As a first step, we can expand it: W(x,x+δx)=1−iδxA(x)+O(δx2)

How do you expand it? Are you giving the definition of W(x, x+δx)? How'd you get that? What's O(δx^2)?

> A group is a set of elements associated with some operation... What’s important here is that these two sets - the set of rotations, with the operation being composition, and the set of 1×1 complex matrices with the operation being multiplication - have the exact same behavior

Where'd the second set come from? Wait, I see, it's just saying that you can say that "multiplying a number by a 1x1 matrix" is the same thing as saying "you can compose a number with a rotation". It's literally the same thing, just said in a less clear manner. Does the new terminology get us anything useful?

> U(1) invariance of the electron field gives rise to the photon field. > SU(2) invariance across lepton fields (such as the electron and electron neutrino) leads to W+, W−, and Z bosons. SU(2) has two generators, so there are three gauge bosons. > SU(3) invariance across quark fields leads to eight gluons, since SU(3) has eight generators.

What? How do you know how many generators there are? Why does SU(2) have two generators but three gauge bosons?

> It’s remarkable how the observation that an equation doesn’t change under some operation, which seems quite trivial, can have deep consequences, dictating the nature of forces and interactions in the theory.

Yeah, I think the part I'm not getting is how changing coordinate systems affects the equation. I think I can see that if you insist on doing something ridiculous like this you'd need some math to correct for it and if the new correction functions are fields then it looks like new particles popping out, but I don't see how that doesn't result in an infinite number of new particles. Like, I can add a function f(x) = x^2 to the Lagrangian, then a g(x) = -x^2 to compensate for it, but those don't represent new particles do they? Why do those cancel out but A doesn't? I just don't see how changing coordinate systems results in different results

Despite my questions, I think I have a better idea of what's going on. You have a function; it should spit out the same numbers when you rotate it; you need a function to correct for the rotation; in physics the new function looks like a particle. I can kinda sorta see how it works now. Thanks for the article!

Edit: Ok, I think I've narrowed down my confusion to the θ(t). I can see that if you want to measure the same (x,y) over time as the coordinate systems change even though that (x,y) represents a different physical point every t, then you'd need to take the change in coordinate system over time into account in the derivative. But I'm not sure how that would be useful, nor how that would result in new physics over the case of a fixed θ

mfn · 3 years ago

(2/2) > I have no idea what's going on here. Why would you measure different points of the field with different coordinate systems and expect sensical results? I'm imagining a surveyor walking in a line starting from the origin: he takes a measurement at the origin, then at (1,0), then at (2,0), then at (3,0), etc. (Imagine that the underlying field is frozen in time so we aren't dealing with the Lagrangian yet.) Since we know the field equations we can predict what he'll measure at each of those points in the line.

> But if the coordinate system changes with every step, he's still moving in a straight line as seen from a bird flying overhead, but at his first step he's at (1,0), then at the next step (2,0) turns into (5,1), then at the next step (6,1) (aka (3,0)) turns into (12,-3), etc, because the coordinate system changes each step. It's still (1,0),(2,0),(3,0) if you measure in the original coordinate system. But the underlying field wouldn't change in that case. Sure, if you put (5,1) into the field equation you'll get a different result than if you put in (2,0), but if you're only changing the coordinate system then that has to be compensated for in the field equation itself and you're not going to get different results for the same physical point. I mean, you should get the same result if you do f((2,0), coordinate system a) as if you did f((5,1), coordinate system b)

So I guess a simpler way to see this is to note that changing coordinates should never affect the results of a physical theory. For example, if I measure things with the origin at x = 0, but you use x = 5 - all our measurements of position will differ by 5 units. But when we apply the equations of motion to some object we're both looking at - F = ma, our predictions of how the object will move will agree. My predictions will be the same as yours, but the positions will differ by 5 units. This is because the equation of motion, F = ma, does not care about translations in the coordinate system, the acceleration is the second derivative. If we had an equation like F = ma + x, then we would no longer be coordinate invariant, and the equation would be unphysical. It wouldn't make sense, you could look at the system in a different way (i.e. using different coordinates) and get completely different results.

> Why is it called that? I assume someone chose that name because it made sense to them for good reasons

This I'm not sure about - I haven't actually seen the reason mentioned anywhere, other than just an indication that it's for some historical reason.

> Wait, "our field" and "the new field"? What fields are those? We were talking about a field defined by the function ϕ(x,t) and thinking about its Lagrangian. We added a term A to the Lagrangian and that was it. What's the "new field"? Why does ϕ(x,t) have an interaction with it? Is A the new field?

Yup! A is the 'new field' - the terminology could be clearer here. So by requiring that our toy Lagrangian with phi be gauge invariant, we now have a need to introduce another field, A, and way this field appears in the Lagrangian (by multiplying by phi) is something that will end up acting like a force.

> What is "the way we're measuring it"? I think it's, basically, the coordinate system of the surveyor changes each step, so that's a different "way" of measuring it? I still don't see why changing the coordinate system makes new stuff pop out of the equation

Yes - by way of measuring it I mean that we are using a different coordinate system at each point, just to see what the effects of doing so are. And this is the magic bit - if we don't have this other field A, then the equation will produce different results depending on which coordinate system you use. So any theory without A will not consistently give the same results regardless of coordinate system.

> How do you expand it? Are you giving the definition of W(x, x+δx)? How'd you get that? What's O(δx^2)?

Power series expansion. So we are defining W(x, y) as a function that allows us to compare the field at x and y. We don't know what this function is, we just assume it exists. We then expand it out in powers of delta x. Since we are eventually going to take the limit, any higher order term - that involves delta x squared or higher - will be too small to matter, so we just care about the constant term and the linear term.

> Where'd the second set come from? Wait, I see, it's just saying that you can say that "multiplying a number by a 1x1 matrix" is the same thing as saying "you can compose a number with a rotation". It's literally the same thing, just said in a less clear manner. Does the new terminology get us anything useful?

Once we make the connection between a symmetry of our Lagrangian and some abstract group like SU(3), we can immediately bring in group theoretic results about that group. For example, since we know (from group theory) that SU(3) has eight generators, we can now use that result and infer that we need eight gauge bosons to make the theory gauge invariant.

> What? How do you know how many generators there are? Why does SU(2) have two generators but three gauge bosons?

Typo - will fix! Should be three generators.

> Yeah, I think the part I'm not getting is how changing coordinate systems affects the equation. I think I can see that if you insist on doing something ridiculous like this you'd need some math to correct for it and if the new correction functions are fields then it looks like new particles popping out, but I don't see how that doesn't result in an infinite number of new particles. Like, I can add a function f(x) = x^2 to the Lagrangian, then a g(x) = -x^2 to compensate for it, but those don't represent new particles do they? Why do those cancel out but A doesn't? I just don't see how changing coordinate systems results in different results

So anytime something gets added to the Lagrangian, you effectively have a new theory - it's a proposal. The point here is that you don't really have infinite flexibility in adding things to the Lagrangian - if you add complex scalar fields, then you must also add gauge bosons. So symmetry doesn't constrain everything - you still need to figure out what the Lagrangian should be, but it'll force you to add other fields as well to make things gauge invariant.

> Despite my questions, I think I have a better idea of what's going on. You have a function; it should spit out the same numbers when you rotate it; you need a function to correct for the rotation; in physics the new function looks like a particle. I can kinda sorta see how it works now. Thanks for the article!

Appreciate the thorough review! Lots of things that I should have been more thorough about - I will fix :) Thanks again.