I feel like I'm missing a key insight here. I understand the problem that regular softmax attention struggles to approach assigning zero attention to irrelevant stuff. And I get that having this subtraction formula makes it possible to assign exactly (or near) zero attention weight without having crazy outlier activations. But it seems like it also makes it very easy to have negative attention weight (which is equivalent to having positive attention weight on the negation of your value vectors). Intuitively, it just feels like a difficult balancing act to keep all the stuff you don't care about so close to zero.
But Figure 1 clearly shows that it works, so I don't doubt that it is in fact possible. I'm just struggling to build a picture of how exactly the network accomplishes this.
that silly softmax1 blog post is not worth the read. no one uses it in practice
if you think about it, the "escape hatch" is the design of the entire transformer dictionary. if Key/Query attention misaligns with Value's weights, you get a layer head that does not attend to anything...
I've tried that in a small transformer that I trained from scratch and it didn't really make any difference. I also made a version where I made this trainable somehow, probably by replacing the 1 with a constant associated with the layer, and that didn't make any difference either.
I didn't follow Miller's proposal quite as he wrote it though and I put the mechanism in all the layers rather than avoiding it at the end.
My test doesn't absolutely rule out usefulness-- there's always different ways of applying something, but I saw no indication of it.
You referring to Miller's blogpost?[0] There's not an error in attention. Adding the +1 actually makes it not attention because you no longer generate a probability distribution[1]. There's nothing really preventing attention to have a zero in any of the entries, the thing is that you probably won't get -inf (very large negative number) inside inner product and you're going to have a difficult time updating those weights via gradient descent.
I've also tested it on many networks and different types of attention and I've yet to see a meaningful improvement (or even an improvement), even in generalization.
It really is the training method...
As to the paper, I'm also still at a big lost and honestly, if reviewing could not accept it. The results look good, but I can't tell why and there's some "black magic" going on here.
- Figure 3 has "Transformer" and doesn't specify. Is this StableLM-3B-4E1T?
- What fucking dataset is this on? Stable has a WandB link[2] for that project and I don't see any experiment with similar (presumably entropy?) loss values (come on... this is fucking research... label your fucking graphs...)
- Where the fuck is the ablation? (Yes, I saw Fig 6 and Sec 3.8)
- How do I know that (assuming this is Stable) that the difference isn't just hyperparemeters? Or worse, GPUs! (yes, number of GPUs can change results due to sharding and this changing the statistics)
- How do I know it isn't down to 1k warmup steps instead of 5k?
- What about hidden size, layers, heads, or FFN size? Stable has 32/2560/32/? and this has 28/3072/12/8192 (these all will mess with sharding statistics too). Is the head dimension the same?
- How do I know it isn't down to the tokenizer?
- What is this magic? `0.8 - 0.6 * math.exp(-0.3 * depth)`
- Was this learned? Hand picked? This is a huge factor
- Any information about the learned parameters? Their final values? Trajectories?
- The code does not seem to be the same as whats in the algos...
Obviously they improved something, but there is nothing in the paper that is convincing me that it is the differential attention. There are too many parameters at play and how am I supposed to know that the difference is by the thing they are proposing. And more importantly, how much it is improved by that specific thing and not by other things.
[0] https://www.evanmiller.org/attention-is-off-by-one.html
[1] This is a bit convoluted but without this condition many "alternative forms" you see would be equivalent to other architectures like linear layers or gated units. Term is not well defined, but this really appears to be the only agreed upon aspect, even if only implicitly stated. This is a much longer conversation though.
[2] https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo
[2.1] The config: https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-3b-4e1t.yml
It does sound like we're hindering the model a bit by allowing negative weights to exist instead of sending them through, say, a ReLU. But, dealing with this might be an easier problem than you think for the model.
In the first diagram with the attention weights, there actually are some negative scores in the noise section. But, the attention to that section is very small anyway. All the second attention map needs to do is predict the noise in the first one -- a task that can be done very accurately, because it has full access to the input of the first.
To refer back to their real-world comparison, noise-canceling headphones have access to what your ear hears through a microphone, so they can output exactly the right cancellation signal. Similarly, the second attention map knows what's being input into the first one, so it can output a corresponding cancellation signal. It's not perfect -- just as noise-canceling headphones aren't perfect -- but it still gets you 99% of the way there, which is enough to boost performance.
>I'm just struggling to build a picture of how exactly the network accomplishes this.
I mean, intuitively it would be trivial for the model to just optimise lambda to zero during training. Then you essentially have built a vanilla transformer with an overcomplicated parameter pruning mechanism. Pruning is already pretty well established in the literature as something that works surprisingly good for reducing parameter counts up to (hold on to your papers)... about 40%. In practice the model probably doesn't work exactly like that, but I wouldn't be surprised if it just approximates the normal transformer in the end anyways.
Very clever. I like this kind of nitty-gritty detail work, and the change is small enough to be adapted easily by others. Bravo!
I'm a little concerned about the last sentence of the section introduction of
"2 Differential Transformer". It mentions using improvements from previous papers, but in the grammatical context, it's unclear if this improvement is added to both the normal transformer and their diff transformer. This would otherwise sully the comparisons. It's the "main difference" wording in the previous sentence that raised a flag for me.
Of course, a good-faith researcher would know this and may not feel the need to clarify. But you can never be too careful about some published research in this field.
Yes. This looks really, really good to me. Cross the board improvements in training time, perplexity improvements per both token trained and per model size. I'm reminded of MoE architectures, in that world we're choosing an optimal small model to process part or all of the inference job; I wonder if MoE got some of the same benefits from forcing the Transformer to distinguish between alternate possibilities.
In any event, I'd imagine that this will get widely adopted if the numbers hold up; like I said, this seems to be basically no downside, and should be easy to replicate.
There is a downside, every attention layer has to effectively compute attention twice (run scaled_dot_product_attention). As scaled_dot_product_attention is usually one of the most expensive operations in training and inference of a model, it seems like networks using this may be substantially slow and perhaps should considered against larger networks with more attention layers.
The two other changes they mention have been widely adopted, and are included in at least some of the models they benchmark against. It seems they list them for completeness as changes to the original transformer architecture.
Like most things in this new world of Machine Learning, I'm really confused why this works?
The analogy to noise-cancelling headphones is helpful but in that case we clearly know which is signal and which is noise. Here, if we knew why would we even bother to the noise-cancelling work?
With a single softmax you cannot predict exactly 0, but only very small numbers. When you have a large number of values to add up, this "poisons" the output with a lot of irrelevant stuff (the noise mentioned in the paper).
To make things worse, low attention values will have very low gradient, thus needing a lot of weight updates to undo that kind of mistakes. On the other hand, subtracting the output of two softmax allows the model to predict a weight of exactly zero for some of the values, while keeping a reasonable gradient flowing through.
So the model already knows what is noise, but a single softmax makes it harder to exclude it.
Moreover, with a single softmax the output of all heads is forced to stay in the convex hull of the value vectors, whereas with this variant each head can choose its own lambda, thus shifting the "range" of the outputs outside the convex hull pre-determined by the values. This makes the model as a whole more expressive.
While I don't discount the value of this, can you expand on the meaning of your claim that it makes the model 'more expressive'
Everything I am seeing in this paper is related to reduced size and noise, which implies a reduction in expressiveness.
The improvement in needle and a haystack, benchmarks on multi-hop questions of in corpus data and multishot in-context learning points to this.
This is a wonderful thing if robustness is more important than generality, but it doesn't address trimming away activations that may be spurious in the general use case but may improve an individual domain specificity.
Context would dramatically impact what tradeoffs and more desireble, and noise is probably never desirable. But the ability of this paper to enable bit size for inference points to a reduction in expressiveness.
Could you help explain how we would achieve an attention score of exactly 0, in practice? Here’s my take:
If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.
What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.
It is a neat approach, but one that comes with a tradeoff, IIUC: doubling the key heads.
I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute.
Noise cancelling headphones are probably the wrong analogy here.
The better example is the differential signalling used in professional audio and many digital signaling protocols like Ethernet, HDMI and USB.
Instead of using one wire, referencing to ground, they send the signal as the difference between both wires. Both wires end up carrying the same signal with inverted polarity. Because both wires are running next to each-other any external noice will be applied to both equally.
The voltage will change, but the difference in voltage between both wires is untouched. And when you subtract the two voltages at the receiver end, any noise simply gets subtracted out.
I think when they bring up differential amplifiers they're referring more to the DSP technique of how headphone noise cancelling works but the actual electrical properties of how a differential amplifier does that muddies the message a bit.
It sort of feels closer to heterodyning and "demodulating" the signal encoded in the softmax. Those tiny little errors we're trying to denoise with this technique are almost closer to carrier waves (when encoded to softmax) than noise imo. This wouldn't get rid of noise in the training data or noise in the dimensionality of the key / value space. It's really only removing noise introduced by the process itself.
Don't look for an analogy, this just adds a new mathematical capability. It enables "negative attention", the network can say "I want to subtract the contribution of this token" in the attention calculation. Previously it could only reduce how much it adds.
The simple way of doing this would be to just remove the softmax or use a sigmoid instead, but in practice a softmax works better it seems.
My hypothesis for why this works that it mitigates the downsides of rope
to eli5:
rope is the modern strategy used to give information to the model about how far a query and a key are apart when doing attention. It's the best strategy we have now, but has a major downside, where it makes some connections between tokens that are far apart much stronger than you would like them to be. Xpos (https://arxiv.org/pdf/2212.10554) is another paper by microsoft tackling issues with rope and you can see figure 1 on page 4 to get a visual interpretation of the sinusoidal attention strength (you would like it to be smooth).
I think a big reason differential transformers is working so well, especially on long sequence stuff, because when both q1 and q2 don't match a token, the rope relative strength will still have the same value and the noise will cancel out. Leaving intended matches, but at the cost of somewhat dampening the original value rope brought.
Just a hypothesis though. It would be easy to test by running this experiment against a baseline where both use alibi attention (https://arxiv.org/pdf/2108.12409) which has a different set of tradeoffs this wouldn't mitigate, but still a really interesting result.
Some of the "prior art" here is ladder networks and to some handwavy extent residual nets, both of which can be interpreted as training the model on reducing the error to its previous predictions as opposed to predicting the final result directly. I think some intuition for why it works has to do with changing the gradient descent landscape to be a bit friendlier towards learning in small baby steps, as you are now explicitly designing the network around the idea that it will start off making lots of errors in its predictions and then get better over time.
It sounds like they're just splitting the query / key space down the middle. We don't know which dimensions are encoded in each matrix, but they're assuming the "noise" introduced in one query / key space is equivalent to noise introduced in the other space.
If that is the case, then the "signal" in this case would be the softmax that encodes the dimensions captured by the query / key space. Since the noise ideally is the same in both softmax encodings, subtracting them should "cancel out" the noise.
I think common mode filtering in balanced audio cables is a much better analogy than noise canceling headphones (and where this paper gets its name from I assume), you don't know what the noise is ahead of time, but if you take two samples with one positive and one negative, noise displaces both absolutely, which you can take advantage of to denoise the signal (find the differential mode).
For example, if you are trying to send a +1V signal on one wire, and a -1V signal on the other and a +0.5V noise exists, one wire will have +1.5V and the other will have -0.5V,
Take the difference and divide by 2:
(+1.5V - -0.5V) / 2 = +1V
or, if your setup is different
(-0.5V - +1.5V) / 2 = -1V
I don't understand either. It seems the general idea is that they calculate attention twice, which due to random initialization might be expected to give two slightly different results. I'd have thought that what these two attention maps would have in common would be the signal, and where they would differ would be noise, so rather than subtracting them (resulting in all noise?!) what you really want is to add (so the common signal gets reinforced) and normalize.
I think there might be some communalities with system engineering, where you subtract the output from the input in order to get a control signal that steers the plant to the target values. I too fail to see how that would be supposed to work in practice.
The values between the groups are also going to diverge during training due to the structure of the DiffAttn equation.
The analogy I can think of is when you're paying attention to a variety of things and you actively avoid concentrating on something because it will distract you. You don't give it zero attention, you give it negative attention.
> Differential attention takes the difference between two softmax attention functions to eliminate attention noise
If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.
> According to the fitted curves, 6.8B-size
DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters
This raises a few questions for me:
- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?
- Does that tradeoff change noticeably between training and inference?
My understanding was that the extra parameters required for the second attention mechanism are included in those 6.8B parameters (i.e. those are the total parameters of the model, not some made-up metric of would-be parameter count in a standard transformer). This makes the result doubly impressive!
Here's the bit from the paper:
> We set the number of heads h = dmodel/2d, where d is equal to the head dimension
of Transformer. So we can align the parameter counts and computational complexity.
In other words, they make up for it by having only half as many attention heads per layer.
I think they mitigated the extra memory/compute from this by using half the number of overall heads and doubling V and O. Without actually checking the math I think it should be equivalent in flops, not counting the extra (cheap) multiply by const and subtract.
I think it would negate the RAM savings, but it would also reduce the amount of storage needed at rest and possibly reduce initial start up times depending on storage speed and model size. So, possibly good for low-end models on consumer devices?
Hmm, 0.8 works well, but let's try setting lower layers to lower initial value. Let's say 0.2. Ok, I need a formula that will go between 0.2 and 0.8, slowly approaching 0.8. Starts fiddling with numbers for 20min, I guess this can work.
Looks like this makes (at least initially in training) the “negative” attention term smaller in the early layers (smaller l) compared to later layers (larger l). Which I guess makes sense: you probably want to attend a little bit to everything before concluding that it’s really a few spots you should look at.
(Although it seems the author do not discuss this choice anywhere in the paper?)
The key bit I didn't understand at first was what happens if the two groups of attention learn the same thing; because their attention masks are subtracted from one another if they both output similar values the attention across the board will drop to zero and this will lead to high loss. So the only way to reduce loss is if they learn to attend to different things. One of the simplest strategies they could learn (and this paper claims that they do) is for one group to focus on relevant context and the other to focus on irrelevant context. Thus one group learns the noise and the other the signal (it's not this cut and dry but is a useful simplification for understanding IMO).
An interesting aspect is that they don't do a plain subtraction, but rather subtract a portion of the second softmax.
This makes sense, if one considers that the two copies are identical then the softmax outputs would be identical and the difference is zero everywhere. However, by subtracting a scaled copy, the normalization of the difference seems to really boost the signal value(s) over the "noise", making the signal stand out compared to pre-normalization.
> what happens if the two groups of attention learn the same thing
I wonder if there's a metaphor here for our own experience and utility in "surprise".
Like if one attention head is surprised by what another learns, up-weight it. But if they both find the same, assume it's not very surprising and down-weight it.
Admittedly, "surprise" is something that has a big section of my knowledgebase[1][2][3] (both as a subjective feeling and adaptive function of our minds, one of the most complex adaptive system we know of)
I wonder what is lost here. Surely there's a trade-off...
I'm wondering if there's any effect of "creativity", or ability to interpolate between concepts. Hallucination and creativity feel very related to me. I understand hallucinating as simply being misaligned with the space humans feel appropriate to interpolate between
> Hallucination and creativity feel very related to me.
Why? I see them as just sampling errors.
Sure a mistake can spark inspiration sometimes, but creativity is much more than mistakes.
> I understand hallucinating as simply being misaligned with the space humans feel appropriate to interpolate between
These language models are next-token predictors. The way the next token is predicted is by sampling a probability space outputted by the model.
That sampling process can be non deterministic.
Hallucinations are when that sampling results in tokens that come together to create a false or otherwise unintended statement.
You can just as well think of everything a model outputs as a hallucination, but we train the model to output a space what we want them to hallucinate is more likely. Otherwise it just outputs meaningless noise.
“Hallucinate” is really an awful word for what it’s trying to describe.
> You can just as well think of everything a model outputs as a hallucination
Exactly. Don't forget that an important factor in the success of GPT3 was RLHF, which is essentially training the model to produce "hallucinations" that are more acceptable on average to human trainers.
Often see this argument but it doesn't hold water for me. What we call hallucination is usually when the model says something confidently wrong. Yes the sampling procedure is nondeterministic but this is unrelated to hallucinations. The model can generate a distribution to sample with very little weight on the "wrong" output and then this should be ignored by procedures like top-k sampling. The fact that this doesn't easily solve the problem shows that hallucination is a deeper problem in the model itself and not just a byproduct of sampling.
Hallucinate is an awful word because of what it is trying to describe.
Hallucination describes the same feature you just called "non deterministic sampling", but exclusively the cases that we don't like. It would be really convenient if we could actually draw that line, but we can't. If non-determinism is a core feature, then that feature will be present in every case; including the ones we find desirable, and the ones we find undesirable.
> Sure a mistake can spark inspiration sometimes, but creativity is much more than mistakes.
It looks like creativity has many steps but being able to come with novel, unprompted stuff is important, as long as you are able to discard the bullshit earlier.
"Hallucination" is only a problem if later layers (or additional networks) can't detect and remove it
LLM’s are too unpredictable for many practical uses so I’d guess better predictability is better. Hopefully the change the paper proposes will help!
But here’s a case for the other side: sure, most mistakes are just errors, but evolution happens via “mistakes.” Also, LLM’s often deliberately add add randomness at inference time.
For one, speed and memory. They have twice as many Q and K weights in the attention blocks, leading to a ~10% reduction in throughput on their H100 (table 7 in appendix A).
I wonder how much of the value here is from canceling out the positional noise rope produces. I would love to see a table comparing an alibi version of this to an alibi baseline in addition to the rope models here.
But Figure 1 clearly shows that it works, so I don't doubt that it is in fact possible. I'm just struggling to build a picture of how exactly the network accomplishes this.
softmax should be exp()/1+∑exp()
Notice the 1 added to the denominator.
The difference is at the negative limit, softmax can be 0, instead of some epsilon. The same could be done by adding an extra zero value in x.
Downside is, you have to retrain your model from scratch to fix this.
if you think about it, the "escape hatch" is the design of the entire transformer dictionary. if Key/Query attention misaligns with Value's weights, you get a layer head that does not attend to anything...
I didn't follow Miller's proposal quite as he wrote it though and I put the mechanism in all the layers rather than avoiding it at the end.
My test doesn't absolutely rule out usefulness-- there's always different ways of applying something, but I saw no indication of it.
I've also tested it on many networks and different types of attention and I've yet to see a meaningful improvement (or even an improvement), even in generalization.
It really is the training method...
As to the paper, I'm also still at a big lost and honestly, if reviewing could not accept it. The results look good, but I can't tell why and there's some "black magic" going on here.
Obviously they improved something, but there is nothing in the paper that is convincing me that it is the differential attention. There are too many parameters at play and how am I supposed to know that the difference is by the thing they are proposing. And more importantly, how much it is improved by that specific thing and not by other things.In the first diagram with the attention weights, there actually are some negative scores in the noise section. But, the attention to that section is very small anyway. All the second attention map needs to do is predict the noise in the first one -- a task that can be done very accurately, because it has full access to the input of the first.
To refer back to their real-world comparison, noise-canceling headphones have access to what your ear hears through a microphone, so they can output exactly the right cancellation signal. Similarly, the second attention map knows what's being input into the first one, so it can output a corresponding cancellation signal. It's not perfect -- just as noise-canceling headphones aren't perfect -- but it still gets you 99% of the way there, which is enough to boost performance.
Deleted Comment
I mean, intuitively it would be trivial for the model to just optimise lambda to zero during training. Then you essentially have built a vanilla transformer with an overcomplicated parameter pruning mechanism. Pruning is already pretty well established in the literature as something that works surprisingly good for reducing parameter counts up to (hold on to your papers)... about 40%. In practice the model probably doesn't work exactly like that, but I wouldn't be surprised if it just approximates the normal transformer in the end anyways.
I'm a little concerned about the last sentence of the section introduction of "2 Differential Transformer". It mentions using improvements from previous papers, but in the grammatical context, it's unclear if this improvement is added to both the normal transformer and their diff transformer. This would otherwise sully the comparisons. It's the "main difference" wording in the previous sentence that raised a flag for me.
Of course, a good-faith researcher would know this and may not feel the need to clarify. But you can never be too careful about some published research in this field.
In any event, I'd imagine that this will get widely adopted if the numbers hold up; like I said, this seems to be basically no downside, and should be easy to replicate.
https://github.com/microsoft/unilm/blob/master/Diff-Transfor...
The analogy to noise-cancelling headphones is helpful but in that case we clearly know which is signal and which is noise. Here, if we knew why would we even bother to the noise-cancelling work?
To make things worse, low attention values will have very low gradient, thus needing a lot of weight updates to undo that kind of mistakes. On the other hand, subtracting the output of two softmax allows the model to predict a weight of exactly zero for some of the values, while keeping a reasonable gradient flowing through.
So the model already knows what is noise, but a single softmax makes it harder to exclude it.
Moreover, with a single softmax the output of all heads is forced to stay in the convex hull of the value vectors, whereas with this variant each head can choose its own lambda, thus shifting the "range" of the outputs outside the convex hull pre-determined by the values. This makes the model as a whole more expressive.
Everything I am seeing in this paper is related to reduced size and noise, which implies a reduction in expressiveness.
The improvement in needle and a haystack, benchmarks on multi-hop questions of in corpus data and multishot in-context learning points to this.
This is a wonderful thing if robustness is more important than generality, but it doesn't address trimming away activations that may be spurious in the general use case but may improve an individual domain specificity.
Context would dramatically impact what tradeoffs and more desireble, and noise is probably never desirable. But the ability of this paper to enable bit size for inference points to a reduction in expressiveness.
Perhaps I am too focused on generalization?
Also, where is each softmax happening here? For each attention head?
If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.
What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.
I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute.
Deleted Comment
Wouldn’t this be pretty unlikely, though?
Dead Comment
The better example is the differential signalling used in professional audio and many digital signaling protocols like Ethernet, HDMI and USB.
Instead of using one wire, referencing to ground, they send the signal as the difference between both wires. Both wires end up carrying the same signal with inverted polarity. Because both wires are running next to each-other any external noice will be applied to both equally.
The voltage will change, but the difference in voltage between both wires is untouched. And when you subtract the two voltages at the receiver end, any noise simply gets subtracted out.
It sort of feels closer to heterodyning and "demodulating" the signal encoded in the softmax. Those tiny little errors we're trying to denoise with this technique are almost closer to carrier waves (when encoded to softmax) than noise imo. This wouldn't get rid of noise in the training data or noise in the dimensionality of the key / value space. It's really only removing noise introduced by the process itself.
Dead Comment
The simple way of doing this would be to just remove the softmax or use a sigmoid instead, but in practice a softmax works better it seems.
to eli5:
rope is the modern strategy used to give information to the model about how far a query and a key are apart when doing attention. It's the best strategy we have now, but has a major downside, where it makes some connections between tokens that are far apart much stronger than you would like them to be. Xpos (https://arxiv.org/pdf/2212.10554) is another paper by microsoft tackling issues with rope and you can see figure 1 on page 4 to get a visual interpretation of the sinusoidal attention strength (you would like it to be smooth).
I think a big reason differential transformers is working so well, especially on long sequence stuff, because when both q1 and q2 don't match a token, the rope relative strength will still have the same value and the noise will cancel out. Leaving intended matches, but at the cost of somewhat dampening the original value rope brought.
Just a hypothesis though. It would be easy to test by running this experiment against a baseline where both use alibi attention (https://arxiv.org/pdf/2108.12409) which has a different set of tradeoffs this wouldn't mitigate, but still a really interesting result.
If that is the case, then the "signal" in this case would be the softmax that encodes the dimensions captured by the query / key space. Since the noise ideally is the same in both softmax encodings, subtracting them should "cancel out" the noise.
For example, if you are trying to send a +1V signal on one wire, and a -1V signal on the other and a +0.5V noise exists, one wire will have +1.5V and the other will have -0.5V,
Take the difference and divide by 2:
(+1.5V - -0.5V) / 2 = +1V or, if your setup is different (-0.5V - +1.5V) / 2 = -1V
The analogy I can think of is when you're paying attention to a variety of things and you actively avoid concentrating on something because it will distract you. You don't give it zero attention, you give it negative attention.
If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.
> According to the fitted curves, 6.8B-size DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters
This raises a few questions for me:
- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?
- Does that tradeoff change noticeably between training and inference?
Here's the bit from the paper:
> We set the number of heads h = dmodel/2d, where d is equal to the head dimension of Transformer. So we can align the parameter counts and computational complexity.
In other words, they make up for it by having only half as many attention heads per layer.
I wonder about the story behind that formula...
(Although it seems the author do not discuss this choice anywhere in the paper?)
This makes sense, if one considers that the two copies are identical then the softmax outputs would be identical and the difference is zero everywhere. However, by subtracting a scaled copy, the normalization of the difference seems to really boost the signal value(s) over the "noise", making the signal stand out compared to pre-normalization.
I wonder if there's a metaphor here for our own experience and utility in "surprise".
Like if one attention head is surprised by what another learns, up-weight it. But if they both find the same, assume it's not very surprising and down-weight it.
Admittedly, "surprise" is something that has a big section of my knowledgebase[1][2][3] (both as a subjective feeling and adaptive function of our minds, one of the most complex adaptive system we know of)
[1] https://plus.maths.org/content/information-surprise
[2] https://blakeelias.name/papers/Multi-Agent-Cooperation-Intri...
[3] https://complexity.simplecast.com/episodes/81/transcript
I'm wondering if there's any effect of "creativity", or ability to interpolate between concepts. Hallucination and creativity feel very related to me. I understand hallucinating as simply being misaligned with the space humans feel appropriate to interpolate between
Why? I see them as just sampling errors.
Sure a mistake can spark inspiration sometimes, but creativity is much more than mistakes.
> I understand hallucinating as simply being misaligned with the space humans feel appropriate to interpolate between
These language models are next-token predictors. The way the next token is predicted is by sampling a probability space outputted by the model.
That sampling process can be non deterministic.
Hallucinations are when that sampling results in tokens that come together to create a false or otherwise unintended statement.
You can just as well think of everything a model outputs as a hallucination, but we train the model to output a space what we want them to hallucinate is more likely. Otherwise it just outputs meaningless noise.
“Hallucinate” is really an awful word for what it’s trying to describe.
Exactly. Don't forget that an important factor in the success of GPT3 was RLHF, which is essentially training the model to produce "hallucinations" that are more acceptable on average to human trainers.
Hallucination describes the same feature you just called "non deterministic sampling", but exclusively the cases that we don't like. It would be really convenient if we could actually draw that line, but we can't. If non-determinism is a core feature, then that feature will be present in every case; including the ones we find desirable, and the ones we find undesirable.
It looks like creativity has many steps but being able to come with novel, unprompted stuff is important, as long as you are able to discard the bullshit earlier.
"Hallucination" is only a problem if later layers (or additional networks) can't detect and remove it
But here’s a case for the other side: sure, most mistakes are just errors, but evolution happens via “mistakes.” Also, LLM’s often deliberately add add randomness at inference time.
For one, speed and memory. They have twice as many Q and K weights in the attention blocks, leading to a ~10% reduction in throughput on their H100 (table 7 in appendix A).
Crazy gains though congrats to the researchers