This is actually really cool article, though it could be a bit better written for sure.
The new Tesla 8 and 16 bit floating number formats in the Dojo system supports this kind of rounding up/down using a PRNG when compressing neural network parameters from higher precision floats (it is specified in the floating point paper).
The random rounding is needed to not have a bias in the neural network weight updates, but this article improves the rounding method to be more accurate (closer to the real rounding than uniform sampling) without reintroducing bias.
> Randomized rounding, where the sender sends 1 with probability x and 0 otherwise, and the receiver uses the received bit X as the estimate x', has the property that E[x'] = x. Unbiased estimators are arguably more natural for many estimation problems. Here the measure of performance would be the maximum variance for the estimate over all inputs x, so for randomized rounding the cost is 1/4 (when x = 1/2).
The "cost"/"variance" of 1/4 in the above paragraph is talked about later in the article as the "expected squared error"; the main result is a scheme that has a cost of just 0.04599, which is better than the "send a 1 with a probability x" scheme, which has a cost of 1/4 = 0.25.
What are some real-world situations where you can only send a single bit but have shared randomness? That seems like a bizarre constraint. Is this just - and no shame in this - math for math’s sake?
Real world situation that is very similar to this: you send a friend a few bits (a seed) they plug it into Minecraft and the both of you can play in the same exact yet practically infinite world.
That is possible due to the shared randomness built into the world generation on both computers.
This sounds halfway related to a problem my statistics professor worked on, sending data over a shared interface without detection. Note that these scenarios assume discrete time.
"Shared randomness" might be the current noise level on some radio frequency in enemy territory. You and your receiving friend can both hear everything, but as soon as one of you starts transmitting you must have a very careful strategy to avoid detection.
I could imagine a scenario where you have many clients sending you information repeatedly, and you’d like to restrict the size of each packet. I think what this post is saying is that during the initial client setup you could add an arbitrary amount of shared randomness to increase accuracy while keeping the packet size down.
To more directly answer your question, the post mentions federated learning of machine models as an application.
Concretely, imagine a sensor node sampling a lot of low-quality sensors frequently. You can afford some setup (to share an AES key, generating as much shared (pseudo-)random as you want), but you want your per-measurement overhead to be small.
… it’s not the most practical, and the single-but version seems to be a toy problem, but I can imagine it being useful in some very particular cases?
The article clearly mentions distributed learning and its really not that hard to generalise on literally any partially parallelizable computing problem with bandwidth limitations between the compute nodes.
I hate these articles that jump straight into some complex solution without stating the problem clearly... If the problem really is to send a real number using a single bit (and some shared randomness) then clearly that’s just impossible. Next!
The problem is a variation of that, which makes it possible, as explained in the post. The large caveat is that the accuracy is probabilistic, and it’s expected that the receiver is getting many 1-bit from various sensors that are all measuring the same phenomena.
It’s a bit like the concept of dithering in images, where visually we can represent shades of grays just by using many 1 bit pixels.
But no! The very first sentence clearly states: "In this post, we'll look at the natural problem of how to communicate an estimate of a real value in [0,1], using just 1 bit." The second paragraph sketches the motivation for the problem.
How much more exposition do you want from a casual blog post? There's a link to the academic paper if you want full details (you can follow up on the numerous references to other papers, which explain applications.)
The blog reproduces rather faithfully what the academic paper says.
The academic paper is confusing, because it defines the problem as it would have been a singular event, the sender chooses a real number, then sends 1 bit to the receiver and the receiver miraculously guesses the real number.
You have to waste time by carefully studying the paper to see that the statistical methods used are meaningless for a singular event, so in fact a large number of transmissions of a single bit are assumed and the accuracy (i.e. the variance) of the estimation of the real number is computed over that very large but unspecified number of transmitted bits.
In any real application, the main quality of an estimation algorithm would be determined by how many single bits must be transmitted and received to guarantee a certain accuracy (variance) of the estimation, i.e. a certain number of recovered bits of the real number, but I have lost patience when reading the paper before discovering whether the authors even mention this somewhere.
What should be clear for everyone, is that recovering N bits of a real number, i.e. reducing the variance of the estimation to that level, always requires transmitting many more single bits than N.
Nevertheless, even if more bits are sent, there are many cases when this is useful, e.g. because losing some single bits or receiving them with errors has little effect, but there may also be other reasons.
some of the comments on this article are so embarrassing. It's just a short and clearly written article on a pretty simple and fundamental type of problem in information theory. Similar ideas are the basis for the systems we use in data compression and transmission and error correction among other things.
Maybe better to think for a few seconds before dismissing the problem as pointless and claiming to hate it.
It describes the problem, but not fully. Either do it properly or link to a complete description.
In this case, unlike what the article states as the problem, the point seems not to be how to send a real number using a single bit, but how to aggregate such bits from many agents that measure some shared thing.
Suppose you have two clocks that are synchronized to +-dt1, and sending messages takes some known amount of time +-dt2 and you want to make sure the sender is able to transmit within +dt3 of whatever event triggered the send event.
It seems to me that you'd be able to transmit at least dt3/(2*(dt1+dt2)) distinct values with the pretty naive approach of discretizing time into segments longer than the uncertainty between the message-and-clock error. I'd be willing to bet that a sophisticated approach could do much better.
According to the paper, this is sending an estimate of a real number with a finite number of bits (possibly one). The method aims to reduce worst-case(?) error by relying on preshared state (in this case, the seed to a PRNG).
What is "shared randomness" in the context of this article? When I google it I find papers on quantum computing - is that a prerequisite for this implementation?
Shared randomness is when the sender and receiver each have a PRNG or list of random numbers that are synchronized. Both sides know what the “current” and “next” random numbers are.
Thank you! I was confused by the ℎ∼𝑈[−1/2,1/2], thinking that it meant either that they both have a set of random numbers (in which case, how does the receiver know which one of them the sender used?) or that they both had the same single random number in that range (in which case it looks like it's basically the same as the randomised rounding case).
Your explanation that they are both going through the same list at the same rate clarifies that. They have n random numbers and are both using the same one for each transmission/receipt event.
I guess that in practice, synchronising that would be a pain. Either they sync on packets sent/received, which means that one dropped packet messes everything up, or they sync on time, in which case there's clock skew to worry about. Or they could have sync information in the data sent ("I'm using the 1234th item in the shared randomness"), which makes a bit of a nonsense of the whole "one bit" thing.
Using a single bit to communicate weight updates during the learning process reduces bandwidth required, and allows highly parallel training.
I suspect in the future we'll even see methods of sub-1-bit weight updates to further decrease bandwidth requirements to keep massive models approximately in sync between distant learning nodes.
What would a sub 1-bit weight update look like? I’m having trouble wrapping my head around the concept of (what I assume would be) a fraction of a bit?
First imagine a less-than-bit fact. Suppose you have a set of 1024 computers and you want to pick out a specific computer based on many facts about a single computer. 1 in 4 computers runs Linux and the rest run Windows. Knowing a computer runs Windows is about half a bit of the information necessary to identify the computer.
A less-than-bit update can be combined with another such an update to encode a single bit. For instance, my protocol could use a bit to indicate “the computer runs Linux, or it is one of the 256 Windows computers in this list”. You know the computer has a 50% chance of running Windows.
Imagine we have 2 weights we want to update. For each weight, choose a pseudo-random number known by both sender and receiver. If a "1" bit is received, update both weights with the random number. If a zero is received, don't.
After enough iterations, each weight can converge on any value.
The new Tesla 8 and 16 bit floating number formats in the Dojo system supports this kind of rounding up/down using a PRNG when compressing neural network parameters from higher precision floats (it is specified in the floating point paper).
The random rounding is needed to not have a bias in the neural network weight updates, but this article improves the rounding method to be more accurate (closer to the real rounding than uniform sampling) without reintroducing bias.
Is this compared in the post? Some of it went over my head.
> Randomized rounding, where the sender sends 1 with probability x and 0 otherwise, and the receiver uses the received bit X as the estimate x', has the property that E[x'] = x. Unbiased estimators are arguably more natural for many estimation problems. Here the measure of performance would be the maximum variance for the estimate over all inputs x, so for randomized rounding the cost is 1/4 (when x = 1/2).
The "cost"/"variance" of 1/4 in the above paragraph is talked about later in the article as the "expected squared error"; the main result is a scheme that has a cost of just 0.04599, which is better than the "send a 1 with a probability x" scheme, which has a cost of 1/4 = 0.25.
"Shared randomness" might be the current noise level on some radio frequency in enemy territory. You and your receiving friend can both hear everything, but as soon as one of you starts transmitting you must have a very careful strategy to avoid detection.
To more directly answer your question, the post mentions federated learning of machine models as an application.
… it’s not the most practical, and the single-but version seems to be a toy problem, but I can imagine it being useful in some very particular cases?
Deleted Comment
Deleted Comment
It’s a bit like the concept of dithering in images, where visually we can represent shades of grays just by using many 1 bit pixels.
I'm not sure whether so many people would be intrigued to read "How to send a real number using arbitrarily many bits."
Not impossible. You can perfectly well send say, pi, by implementing a protocol which sends 1 if the number is pi and 0 if not.
And it's about as useful as the mentioned algorithms.
How much more exposition do you want from a casual blog post? There's a link to the academic paper if you want full details (you can follow up on the numerous references to other papers, which explain applications.)
The academic paper is confusing, because it defines the problem as it would have been a singular event, the sender chooses a real number, then sends 1 bit to the receiver and the receiver miraculously guesses the real number.
You have to waste time by carefully studying the paper to see that the statistical methods used are meaningless for a singular event, so in fact a large number of transmissions of a single bit are assumed and the accuracy (i.e. the variance) of the estimation of the real number is computed over that very large but unspecified number of transmitted bits.
In any real application, the main quality of an estimation algorithm would be determined by how many single bits must be transmitted and received to guarantee a certain accuracy (variance) of the estimation, i.e. a certain number of recovered bits of the real number, but I have lost patience when reading the paper before discovering whether the authors even mention this somewhere.
What should be clear for everyone, is that recovering N bits of a real number, i.e. reducing the variance of the estimation to that level, always requires transmitting many more single bits than N.
Nevertheless, even if more bits are sent, there are many cases when this is useful, e.g. because losing some single bits or receiving them with errors has little effect, but there may also be other reasons.
Maybe better to think for a few seconds before dismissing the problem as pointless and claiming to hate it.
In this case, unlike what the article states as the problem, the point seems not to be how to send a real number using a single bit, but how to aggregate such bits from many agents that measure some shared thing.
That's two very, very different things.
Suppose you have two clocks that are synchronized to +-dt1, and sending messages takes some known amount of time +-dt2 and you want to make sure the sender is able to transmit within +dt3 of whatever event triggered the send event.
It seems to me that you'd be able to transmit at least dt3/(2*(dt1+dt2)) distinct values with the pretty naive approach of discretizing time into segments longer than the uncertainty between the message-and-clock error. I'd be willing to bet that a sophisticated approach could do much better.
Further, you could send the bit as a pulse, where the pulse-width is proportional to the number.
Your explanation that they are both going through the same list at the same rate clarifies that. They have n random numbers and are both using the same one for each transmission/receipt event.
I guess that in practice, synchronising that would be a pain. Either they sync on packets sent/received, which means that one dropped packet messes everything up, or they sync on time, in which case there's clock skew to worry about. Or they could have sync information in the data sent ("I'm using the 1234th item in the shared randomness"), which makes a bit of a nonsense of the whole "one bit" thing.
Am I missing something there?
Using a single bit to communicate weight updates during the learning process reduces bandwidth required, and allows highly parallel training.
I suspect in the future we'll even see methods of sub-1-bit weight updates to further decrease bandwidth requirements to keep massive models approximately in sync between distant learning nodes.
A less-than-bit update can be combined with another such an update to encode a single bit. For instance, my protocol could use a bit to indicate “the computer runs Linux, or it is one of the 256 Windows computers in this list”. You know the computer has a 50% chance of running Windows.
After enough iterations, each weight can converge on any value.