One of the easier solutions that does the rounds periodically (in many forms above and beyond logic gates, such as symbolic regression) is just densely connecting everything and ensuring that an identity function exists. Anneal a penalty around non-identity nodes, use L1 for the penalty, and you can learn a sparse representation.
There are a number of details to work through, such as making an "identity" for 2 inputs and 1 output (just don't offer those; use gates like a half adder instead of AND or XOR, adding a post-processing step removing extra wires you don't care about), or defining "densely connected" in a way that doesn't explode combinatorially (many solutions, details only matter a little), but it's the brute-force solution, and you only pay the cost during training.
There are lots of other fun ways to handle that problem though. One of my favorites is to represent your circuit as "fields" rather than discrete nodes. Choose your favorite representation for R2->Rn (could be a stack of grids, could be a neural net, who cares), and you conceptually represent the problem as a plane of wire density, a plane of "AND" density, a plane of "XOR" density, etc. Hook up the boundary conditions (inputs and outputs on the left and right side of the planes) and run your favorite differentiable PDE solver, annealing the discreteness of the wires and gates during training.
Ha! I have spent the last 2 years on this idea as a pet research project and have recently found a way of learning the wiring in a scalable fashion (arbitrary number of input bits, arbitray number of output bits). Would love to chat with someone also obsessed with this idea.
I also I'm very interested. I had played around a lot with Differentiable Logic Networks a couple of months ago and how to make the learned wiring scale to bigger number of gates. I had a couple of ideas that seemed to worked in a smaller scale, but that had trouble converging with deeper networks.
I think the techniques in “Weight Agnostic Neural Networks” should be applicable here, too. It uses a variant of NEAT I believe. This would allow for learning the topology and wiring rather than just gates. But, in practice it is probably pretty slow, and may not be all that different than a pruned and optimized DLGN..
Using logic operators? Picking something from a range of options with SoftMax? Having a distribution to pick from?
I remember reading about adaptive boolean logic networks in the 90's. I remember a paper about them using the phrase "Just say no to backpropagation". It probably goes back considerably earlier.
Fuzzy logic was all the rage in the 90's too. Almost at the level of marketers sticking the label on everything the way AI is done today. Most of that was just 'may contain traces of stochasticity' but the academic field used actual defined logical operators for interpolated values from zero to one.
Careless application of Gumbel Softmax would let your model to select same input for both inputs of a gate.
Also, note that selecting a wire has all same problems as selecting an operation, such as vanishing gradient, but no apparent mitigation like residual initialization. And for wire selection gradient vanishes quicker, because there are more wires (tens and hundredths) than operations (16).
If you replace the uint64_t cell with an attribute((vector_size(32))) and build with march=native, the bitwise ops will work exactly as before but you'll light up the vector units on the x64 machine.
The interesting thing here is that it's not a straightforward port. JAX is already very fast, for the architecture it implements. The point is that the network is heavily contracted by removing nodes that only do pass-through, and then hugely parallelizing the computations using bitwise operations on 64 bits at once. Hence this incredible speedup.
How much of a speedup is the C compiler optimization able to achieve in terms of compiling it down to a hand-written C equivalent vs the -O0 non-optimized assembler? What does the optimized C/assembler do which isn't actually necessary and accounts for the remaining inefficiency?
There are 163 lines of C. Of them, with -O3, 104 lines are present in the assembly output. So the C compiler is able to eliminate an additional ~36.2% of the instructions. It doesn't do anything fancy, like autovectorization.
The 3.45s surprises me, because it's faster than the 4.09s I measured earlier. Maybe I had a P core vs an E core. For -O0, the compiler is emitting machine code like:
Which is comically bad. If I try with e.g. -Og, I get the same disassembly as -O3. Even -01 gives me the same disassembly as -O3. The assembly (-0g, -01, -03) looks like a pretty direct translation of the C. Better, but also nothing crazy (e.g. no autovectorization):
Looking more closely, there's actually surprisingly little register spilling.
I think the real question you're asking is, as I wrote:
> If we assume instruction latency is 1 cycle, we should expect 2,590 fps. But we measure a number nearly 10× higher! What gives?
Part of this is due to counting the instructions in the dissassembly wrong. In the blogpost I used 349 instructions, going off Godbolt, but in reality it's 135. If I redo the calculations with this new numbers, I get 2.11 instructions per bit, 0.553 million instrs per step, dividing out 3.70 gcycles/s gives 6,690 fps. Which is better than 2,590 fps, but still 3.6x slower than 24,400. But I think 3.6x is a factor you can chalk up to instruction-level parallelism,.
Hope that answers your questions. Love your writing Gwern.
Thanks for checking. It sounds like the C compiler isn't doing a great job here of 'seeing through' the logic gate operations and compiling them down to something closer to optimal machine code. Maybe this is an example of how C isn't necessarily great for numerical optimization, or the C compiler is just bailing out of analysis before it can fix it all up.
A fullstrength symbolic optimization framework like a SMT solver might be able to boil the logic gates down into something truly optimal, which would then be a very interesting proof of concept to certain people, but I expect that might be for you an entire project in its own right and not something you could quickly check.
Still, something to keep in mind: there's an interesting neurosymbolic research direction here in training logic gates to try to extract learned 'lottery tickets' which can then be turned into hyper-optimized symbolic code achieving the same task-performance but possibly far more energy-efficient or formally-verifiably.
“ NN-512 is an open-source Go program that generates fully AVX-512 vectorized, human-readable, stand-alone C implementations of convolutional neural nets”
Well done — really enjoyed this. We could use this kind of optimization in our library[0], which builds differentiable logic networks out of gates like AND, XOR, etc.
It focuses on training circuit-like structures via gradient descent using soft logic semantics. The idea of compiling trained models down to efficient bit-parallel C is exactly the kind of post-training optimization we’ve been exploring — converting soft gates back into hard boolean logic (e.g. by thresholding or symbolic substitution), then emitting optimized code for inference (C, WASM, HDL, etc).
The Game of Life kernel is a great example of where logic-based nets really shine.
I also worked a long time ago in recreating the original Deep Differentiable Logic Network paper [1], so I have a couple of additions to make.
> I wanted to see if I could learn the wires in addition to the gates. I still think it’s possible, but it’s something I had to abandon to get the model to converge.
Actually, I read some other paper where they also learned the wiring, but they did so by alternating the training of the gates and the wires (in some iterations they learned the wiring while keeping the gates frozen, and in other they learned the gates while keeping the wiring frozen). The problem with this approach is that it is inherently non-escalable: you need a lot of gates to approximate the behavior of a simple MLP, and if you need a full NxM learned matrix to encode the wiring, the memory needed to learn, for example, MNIST, gets huge, quickly. I think that for this there are 2 fixes:
- You actually don't need to learn a full NxM matrix to increase the expressivity of the network. You can, for each output gate, select a random subset of possible input gates of size K, and then you only need a learned matrix of size KxM. I did the numbers, and even a moderately small K, like 16 or 32, wildly increases the number of circuits you can learn with a smaller number of layers and gates.
- You could use a LoRA kind of matrix. Instead of a matrix NxM, use a pair of matrices NxK and KxM, where K<<N,M.
Learning the wiring also has other benefits. As the output gate can learn to swap the inputs if needed, you can remove some learnable gates that are "mirrors" or "permutations" of each other (a and not b, not a and b; a or not b, not a or b), which can help scale the networks to use gates of more inputs (I tried with 3-input gates and 4-input gates).
Also, as the author pointed out, it was very difficult to get the models to converge. It was very frustrating that I never managed to get a working model that performed really well on MNIST. In the end, I gave up on that and I worked on how to make the network consistently learn simple 3-input or 4-input functions with perfect accuracy, and I managed to make it learn them consistently with a couple dozen iterations, which was nice.
Is there soon expanded explanation for "Of course it is biased! There’s no way to train the network otherwise!" ?
I'm still struggling to understand why is that the case. As far as I understand the training, in a bad case (probably mostly at the start) you could happen to learn the wrong gate early and then have to revert from it. Why isn't the same thing happening without the biasing to pass-thru? I get why pass-thru would make things faster, but not why it would prevent converging.
Thank you for the excellent writeup of some extremely interesting work! Do you have any opinions on whether binary networks and/or differentiable circuits will play a large role in the future of AI? I've long had this hunch that we'll look back on current dense vector representations as an inferior way of encoding information.
Well, I'm not an expert. I think that this research direction is very cool. I think that, at the limit, for some (but not all!) applications, we'll be training over the raw instructions available to the hardware, or perhaps even the hardware itself. Maybe something as in this short story[0]:
> A descendant of AutoML-Zero, “HQU” starts with raw GPU primitives like matrix multiplication, and it directly outputs binary blobs. These blobs are then executed in a wide family of simulated games, each randomized, and the HQU outer loop evolved to increase reward.
I also think that different applications will require different architectures and tools, much like how you don't write systems software in Lua, nor script games mods with Zsh. It's fun to speculate, but who knows.
Yes and no. I wasn't expecting to be able to reproduce the work, so I'm just content that it works. I was very surprised by how much hyperparameter finagling I had to do to get the DLGN converging; the tiny relu network I trained at the beginning, in comparison, converged with dead-simple SGD in a third of the epochs.
The speedup was surprising in the sense that the bit-level parallelism fell out naturally: that 64× speedup alone was unexpected and pretty sweet. There's likely still a lot of speed left on the table. I just did the bare minimum to get the C code working: it's single-threaded, there's no vectorization, lots of register spilling, etc. Imagine the speedup you'd get running the circuit on e.g. an FPGA.
But no, it was not surprising in the sense that yeah, multiplying billions of floats is going to be much slower than a handful of parallel bitwise ops. Physics is physics, doesn't matter how good your optimizer is.
You've made some mistakes with the Game of Life rules. You've missed out the overpopulation rule:
Any live cell with more than three live neighbours dies
Nit:
> I guess there’s a harsh third rule which is, “if the cell is dead, it stays dead”.
That phrasing is inaccurate, if a dead cell stayed dead, the first rule wouldn't work. I'm not sure that particular sentence adds much to the flow, honestly.
You're thinking about the cells as toggles on a stateful grid, TFA is thinking about them as pure functions that take in an input state and output a new state (with "off" being the default).
From that perspective, there's no point in "killing" a cell, it's simpler to only write out the 0 -> 1 and 1 -> 1 transition cases and leave all of the other cases as implicitly 0
I did some extremely rough research into doing learnable wirings [1], but couldn't get past even learning ~4-bit addition.
[0]: https://arxiv.org/abs/2210.08277
[1]: https://ezb.io/thoughts/program_synthesis/boolean_circuits/2...
There are a number of details to work through, such as making an "identity" for 2 inputs and 1 output (just don't offer those; use gates like a half adder instead of AND or XOR, adding a post-processing step removing extra wires you don't care about), or defining "densely connected" in a way that doesn't explode combinatorially (many solutions, details only matter a little), but it's the brute-force solution, and you only pay the cost during training.
There are lots of other fun ways to handle that problem though. One of my favorites is to represent your circuit as "fields" rather than discrete nodes. Choose your favorite representation for R2->Rn (could be a stack of grids, could be a neural net, who cares), and you conceptually represent the problem as a plane of wire density, a plane of "AND" density, a plane of "XOR" density, etc. Hook up the boundary conditions (inputs and outputs on the left and right side of the planes) and run your favorite differentiable PDE solver, annealing the discreteness of the wires and gates during training.
https://weightagnostic.github.io/
Using logic operators? Picking something from a range of options with SoftMax? Having a distribution to pick from?
I remember reading about adaptive boolean logic networks in the 90's. I remember a paper about them using the phrase "Just say no to backpropagation". It probably goes back considerably earlier.
Fuzzy logic was all the rage in the 90's too. Almost at the level of marketers sticking the label on everything the way AI is done today. Most of that was just 'may contain traces of stochasticity' but the academic field used actual defined logical operators for interpolated values from zero to one.
A quick look on picking from a selection found https://psycnet.apa.org/record/1960-03588-000 but these days softmax is just about ubiquitous.
Also, note that selecting a wire has all same problems as selecting an operation, such as vanishing gradient, but no apparent mitigation like residual initialization. And for wire selection gradient vanishes quicker, because there are more wires (tens and hundredths) than operations (16).
Good blog post, thanks!
I profiled just now:
The 3.45s surprises me, because it's faster than the 4.09s I measured earlier. Maybe I had a P core vs an E core. For -O0, the compiler is emitting machine code like: Which is comically bad. If I try with e.g. -Og, I get the same disassembly as -O3. Even -01 gives me the same disassembly as -O3. The assembly (-0g, -01, -03) looks like a pretty direct translation of the C. Better, but also nothing crazy (e.g. no autovectorization): Looking more closely, there's actually surprisingly little register spilling.I think the real question you're asking is, as I wrote:
> If we assume instruction latency is 1 cycle, we should expect 2,590 fps. But we measure a number nearly 10× higher! What gives?
Part of this is due to counting the instructions in the dissassembly wrong. In the blogpost I used 349 instructions, going off Godbolt, but in reality it's 135. If I redo the calculations with this new numbers, I get 2.11 instructions per bit, 0.553 million instrs per step, dividing out 3.70 gcycles/s gives 6,690 fps. Which is better than 2,590 fps, but still 3.6x slower than 24,400. But I think 3.6x is a factor you can chalk up to instruction-level parallelism,.
Hope that answers your questions. Love your writing Gwern.
A fullstrength symbolic optimization framework like a SMT solver might be able to boil the logic gates down into something truly optimal, which would then be a very interesting proof of concept to certain people, but I expect that might be for you an entire project in its own right and not something you could quickly check.
Still, something to keep in mind: there's an interesting neurosymbolic research direction here in training logic gates to try to extract learned 'lottery tickets' which can then be turned into hyper-optimized symbolic code achieving the same task-performance but possibly far more energy-efficient or formally-verifiably.
“ NN-512 is an open-source Go program that generates fully AVX-512 vectorized, human-readable, stand-alone C implementations of convolutional neural nets”
It focuses on training circuit-like structures via gradient descent using soft logic semantics. The idea of compiling trained models down to efficient bit-parallel C is exactly the kind of post-training optimization we’ve been exploring — converting soft gates back into hard boolean logic (e.g. by thresholding or symbolic substitution), then emitting optimized code for inference (C, WASM, HDL, etc).
The Game of Life kernel is a great example of where logic-based nets really shine.
[0]https://github.com/VoxLeone/SpinStep/tree/main/benchmark
> I wanted to see if I could learn the wires in addition to the gates. I still think it’s possible, but it’s something I had to abandon to get the model to converge.
Actually, I read some other paper where they also learned the wiring, but they did so by alternating the training of the gates and the wires (in some iterations they learned the wiring while keeping the gates frozen, and in other they learned the gates while keeping the wiring frozen). The problem with this approach is that it is inherently non-escalable: you need a lot of gates to approximate the behavior of a simple MLP, and if you need a full NxM learned matrix to encode the wiring, the memory needed to learn, for example, MNIST, gets huge, quickly. I think that for this there are 2 fixes:
- You actually don't need to learn a full NxM matrix to increase the expressivity of the network. You can, for each output gate, select a random subset of possible input gates of size K, and then you only need a learned matrix of size KxM. I did the numbers, and even a moderately small K, like 16 or 32, wildly increases the number of circuits you can learn with a smaller number of layers and gates.
- You could use a LoRA kind of matrix. Instead of a matrix NxM, use a pair of matrices NxK and KxM, where K<<N,M.
Learning the wiring also has other benefits. As the output gate can learn to swap the inputs if needed, you can remove some learnable gates that are "mirrors" or "permutations" of each other (a and not b, not a and b; a or not b, not a or b), which can help scale the networks to use gates of more inputs (I tried with 3-input gates and 4-input gates).
Also, as the author pointed out, it was very difficult to get the models to converge. It was very frustrating that I never managed to get a working model that performed really well on MNIST. In the end, I gave up on that and I worked on how to make the network consistently learn simple 3-input or 4-input functions with perfect accuracy, and I managed to make it learn them consistently with a couple dozen iterations, which was nice.
[1] https://arxiv.org/abs/2210.08277
I'm still struggling to understand why is that the case. As far as I understand the training, in a bad case (probably mostly at the start) you could happen to learn the wrong gate early and then have to revert from it. Why isn't the same thing happening without the biasing to pass-thru? I get why pass-thru would make things faster, but not why it would prevent converging.
Well, I'm not an expert. I think that this research direction is very cool. I think that, at the limit, for some (but not all!) applications, we'll be training over the raw instructions available to the hardware, or perhaps even the hardware itself. Maybe something as in this short story[0]:
> A descendant of AutoML-Zero, “HQU” starts with raw GPU primitives like matrix multiplication, and it directly outputs binary blobs. These blobs are then executed in a wide family of simulated games, each randomized, and the HQU outer loop evolved to increase reward.
I also think that different applications will require different architectures and tools, much like how you don't write systems software in Lua, nor script games mods with Zsh. It's fun to speculate, but who knows.
[0]: https://gwern.net/fiction/clippy
iirc it's around 30-40?
The speedup was surprising in the sense that the bit-level parallelism fell out naturally: that 64× speedup alone was unexpected and pretty sweet. There's likely still a lot of speed left on the table. I just did the bare minimum to get the C code working: it's single-threaded, there's no vectorization, lots of register spilling, etc. Imagine the speedup you'd get running the circuit on e.g. an FPGA.
But no, it was not surprising in the sense that yeah, multiplying billions of floats is going to be much slower than a handful of parallel bitwise ops. Physics is physics, doesn't matter how good your optimizer is.
Any live cell with more than three live neighbours dies
Nit: > I guess there’s a harsh third rule which is, “if the cell is dead, it stays dead”.
That phrasing is inaccurate, if a dead cell stayed dead, the first rule wouldn't work. I'm not sure that particular sentence adds much to the flow, honestly.
From that perspective, there's no point in "killing" a cell, it's simpler to only write out the 0 -> 1 and 1 -> 1 transition cases and leave all of the other cases as implicitly 0