So far all my work has gone into the technical side of setting up the game (a Java app written in 2010) to work as a reinforcement learning environment. The developers were nice enough to maintain the source and open it to the community, so I patched the client/server to be controllable through protobuf messages. So far, I can:
- Record games between humans. I also wrote a kind of janky replay viewer [1] that probably only makes sense to people who play the game already. (Before, the game didn't have any recording feature.)
- Define bots with pytorch/python and run them in offline training mode. (The game runs relatively quickly, like 8 gameplay minutes / realtime second.)
- Run my python-defined bots online versus human players. (Just managed to get this working today.)
It took a bunch of messing around with the Java source to get this far, and I haven't even really started on the reinforcement learning part yet. Hopefully I can start on that soon.
This game (https://planeball.com) is really unique, and I'm excited to produce a reinforcement learning environment that other people can play with easily. Thinking about how you might build bots for this game was one of the problems that made me interested in artificial intelligence 8 years ago. The controls/mechanics are pretty simple and it's relatively easy to make bots that beat new players---basically just don't crash into obstacles, don't stall out, conserve your energy, and shoot when you will deal damage---but good human players do a lot of complicated intuitive decision-making.
[1] http://altistats.com/viewer/?f=4b020f28-af0b-4aa0-96be-a73f0... (Press h for help on controls. Planes will "jump around" when they're not close to the objective---the server sends limited information on planes that are outside the field of vision of the client, but my recording viewer displays the whole map.)
Except, the linear map W is just set to a random initialization, so it won't work for obvious reasons in its current form. (I guess this is why there is no example of its output. I'm guessing it was vibe-coded?) Also, since the intervention is only happening at the last hidden layer, I can't imagine this would really change how the model "thinks" in an interesting way. Like, yeah, you can absolutely make a model talk about dogs by adding in control vector for "dogness" somewhere.
Basically, this method is "inspired by graffiti art of tagging and the neuroplastic nature of living brains" in the same way that taking an exponential moving average of a time series would be "informed by state-space dynamics techniques utilized in deep learning, reservoir computing, and quantum mechanics." Really tired of the amount of insincere/pointless language in deep learning nowadays.
Do you think it's accurate to describe equivariance as both a strength and a weakness here? As in it allows the model to learn a useful compression, but you have to pick your set of equivariant layers up front, and there's little the model can do to "fix" bad choices.
One example that comes to mind (I don't know much/haven't thought about it much) is how AlphaFold apparently dropped rotational equivariance of the model in favor of what amounts to data augmentation---opting to "hammer in" the symmetry rather than using these fancy equivariant-by-design architectures. Apparently it's a common finding that hard-coded equivariance can hurt performance in practice when you have enough data.
If X is a random variable having a uniform distribution between zero and one, then –ln(X)/λ has an exponential distribution with rate λ.
This relationship comes in handy when, for example, you want to draw weighted random samples. Or generating event times for simulations.
When X is an exponential variable and c is a constant, X + c has the same distribution as X after conditioning on large outcomes. In other words, these two variables have same "tail." This is true exactly for exponential distributions. (Sometimes this is called "memorylessness.")
Similarly, when U has a uniform distribution on [0, 1] and c is a constant, cU has the same distribution as U after conditioning on small outcomes.
But if cU is distributed like U near 0, then -ln(c U) is distributed like -ln(U) near infinity. But -ln(c U) = -ln(c) - ln(U), so the tail of -ln(U) doesn't change when we add a constant, meaning it must have an exponential distribution.
I'm curious about the focus on information compression, though. The classical view of inference as compression is beautiful and deserves more communication, but I think the real novelty here is in how the explicitly "information-constrained" code z participates in the forward pass.
About their overall method, they write:
> It isn’t obvious why such a method is performing compression. You’ll see later how we derived it from trying to compress ARC-AGI.
I must be learning something in my PhD, because the relation with compression _did_ seem obvious! Viewing prediction loss and KL divergence of a latent distribution p(z) as "information costs" of an implicit compression scheme is very classical, and I think a lot of people would feel the same. However, while they explained that a L2 regularization over model weights can be viewed (up to a constant) as an approximation of the bits needed to encode the model parameters theta, they later say (of regularization w.r.t. theta):
> We don’t use it. Maybe it matters, but we don’t know. Regularization measures the complexity of f in our problem formulation, and is native to our derivation of CompressARC. It is somewhat reckless for us to exclude it in our implementation.
So, in principle, the compression/description length minimization point of view isn't an explanation for this success any more than it explains success of VAEs or empirical risk minimization in general. (From what I understand, this model can be viewed as a VAE where the encoding layer has constant input.) That's no surprise! As I see it, our lack of an adequate notion of "description length" for a network's learned parameters is at the heart of our most basic confusions in deep learning.
Now, let's think about the input distribution p(z). In a classical VAE, the decoder needs to rely on z to know what kind of data point to produce, and "absorbing" information about the nature of a particular kind of data point is actually what's expected. If I trained a VAE on exactly two images, I'd expect the latent z to carry at most one bit of information. If CompressARC were allowed to "absorb" details of the problem instance in this way, I'd expect p(z) to degenerate to the prior N(0, 1)—that is, carry no information. The model could, for example, replace z with a constant at the very first layer and overfit the data in any way it wanted.
Why doesn't this happen? In the section on the "decoding layer" (responsible for generating z), the authors write:
> Specifically, it forces CompressARC to spend more bits on the KL whenever it uses z to break a symmetry, and the larger the symmetry group broken, the more bits it spends.
As they emphasize throughout this post, this model is _very_ equivariant and can't "break symmetries" without using the parameter z. For example, if the model wants to do something like produce all-green images, the tensors constituting the "multitensor" z can't all be constant w.r.t. the color channel---at least one of them needs to break the symmetry.
The reason the equivariant network learns a "good algorithm" (low description length, etc.) is unexplained, as usual in deep learning. The interesting result is that explicitly penalizing the entropy of the parameters responsible for breaking symmetry seems to give the network the right conditions to learn a good algorithm. If we took away equivariance and restricted our loss to prediction loss plus an L2 "regularization" of the network parameters, we could still motivate this from the point of view of "compression," but I strongly suspect the network would just learn to memorize the problem instances and solutions.
In this article, we fix a mereology and a kind of quantity Q that "decomposes" over it---in the sense that Q(p) = sum_{r <= p} q(r) for some function q(r)---and then see that Mobius inversion lets us solve for q in terms of Q. In terms of incidence algebras, we're saying: assume Q = zeta q, as a product of elements in an incidence algebra. Then zeta has an inverse mu, so q = mu Q.
In other situations, we might want to "solve for" a quantity Q that decomposes over some class of metrologies while respecting some properties. The "simpler" and more "homogeneous" the parts of your mereology, the less you can express, but the easier it becomes to reason about Q. A mereology that breaks me up into the empty set, singleton sets with each of my atoms, and the set of all my atoms admits no "decomposing quantities" besides a histogram of my atoms. An attempt to measure "how healthy I am" in terms of that mereology can't do much. On the other hand, if I choose the mereology that breaks me up into the empty set and my whole, all quantities decompose but I have no tools to reason about them.
I guess Euler characteristic could be an example of how the requirement of respecting a certain kind of mereology can "bend" a hard-to-decompose quantity into a weirder but "nicer" quantity. For example, say we're interested in defining a Q that attempts to "count the number of connected regions" of some object, and we insist on using a mereology that lets us divide regions up into "cells". Of course this is impossible, as we can see in the problem of counting connected components of a graph-like object: we can't get the answer just as a function of the number of vertices and edges. However, if we insist on assigning a value of 1 to "blobs" of any dimension, the "compositionality requirement" forces us to define the Euler characteristic. This doesn't help us much with graph algorithms in general, but gives us an unexpectedly easy way to, say, count the number of blob-shaped islands on a map.
I wonder if there are other examples of this?
Conditional on the value of the first draw, N is geometrically distributed. If we're drawing from an absolutely continuous distribution on the first line, then of course the details of our distribution don't matter: N is a draw from a geometric distribution with rate lambda, where lambda in turn is drawn uniformly from [0, 1]. It follows that N has a thick tail; for example, the expected value of N is the expected value of 1/lambda, which is infinite. In fact, N turns out to have a power law tail.
However, this isn't true if we're drawing from a distribution that's not absolutely continuous. If you coarse-grain into just "fast" and "slow" cars, then N again has a thin (geometric) tail. More to the point, if we imagine that our queues of cars need to be formed within a finite amount of time, then a car is only added to the queue in front of it if its velocity is epsilon larger than the velocity of the queue, and the problematic situation where lambda -> 0 goes away. In this idealized scenario, I guess you could relate the rate of the exponential tail of N to how long the cars have been travelling for.
Finally, it's worth remembering the old "waiting-time paradox": the variable N we're talking about is not the same as the length of the queue that a randomly selected driver finds themself in. What's the distribution of the latter---the distribution of "experienced" queue lengths? In this post the author computed that P(N = n) = 1/n(n + 1). It stands to reason that to get the density of the distribution of experienced lengths we need to multiply by n and divide by a normalizing constant. Unfortunately, you can't multiply 1/(n + 1) by any constant to get a probability distribution, since the sum over n diverges.
What does it mean that the distribution of experienced queues lengths doesn't exist? If you did a huge numerical simulation, you'd find that almost all drivers experience incredibly large queues, and that this concentration towards large queues only becomes more pronounced as you simulate more drivers. If anything, you could argue that the experienced queue length is "concentrated at infinity," although of course in practice all queues are finite.