I have several gripes with NumPy, or more broadly the notion of using Python to call a C/asm library that vectorizes math operations. A lot of people speak of NumPy like the solution to all your high-performance math needs in Python, which I think is disingenuous. The more custom logic you do, the less suitable the tools get. Pure Python numeric code is incredibly slow - like 1/30× compared to Java - and as you find parts that can't be vectorized, you have to drop back down to pure Python.
I would like to give the simple example of the sieve of Eratosthenes:
def sieve_primeness(limit):
result = [False] + [True] * limit
for i in range(2, len(result)):
if result[i]:
for j in range(i * i, len(result), i):
result[j] = False
This code is scalar, and porting it to a language like C/C++/Rust/Java gives decent performance straight out of the box. Performance in Python is about 1/30× the speed, which is not acceptable.People pretend that you can hand-wave the performance problem away by using NumPy. Please tell me how to vectorize this Python code. Go on, I'm waiting.
You can't vectorize the `if result[i]` because that controls whether the inner for-loop executes; so it must execute in pure Python. For the inner loop, you can vectorize that by creating a huge temporary array and then AND it with the result array, but that is extremely memory-inefficient compared to flipping bits of the result array in place, and probably messes up the cache too.
Alternatively, you can run the code in PyPy, but that usually gives a speed-up of maybe 3×, which is nowhere near enough to recover the 30× speed penalty.
Another example is that NumPy is not going to help you implement your own bigint library, because that also requires a lot of custom logic that executes between loop iterations. Meanwhile, I've implemented bigint in pure Java with acceptable performance and without issues.
Again, my point is that anytime you venture into custom numerical/logical code that is not the simple `C = A + B`, you enter a world of pain when it comes to Python performance or vectorization.
> Some functions have axes arguments. Some have different versions with different names. Some have Conventions. Some have Conventions and axes arguments. And some don’t provide any vectorized version at all.
> But the biggest flaw of NumPy is this: Say you create a function that solves some problem with arrays of some given shape. Now, how do you apply it to particular dimensions of some larger arrays? The answer is: You re-write your function from scratch in a much more complex way. The basic principle of programming is abstraction—solving simple problems and then using the solutions as building blocks for more complex problems. NumPy doesn’t let you do that.
Usually when I write Matlab code, the vectorized version just works, and if there are any changes needed, they're pretty minor and intuitive. With numpy I feel like I have to look up the documentation for every single function, transposing and reshaping the array into whatever shape that particular function expects. It's not very consistent.
More generally speaking, you're asking when the iteration `x_+ = Ax` converges to a fixed point which is an eigenvector of A. This can happen a few different ways. The obvious way is that A has an eigenvector `v` with eigenvalue 1, and all other eigenvalues with magnitude < 1. Then those other components will die out with repeated application of A, leaving only `v` in the limit.
For Markov chains, we can get this exact property from the Perron-Frobenius theorem, which applies to non-negative irreducible matrices. Irreducible means that the transition graph of the Markov chain is strongly connected. If that's the case, then there is a unique eigenvector called the stationary distribution (with eigenvalue 1), and all initial conditions will converge to it.
In case A is not irreducible, you may have different connected components, and the stationary distribution may depend on which component your initial condition is in. Going back to the n x n identity matrix, it has n connected components (it's a completely disconnected graph with all the self-transition probabilities = 1). So every initial condition is stationary, because you can't change anything after the initial step.
The works that I've read train an NN on numerical solutions for different geometries and boundary conditions. Then they try to infer the solutions for configurations outside the training set, which should be much faster than recomputing the numerical solution.
It could be interesting to do a comparison with finite volume methods to see when/how those approximations break down.
Regarding your second point: yes, when H = 1 we just recover the standard Kalman Filter, and yes, when H grows large the estimate gets worse and worse, in the sense that the softmax nonlinearity includes more and more irrelevant data from the past in the estimate. The point is that in real-world problems, which are usually messy and nonlinear, we probably want H - the so called context length - to be large, because then we can take advantage of information we collected in the past to help improve decisions in the present. It just so happens that in the special case when the system is linear, this is more harmful then helpful. Here is one way to think about our result: imagine you have a Transformer which takes as input K-dimensional embeddings and context length H. You want to use this Transformer for filtering in some dynamical system. The most basic question you could ask is: if the system is linear, can you do Kalman Filtering? In other words, in the easy, linear scenario, can you match the optimal algorithm? If the answer is no, I see no reason to see why you should expect it to work in harder, nonlinear settings. We show that the answer is yes, when the system you want to filter in has roughly sqrt(K) states, and you design the embeddings appropriately. Hopefully this preliminary result will lead to a better understanding of how deep learning can improve control in the hard, nonlinear scenario.
The statement that the Kalman filter is mean-square optimal because it generates a correct estimate in expectation is false. In fact, any gain L will generate an estimate whose expected value is x_k, as long as w_k and v_k are zero mean. The Kalman gain is a specific choice of L that is optimal in the mean square sense only when the disturbances are Gaussian. The Kalman gain is also time-varying and depends on the evolution of the estimate covariance, although it will converge to a steady-state value.
What's being described here is more properly called a Luenberger observer, but I guess that name doesn't get the same recognition outside the control community.
I'm also wondering why they chose to include H past estimates and measurements in the transformer. They're already embedding the Kalman gain into the weights of the transformer, so taking just one past estimate/measurement should exactly recover the Kalman filter. Going further into the past just makes the estimate worse, because of the softmax.
e.g. optimization of state space control coefficients looks something like training a LLM matrix...
However, from what I have seen, this isn't really a useful way of reframing the problem. The optimal control problem is at least as hard, if not harder, than the original problem of training the neural network, and the latter has mature and performant software for doing it efficiently. That's not to say there isn't good software for optimal control, but it's a more general problem and therefore off-the-shelf solvers can't leverage the network structure very well.
Some researchers have made interesting theoretical connections like in neural ODEs, but even there the practicality is limited.