brosco (u/brosco) - Readit News

brosco commented on Who Invented Backpropagation? people.idsia.ch/~juergen/... · Posted by u/nothrowaways

digikata · 15 days ago

There are large bodies of work for optimization of state space control theory that I strongly suspect as a lot of crossover for AI, and at least has very similar mathematical structure.

e.g. optimization of state space control coefficients looks something like training a LLM matrix...

brosco · 15 days ago

There is indeed a lot of crossover, and a lot of neural networks can be written in a state space form. The optimal control problem should be equivalent to training the weights, as you mention.

However, from what I have seen, this isn't really a useful way of reframing the problem. The optimal control problem is at least as hard, if not harder, than the original problem of training the neural network, and the latter has mature and performant software for doing it efficiently. That's not to say there isn't good software for optimal control, but it's a more general problem and therefore off-the-shelf solvers can't leverage the network structure very well.

Some researchers have made interesting theoretical connections like in neural ODEs, but even there the practicality is limited.

brosco commented on A Lean companion to Analysis I terrytao.wordpress.com/20... · Posted by u/jeremyscanvic

brosco · 3 months ago

Very cool. Analysis I was the first "real" math textbook that I (an engineer, not a mathematician) felt like I could completely follow and work through, after a few attempts to get through others like Rudin. Hopefully the Lean companion will help make it even more accessible to people who are familiar with math and programming and looking to learn things rigorously.

brosco commented on I don't like NumPy dynomight.net/numpy/... · Posted by u/MinimalAction

nayuki · 4 months ago

I skimmed the article and agree with it at a high level, though I haven't faced most of those problems personally.

I have several gripes with NumPy, or more broadly the notion of using Python to call a C/asm library that vectorizes math operations. A lot of people speak of NumPy like the solution to all your high-performance math needs in Python, which I think is disingenuous. The more custom logic you do, the less suitable the tools get. Pure Python numeric code is incredibly slow - like 1/30× compared to Java - and as you find parts that can't be vectorized, you have to drop back down to pure Python.

I would like to give the simple example of the sieve of Eratosthenes:

  def sieve_primeness(limit):
      result = [False] + [True] * limit
      for i in range(2, len(result)):
          if result[i]:
              for j in range(i * i, len(result), i):
                  result[j] = False

This code is scalar, and porting it to a language like C/C++/Rust/Java gives decent performance straight out of the box. Performance in Python is about 1/30× the speed, which is not acceptable.

People pretend that you can hand-wave the performance problem away by using NumPy. Please tell me how to vectorize this Python code. Go on, I'm waiting.

You can't vectorize the `if result[i]` because that controls whether the inner for-loop executes; so it must execute in pure Python. For the inner loop, you can vectorize that by creating a huge temporary array and then AND it with the result array, but that is extremely memory-inefficient compared to flipping bits of the result array in place, and probably messes up the cache too.

Alternatively, you can run the code in PyPy, but that usually gives a speed-up of maybe 3×, which is nowhere near enough to recover the 30× speed penalty.

Another example is that NumPy is not going to help you implement your own bigint library, because that also requires a lot of custom logic that executes between loop iterations. Meanwhile, I've implemented bigint in pure Java with acceptable performance and without issues.

Again, my point is that anytime you venture into custom numerical/logical code that is not the simple `C = A + B`, you enter a world of pain when it comes to Python performance or vectorization.

brosco · 4 months ago

Have you heard of JIT libraries like numba (https://github.com/numba/numba)? It doesn't work for all python code, but can be helpful for the type of function you gave as an example. There's no need to rewrite anything, just add a decorator to the function. I don't really know how performance compares to C, for example.

brosco commented on I don't like NumPy dynomight.net/numpy/... · Posted by u/MinimalAction

brosco · 4 months ago

Compared to Matlab (and to some extent Julia), my complaints about numpy are summed up in these two paragraphs:

> Some functions have axes arguments. Some have different versions with different names. Some have Conventions. Some have Conventions and axes arguments. And some don’t provide any vectorized version at all.

> But the biggest flaw of NumPy is this: Say you create a function that solves some problem with arrays of some given shape. Now, how do you apply it to particular dimensions of some larger arrays? The answer is: You re-write your function from scratch in a much more complex way. The basic principle of programming is abstraction—solving simple problems and then using the solutions as building blocks for more complex problems. NumPy doesn’t let you do that.

Usually when I write Matlab code, the vectorized version just works, and if there are any changes needed, they're pretty minor and intuitive. With numpy I feel like I have to look up the documentation for every single function, transposing and reshaping the array into whatever shape that particular function expects. It's not very consistent.

brosco commented on Show HN: Single-Header Profiler for C++17 github.com/DmitriBogdanov... · Posted by u/GeorgeHaldane

brosco · 5 months ago

This looks great! I've been needing something like this for a while, for a project which is quite compute-heavy and uses lots of threads and recursion. I've been using valgrind to profile small test examples, but that's definitely the nuclear option since it slows down the execution so much. I'm going to try this out right away.

brosco commented on Markov Chains Explained Visually (2014) setosa.io/ev/markov-chain... · Posted by u/mrcgnc

globalnode · 6 months ago

This is timely! I have an assignment on these coming up soon. Can anyone with knowledge about this explain something. From what I can tell, many matrix multiplications move vectors so they are more inline with eigenvectors if they exist. So Markov Chains are just a continual movement in this direction. Some examples that don't do this that I can think of are the Identity matrix and rotations.. Is there a way to test if a matrix will have this effect? Is it just testing for existence of eigenvectors?

brosco · 6 months ago

That's a good observation, and it is indeed true for many Markov chains. But your counterexample of the identity matrix is not quite right; every vector is an eigenvector of the identity, so there is no "realignment" needed.

More generally speaking, you're asking when the iteration `x_+ = Ax` converges to a fixed point which is an eigenvector of A. This can happen a few different ways. The obvious way is that A has an eigenvector `v` with eigenvalue 1, and all other eigenvalues with magnitude < 1. Then those other components will die out with repeated application of A, leaving only `v` in the limit.

For Markov chains, we can get this exact property from the Perron-Frobenius theorem, which applies to non-negative irreducible matrices. Irreducible means that the transition graph of the Markov chain is strongly connected. If that's the case, then there is a unique eigenvector called the stationary distribution (with eigenvalue 1), and all initial conditions will converge to it.

In case A is not irreducible, you may have different connected components, and the stationary distribution may depend on which component your initial condition is in. Going back to the n x n identity matrix, it has n connected components (it's a completely disconnected graph with all the self-transition probabilities = 1). So every initial condition is stationary, because you can't change anything after the initial step.

brosco commented on Challenging projects every programmer should try (2019) austinhenley.com/blog/cha... · Posted by u/azhenley

TheAlchemist · 2 years ago

Totally newbie question - 'approximating very complicated behavior' - this seems like a perfect problem for ML to me. Is this something that's used or explored ?

brosco · 2 years ago

It's absolutely being explored. There is a lot of active research into using ML to learn solutions of PDEs (Navier-Stokes in this case). It's not my field so I don't know much about the specifics.

The works that I've read train an NN on numerical solutions for different geometries and boundary conditions. Then they try to infer the solutions for configurations outside the training set, which should be much faster than recomputing the numerical solution.

brosco commented on Challenging projects every programmer should try (2019) austinhenley.com/blog/cha... · Posted by u/azhenley

kylegalbraith · 2 years ago

Total long shot. But does anyone have any good side projects that would center around simulating fluid dynamics? I’ve always been interested in aerodynamics and I’ve wanted to see if there is a way to learn more about it with my programming skills.

brosco · 2 years ago

If you're interested to learn more about aerodynamics I would highly suggest learning a bit of classical aerodynamics. It will not be software oriented, since most of the theory deals with approximating very complicated behavior with simple analytical models.

It could be interesting to do a comparison with finite volume methods to see when/how those approximations break down.

brosco commented on Can a transformer represent a Kalman filter? arxiv.org/abs/2312.06937... · Posted by u/bluish29

gautamcgoel · 2 years ago

Author here. Thanks for this thoughtful comment. Regarding your first point: yes, this is sloppy writing on my part. The Kalman Filter is always mean-square optimal among all linear estimators, but as you say, it is only optimal along all causal estimators when the disturbances are Gaussian. Nice catch - I will clarify this point in a future version of the paper.

Regarding your second point: yes, when H = 1 we just recover the standard Kalman Filter, and yes, when H grows large the estimate gets worse and worse, in the sense that the softmax nonlinearity includes more and more irrelevant data from the past in the estimate. The point is that in real-world problems, which are usually messy and nonlinear, we probably want H - the so called context length - to be large, because then we can take advantage of information we collected in the past to help improve decisions in the present. It just so happens that in the special case when the system is linear, this is more harmful then helpful. Here is one way to think about our result: imagine you have a Transformer which takes as input K-dimensional embeddings and context length H. You want to use this Transformer for filtering in some dynamical system. The most basic question you could ask is: if the system is linear, can you do Kalman Filtering? In other words, in the easy, linear scenario, can you match the optimal algorithm? If the answer is no, I see no reason to see why you should expect it to work in harder, nonlinear settings. We show that the answer is yes, when the system you want to filter in has roughly sqrt(K) states, and you design the embeddings appropriately. Hopefully this preliminary result will lead to a better understanding of how deep learning can improve control in the hard, nonlinear scenario.

brosco · 2 years ago

Thanks for clarifying the motivation, that makes a lot of sense.

brosco commented on Can a transformer represent a Kalman filter? arxiv.org/abs/2312.06937... · Posted by u/bluish29

brosco · 2 years ago

I guess this will probably come up in the reviews but the presentation of the Kalman filter is lacking. I know it's not the point of the paper, but getting these details wrong in a paper about Kalman filters is not encouraging.

The statement that the Kalman filter is mean-square optimal because it generates a correct estimate in expectation is false. In fact, any gain L will generate an estimate whose expected value is x_k, as long as w_k and v_k are zero mean. The Kalman gain is a specific choice of L that is optimal in the mean square sense only when the disturbances are Gaussian. The Kalman gain is also time-varying and depends on the evolution of the estimate covariance, although it will converge to a steady-state value.

What's being described here is more properly called a Luenberger observer, but I guess that name doesn't get the same recognition outside the control community.

I'm also wondering why they chose to include H past estimates and measurements in the transformer. They're already embedding the Kalman gain into the weights of the transformer, so taking just one past estimate/measurement should exactly recover the Kalman filter. Going further into the past just makes the estimate worse, because of the softmax.