jackcook (u/jackcook)

jackcook commented on When machine learning tells the wrong story jackcook.com/2024/11/09/b... · Posted by u/jackcook

Upvoter33 · a year ago

Good article, neat research behind it.

I think the paper's contributions really don't have anything to do with ML; it's about the new side channel with interrupts, which is a cool find. ML just gets more people to read it, which I guess is ok. I mean, you could just use "statistics" here in much the same way.

I remember an advisor once telling me: once you figure out what a paper is really about, rewrite it, and remove the stuff you used to think it was about. The title of this paper should be about the new side channel, not about the ML story, imho.

But this is just a nitpick. Great work!

jackcook · a year ago

Thanks for reading! The two stories are of course deeply intertwined: we wouldn’t have found the new side channel without the cautionary tale about machine learning.

But the finding about ML misinterpretation is particularly notable because it calls a lot of existing computer architecture research into question. In the past, attacks like this were very difficult to pull off without an in-depth understanding of the side channel being exploited. But ML models (in this case, an LSTM) generally go a bit beyond “statistics” because they unlock much greater accuracy, making it much easier to develop powerful attacks that exploit side channels that aren’t really understood. And there are a lot of ML-assisted attacks created in this fashion today: the Shusterman et al. paper alone has almost 200 citations, a huge amount for a computer architecture paper.

The point of publishing this kind of research is to better understand our systems so we can build stronger defenses — the cost of getting this wrong and misleading the community is pretty high. And this would technically still be true even if we ultimately found that the cache was responsible for the prior attack. But of course, it helps that we discovered a new side channel along the way — this really drove our point home. I probably could have emphasized this more in my blogpost.

jackcook commented on When machine learning tells the wrong story jackcook.com/2024/11/09/b... · Posted by u/jackcook

netaustin · a year ago

Very interesting and well-explained. Given that the research has been out for two years, any interested data collectors have considered this! Forget hackers, this an exploit for enterprises and governments!

Could websites concerned with privacy deploy a package that triggers interrupts randomly? Could a browser extension do it for every site?

jackcook · a year ago

Websites doing this would have to be careful about it: they might become the only website triggering a lot of interrupts randomly, which then makes them easy to identify.

Our countermeasure which triggers interrupts randomly is implemented as a browser extension, the source code for which is available here: https://github.com/jackcook/bigger-fish

I'm not sure I would recommend it for daily use though, I think our tests showed it slowed page load times down by about 10%.

jackcook commented on When machine learning tells the wrong story jackcook.com/2024/11/09/b... · Posted by u/jackcook

pandaxtc · a year ago

This article is awesome, your writing is super approachable and the interactive demos are really cool. I also appreciate the background on how you got into doing this sort of thing.

jackcook · a year ago

Thank you! Really appreciate it

jackcook commented on Mamba: The Easy Way jackcook.com/2024/02/23/m... · Posted by u/jackcook

jsenn · 2 years ago

This was really helpful, but only discusses linear operations, which obviously can’t be the whole story. From the paper it seems like the discretization is the only nonlinear step—in particular the selection mechanism is just a linear transformation. Is that right? How important is the particular form of the nonlinearity?

EDIT: from looking at the paper, it seems like even though the core state space model/selection mechanism is linear (except for discretization?), they incorporate a nonlinearity in the full “mamba block”, which is stacked up with residual connections and layer norm just like in a transformer. They describe this as combining a linear attention and an MLP into a single step, rather than alternating attention and MLP as in a transformer.

jackcook · 2 years ago

Yes you're spot on, the nonlinearities come from the full Mamba blocks, which I left out of this post for simplicity/to focus on the bigger ideas the paper introduced. You can see it marked by the "X" on the right-most part of Figure 3 in the Mamba paper: https://arxiv.org/abs/2312.00752

jackcook commented on Mamba: The Easy Way jackcook.com/2024/02/23/m... · Posted by u/jackcook

magnio · 2 years ago

Fantastic blog post, thank you for this. I am not even familiar with transformers, yet the explanation is stellar clear to me, and the included references and context are a trasure trove. The explanation of FlashAttention is the best I have seen, and that is not even the focus of the article.

One question I have on selectivity: footnote 4 says "the continuous A is constant, while our discretization parameter ∆ is input-dependent." What is the effect of varying the discretization instead of the (main, as I understand it) state A? My gut says it simplifies training and provides stability, but I feel A carries most of the behavior of the model, so it should have more wiggle room throughout training.

jackcook · 2 years ago

Thank you for the kind words! I think it’s mostly to reduce complexity during training. Here’s an excerpt from page 9 of the Mamba paper:

“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”