Readit News logoReadit News
ansk commented on Who invented deep residual learning?   people.idsia.ch/~juergen/... · Posted by u/timlod
ansk · 2 months ago
Of all Schmidhuber's credit-attribution grievances, this is the one I am most sympathetic to. I think if he spent less time remarking on how other people didn't actually invent things (e.g. Hinton and backprop, LeCun and CNNs, etc.) or making tenuous arguments about how modern techniques are really just instances of some idea he briefly explored decades ago (GANs, attention), and instead just focused on how this single line of research (namely, gradient flow and training dynamics in deep neural networks) laid the foundation for modern deep learning, he'd have a much better reputation and probably a Turing award. That said, I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.
ansk commented on America’s semiconductor boom [video]   youtube.com/watch?v=T-jt3... · Posted by u/zdw
itsnowandnever · 2 months ago
I always thought it was funny that for my entire lifetime people have talked about Arizona being perfect for fabs because it's dry there and not subject to tremors meanwhile Taiwan where 60% of all chips are produced (and 90% of the most sensitive ones) is tropical and has earthquakes fairly frequently.
ansk · 2 months ago
I can only imagine what the Taiwanese can do in Arizona. Truly a synergy for the ages.
ansk commented on Depression reduces capacity to learn to actively avoid aversive events   eneuro.org/content/12/9/E... · Posted by u/PaulHoule
ansk · 3 months ago
My personal experience is that the cost of enduring a negative stimulus is not simply a function of the magnitude of the negative stimulus, but rather the magnitude of the negative stimulus in relation to the magnitude of all other concurrent negative stimuli. This study controls the environment so that a single negative stimulus is isolated and additional external negative stimuli are minimized, but it cannot control for the fact that a depressed person also endures a constant barrage of negative stimuli which are generated internally (hopelessness, exhaustion, fear, self-doubt, etc). The magnitude of these internally generated negative stimuli is likely much larger than that of the aversive external stimulus used in this study, so it seems reasonable that the marginal relief obtained by avoiding the external stimulus may be perceived as relatively negligible, or at least diminished to the point that the cost of avoiding is greater than the cost of enduring.
ansk commented on FFmpeg Assembly Language Lessons   github.com/FFmpeg/asm-les... · Posted by u/flykespice
zahlman · 4 months ago
Don't know how I overlooked that, thanks. Maybe because the one Python wrapper I know about is generating command lines and making subprocess calls.
ansk · 4 months ago
For future reference, if you want proper python bindings for ffmpeg* you should use pyav.

* To be more precise, these are bindings for the libav* libraries that underlie ffmpeg

ansk commented on I should have loved biology too   nehalslearnings.substack.... · Posted by u/nehal96
sdenton4 · 8 months ago
Well, this is incredible: "The gene sequence had a strange repeating structure, CAGCAGCAG… continuing for 17 repeats on average (ranging between 10 to 35 normally), encoding a huge protein that’s found in neurons and testicular tissue (its exact function is still not well understood). The mutation that causes HD increases the number of repeats to more than forty – a “molecular stutter” – creating a longer huntingtin protein, which is believed to form abnormally sized clumps when enzymes in neural cells cut it. The more repeats there are, the sooner the symptoms occur and the higher the severity"

Not the only sequence model that exhibits stutters on repetitive inputs...

ansk · 8 months ago
And on the seventh day, God ended His work which He had done and began vibe coding the remainder of the human genome.
ansk commented on Adobe's new image rotation tool is one of the most impressive AI tools seen   creativebloq.com/design/a... · Posted by u/ralusek
HarHarVeryFunny · a year ago
I think the most novel part of it, and where a lot of the power comes from, is in the key based attention, which then operationally gives rise to the emergence of induction heads (whereby pair of adjacent layers coordinate to provide a powerful context lookup and copy mechanism).

The reusable/stackable block is of course a key part of the design since the key insight was that language is as much hierarchical as sequential, and can therefore be processed in parallel (not in sequence) with a hierarchical stack of layers that each use the key-based lookup mechanism to access other tokens whether based on position or not.

In any case, if you look at the seq2seq architectures than preceded it, it's hard to claim that the Transformer is really based-on/evolved-from any of them (especially prevailing recurrent approaches), notwithstanding that it obviously leveraged the concept of attention.

I find the developmental history of the Transformer interesting, and wish more had been documented about it. It seems from interview with Uszkoreit that the idea of parallel language processing based on an hierarchical design using self-attention was his, but that he was personally unable to realize this idea in a way that beat other contemporary approaches. Noam Shazeer was the one who then took the idea and realized it in the the form that would eventually become the Transformer, but it seems there was some degree of throw the kitchen sink at it and then a later ablation process to minimize the design. What would be interesting to know would be an honest assessment of how much of the final design was inspiration and how much experimentation. It's hard to imagine that Shazeer anticipated the emergence of induction heads when this model was trained at sufficient scale, so the architecture does seem to at least partly be an a accidental discovery, and more than the next generation seq2seq model that it seems to have been conceived as.

ansk · a year ago
Key-based attention is not attributable to the Transformer paper. First paper I can find where keys, queries, and values are distinct matrices is https://arxiv.org/abs/1703.03906, described at the end of section 2. The authors of the Transformer paper are very clear in how they describe their contribution to the attention formulation, writing "Dot-product attention is identical to our algorithm, except for the scaling factor". I think it's fair to state that multi-head is the paper's only substantial contribution to the design of attention mechanisms.

I think you're overestimating the degree to which this type of research is motivated by big-picture, top-down thinking. In reality, it's a bunch of empirically-driven, in-the-weeds experiments that guide a very local search in a intractably large search space. I can just about guarantee the process went something like this:

- The authors begin with an architecture similar to the current SOTA, which was a mix of recurrent layers and attention

- The authors realize that they can replace some of the recurrent layers with attention layers, and performance is equal or better. It's also way faster, so they try to replace as many recurrent layers as possible.

- They realize that if they remove all the recurrent layers, the model sucks. They're smart people and they quickly realize this is because the attention-only model is invariant to sequence order. They add positional encodings to compensate for this.

- They keep iterating on the architecture design, incorporating best-practices from the computer vision community such as normalization and residual connections, resulting in the now-famous Transformer block.

At no point is any stroke of genius required to get from the prior SOTA to the Transformer. It's the type of discovery that follows so naturally from an empirically-driven approach to research that it feels all but inevitable.

ansk commented on Adobe's new image rotation tool is one of the most impressive AI tools seen   creativebloq.com/design/a... · Posted by u/ralusek
HarHarVeryFunny · a year ago
Not at all - Transformer was invented by a bunch of former Google employees (while at Google), primarily Jakob Uszkoreit and Noam Shazeer. Of course as with anything it builds on what had gone before, but it's really quite a novel architecture.
ansk · a year ago
The scientific impact of the transformer paper is large, but in my opinion the novelty is vastly overstated. The primary novelty is adapting the (already existing) dot-product attention mechanism to be multi-headed. And frankly, the single-head -> multi-head evolution wasn't particularly novel -- it's the same trick the computer vision community applied to convolutions 5 years earlier, yielding the widely-adopted grouped convolution. The lasting contribution of the Transformer paper is really just ordering the existing architectural primitives (attention layers, feedforward layers, normalization, residuals) in a nice, reusable block. In my opinion, the most impactful contributions in the lineage of modern attention-based LLMs are the introduction of dot-product attention (Bahdanau et al, 2015) and the first attention-based sequence-to-sequence model (Graves, 2013). Both of these are from academic labs.

As a side note, a similar phenomenon occurred with the Adam optimizer, where the ratio of public/scientific attribution to novelty is disproportionately large (the Adam optimizer is very minor modification of the RMSProp + momentum optimization algorithm presented in the same Graves, 2013 paper mentioned above)

ansk commented on PyTorch Native Architecture Optimization: Torchao   pytorch.org/blog/pytorch-... · Posted by u/jonbaer
uoaei · a year ago
Praising XLA by defending Tensorflow of all things has to be one of the strangest takes I've ever come across.

JAX is right there. No need to beat a dead horse when there's a stallion in the stables.

ansk · a year ago
Tensorflow is a lot like IBM -- it deserves praise not because it's great in its current state, but for its contributions towards advancing the broader technological front to where it is today. Tensorflow walked so JAX could run, so to speak. Frankly, I don't really draw much of a distinction between the two frameworks since I really just use them as lightweight XLA wrappers.
ansk commented on PyTorch Native Architecture Optimization: Torchao   pytorch.org/blog/pytorch-... · Posted by u/jonbaer
Atheb · a year ago
You got to give it to the pytorch team, they're really great at bringing complex optimization schemes (mixed-precision, torch.compile, etc) down to a simple to use API. I'm glad I moved from TF/Kerasto Pytorch around 2018-2019 and never looked back. I'm eager to try this as well.
ansk · a year ago
I've seen and ignored a lot of "pytorch good, tensorflow bad" takes in my time, but this is so egregiously wrong I can't help but chime in. Facilitating graph-level optimizations has been one of the most central tenets of tensorflow's design philosophy since its inception. The XLA compiler was designed in close collaboration with the tensorflow team and was available in the tensorflow API as far back as 2017. It's not an exaggeration to say that pytorch is 5+ years behind on this front. Before anyone invokes the words "pythonic" or "ergonomic", I'd like to note that the tensorflow 2 API for compilation is nearly identical to torch.compile.
ansk commented on The Intelligence Age   ia.samaltman.com/... · Posted by u/firloop
jmmcd · a year ago
Yes, he's handwaving in this general area, but no, he's not really relying on the UAT. If you talked to most NN people 2 decades ago and asked about this, they might well answer in terms of the UAT. But nowadays, most people, including here Altman, would answer in terms of practical experience of success in learning a surprisingly diverse array of distributions using a single architecture.
ansk · a year ago
I think that while researchers would agree that the empirical success of deep learning has been remarkable, they would still agree that the language used here -- "an algorithm that could really, truly learn any distribution of data (or really, the underlying “rules” that produce any distribution of data)" -- is an overly strong characterization, to the point that it is no longer accurate. A hash function is a good example of a generating process which NN + SGD will not learn with any degree of generalization. If you trained GPT4 on an infinite dataset of strings and their corresponding hashes, it would simply saturate its 100 billion+ parameters with something akin to a compressed lookup table of input/output pairs, despite the true generating process being a program that could be expressed in less than a kilobyte. On unseen data, it would be no better than a uniform prior over hashes. Anyways, my point is that people knowledgable in the field would have far more tempered takes on the practical limits of deep learning, and would reserve the absolute framing used here for claims that have been proven formally.

u/ansk

KarmaCake day404January 29, 2021View Original