spindump8930 (u/spindump8930)

spindump8930 commented on Grok: Searching X for "From:Elonmusk (Israel or Palestine or Hamas or Gaza)" simonwillison.net/2025/Ju... · Posted by u/simonw

xnx · 2 months ago

> It’s worth noting that LLMs are non-deterministic,

This is probably better phrased as "LLMs may not provide consistent answers due to changing data and built-in randomness."

Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.

spindump8930 · 2 months ago

The many sources of stochastic/non-deterministic behavior have been mentioned in other replies but I wanted to point out this paper: https://arxiv.org/abs/2506.09501 which analyzes the issues around GPU non determinism (once sampling and batching related effects are removed).

One important take-away is that these issues are more likely in longer generations so reasoning models can suffer more.

spindump8930 commented on Give Me FP32 or Give Me Death? arxiv.org/abs/2506.09501... · Posted by u/spindump8930

spindump8930 · 2 months ago

People often don't understand why LLMs can be non deterministic even with deterministic seeding, temperature, sampling. This paper shows how bad it can be with different hardware and gpu hosts.

spindump8930 commented on Show HN: Semantic Calculator (king-man+woman=?) calc.datova.ai... · Posted by u/nxa

spindump8930 · 4 months ago

First off, this interface is very nice and a pleasure to use, congrats!

Are you using word2vec for these, or embeddings from another model?

I also wanted to add some flavor since it looks like many folks in this thread haven't seen something like this - it's been known since 2013 that we can do this (but it's great to remind folks especially with all the "modern" interest in NLP).

It's also known (in some circles!) that a lot of these vector arithmetic things need some tricks to really shine. For example, excluding the words already present in the query[1]. Others in this thread seem surprised at some of the biases present - there's also a long history of work on that [2,3].

[1] https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935...

[2] https://arxiv.org/abs/1905.09866

[3] https://arxiv.org/abs/1903.03862

spindump8930 commented on Byte latent transformer: Patches scale better than tokens (2024) arxiv.org/abs/2412.09871... · Posted by u/dlojudice

anon291 · 4 months ago

I disagree. Most AI innovation today is around things like agents, integrations, and building out use cases. This is possible because transformers have made human-like AI possible for the first-time in the history of humanity. These use-cases will remain the same even if the underlying architecture changes. The number of people working on new architectures today is way more than were working on neural networks in 2017 when 'attention is all you need' came out. Nevertheless, actual ML model researchers are only a small portion of the total ML/AI community, and this is fine.

spindump8930 · 4 months ago

If you consider most of the dominate architectures in deeplearning type approaches, transformers are remarkably generic. If you reduce transformer like architectures to "position independent iterated self attention with intermediate transformations", they can support ~all modalities and incorporate other representations (e.g. convolutions, CLIP style embeddings, graphs or sequences encoded with additional position embeddings). On top of that, they're very compute friendly.

Two of the largest weaknesses seem to be auto-regressive sampling (not unique to the base architecture) and expensive self attention over very long contexts (whether sequence shaped or generic graph shaped). Many researchers are focusing efforts there!

Also see: https://www.isattentionallyouneed.com/

spindump8930 commented on Byte latent transformer: Patches scale better than tokens (2024) arxiv.org/abs/2412.09871... · Posted by u/dlojudice

dlojudice · 4 months ago

This BLT approach is why "AI research is stalling" takes are wrong. Dynamic byte-level patches instead of tokens seems genuinely innovative, not just scaling up the same architecture. Better efficiency AND handling edge cases better? Actual progress. The field is still finding clever ways to rethink fundamentals.

spindump8930 · 4 months ago

This paper is very cool, comes from respected authors, and is a very nice idea with good experiments (flop controlled for compute). It shouldn't be seen as a wall-breaking innovation though. From the paper:

> Existing transformer libraries and codebases are designed to be highly efficient for tokenizer-based transformer architectures. While we present theoretical flop matched experiments and also use certain efficient implementations (such as FlexAttention) to handle layers that deviate from the vanilla transformer architecture, our implementations may yet not be at parity with tokenizer-based models in terms of wall-clock time and may benefit from further optimizations.

And unfortunately wall-clock deficiencies mean that any quality improvement needs to overcome that additional scaling barrier before any big runs (meaning expensive) can risk using it.

spindump8930 commented on Byte latent transformer: Patches scale better than tokens (2024) arxiv.org/abs/2412.09871... · Posted by u/dlojudice

mdaniel · 4 months ago

I've secretly wondered if the next (ahem) quantum leap in output quality will arrive with quantum computing wherein answering 10,000 if statements simultaneously would radically change the inference pipeline

But I am also open to the fact that I may be thinking of this in terms of 'faster horses' and not the right question

spindump8930 · 4 months ago

It's not clear how your perception of quantum computing would lead to 'faster horses' in the current view of NN architectures - keep mind that the common view of 'exploring many paths simultaneously' is at best an oversimplification (https://scottaaronson.blog/?p=2026).

That said, perhaps advances in computing fundamentals would lead to something entirely new (and not at all horselike).

spindump8930 commented on When ChatGPT broke the field of NLP: An oral history quantamagazine.org/when-c... · Posted by u/mathgenius

hn_throwaway_99 · 4 months ago

This is pretty much correct. I'd have to search for it but I remember an article from a couple years back that detailed how LLMs blew up the field of NLP processing overnight.

Although I'd also offer a slightly different lens through which to look at the reaction of other researchers. There's jealousy, sure, but overnight a ton of NLP researchers basically had to come to terms with the fact that their research was useless, at least from a practical perspective.

For example, imagine you just got your PhD in machine translation, which took you 7 years of laboring away in grad/post grad work. Then something comes out that can do machine translation several orders of magnitude better than anything you have proposed. Anyone can argue about what "understanding" means until they're blue in the face, but for machine translation, nobody really cares that much - people just want to get text in another language that means the same thing as the original language, and they don't really care how.

Tha majority of research leads to "dead ends", but most folks understand that's the nature of research, and there is usually still value in discovering "OK, this won't work". Usually, though, this process is pretty incremental. With LLMs all of a sudden you had lots of folks whose life work was pretty useless (again, from a practical perspective), and that'd be tough for anyone to deal with.

spindump8930 · 4 months ago

You might be thinking of this article by Sebastian Ruder: https://www.ruder.io/nlp-imagenet/

Note that the author has a background spanning a lot of the timespans/topics discussed - much work in multilingual NLP, translation, and more recently at DeepMind, Cohere, and Meta (in other words, someone with a great perspective on everything in the top article).

Re: Machine Translation, note that Transformers were introduced for this task, and built on one of the earlier notions of attention in sequence models: https://arxiv.org/abs/1409.0473 (2014, 38k citations)

That's not to say there weren't holdouts or people who really were "hurt" by a huge jump in MT capability - just that this is a logical progression in language understanding methods as seen by some folks (though who could have predicted the popularity of chat interfaces).

spindump8930 commented on Lossless LLM compression for efficient GPU inference via dynamic-length float arxiv.org/abs/2504.11651... · Posted by u/CharlesW

sroussey · 4 months ago

True, but their research did include running on 5080 local.

The big take away, in my opinion, is that their technique for LUTs etc could also be applied to lossy quants as well. Say maybe you get 5bit accuracy in size of 4bit?

I don’t know, but maybe? Also their two stage design might make current quantized you kernal designs better.

spindump8930 · 4 months ago

Yes, it could be stacked on quants. It might be that quantized activations already are more "dense" and so they can't be compressed as much (from 16 -> ~11 bits), but certainly possible.

spindump8930 commented on Lossless LLM compression for efficient GPU inference via dynamic-length float arxiv.org/abs/2504.11651... · Posted by u/CharlesW

hchja · 4 months ago

This is pretty useless in any case that doesn’t involve BFloat16 models

spindump8930 · 4 months ago

bf16 is the defacto default datatype and distribution type for LLMs, which are then often eagerly quantized by users with more limited hardware. See the recent Llama releases and e.g. the H100 spec sheet (advertised flops and metrics target bf16).