x1000 (u/x1000) - Readit News

x1000 commented on Golden Ratio using an equilateral triangle inscribed in a circle geometrycode.com/free/how... · Posted by u/peter_d_sherman

tigerlily · 2 months ago

I once wondered what happens when you take x away from x squared, and let that equal 1.

I sat down and worked it out. What do you know golden ratio.

Oh and this other number, -0.618. Anyone know what it's good for?

x1000 · 2 months ago

It’s the negative of the inverse of the golden ratio. (Also 1 minus the golden ratio.) So, good for anything the golden ratio itself is good for.

x1000 commented on Universe expected to decay in 10⁷⁸ years, much sooner than previously thought phys.org/news/2025-05-uni... · Posted by u/pseudolus

matheusd · 10 months ago

Is it though? It is my understanding that the quantum fluctuations that give rise to BBs will still exist, even after (and specially after) the evaporation of black holes (perhaps assuming no Big Rip).

x1000 · 10 months ago

Not a physicist, but I see it this way too. My understanding of Boltzmann brains is that they are a theoretical consequence of infinite time and space in a universe with random quantum fluctuations. And that those random fluctuations would still be present in an otherwise empty universe. So then this article has no bearing on the Boltzmann brain thought experiment or its ramifications.

x1000 commented on Parameter-free KV cache compression for memory-efficient long-context LLMs arxiv.org/abs/2503.10714... · Posted by u/PaulHoule

az226 · a year ago

Is this some joke? They use Llama 2 7B? What year is it?

x1000 · a year ago

If they had experimented using a newer model (gemma 3, deepseek-1 7b, etc.) and reported better results, would that be because their newer baseline model was better than the llama 2 model used in the previous methods' experiments? A more comprehensive study would include results for as many baseline models as possible. But there are likely other researchers in the lab all waiting to use those expensive GPUs for their experiments as well.

x1000 commented on MIT 6.S184: Introduction to Flow Matching and Diffusion Models diffusion.csail.mit.edu... · Posted by u/__rito__

ddingus · a year ago

Would one of you, who is familiar with this topic, help me understand the primary use case(s) along with a few words, just your overall take on these techniques?

Thanks and appreciated in advance.

x1000 · a year ago

It’s the fundamentals that underly Stable Diffusion, Dalle, and various other SOTA image generation models, video, and audio generation models. They’ve also started taking off in the field of robotics control [1]. These models are trained to incrementally nudge samples of pure noise onto the distributions of their training data. Because they’re trained on noised versions of the training set, the models are able to better explore, navigate, and make use of the regions near the true data distribution in the denoising process. One of the biggest issues with GANs is a thing called “mode collapse” [2].

[1] https://www.physicalintelligence.company/blog/pi0

[2] https://en.wikipedia.org/wiki/Mode_collapse

x1000 commented on Zod: TypeScript-first schema validation with static type inference zod.dev/... · Posted by u/tosh

kaoD · a year ago

Does/will this correctly handle `| undefined` vs `?`, i.e. correct behavior under `exactOptionalPropertyTypes`?

Zod doesn't (yet[0]) and it's been a pain point for me.

[0] https://github.com/colinhacks/zod/issues/635

x1000 · a year ago

I ran into exactly same pain point which was enough to nullify the benefits of using zod at all.

x1000 commented on Differential Transformer arxiv.org/abs/2410.05258... · Posted by u/weirdcat

blackbear_ · a year ago

With a single softmax you cannot predict exactly 0, but only very small numbers. When you have a large number of values to add up, this "poisons" the output with a lot of irrelevant stuff (the noise mentioned in the paper).

To make things worse, low attention values will have very low gradient, thus needing a lot of weight updates to undo that kind of mistakes. On the other hand, subtracting the output of two softmax allows the model to predict a weight of exactly zero for some of the values, while keeping a reasonable gradient flowing through.

So the model already knows what is noise, but a single softmax makes it harder to exclude it.

Moreover, with a single softmax the output of all heads is forced to stay in the convex hull of the value vectors, whereas with this variant each head can choose its own lambda, thus shifting the "range" of the outputs outside the convex hull pre-determined by the values. This makes the model as a whole more expressive.

x1000 · a year ago

Could you help explain how we would achieve an attention score of exactly 0, in practice? Here’s my take:

If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.

What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.

x1000 commented on Introdution to Computer architecture explained with Minecraft [video] youtube.com/watch?v=dV_lf... · Posted by u/xeonmc

Refusing23 · 2 years ago

i cant watch the video but i do wonder if less than 7 minutes for this explanation is anywhere near enough detail

x1000 · 2 years ago

My first exposure to computer architecture was through a Minecraft video[1] (which I likely stumbled upon on Digg). In Linear Algebra lecture the next day, I overheard my classmates discussing the video. I purchased the game later that week.

Seeing the circuitry of a computer in this way helped me to understand that computers operated by means other than pure magic. And, the video I saw was much less descriptive of how a computer works than the one the OP linked. So, although neither video amounts to a full college course on the topic, there’s still a lot of value in their ability to expose people to the topic. It’s inspiring to see how computers are mostly a composition of NAND gates, and to compare the massive structures in the videos with the microprocessors of the real world.

[1] https://youtu.be/LGkkyKZVzug?si=hZRdmablPt15gGqn

x1000 commented on "Attention", "Transformers", in Neural Network "Large Language Models" bactra.org/notebooks/nn-a... · Posted by u/macleginn

next_xibalba · 2 years ago

Do you have insight into the choice of the term attention, which, according to this article’s author, bears very little resemblance to the human sense of the word (I.e. it is selective and not averaging)?

x1000 · 2 years ago

There’s a video[1] of Karpathy recounting an email correspondence he had with with Bahdanau. The email explains that the word “Attention” comes from Bengio who, in one of his final reviews of the paper, determined it to be preferable to Bahdanau’s original idea of calling it “RNNSearch”.

[1] https://youtu.be/XfpMkf4rD6E?t=18m23s

x1000 commented on Can GPT-4 and GPT-3.5 play Wordle? twitter.com/biz84/status/... · Posted by u/amichail

hn_throwaway_99 · 3 years ago

The thing I have such a hard time understanding (and, in general, I have a very high-level, layman's understanding of LLMs) is how these LLMs are able to understand structure inside tokens. That is, I understood that LLMs just predict the next token in the sequence. But for a game like Wordle, it's obviously critically important that the LLM understands the mapping between the individual letters of, say, "apple", and the output result that the author defined in the prompt, e.g. "XXXXO" means that only "e" is in the word (but the wrong position).

I'm just completely baffled how anything in the training procedure could allow the LLM to learn information about the structure of tokens. Does the tokenization process not treat every token (which I thought usually maps to a word) as an "opaque blob"?

x1000 · 3 years ago

Imagine you are a LLM and all you see are tokens. Your job is not only to predict the next token in a sequence, but also to create a nice embedding for the token (where two similar words sit next to each other). Given a small enough latent space, you're probably not concerning yourself too much with the "structure inside" the tokens. But given a large enough latent space, and a large enough training corpus, you will encounter certain tokens frequently enough that you will begin to see a pattern. At some point during training, you are fed:

1) An English dictionary as input.

2) List of words that start with "app" wiki page as input.

3) Other alphabetically sorted pieces of text.

4) Elementary school homeworks for spelling.

5) Papers on glyphs, diphthongs, and other phonetic concepts.

You begin to recognize that the tokens in these lists appear near each other in this strange context. You hardly ever see token 11346 ("apple") and token 99015 ("appli") this close to each other before. But you see it frequently enough that you decide to nudge these two tokens' embeddings closer to one another.

Your ability to predict the next token in a sequence has improved. You have no idea why these two tokens are close every ten millionth training example. Your word embeddings start to encode spelling information. Your word embeddings start to encode handwriting information. Your word embeddings start to encode phonic information. You've never seen or heard the actual word, "apple". But, after enough training, your embeddings contain enough information so that if you're asked, ["How do", "you", "spell", "apple"], you are confident as you proclaim ["a", "p", "p", "l", "e", "."] as the obvious answer.