Scene_Cast2 (u/Scene_Cast2)

Scene_Cast2 commented on Gemma 3 270M re-implemented in pure PyTorch for local tinkering github.com/rasbt/LLMs-fro... · Posted by u/ModelForge

canyon289 · 3 days ago

At this small scale the embeddings indeed were a big focus. Consider this thought process.

The tokens themselves are a form of compression. Lets say we have the word "WaffleHouse", character level this would be 11 tokens, but with an embedder this would be perhaps 2 or 3 tokens (I didn't actually run through the tokenizer but we could verify precisely). This matters a lot for on device processing especially.

So while we could get more intelligence out of the model by bumping up the "knowledge" parameters, the device would need to process more input and output tokens.

Another advantage on small devices is the embeddings are just a lookup table which requires little to no computation. Its the rest of the parameters that have the expensive matrix multplications, so if we increased those we'd also be increasing the number of FLOPs needed for a forward pass.

This blog post explains it well. https://www.adamcasson.com/posts/transformer-flops

So all this to say is there are definite tradeoffs between model size, performance on evals, and compute cost. We ran many internal experiments with different choices to see could work well, and then picked what we believed work will best for the open community.

Scene_Cast2 · 3 days ago

How would this matrix get trained with PyTorch? I currently have a toy Transformer network - I ended up marking the matrix as sparse and using SparseAdam - gives a bit of a performance boost, but at the same time I can't use torch.compile() on the fetch from this matrix.

Scene_Cast2 commented on The Weight of a Cell asimov.press/p/cell-weigh... · Posted by u/arbesman

SeanSullivan86 · 5 days ago

What happens when something is put on the scale while it's sampling? Does the curve depend on properties of the scale, or just properties of the object and the manner in which it was put on the scale?

Scene_Cast2 · 5 days ago

It's the latter. The scale is meant for real-time monitoring of rapidly varying force (the primary application is about monitoring the force derivative and repeatable max force logging). It uses an aluminum load cell if you're familiar with those, there's a tad of a multi-kHz resonance that is typically overshadowed by the object properties.

Scene_Cast2 commented on The Weight of a Cell asimov.press/p/cell-weigh... · Posted by u/arbesman

Scene_Cast2 · 5 days ago

I've built a scale with a kHz sampling rate and gram precision at +/-100kg range.

One thing I found out is that getting calibrated accuracy beyond 0.1% is hard and expensive despite having all that precision.

Scene_Cast2 commented on Kodak says it might have to cease operations [updated] cnn.com/2025/08/12/busine... · Posted by u/mastry

Scene_Cast2 · 10 days ago

That's pretty sad. I find their color film resolves better than Ilford's B&W stuff (I know it's apples to oranges, but if I'm shooting B&W, I kind of want the detail...). Their film is also generally easier to find than Provia or Velvia, for medium format at least.

Scene_Cast2 commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 magazine.sebastianraschka... · Posted by u/ModelForge

Scene_Cast2 · 13 days ago

I find it interesting that the architectures of modern open weight LLMs are so similar, and that most innovation seems to be happening on the training (data, RL) front.

This is contrary to what I've seen in a large ML shop, where architectural tuning was king.

Scene_Cast2 commented on How attention sinks keep language models stable hanlab.mit.edu/blog/strea... · Posted by u/pr337h4m

Scene_Cast2 · 15 days ago

I found a fairly large improvement in my toy transformer model where I added a "global" token akin to the CLS token in ViT.

Another approach I've seen is the "Diff transformer" from MS Research (https://github.com/microsoft/unilm/tree/master/Diff-Transfor...).

Scene_Cast2 commented on Ask HN: Have you ever regretted open-sourcing something? · Posted by u/paulwilsonn

Scene_Cast2 · 18 days ago

Not the OP, but I have a similar dilemma. I'm currently sitting on a SOTA ML model for a particular niche. I'm trying to figure whether I should try selling it to the incumbents (in some shape or form), or if I should publish a paper on the techniques, and/or if I should OSS it.

Scene_Cast2 commented on Harmony: OpenAI's response format for its open-weight model series github.com/openai/harmony... · Posted by u/meetpateltech

Scene_Cast2 · 18 days ago

I wonder how much performance is left on the table due to it not being zero-copy.

Scene_Cast2 commented on Open models by OpenAI openai.com/open-models/... · Posted by u/lackoftactics

shpongled · 18 days ago

Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2

Scene_Cast2 · 18 days ago

Should be in the RoPE paper. The OG transformers used multiplicative sinusoidal embeddings, while RoPE does a pairwise rotation.

There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.

Scene_Cast2 commented on A Photonic SRAM with Embedded XOR Logic for Ultra-Fast In-Memory Computing arxiv.org/abs/2506.22707... · Posted by u/PaulHoule

Scene_Cast2 · a month ago

Something I've never quite understood is where, on the spectrum of mainstream vs niche, in memory computing approaches lie. What are the proposed use cases?

I understand that you can get highly power efficient XORs, for example. But if we go down this path, would they help with a matrix multiply? Or the bias term of a FFN? Would there be any improvement (i.e. is there anything to offload) in regular business logic? Should I think of it as a more efficient but highly limited DSP? Or a fixed function accelerator replacement (e.g. "we want to encrypt this segment of memory")