ModelForge (u/ModelForge)

ModelForge commented on Gemma 3 270M re-implemented in pure PyTorch for local tinkering github.com/rasbt/LLMs-fro... · Posted by u/ModelForge

ladberg · 3 days ago

Given that the compiled version is slower than then eager version on A100, there's definitely something suboptimal happening there

ModelForge · 3 days ago

No the compiled version is actually faster.

From that table, the A100 tok/sec (larger is faster) numbers are:

- Eager: 28

- Compiled: 128

And

- KV cache eager: 26

- KV cache compiled: 99

The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly

ModelForge commented on Gemma 3 270M re-implemented in pure PyTorch for local tinkering github.com/rasbt/LLMs-fro... · Posted by u/ModelForge

lsb · 3 days ago

That’s wild that with a KV cache and compilation on the Mac CPU you are faster than on an A100 GPU.

ModelForge · 3 days ago

Could be an artifact of the small size not fully taking advantage of the GPU. For example, for the slightly larger Qwen3 0.6B model the A100 is faster (you can see it when scrolling to the bottom here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11...)

ModelForge commented on Gemma 3 270M re-implemented in pure PyTorch for local tinkering github.com/rasbt/LLMs-fro... · Posted by u/ModelForge

kace91 · 3 days ago

This might be a very basic question, but as a dev whose only interaction with models is using the main commercial ones (sonnet, ChatGPT and the like), what are some usecases for these smaller local models?

What usages can be reasonable to expect from them? Are there uses out of the box or does one have to go through some custom post-training to get useful behavior?

I feel like there is a huge gap between understanding models as a user of commercial tools and the kind of discussions happening in these threads, but I’m not sure what are the in-between steps.

ModelForge · 3 days ago

I'd say the common ones (besides educational) are

- private, on-device models (possibly with lower latency than models via web API); also edge devices

- algorithm research (faster and cheaper to prototype new ideas)

- cheap tasks, like classification/categorization; sure, you don't need a decoder-style LLM for that, but it has the advantage of being more free-form, which is useful in many scenarios; or maybe a sanity checker for grammar; or even a router to other model (GPT-5 style)

ModelForge commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 magazine.sebastianraschka... · Posted by u/ModelForge

oezi · 13 days ago

One question I was wondering about regarding the open models released by big labs is how much more the could improve with additional training. GPT-OSS has 2.1m hours of training, how much score improvements could we see at double that?

ModelForge · 13 days ago

I think GPT-4.5 was potentially the original GPT-5 model that was larger and pre-trained on more data. Too bad it was too expensive to deploy at scale so that we never saw the RL-ed version

ModelForge commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 magazine.sebastianraschka... · Posted by u/ModelForge

poorman · 13 days ago

This article really goes into a lot of detail which is nice. gpt-oss is just not good for agentic use in my observation.

tldr; I'll save you a lot of time trying things out for yourself. If you are on a >=32 GB Mac download LMStudio and then the `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of RAM so a 32GB machine is plenty. Set it up with opencode [1] and you're off to the races! It has great tool calling ability. The tool calling ability of gpt-oss doesn't even come close in my observations.

[1] https://opencode.ai/

ModelForge · 13 days ago

The ollama one uses even less (around 13 GB), which is nice. Apparently the gpt-oss team also shared the mxfp4 optimizations for metal

ModelForge commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 magazine.sebastianraschka... · Posted by u/ModelForge

Scene_Cast2 · 13 days ago

I find it interesting that the architectures of modern open weight LLMs are so similar, and that most innovation seems to be happening on the training (data, RL) front.

This is contrary to what I've seen in a large ML shop, where architectural tuning was king.

ModelForge · 13 days ago

Good point. LLMs lower the barrier to entry if someone has enough resources because those architectures are more robust to tweaks given one throws enough compute and data at them. You can even violate scaling laws and still get a good model (like Llama 3 showed back then)

ModelForge commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 magazine.sebastianraschka... · Posted by u/ModelForge

mhitza · 13 days ago

I've been using lightly gpt-oss-20b but what I've found is that for smaller (single sentence) prompts it was easy enough to have it loop infinitely. Since I'm running it with llama.cpp I've set a small repetition penalty and haven't encountered those issues since (I'm using it a couple of times a day to analyze diffs, so I might have just gotten lucky since)

ModelForge · 13 days ago

I’ve been using the ollama version (uses about 13 Gb RAM on macOS) and haven’t had that issue yet. I wonder if that’s maybe an issue of the llama.cpp port?