Readit News logoReadit News
ModelForge commented on Gemma 3 270M re-implemented in pure PyTorch for local tinkering   github.com/rasbt/LLMs-fro... · Posted by u/ModelForge
ladberg · 3 days ago
Given that the compiled version is slower than then eager version on A100, there's definitely something suboptimal happening there
ModelForge · 3 days ago
No the compiled version is actually faster.

From that table, the A100 tok/sec (larger is faster) numbers are:

- Eager: 28

- Compiled: 128

And

- KV cache eager: 26

- KV cache compiled: 99

The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly

ModelForge commented on Gemma 3 270M re-implemented in pure PyTorch for local tinkering   github.com/rasbt/LLMs-fro... · Posted by u/ModelForge
lsb · 3 days ago
That’s wild that with a KV cache and compilation on the Mac CPU you are faster than on an A100 GPU.
ModelForge · 3 days ago
Could be an artifact of the small size not fully taking advantage of the GPU. For example, for the slightly larger Qwen3 0.6B model the A100 is faster (you can see it when scrolling to the bottom here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11...)
ModelForge commented on Gemma 3 270M re-implemented in pure PyTorch for local tinkering   github.com/rasbt/LLMs-fro... · Posted by u/ModelForge
kace91 · 3 days ago
This might be a very basic question, but as a dev whose only interaction with models is using the main commercial ones (sonnet, ChatGPT and the like), what are some usecases for these smaller local models?

What usages can be reasonable to expect from them? Are there uses out of the box or does one have to go through some custom post-training to get useful behavior?

I feel like there is a huge gap between understanding models as a user of commercial tools and the kind of discussions happening in these threads, but I’m not sure what are the in-between steps.

ModelForge · 3 days ago
I'd say the common ones (besides educational) are

- private, on-device models (possibly with lower latency than models via web API); also edge devices

- algorithm research (faster and cheaper to prototype new ideas)

- cheap tasks, like classification/categorization; sure, you don't need a decoder-style LLM for that, but it has the advantage of being more free-form, which is useful in many scenarios; or maybe a sanity checker for grammar; or even a router to other model (GPT-5 style)

ModelForge commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2   magazine.sebastianraschka... · Posted by u/ModelForge
oezi · 13 days ago
One question I was wondering about regarding the open models released by big labs is how much more the could improve with additional training. GPT-OSS has 2.1m hours of training, how much score improvements could we see at double that?
ModelForge · 13 days ago
I think GPT-4.5 was potentially the original GPT-5 model that was larger and pre-trained on more data. Too bad it was too expensive to deploy at scale so that we never saw the RL-ed version
ModelForge commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2   magazine.sebastianraschka... · Posted by u/ModelForge
poorman · 13 days ago
This article really goes into a lot of detail which is nice. gpt-oss is just not good for agentic use in my observation.

tldr; I'll save you a lot of time trying things out for yourself. If you are on a >=32 GB Mac download LMStudio and then the `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of RAM so a 32GB machine is plenty. Set it up with opencode [1] and you're off to the races! It has great tool calling ability. The tool calling ability of gpt-oss doesn't even come close in my observations.

[1] https://opencode.ai/

ModelForge · 13 days ago
The ollama one uses even less (around 13 GB), which is nice. Apparently the gpt-oss team also shared the mxfp4 optimizations for metal
ModelForge commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2   magazine.sebastianraschka... · Posted by u/ModelForge
Scene_Cast2 · 13 days ago
I find it interesting that the architectures of modern open weight LLMs are so similar, and that most innovation seems to be happening on the training (data, RL) front.

This is contrary to what I've seen in a large ML shop, where architectural tuning was king.

ModelForge · 13 days ago
Good point. LLMs lower the barrier to entry if someone has enough resources because those architectures are more robust to tweaks given one throws enough compute and data at them. You can even violate scaling laws and still get a good model (like Llama 3 showed back then)
ModelForge commented on GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2   magazine.sebastianraschka... · Posted by u/ModelForge
mhitza · 13 days ago
I've been using lightly gpt-oss-20b but what I've found is that for smaller (single sentence) prompts it was easy enough to have it loop infinitely. Since I'm running it with llama.cpp I've set a small repetition penalty and haven't encountered those issues since (I'm using it a couple of times a day to analyze diffs, so I might have just gotten lucky since)
ModelForge · 13 days ago
I’ve been using the ollama version (uses about 13 Gb RAM on macOS) and haven’t had that issue yet. I wonder if that’s maybe an issue of the llama.cpp port?

u/ModelForge

KarmaCake day326December 18, 2024View Original