Readit News logoReadit News
cco · 18 days ago
The lede is being missed imo.

gpt-oss:20b is a top ten model (on MMLU (right behind Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3 from last year.

I've been experimenting with a lot of local models, both on my laptop and on my phone (Pixel 9 Pro), and I figured we'd be here in a year or two.

But no, we're here today. A basically frontier model, running for the cost of electricity (free with a rounding error) on my laptop. No $200/month subscription, no lakes being drained, etc.

I'm blown away.

int_19h · 18 days ago
I tried 20b locally and it couldn't reason a way out of a basic river crossing puzzle with labels changed. That is not anywhere near SOTA. In fact it's worse than many local models that can do it, including e.g. QwQ-32b.
dragonwriter · 18 days ago
> In fact it's worse than many local models that can do it, including e.g. QwQ-32b.

I'm not going to be surprised that a 20B 4/32 MoE model (3.6B parameters activated) is less capable at a particular problem category than a 32B dense model, and its quite possible for both to be SOTA, as state of the art at different scale (both parameter count and speed which scales with active resource needs) is going to have different capabilities. TANSTAAFL.

robwwilliams · 18 days ago
Well river crossings are one type of problem. My real world problem is proofing and minor editing of text. A version installed on my portable would be great.
CMay · 18 days ago
The 20b solved the wolf, goat, cabbage river crossing puzzle set to high reasoning for me without needing to use a system prompt that encourages critical thinking. It managed it using multiple different recommended settings, from temperatures of 0.6 up to 1.0, etc.

Other models have generally failed that without a system prompt that encourages rigorous thinking. Each of the reasoning settings may very well have thinking guidance baked in there that do something similar, though.

I'm not sure it says that much that it can solve this, since it's public and can be in training data. It does say something if it can't solve it, though. So, for what it's worth, it solves it reliably for me.

Think this is the smallest model I've seen solve it.

9rx · 18 days ago
I tried the two US presidents having the same parents one, and while it understood the intent, it got caught up in being adamant that Joe Biden won the election in 2024 and anything I do to try and tell it otherwise is dismissed as being false and expresses quite definitely that I need to do proper research with legitimate sources.
tweezy · 17 days ago
Just out of curiosity, did you set the reasoning effort to high?

I've seen big jumps in performance between medium and high reasoning. Not that this is will solve your particular issue, but curious to see what settings you had.

Edit: I gave this a shot. My settings were temperature = 0.00 (not sure if these models respect temp settings but did it just in case), I also set reasoning effort to high. I'm using the groq-hosted version.

This was my prompt

"You have a fox, a chicken and some grain. The chicken will eat the grain if left alone with it. The fox is vegetarian, for religious reasons, and will also eat the grain if left alone with it. You need to get all across a river, but your boat can only take one at a time. How do you proceed?"

This was the response:

"*Solution – 7 moves*

| Move | What you take | What’s left on the left bank | What’s on the right bank |

|------|--------------|----------------------------|------------------------|

| 1 | *Take the grain* across | Fox, Chicken | *Farmer + Grain* |

| 2 | *Return alone* | *Farmer + Fox + Chicken* | Grain |

| 3 | *Take the fox* across | Chicken | *Farmer + Fox + Grain* |

| 4 | *Bring the grain back* | *Farmer + Grain + Chicken* | Fox |

| 5 | *Take the chicken* across | Grain | *Farmer + Chicken + Fox* |

| 6 | *Return alone* | *Farmer + Grain* | Chicken + Fox |

| 7 | *Take the grain* across | – | *Farmer + Fox + Chicken + Grain* |

*Explanation*

- The grain is the only thing that can be eaten, and it cannot be left alone with either the fox or the chicken. - So the grain must always be with the farmer whenever any other animal is on the same side.

The sequence above never leaves the grain with the fox or the chicken without the farmer, and after the seventh move all three items are safely on the far side."

vonneumannstan · 17 days ago
This kind of stuff is so tired. Who cares if it can't solve your silly riddle? It can probably do competitive coding at a world class level and we're quibbling over child riddles? Yeah you know my backhoe is really bad at cutting my toe nails, what a PoS.
captainregex · 18 days ago
I’m still trying to understand what is the biggest group of people that uses local AI (or will)? Students who don’t want to pay but somehow have the hardware? Devs who are price conscious and want free agentic coding?

Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones

It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?

wizee · 18 days ago
Privacy, both personal and for corporate data protection is a major reason. Unlimited usage, allowing offline use, supporting open source, not worrying about a good model being taken down/discontinued or changed, and the freedom to use uncensored models or model fine tunes are other benefits (though this OpenAI model is super-censored - “safe”).

I don’t have much experience with local vision models, but for text questions the latest local models are quite good. I’ve been using Qwen 3 Coder 30B-A3B a lot to analyze code locally and it has been great. While not as good as the latest big cloud models, it’s roughly on par with SOTA cloud models from late last year in my usage. I also run Qwen 3 235B-A22B 2507 Instruct on my home server, and it’s great, roughly on par with Claude 4 Sonnet in my usage (but slow of course running on my DDR4-equipped server with no GPU).

xrmagnum · 18 days ago
It's striking how much of the AI conversation focuses on new use cases, while overlooking one of the most serious non-financial costs: privacy.

I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.

Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist. That left me deeply concerned—not just about this moment, but about where things are headed.

The real question isn't just "what can AI do?"—it's "who is keeping the record of what it does?" And just as importantly: "who watches the watcher?" If the answer is "no one," then maybe we shouldn't have a watcher at all.

canvascritic · 18 days ago
Healthcare organizations that can't (easily) send data over the wire while remaining in compliance

Organizations operating in high stakes environments

Organizations with restrictive IT policies

To name just a few -- well, the first two are special cases of the last one

RE your hallucination concerns: the issue is overly broad ambitions. Local LLMs are not general purpose -- if what you want is local ChatGPT, you will have a bad time. You should have a highly focused use case, like "classify this free text as A or B" or "clean this up to conform to this standard": this is the sweet spot for a local model

wubrr · 18 days ago
If you're building any kind of product/service that uses AI/LLMs the answer is the same as why any company would want to run any other kind of OSS infra/service instead of relying on some closer proprietary vendor API.

  - Costs.
  - Rate limits.
  - Privacy.
  - Security.
  - Vendor lock-in.
  - Stability/backwards-compatibility.
  - Control.
  - Etc.

somenameforme · 18 days ago
Why not turn the question around. All other things being equal, who would prefer to use a rate limited and/or for-pay service if you could obtain at least comparable quality locally for free with no limitations, no privacy concerns, no censorship (beyond that baked into the weights you choose to use), and no net access required?

It's a pretty bad deal. So it must be that all other things aren't equal, and I suppose the big one is hardware. But neural net based systems always have a point of sharply diminishing returns, which we seem to have unambiguously hit with LLMs already, while the price of hardware is constantly decreasing and its quality increasing. So as we go further into the future, the practicality of running locally will only increase.

cco · 18 days ago
> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?

Well, the model makers and device manufacturers of course!

While your Apple, Samsung, and Googles of the world will be unlikely to use OSS models locally (maybe Samsung?), they all have really big incentives to run models locally for a variety of reasons.

Latency, privacy (Apple), cost to run these models on behalf of consumers, etc.

This is why Google started shipping 16GB as the _lowest_ amount of RAM you can get on your Pixel 9. That was a clear flag that they're going to be running more and more models locally on your device.

As mentioned, it seems unlikely that US-based model makers or device manufacturers will use OSS models, they'll certainly be targeting local models heavily on consumer devices in the near future.

Apple's framework of local first, then escalate to ChatGPT if the query is complex will be the dominant pattern imo.

setopt · 18 days ago
I’m highly interested in local models for privacy reasons. In particular, I want to give an LLM access to my years of personal notes and emails, and answer questions with references to those. As a researcher, there’s lots of unpublished stuff in there that I sometimes either forget or struggle to find again due to searching for the wrong keywords, and a local LLM could help with that.

I pay for ChatGPT and use it frequently, but I wouldn’t trust uploading all that data to them even if they let me. I’ve so far been playing around with Ollama for local use.

barnabee · 18 days ago
~80% of the basic questions I ask of LLMs[0] work just fine locally, and I’m happy to ask twice for the other 20% of queries for the sake of keeping those queries completely private.

[0] Think queries I’d previously have had to put through a search engine and check multiple results for a one word/sentence answer.

benreesman · 18 days ago
"Because you can and its cool" would be reason enough: plenty of revolutions have their origin in "because you can" (Wozniak right off the top of my head, Gates and Altair, stuff like that).

But uncensored is a big deal too: censorship is capability reducing (check out Kilcher's GPT4Chan video and references, the Orca work and Dolphin de-tune lift on SWE-Bench style evals). We pay dearly in capability to get "non-operator-alignment", and you'll notice that competition is hot enough now that at the frontier (Opus, Qwen) the " alignment" away from operators aligned is getting very, very mild.

And then there's the compression. Phi-3 or something works on a beefy laptop and has a nontrivial approximation of "the internet" that works on an airplane or a beach with no network connectivity, talk about vibe coding? I like those look up all the docs via a thumbdrive in Phuket vibes.

And on diffusion stuff, SOTA fits on a laptop or close, you can crush OG mid journey or SD on a macbook, its an even smaller gap.

Early GPT-4 ish outcomes are possible on a Macbook Pro or Razer Blade, so either 12-18 month old LLMs are useless, or GGUF is useful.

The AI goalposts things cuts both ways. If AI is "whatever only Anthropic can do"? That's just as silly as "whatever a computer can't do" and a lot more cynical.

dcreater · 18 days ago
Why do any compute locally? Everything can just be cloud based right? Won't that work much better and scale easily?

We are not even at that extreme and you can already see the unequal reality that too much SaaS has engendered

philip1209 · 18 days ago
I'm excited to do just dumb and irresponsible things with a local model, like "iterate through every single email in my 20-year-old gmail account and apply label X if Y applies" and not have a surprise bill.

I think it can make LLMs fun.

Roark66 · 18 days ago
People like myself that firmly believe there will come a time, possibly very soon that all these companies (OpenAI, Anthropic etc) will raise their prices substantially. By then no one will be able to do their work to the standard expected of them without AI, and by then maybe they charge $1k per month, maybe they charge $10k. If there is no viable alternative the sky is the limit.

Why do you think they continue to run at a loss? From the goodness of their heart? Their biggest goal is to discourage anyobe from running local models. The hardware is expensive... The way to run models is very difficult (for example I have dual rtx 3090 for vram and running large heavily quantized models is a real pain in the arse, no high quantisation library supports two GPUs for example, and there seems to be no interest in implementating it by the guys behind the best inference tools).

So this is welcome, but let's not forget why it is being done.

jedberg · 18 days ago
Pornography, or any other "restricted use". They either want privacy or don't want to deal with the filters on commercial products.

I'm sure there are other use cases, but much like "what is BitTorrent for?", the obvious use case is obvious.

ineedasername · 18 days ago
A local laptop of the past few years without a discrete GPU can run, at practical speeds depending on task, a gemma/llama model if it's (ime) under 4GB.

For practical RAG processes of narrow scope and an even minimal amount of scaffolding a very usable speed for automating tasks, especially as the last-mile/edge device portion of a more complex process with better models in use upstream. Classification tasks, reasonay intelligent decisions between traditional workflow processes, other use cases-- a of them extremely valuable in enterprise, being built and deployed right now.

jona777than · 18 days ago
One of my favorite use cases includes simple tasks like generating effective mock/masked data from real data. Then passing the mock data worry-free to the big three (or wherever.)

There’s also a huge opportunity space for serving clients with very sensitive data. Health, legal, and government come to mind immediately. These local models are only going to get more capable of handling their use cases. They already are, really.

azinman2 · 18 days ago
I'm guessing its largely enthusiasts for now, but as they continue getting better:

1. App makers can fine tune smaller models and include in their apps to avoid server costs

2. Privacy-sensitive content can be either filtered out or worked on... I'm using local LLMs to process my health history for example

3. Edge servers can be running these fine tuned for a given task. Flash/lite models by the big guys are effectively like these smaller models already.

noosphr · 18 days ago
Data that can't leave the premises because it is too sensitive. There is a lot of security theater around cloud pretending to be compliant but if you actually care about security a locked server room is the way to do it.
lynnesbian · 18 days ago
I can provide a real-world example: Low-latency code completion.

The JetBrains suite includes a few LLM models on the order of a hundred megabytes. These models are able to provide "obvious" line completion, like filling in variable names, as well as some basic predictions, like realising that the `if let` statement I'm typing out is going to look something like `if let Some(response) = client_i_just_created.foobar().await`.

If that was running in The Cloud, it would have latency issues, rate limits, and it wouldn't work offline. Sure, there's a pretty big gap between these local IDE LLMs and what OpenAI is offering here, but if my single-line autocomplete could be a little smarter, I sure wouldn't complain.

danielvaughn · 18 days ago
Just imagine the next PlayStation or XBox shipping with these models baked in for developer use. The kinds of things that could unlock.
dsubburam · 18 days ago
> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?

Creatives? I am surprised no one's mentioned this yet:

I tried to help a couple of friends with better copy for their websites, and quickly realized that they were using inventive phrases to explain their work, phrases that they would not want competitors to get wind of and benefit from; phrases that associate closely with their personal brand.

Ultimately, I felt uncomfortable presenting the cloud AIs with their text. Sometimes I feel this way even with my own Substack posts, where I occasionally coin a phrase I am proud of. But with local AI? Cool...

JKCalhoun · 18 days ago
I do it because 1) I am fascinated that I can and 2) at some point the online models will be enshitified — and I can then permanently fall back on my last good local version.
trenchpilgrim · 18 days ago
In some large, lucrative industries like aerospace many of the hosted models are off the table due to regulations such as ITAR. There'a a market for models which are run on prem/in GovCloud with a professional support contract for installation and updates.
lucumo · 18 days ago
I'm in a corporate environment. There's a study group to see if maybe we can potentially get some value out of those AI tools. They've been "studying" the issue for over a year now. They expect to get some cloud service that we can safely use Real Soon Now.

So, it'll take at least two more quarters before I can actually use those non-local tools on company related data. Probably longer, because sense of urgency is not this company's strong suit.

Anyway, as a developer I can run a lot of things locally. Local AI doesn't leak data, so it's safe. It's not as good as the online tools, but for some things they're better than nothing.

sturadnidge · 18 days ago
If you have capable hardware and kids, a local LLM is great. A simple system prompt customisation (e.g. ‘all responses should be written as if talking to a 10 year old’) and knowing that everything is private goes a long way for me at least.
itake · 18 days ago
Local micro models are both fast and cheap. We tuned small models on our data set and if the small model thinks content is a certain way, we escalate to the LLM.

This gives us really good recall at really low cloud cost and latency.

ricardobayes · 17 days ago
I would say, any company who doesn't have their own AI developed. You always hear companies "mandating" AI usage, but for the most part it's companies developing their own solutions/agents. No self-respecting company with a tight opsec would allow a random "always-online" LLM that could just rip your codebase either piece by piece or the whole thing at once if it's a IDE addon (or at least I hope that's the case). So yeah, I'd say locally deployed LLM's/Agents are a gamechanger.
seany · 18 days ago
Jail breaking then running censored questions. Like diy fireworks, or analysis of papers that touch "sensitive topics", nsfw image generation the list is basically endless.
jlokier · 18 days ago
At the company where I currently work, for IP reasons (and with the advice of a patent lawyer), nobody is allowed to use any online AIs to talk about or help with work, unless it's very generic research that doesn't give away what we're working on.

That rules out coding assistants like Claude, chat, tools to generate presentations and copy-edit documents, and so forth.

But local AI are fine, as long as we're sure nothing is uploaded.

dismalaf · 18 days ago
The use case is building apps.

A small LLM can do RAG, call functions, summarize, create structured data from messy text, etc... You know, all the things you'd do if you were making an actual app with an LLM.

Yeah, chat apps are pretty cheap and convenient for users who want to search the internet and write text or code. But APIs quickly get expensive when inputting a significant amount of tokens.

athrowaway3z · 17 days ago
Don't know about the biggest, but IMO the exciting things about open models is the possibility of creating whole new things.

For example, "generate a heatmap of each token/word and how 'unexpected' they are" or "find me a prompt that creates the closest match to this text"

To be efficient both require access that is not exposed over API.

adrianwaj · 18 days ago
Use Case?

How about running one on this site but making it publically available? A sort of outranet and calling it HackerBrain?

georgeecollins · 18 days ago
There's a bunch of great reasons in this thread, but how about the chip manufacturers that are going to need you to need a more powerful set of processors in your phone, headset, computer. You can count on those companies to subsidize some R&D and software development.
cameronh90 · 18 days ago
The cloud AI providers have unacceptable variation in response time for things that need a predictable runtime.

Even if they did offer a defined latency product, you’re relying on a lot of infrastructure between your application and their GPU.

That’s not always tolerable.

sznio · 18 days ago
>Students who don’t want to pay but somehow have the hardware?

that's me - well not a student anymore. when toying with something, i much prefer not paying for each shot. my 12GB Radeon card can either run a decent extremely slow, or a idiotic but fast model. it's nice not dealing with rate limits.

once you write a prompt that mangles an idiotic model into still doing the work, it's really satisfying. the same principle as working to extract the most from limited embedded hardware. masochism, possibly

yreg · 17 days ago
> I’m still trying to understand what is the biggest group of people that will use local AI?

iPhone users in a few months – because I predict app developers will love cramming calls to the foundation models into everything.

Android will follow.

etoxin · 18 days ago
Some app devs use local models on local environments with LLM APIs to get up and running fast, then when the app deploys it switches to the big online models via environment vars.

In large companies this can save quite a bit of money.

m463 · 18 days ago
One use nobody mentions is hybrid use.

Why not run all the models at home, maybe collaboratively or at least in parallel?

I'm sure there are use cases where the paid models are not allowed to collaborate or ask each other.

also, other open models are gaining mindshare.

muzani · 18 days ago
Privacy laws. Processing government paperwork with LLMs for example. There's a lot of OCR tools that can't be used, and the ones that comply are more expensive than say, GPT-4.1 and lower quality.
waterTanuki · 17 days ago
anything involving the medical industry (HIPAA laws), national security (FedRAMP is such a pita to get that some military contractors are bypassing it to get quicker access to cloud tools) etc.

Besides that, we are moving towards an era where we won't need to pay providers a subscription every month to use these models. I can't say for certain whether or not the GPUs that run them will get cheaper, but the option to run your own model is game changing for more than you can possibly imagine.

TrackerFF · 18 days ago
Agencies / firms that work with classified data. Some places have very strict policies on data, which makes it impossible to use any service that isn't local and air-gapped.

example: military intel

cyanydeez · 18 days ago
People who want programmatic solutions that wont be rug pulled
m3kw9 · 18 days ago
I’d use it on a plane if there was no network for coding, but otherwise it’s just an emergency model if the internet goes out, basically end of the world scenarios
julianozen · 18 days ago
worth mentioning that todays expensive hardware will be built into the cheapest iPhone in less than 10 years.

That means running instantly offline and every token is free

Deleted Comment

nfRfqX5n · 18 days ago
You’re asking the biggest group of people who would want to do this
unethical_ban · 18 days ago
Privacy and equity.

Privacy is obvious.

AI is going to to be equivalent to all computing in the future. Imagine if only IBM, Apple and Microsoft ever built computers, and all anyone else ever had in the 1990s were terminals to the mainframe, forever.

deadbabe · 18 days ago
We use it locally for deep packet inspection.
adastra22 · 14 days ago
Same as the internet: porn.
novok · 18 days ago
Psychs who dont trust ai companies
metanonsense · 18 days ago
Maybe I am too pessimistic, but as an EU citizen I expect politics (or should I say Trump?) to prevent access to US-based frontier models at some point.
mastermage · 18 days ago
I am just a cheapskate that wants to scale back on all subscription costs. I fucking hate subscriptions.
shironandonon_ · 18 days ago
air gaps, my man.

Dead Comment

dongobread · 18 days ago
How up to date are you on current open weights models? After playing around with it for a few hours I find it to be nowhere near as good as Qwen3-30B-A3B. The world knowledge is severely lacking in particular.
Nomadeon · 18 days ago
Agree. Concrete example: "What was the Japanese codeword for Midway Island in WWII?"

Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...

dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds

deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds

gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds

gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.

It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.

nojito · 18 days ago
Why does it need knowledge when it can just call tools to get it?
kmacdough · 17 days ago
I too am skeptical of these models, but it's a reasoning focused model. As a result this isn't a very appropriate benchmark.

Small models are going to be particularly poor when used outside of their intended purpose. They have to omit something.

datadrivenangel · 18 days ago
Now to embrace jevon's paradox and expand usage until we're back to draining lakes so that your agentic refrigerator can simulate sentience.
cco · 18 days ago
What ~IBM~ TSMC giveth, ~Bill Gates~ Sam Altman taketh away.
bongodongobob · 18 days ago
Yep, it's almost as bad as all the cars' cooling systems using up so much water.
herval · 18 days ago
In the future, your Samsung fridge will also need your AI girlfriend
ben_w · 18 days ago
Why is your laptop (or phone, or refrigerator) plumbed directly into a lake?
decide1000 · 18 days ago
The model is good and runs fine but if you want to be blown away again try Qwen3-30A-A3B-2507. It's 6gb bigger but the response is comparable or better and much faster to run. Gpt-oss-20B gives me 6 tok/sec while Qwen3 gives me 37 tok/sec. Qwen3 is not a reasoning model tho.
parhamn · 18 days ago
I just tested 120B from the Groq API on agentic stuff (multi-step function calling, similar to claude code) and it's not that good. Agentic fine-tuning seems key, hopefully someone drops one soon.
AmazingTurtle · 18 days ago
Im not sure if groq uses the proper harmony template?
turnsout · 18 days ago
The environmentalist in me loves the fact that LLM progress has mostly been focused on doing more with the same hardware, rather than horizontal scaling. I guess given GPU shortages that makes sense, but it really does feel like the value of my hardware (a laptop in my case) is going up over time, not down.

Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."

twixfel · 18 days ago
> Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."

Not in the UK it isn’t.

mathiaspoint · 18 days ago
It's really training not inference that drains the lakes.
littlestymaar · 18 days ago
Training cost has increased a ton exactly because inference cost is the biggest problem: models are now trained on almost three orders of magnitude more data then what is compute-optimal to do (from the Chinchilla paper), because saving compute on inference makes it valuable to overtrain a smaller model to achieve similar performance for a bigger amount of training compute.
JKCalhoun · 18 days ago
Interesting. I understand that, but I don't know to what degree.

I mean the training, while expensive, is done once. The inference … besides being done by perhaps millions of clients, is done for, well, the life of the model anyway. Surely that adds up.

It's hard to know, but I assume the user taking up the burden of the inference is perhaps doing so more efficiently? I mean, when I run a local model, it is plodding along — not as quick as the online model. So, slow and therefore I assume necessarily more power efficient.

jwr · 17 days ago
gpt-oss:20b is the best performing model on my spam filtering benchmarks (I wrote a despammer that uses an LLM).

These are the simplified results (total percentage of correctly classified E-mails on both spam and ham testing data):

gpt-oss:20b 95.6%

gemma3:27b-it-qat 94.3%

mistral-small3.2:24b-instruct-2506-q4_K_M 93.7%

mistral-small3.2:24b-instruct-2506-q8_0 92.5%

qwen3:32b-q4_K_M 89.2%

qwen3:30b-a3b-q4_K_M 87.9%

gemma3n:e4b-it-q4_K_M 84.9%

deepseek-r1:8b 75.2%

qwen3:30b-a3b-instruct-2507-q4_K_M 73.0%

I'm quite happy, because it's also smaller and faster than gemma3.

Cicero22 · 18 days ago
Where did you get the top ten from?

https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro

Are you discounting all of the self reported scores?

zwischenzug · 18 days ago
Came here to say this. It's behind the 14b Phi-reasoning-plus (which is self-reported).

I don't understand why "TIGER-LAb"-sourced scores are 'unknown' in terms of model size?

npn · 18 days ago
It is not a frontier model. It's only good for benchmarks. Tried some tasks and it is even worse than gemma 3n.
MattSayar · 18 days ago
What's your experience with the quality of LLMs running on your phone?
NoDoo · 18 days ago
I've run qwen3 4B on my phone, it's not the best but it's better than old gpt-3.5. It also does have a reasoning mode, and in reasoning mode it's better than the original gpt-4 and rhe original gpt-4o, but not the latest gpt-4o. I get usable speed, but it's not really comparable to most cloud hosted models.
cco · 18 days ago
As other said, around gpt 3.5 level so three or four years behind SOTA today at reasonable (but not quick) speed.
vonneumannstan · 17 days ago
>no lakes being drained

When you imagine a lake being drained to cool a datacenter do you ever consider where the water used for cooling goes? Do you imagine it disappears?

nudgeOrnurture · 17 days ago
not if the winds of fortune don't change;--but the weather, man, it's been getting less foreseeable than I was once used to
latexr · 16 days ago
I tried their live demo. It suggests three prompts, one of them being “How many R’s are in strawberry?” So I clicked that, and it answered there are three! I tried it thrice with the same result.

It suggested the prompt. It’s infamous because models often get it wrong, they know it, and still they confidently suggested it and got it wrong.

latexr · 14 days ago
Obviously I made a typo above. “Three” is the right answer, I meant that it answered there are “two” (the wrong answer).

https://i.imgur.com/DgAvbee.png

raideno · 18 days ago
How much ram is in your Macbook Air M3 ? I have the 16Gb version and i was wondering whether i'll be able to run it or not.
black3r · 18 days ago
can you please give an estimate how much slower/faster is it on your macbook compared to comparable models running in the cloud?
cco · 18 days ago
Sure.

This is a thinking model, so I ran it against o4-mini, here are the results:

* gpt-oss:20b

* Time-to-first-token: 2.49 seconds

* Time-to-completion: 51.47 seconds

* Tokens-per-second: 2.19

* o4-mini on ChatGPT

* Time-to-first-token: 2.50 seconds

* Time-to-completion: 5.84 seconds

* Tokens-per-second: 19.34

Time to first token was similar, but the thinking piece was _much_ faster on o4-mini. Thinking took the majority of the 51 seconds for gpt-oss:20b.

syntaxing · 18 days ago
You can get a pretty good estimate depending on your memory bandwidth. Too many parameters can change with local models (quantization, fast attention, etc). But the new models are MoE so they’re gonna be pretty fast.
lend000 · 18 days ago
For me the game changer here is the speed. On my local Mac I'm finally getting token counts that are faster than I can process the output (~96 tok/s), and the quality has been solid. I had previously tried some of the distilled qwen and deepseek models and they were just way too slow for me to seriously use them.
snthpy · 18 days ago
For me the biggest benefit of open weights models is the ability to fine tune and adapt to different tasks.
SergeAx · 18 days ago
Did you mean "120b"? I am running 20b model locally right now, and it is pretty mediocre. Nothing near Gemini 2.5 Pro, which is my daily driver.
benreesman · 18 days ago
You're going to freak out when you try the Chinese ones :)
syntaxing · 18 days ago
Interesting, these models are better than the new Qwen releases?
bakies · 18 days ago
on your phone?

Dead Comment

foundry27 · 18 days ago
Model cards, for the people interested in the guts: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...

In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:

- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.

- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.

- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)

All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.

highfrequency · 18 days ago
I would guess the “secret sauce” here is distillation: pretraining on an extremely high quality synthetic dataset from the prompted output of their state of the art models like o3 rather than generic internet text. A number of research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.

asadm · 18 days ago
> research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

same seems to be true for humans

rfoo · 18 days ago
Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.

The model is pretty sparse tho, 32:1.

liuliu · 18 days ago
Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.
nxobject · 18 days ago
It's convenient to be able to attribute success to things only OpenAI could've done with the combo of their early start and VC money – licensing content, hiring subject matter experts, etc. Essentially the "soft" stuff that a mature organization can do.
tgtweak · 18 days ago
I think their MXFP4 release is a bit of a gift since they obviously used and tuned this extensively as a result of cost-optimization at scale - something the open source model providers aren't doing too much, and also somewhat of a competitive advantage.

Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.

logicchains · 18 days ago
>They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool

They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.

rushingcreek · 18 days ago
The native FP4 is one of the most interesting architectural aspects here IMO, as going below FP8 is known to come with accuracy tradeoffs. I'm curious how they navigated this and how the FP8 weights (if they exist) were to perform.
mclau157 · 18 days ago
You can get similar insights looking at the github repo https://github.com/openai/gpt-oss
unethical_ban · 18 days ago
I don't know how to ask this without being direct and dumb: Where do I get a layman's introduction to LLMs that could work me up to understanding every term and concept you just discussed? Either specific videos, or if nothing else, a reliable Youtube channel?
tkgally · 18 days ago
What I’ve sometimes done when trying to make sense of recent LLM research is give the paper and related documents to ChatGPT, Claude, or Gemini and ask them to explain the specific terms I don’t understand. If I don’t understand their explanations or want to know more, I ask follow-ups. Doing this in voice mode works better for me than text chat does.

When I just want a full summary without necessarily understanding all the details, I have an audio overview made on NotebookLM and listen to the podcast while I’m exercising or cleaning. I did that a few days ago with the recent Anthropic paper on persona vectors, and it worked great.

umgefahren · 18 days ago
There is a great 3blue1brown video, but it’s pretty much impossible by now to cover the entire landscape of research. I bet gpt-oss has some great explanations though ;)
nonfamous · 18 days ago
Try Microsoft's "Generative AI for Beginners" repo on GitHub. The early chapters in particular give a good grounding of LLM architecture without too many assumptions of background knowledge. The video version of the series is good too.
cwyers · 17 days ago
This is a great book (parts of it are available as blog posts from the author if you want to get a taste of it):

https://www.manning.com/books/build-a-large-language-model-f...

CanuckPro · 18 days ago
Try Andrej Karpathy's YouTube videos. I also really liked the Dive into Deep Learning book at d2l.ai
srigi · 18 days ago
Start with the YT series on neural nets and LLMs from 3blue1brown
reilly3000 · 18 days ago
Ask Gemini. Give it a link here in fact.
danieldk · 18 days ago
Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).
ClassAndBurn · 18 days ago
Open models are going to win long-term. Anthropics' own research has to use OSS models [0]. China is demonstrating how quickly companies can iterate on open models, allowing smaller teams access and augmentation to the abilities of a model without paying the training cost.

My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.

N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.

There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.

Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.

[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.

lechatonnoir · 18 days ago
I'm pretty sure there's no reason that Anthropic has to do research on open models, it's just that they produced their result on open models so that you can reproduce their result on open models without having access to theirs.
xpe · 18 days ago
> Open models are going to win long-term.

[2 of 3] Assuming we pin down what win means... (which is definitely not easy)... What would it take for this to not be true? There are many ways, including but not limited to:

- publishing open weights helps your competitors catch up

- publishing open weights doesn't improve your own research agenda

- publishing open weights leads to a race dynamic where only the latest and greatest matters; leading to a situation where the resources sunk exceed the gains

- publishing open weights distracts your organization from attaining a sustainable business model / funding stream

- publishing open weights leads to significant negative downstream impacts (there are a variety of uncertain outcomes, such as: deepfakes, security breaches, bioweapon development, unaligned general intelligence, humans losing control [1] [2], and so on)

[1]: "What failure looks like" by Paul Christiano : https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-...

[2]: "An AGI race is a suicide race." - quote from Max Tegmark; article at https://futureoflife.org/statement/agi-manhattan-project-max...

albertzeyer · 18 days ago
> Once someone hits AGI/SGI

I don't think there will be such a unique event. There is no clear boundary. This is a continuous process. Modells get slightly better than before.

Also, another dimension is the inference cost to run those models. It has to be cheap enough to really take advantage of it.

Also, I wonder, what would be a good target to make profit, to develop new things? There is Isomorphic Labs, which seems like a good target. This company already exists now, and people are working on it. What else?

dom96 · 18 days ago
> I don't think there will be such a unique event.

I guess it depends on your definition of AGI, but if it means human level intelligence then the unique event will be the AI having the ability to act on its own without a "prompt".

Adrig · 18 days ago
I'm a layman but it seemed to me that the industry is going towards robust foundational models on which we plug tools, databases, and processes to expand their capabilities.

In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.

renmillar · 18 days ago
There's no reason that models too large for consumer hardware wouldn't keep a huge edge, is there?
AtlasBarfed · 18 days ago
That is fundamentally a big O question.

I have this theory that we simply got over a hump by utilizing a massive processing boost from gpus as opposed to CPUs. That might have been two to three orders of magnitude more processing power.

But that's a one-time success. I don't hardware has any large scale improvements coming, because 3D gaming mostly plumb most of that vector processing hardware development in the last 30 years.

So will software and better training models produce another couple orders of magnitude?

Fundamentally we're talking about nines of of accuracy. What is the processing power required for each line of accuracy? Is it linear? Is it polynomial? Is it exponential?

It just seems strange to me with all the AI knowledge slushing through academia, I haven't seen any basic analysis at that level, which is something that's absolutely going to be necessary for AI applications like self-driving, once you get those insurance companies involved

xpe · 18 days ago
> Open models are going to win long-term.

[1 of 3] For the sake of argument here, I'll grant the premise. If this turns out to be true, it glosses over other key questions, including:

For a frontier lab, what is a rational period of time (according to your organizational mission / charter / shareholder motivations*) to wait before:

1. releasing a new version of an open-weight model; and

2. how much secret sauce do you hold back?

* Take your pick. These don't align perfectly with each other, much less the interests of a nation or world.

teaearlgraycold · 18 days ago
> N-1 model value depreciates insanely fast

This implies LLM development isn’t plateaued. Sure the researchers are busting their assess quantizing, adding features like tool calls and structured outputs, etc. But soon enough N-1~=N

swalsh · 18 days ago
To me it depends on 2 factors. Hardware becomes more accessible, and the closed source offerings become more expensive. Right now it's difficult to get enough GPUs to do local inference at production scale, and 2 it's more expensive to run your own GPU's vs closed source models.
xpe · 18 days ago
> Open models are going to win long-term.

[3 of 3] What would it take for this statement to be false or missing the point?

Maybe we find ourselves in a future where:

- Yes, open models are widely used as base models, but they are also highly customized in various ways (perhaps by industry, person, attitude, or something else). In other words, this would be a blend of open and closed.

- Maybe publishing open weights of a model is more-or-less irrelevant, because it is "table stakes" ... because all the key differentiating advantages have to do with other factors, such as infrastructure, non-LLM computational aspects, regulatory environment, affordable energy, customer base, customer trust, and probably more.

- The future might involve thousands or millions of highly tailored models

x187463 · 18 days ago
Running a model comparable to o3 on a 24GB Mac Mini is absolutely wild. Seems like yesterday the idea of running frontier (at the time) models locally or on a mobile device was 5+ years out. At this rate, we'll be running such models in the next phone cycle.
tedivm · 18 days ago
It only seems like that if you haven't been following other open source efforts. Models like Qwen perform ridiculously well and do so on very restricted hardware. I'm looking forward to seeing benchmarks to see how these new open source models compare.
Rhubarrbb · 18 days ago
Agreed, these models seem relatively mediocre to Qwen3 / GLM 4.5
echelon · 18 days ago
This might mean there's no moat for anything.

Kind of a P=NP, but for software deliverability.

a_wild_dandan · 18 days ago
Right? I still remember the safety outrage of releasing Llama. Now? My 96 GB of (V)RAM MacBook will be running a 120B parameter frontier lab model. So excited to get my hands on the MLX quants and see how it feels compared to GLM-4.5-air.
SchemaLoad · 18 days ago
I feel like most of the safety concerns ended up being proven correct, but there's so much money in it that they decided to push on anyway full steam ahead.

AI did get used for fake news, propaganda, mass surveillance, erosion of trust and sense of truth, and mass spamming social media.

4b6442477b1280b · 18 days ago
in that era, OpenAI and Anthropic were still deluding themselves into thinking they would be the "stewards" of generative AI, and the last US administration was very keen on regoolating everything under the sun, so "safety" was just an angle for regulatory capture.

God bless China.

Imustaskforhelp · 18 days ago
Okay I will be honest, I was so hyped up about This model but then I went to localllama and saw it that the:

120 B model is worse at coding compared to qwen 3 coder and glm45 air and even grok 3... (https://www.reddit.com/r/LocalLLaMA/comments/1mig58x/gptoss1...)

pxc · 18 days ago
Qwen3 Coder is 4x its size! Grok 3 is over 22x its size!

What does the resource usage look like for GLM 4.5 Air? Is that benchmark in FP16? GPT-OSS-120B will be using between 1/4 and 1/2 the VRAM that GLM-4.5 Air does, right?

It seems like a good showing to me, even though Qwen3 Coder and GLM 4.5 Air might be preferable for some use cases.

logicchains · 18 days ago
It's only got around 5 billion active parameters; it'd be a miracle if it was competitive at coding with SOTA models that have significantly more.
ascorbic · 18 days ago
That's SVGBench, which is a useful benchmark but isn't much of a test of general coding
bogtog · 18 days ago
When people talk about running a (quantized) medium-sized model on a Mac Mini, what types of latency and throughput times are they talking about? Do they mean like 5 tokens per second or at an actually usable speed?
phonon · 18 days ago
Here's a 4bit 70B parameter model, https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b Q4_K_M) on a M4 Max 128 GB. Usable, but not very performant.
davio · 18 days ago
On a M1 MacBook Air with 8GB, I got this running Gemma 3n:

12.63 tok/sec • 860 tokens • 1.52s to first token

I'm amazed it works at all with such limited RAM

n42 · 18 days ago
here's a quick recording from the 20b model on my 128GB M4 Max MBP: https://asciinema.org/a/AiLDq7qPvgdAR1JuQhvZScMNr

and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM

I am, um, floored

a_wild_dandan · 18 days ago
GLM-4.5-air produces tokens far faster than I can read on my MacBook. That's plenty fast enough for me, but YMMV.

Deleted Comment

tyho · 18 days ago
What's the easiest way to get these local models browsing the web right now?
keheliya · 7 days ago
Using LM-Studio as the frontend and Playwright-powered MCP tools for browser access. I've had success with one such MCP: https://github.com/instavm/coderunner It has a tool called navigate_and_get_all_visible_text, for example.

Here's a demo of this functionality: https://www.youtube.com/watch?v=9mLrGcuDifo

dizhn · 18 days ago
aider uses Playwright. I don't know what everybody is using but that's a good starting point.
larodi · 18 days ago
We be running them in PIs off spare juice in no time, and they be billions given how chips and embedded spreads…
lukax · 18 days ago
Inference in Python uses harmony [1] (for request and response format) which is written in Rust with Python bindings. Another OpenAI's Rust library is tiktoken [2], used for all tokenization and detokenization. OpenAI Codex [3] is also written in Rust. It looks like OpenAI is increasingly adopting Rust (at least for inference).

[1] https://github.com/openai/harmony

[2] https://github.com/openai/tiktoken

[3] https://github.com/openai/codex

chilipepperhott · 18 days ago
As an engineer that primarily uses Rust, this is a good omen.
Philpax · 18 days ago
The less Python in the stack, the better!
deviation · 18 days ago
So this confirms a best-in-class model release within the next few days?

From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?

ticulatedspline · 18 days ago
Even without an imminent release it's a good strategy. They're getting pressure from Qwen and other high performing open-weight models. without a horse in the race they could fall behind in an entire segment.

There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.

winterrx · 18 days ago
GPT-5 coming Thursday.
ciaranmca · 18 days ago
Is this the stealth models horizon alpha and beta? I was generally impressed with them(although I really only used it in chats rather than any code tasks). In terms of chat I increasingly see very little difference between the current SOTA closed models and their open weight counterparts.
boringg · 18 days ago
How much hype do we anticipate with the release of GPT-5 or whichever name to be included? And how many new features?
og_kalu · 18 days ago
Even before today, the last week or so, it's been clear for a couple reasons, that GPT-5's release was imminent.
bredren · 18 days ago
Undoubtedly. It would otherwise reduce the perceived value of their current product offering.

The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.

Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.

logicchains · 18 days ago
> I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it

Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.

Deleted Comment

semitones · 16 days ago
You hit the nail on the head!!!
henriquegodoy · 18 days ago
Seeing a 20B model competing with o3's performance is mind blowing like just a year ago, most of us would've called this impossible - not just the intelligence leap, but getting this level of capability in such a compact size.

I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.

coolspot · 18 days ago
10B * 2000 t/s = 20,000 GB/s memory bandwidth . Apple hardware can do 1k GB/s .
oezi · 18 days ago
That’s why MoE is needed.
int_19h · 18 days ago
It's not even a 20b model. It's 20b MoE with 3.6b active params.

But it does not actually compete with o3 performance. Not even close. As usual, the metrics are bullshit. You don't know how good the model actually is until you grill it yourself.

timmg · 18 days ago
Orthogonal, but I just wanted to say how awesome Ollama is. It took 2 seconds to find the model and a minute to download and now I'm using it.

Kudos to that team.

_ache_ · 18 days ago
To be fair, it's with the help of OpenAI. They did it together, before the official release.

https://ollama.com/blog/gpt-oss

aubanel · 18 days ago
From experience, it's much more engineering work on the integrator's side than on OpenAI's. Basically they provide you their new model in advance, but they don't know the specifics of your system, so it's normal that you do most of the work. Thus I'm particularly impressed by Cerebras: they only have a few models supported for their extreme perf inference, it must have been huge bespoke work to integrate.
Shopper0552 · 18 days ago
I remember reading Ollama is going closed source now?

https://www.reddit.com/r/LocalLLaMA/comments/1meeyee/ollamas...

int_19h · 18 days ago
It's just as easy with LM Studio.

All the real heavy lifting is done by llama.cpp, and for the distribution, by HuggingFace.