gpt-oss:20b is a top ten model (on MMLU (right behind Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3 from last year.
I've been experimenting with a lot of local models, both on my laptop and on my phone (Pixel 9 Pro), and I figured we'd be here in a year or two.
But no, we're here today. A basically frontier model, running for the cost of electricity (free with a rounding error) on my laptop. No $200/month subscription, no lakes being drained, etc.
I tried 20b locally and it couldn't reason a way out of a basic river crossing puzzle with labels changed. That is not anywhere near SOTA. In fact it's worse than many local models that can do it, including e.g. QwQ-32b.
> In fact it's worse than many local models that can do it, including e.g. QwQ-32b.
I'm not going to be surprised that a 20B 4/32 MoE model (3.6B parameters activated) is less capable at a particular problem category than a 32B dense model, and its quite possible for both to be SOTA, as state of the art at different scale (both parameter count and speed which scales with active resource needs) is going to have different capabilities. TANSTAAFL.
Well river crossings are one type of problem. My real world problem is proofing and minor editing of text. A version installed on my portable would be great.
The 20b solved the wolf, goat, cabbage river crossing puzzle set to high reasoning for me without needing to use a system prompt that encourages critical thinking. It managed it using multiple different recommended settings, from temperatures of 0.6 up to 1.0, etc.
Other models have generally failed that without a system prompt that encourages rigorous thinking. Each of the reasoning settings may very well have thinking guidance baked in there that do something similar, though.
I'm not sure it says that much that it can solve this, since it's public and can be in training data. It does say something if it can't solve it, though. So, for what it's worth, it solves it reliably for me.
Think this is the smallest model I've seen solve it.
I tried the two US presidents having the same parents one, and while it understood the intent, it got caught up in being adamant that Joe Biden won the election in 2024 and anything I do to try and tell it otherwise is dismissed as being false and expresses quite definitely that I need to do proper research with legitimate sources.
Just out of curiosity, did you set the reasoning effort to high?
I've seen big jumps in performance between medium and high reasoning. Not that this is will solve your particular issue, but curious to see what settings you had.
Edit: I gave this a shot. My settings were temperature = 0.00 (not sure if these models respect temp settings but did it just in case), I also set reasoning effort to high. I'm using the groq-hosted version.
This was my prompt
"You have a fox, a chicken and some grain. The chicken will eat the grain if left alone with it. The fox is vegetarian, for religious reasons, and will also eat the grain if left alone with it. You need to get all across a river, but your boat can only take one at a time. How do you proceed?"
This was the response:
"*Solution – 7 moves*
| Move | What you take | What’s left on the left bank | What’s on the right bank |
| 7 | *Take the grain* across | – | *Farmer + Fox + Chicken + Grain* |
*Explanation*
- The grain is the only thing that can be eaten, and it cannot be left alone with either the fox or the chicken.
- So the grain must always be with the farmer whenever any other animal is on the same side.
The sequence above never leaves the grain with the fox or the chicken without the farmer, and after the seventh move all three items are safely on the far side."
This kind of stuff is so tired. Who cares if it can't solve your silly riddle? It can probably do competitive coding at a world class level and we're quibbling over child riddles? Yeah you know my backhoe is really bad at cutting my toe nails, what a PoS.
I’m still trying to understand what is the biggest group of people that uses local AI (or will)? Students who don’t want to pay but somehow have the hardware? Devs who are price conscious and want free agentic coding?
Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones
It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?
Privacy, both personal and for corporate data protection is a major reason. Unlimited usage, allowing offline use, supporting open source, not worrying about a good model being taken down/discontinued or changed, and the freedom to use uncensored models or model fine tunes are other benefits (though this OpenAI model is super-censored - “safe”).
I don’t have much experience with local vision models, but for text questions the latest local models are quite good. I’ve been using Qwen 3 Coder 30B-A3B a lot to analyze code locally and it has been great. While not as good as the latest big cloud models, it’s roughly on par with SOTA cloud models from late last year in my usage. I also run Qwen 3 235B-A22B 2507 Instruct on my home server, and it’s great, roughly on par with Claude 4 Sonnet in my usage (but slow of course running on my DDR4-equipped server with no GPU).
It's striking how much of the AI conversation focuses on new use cases, while overlooking one of the most serious non-financial costs: privacy.
I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.
Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist. That left me deeply concerned—not just about this moment, but about where things are headed.
The real question isn't just "what can AI do?"—it's "who is keeping the record of what it does?" And just as importantly: "who watches the watcher?" If the answer is "no one," then maybe we shouldn't have a watcher at all.
Healthcare organizations that can't (easily) send data over the wire while remaining in compliance
Organizations operating in high stakes environments
Organizations with restrictive IT policies
To name just a few -- well, the first two are special cases of the last one
RE your hallucination concerns: the issue is overly broad ambitions. Local LLMs are not general purpose -- if what you want is local ChatGPT, you will have a bad time. You should have a highly focused use case, like "classify this free text as A or B" or "clean this up to conform to this standard": this is the sweet spot for a local model
If you're building any kind of product/service that uses AI/LLMs the answer is the same as why any company would want to run any other kind of OSS infra/service instead of relying on some closer proprietary vendor API.
Why not turn the question around. All other things being equal, who would prefer to use a rate limited and/or for-pay service if you could obtain at least comparable quality locally for free with no limitations, no privacy concerns, no censorship (beyond that baked into the weights you choose to use), and no net access required?
It's a pretty bad deal. So it must be that all other things aren't equal, and I suppose the big one is hardware. But neural net based systems always have a point of sharply diminishing returns, which we seem to have unambiguously hit with LLMs already, while the price of hardware is constantly decreasing and its quality increasing. So as we go further into the future, the practicality of running locally will only increase.
> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?
Well, the model makers and device manufacturers of course!
While your Apple, Samsung, and Googles of the world will be unlikely to use OSS models locally (maybe Samsung?), they all have really big incentives to run models locally for a variety of reasons.
Latency, privacy (Apple), cost to run these models on behalf of consumers, etc.
This is why Google started shipping 16GB as the _lowest_ amount of RAM you can get on your Pixel 9. That was a clear flag that they're going to be running more and more models locally on your device.
As mentioned, it seems unlikely that US-based model makers or device manufacturers will use OSS models, they'll certainly be targeting local models heavily on consumer devices in the near future.
Apple's framework of local first, then escalate to ChatGPT if the query is complex will be the dominant pattern imo.
I’m highly interested in local models for privacy reasons. In particular, I want to give an LLM access to my years of personal notes and emails, and answer questions with references to those. As a researcher, there’s lots of unpublished stuff in there that I sometimes either forget or struggle to find again due to searching for the wrong keywords, and a local LLM could help with that.
I pay for ChatGPT and use it frequently, but I wouldn’t trust uploading all that data to them even if they let me. I’ve so far been playing around with Ollama for local use.
~80% of the basic questions I ask of LLMs[0] work just fine locally, and I’m happy to ask twice for the other 20% of queries for the sake of keeping those queries completely private.
[0] Think queries I’d previously have had to put through a search engine and check multiple results for a one word/sentence answer.
"Because you can and its cool" would be reason enough: plenty of revolutions have their origin in "because you can" (Wozniak right off the top of my head, Gates and Altair, stuff like that).
But uncensored is a big deal too: censorship is capability reducing (check out Kilcher's GPT4Chan video and references, the Orca work and Dolphin de-tune lift on SWE-Bench style evals). We pay dearly in capability to get "non-operator-alignment", and you'll notice that competition is hot enough now that at the frontier (Opus, Qwen) the " alignment" away from operators aligned is getting very, very mild.
And then there's the compression. Phi-3 or something works on a beefy laptop and has a nontrivial approximation of "the internet" that works on an airplane or a beach with no network connectivity, talk about vibe coding? I like those look up all the docs via a thumbdrive in Phuket vibes.
And on diffusion stuff, SOTA fits on a laptop or close, you can crush OG mid journey or SD on a macbook, its an even smaller gap.
Early GPT-4 ish outcomes are possible on a Macbook Pro or Razer Blade, so either 12-18 month old LLMs are useless, or GGUF is useful.
The AI goalposts things cuts both ways. If AI is "whatever only Anthropic can do"? That's just as silly as "whatever a computer can't do" and a lot more cynical.
I'm excited to do just dumb and irresponsible things with a local model, like "iterate through every single email in my 20-year-old gmail account and apply label X if Y applies" and not have a surprise bill.
People like myself that firmly believe there will come a time, possibly very soon that all these companies (OpenAI, Anthropic etc) will raise their prices substantially. By then no one will be able to do their work to the standard expected of them without AI, and by then maybe they charge $1k per month, maybe they charge $10k. If there is no viable alternative the sky is the limit.
Why do you think they continue to run at a loss? From the goodness of their heart? Their biggest goal is to discourage anyobe from running local models. The hardware is expensive... The way to run models is very difficult (for example I have dual rtx 3090 for vram and running large heavily quantized models is a real pain in the arse, no high quantisation library supports two GPUs for example, and there seems to be no interest in implementating it by the guys behind the best inference tools).
So this is welcome, but let's not forget why it is being done.
A local laptop of the past few years without a discrete GPU can run, at practical speeds depending on task, a gemma/llama model if it's (ime) under 4GB.
For practical RAG processes of narrow scope and an even minimal amount of scaffolding a very usable speed for automating tasks, especially as the last-mile/edge device portion of a more complex process with better models in use upstream. Classification tasks, reasonay intelligent decisions between traditional workflow processes, other use cases-- a of them extremely valuable in enterprise, being built and deployed right now.
One of my favorite use cases includes simple tasks like generating effective mock/masked data from real data. Then passing the mock data worry-free to the big three (or wherever.)
There’s also a huge opportunity space for serving clients with very sensitive data. Health, legal, and government come to mind immediately. These local models are only going to get more capable of handling their use cases. They already are, really.
Data that can't leave the premises because it is too sensitive. There is a lot of security theater around cloud pretending to be compliant but if you actually care about security a locked server room is the way to do it.
I can provide a real-world example: Low-latency code completion.
The JetBrains suite includes a few LLM models on the order of a hundred megabytes. These models are able to provide "obvious" line completion, like filling in variable names, as well as some basic predictions, like realising that the `if let` statement I'm typing out is going to look something like `if let Some(response) = client_i_just_created.foobar().await`.
If that was running in The Cloud, it would have latency issues, rate limits, and it wouldn't work offline. Sure, there's a pretty big gap between these local IDE LLMs and what OpenAI is offering here, but if my single-line autocomplete could be a little smarter, I sure wouldn't complain.
> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?
Creatives? I am surprised no one's mentioned this yet:
I tried to help a couple of friends with better copy for their websites, and quickly realized that they were using inventive phrases to explain their work, phrases that they would not want competitors to get wind of and benefit from; phrases that associate closely with their personal brand.
Ultimately, I felt uncomfortable presenting the cloud AIs with their text. Sometimes I feel this way even with my own Substack posts, where I occasionally coin a phrase I am proud of. But with local AI? Cool...
I do it because 1) I am fascinated that I can and 2) at some point the online models will be enshitified — and I can then permanently fall back on my last good local version.
In some large, lucrative industries like aerospace many of the hosted models are off the table due to regulations such as ITAR. There'a a market for models which are run on prem/in GovCloud with a professional support contract for installation and updates.
I'm in a corporate environment. There's a study group to see if maybe we can potentially get some value out of those AI tools. They've been "studying" the issue for over a year now. They expect to get some cloud service that we can safely use Real Soon Now.
So, it'll take at least two more quarters before I can actually use those non-local tools on company related data. Probably longer, because sense of urgency is not this company's strong suit.
Anyway, as a developer I can run a lot of things locally. Local AI doesn't leak data, so it's safe. It's not as good as the online tools, but for some things they're better than nothing.
If you have capable hardware and kids, a local LLM is great. A simple system prompt customisation (e.g. ‘all responses should be written as if talking to a 10 year old’) and knowing that everything is private goes a long way for me at least.
Local micro models are both fast and cheap. We tuned small models on our data set and if the small model thinks content is a certain way, we escalate to the LLM.
This gives us really good recall at really low cloud cost and latency.
I would say, any company who doesn't have their own AI developed. You always hear companies "mandating" AI usage, but for the most part it's companies developing their own solutions/agents. No self-respecting company with a tight opsec would allow a random "always-online" LLM that could just rip your codebase either piece by piece or the whole thing at once if it's a IDE addon (or at least I hope that's the case). So yeah, I'd say locally deployed LLM's/Agents are a gamechanger.
Jail breaking then running censored questions. Like diy fireworks, or analysis of papers that touch "sensitive topics", nsfw image generation the list is basically endless.
At the company where I currently work, for IP reasons (and with the advice of a patent lawyer), nobody is allowed to use any online AIs to talk about or help with work, unless it's very generic research that doesn't give away what we're working on.
That rules out coding assistants like Claude, chat, tools to generate presentations and copy-edit documents, and so forth.
But local AI are fine, as long as we're sure nothing is uploaded.
A small LLM can do RAG, call functions, summarize, create structured data from messy text, etc... You know, all the things you'd do if you were making an actual app with an LLM.
Yeah, chat apps are pretty cheap and convenient for users who want to search the internet and write text or code. But APIs quickly get expensive when inputting a significant amount of tokens.
There's a bunch of great reasons in this thread, but how about the chip manufacturers that are going to need you to need a more powerful set of processors in your phone, headset, computer. You can count on those companies to subsidize some R&D and software development.
>Students who don’t want to pay but somehow have the hardware?
that's me - well not a student anymore.
when toying with something, i much prefer not paying for each shot. my 12GB Radeon card can either run a decent extremely slow, or a idiotic but fast model. it's nice not dealing with rate limits.
once you write a prompt that mangles an idiotic model into still doing the work, it's really satisfying. the same principle as working to extract the most from limited embedded hardware. masochism, possibly
Some app devs use local models on local environments with LLM APIs to get up and running fast, then when the app deploys it switches to the big online models via environment vars.
In large companies this can save quite a bit of money.
Privacy laws. Processing government paperwork with LLMs for example. There's a lot of OCR tools that can't be used, and the ones that comply are more expensive than say, GPT-4.1 and lower quality.
anything involving the medical industry (HIPAA laws), national security (FedRAMP is such a pita to get that some military contractors are bypassing it to get quicker access to cloud tools) etc.
Besides that, we are moving towards an era where we won't need to pay providers a subscription every month to use these models. I can't say for certain whether or not the GPUs that run them will get cheaper, but the option to run your own model is game changing for more than you can possibly imagine.
Agencies / firms that work with classified data. Some places have very strict policies on data, which makes it impossible to use any service that isn't local and air-gapped.
I’d use it on a plane if there was no network for coding, but otherwise it’s just an emergency model if the internet goes out, basically end of the world scenarios
AI is going to to be equivalent to all computing in the future. Imagine if only IBM, Apple and Microsoft ever built computers, and all anyone else ever had in the 1990s were terminals to the mainframe, forever.
Maybe I am too pessimistic, but as an EU citizen I expect politics (or should I say Trump?) to prevent access to US-based frontier models at some point.
How up to date are you on current open weights models? After playing around with it for a few hours I find it to be nowhere near as good as Qwen3-30B-A3B. The world knowledge is severely lacking in particular.
dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds
deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds
gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds
gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !
Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.
It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.
The model is good and runs fine but if you want to be blown away again try Qwen3-30A-A3B-2507. It's 6gb bigger but the response is comparable or better and much faster to run. Gpt-oss-20B gives me 6 tok/sec while Qwen3 gives me 37 tok/sec. Qwen3 is not a reasoning model tho.
I just tested 120B from the Groq API on agentic stuff (multi-step function calling, similar to claude code) and it's not that good. Agentic fine-tuning seems key, hopefully someone drops one soon.
The environmentalist in me loves the fact that LLM progress has mostly been focused on doing more with the same hardware, rather than horizontal scaling. I guess given GPU shortages that makes sense, but it really does feel like the value of my hardware (a laptop in my case) is going up over time, not down.
Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."
Training cost has increased a ton exactly because inference cost is the biggest problem: models are now trained on almost three orders of magnitude more data then what is compute-optimal to do (from the Chinchilla paper), because saving compute on inference makes it valuable to overtrain a smaller model to achieve similar performance for a bigger amount of training compute.
Interesting. I understand that, but I don't know to what degree.
I mean the training, while expensive, is done once. The inference … besides being done by perhaps millions of clients, is done for, well, the life of the model anyway. Surely that adds up.
It's hard to know, but I assume the user taking up the burden of the inference is perhaps doing so more efficiently? I mean, when I run a local model, it is plodding along — not as quick as the online model. So, slow and therefore I assume necessarily more power efficient.
I've run qwen3 4B on my phone, it's not the best but it's better than old gpt-3.5. It also does have a reasoning mode, and in reasoning mode it's better than the original gpt-4 and rhe original gpt-4o, but not the latest gpt-4o. I get usable speed, but it's not really comparable to most cloud hosted models.
I tried their live demo. It suggests three prompts, one of them being “How many R’s are in strawberry?” So I clicked that, and it answered there are three! I tried it thrice with the same result.
It suggested the prompt. It’s infamous because models often get it wrong, they know it, and still they confidently suggested it and got it wrong.
You can get a pretty good estimate depending on your memory bandwidth. Too many parameters can change with local models (quantization, fast attention, etc). But the new models are MoE so they’re gonna be pretty fast.
For me the game changer here is the speed. On my local Mac I'm finally getting token counts that are faster than I can process the output (~96 tok/s), and the quality has been solid. I had previously tried some of the distilled qwen and deepseek models and they were just way too slow for me to seriously use them.
In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:
- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.
- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.
- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)
All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.
I would guess the “secret sauce” here is distillation: pretraining on an extremely high quality synthetic dataset from the prompted output of their state of the art models like o3 rather than generic internet text. A number of research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.
This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.
Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.
Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.
It's convenient to be able to attribute success to things only OpenAI could've done with the combo of their early start and VC money – licensing content, hiring subject matter experts, etc. Essentially the "soft" stuff that a mature organization can do.
I think their MXFP4 release is a bit of a gift since they obviously used and tuned this extensively as a result of cost-optimization at scale - something the open source model providers aren't doing too much, and also somewhat of a competitive advantage.
Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.
>They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool
They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.
The native FP4 is one of the most interesting architectural aspects here IMO, as going below FP8 is known to come with accuracy tradeoffs. I'm curious how they navigated this and how the FP8 weights (if they exist) were to perform.
I don't know how to ask this without being direct and dumb: Where do I get a layman's introduction to LLMs that could work me up to understanding every term and concept you just discussed? Either specific videos, or if nothing else, a reliable Youtube channel?
What I’ve sometimes done when trying to make sense of recent LLM research is give the paper and related documents to ChatGPT, Claude, or Gemini and ask them to explain the specific terms I don’t understand. If I don’t understand their explanations or want to know more, I ask follow-ups. Doing this in voice mode works better for me than text chat does.
When I just want a full summary without necessarily understanding all the details, I have an audio overview made on NotebookLM and listen to the podcast while I’m exercising or cleaning. I did that a few days ago with the recent Anthropic paper on persona vectors, and it worked great.
There is a great 3blue1brown video, but it’s pretty much impossible by now to cover the entire landscape of research. I bet gpt-oss has some great explanations though ;)
Try Microsoft's "Generative AI for Beginners" repo on GitHub. The early chapters in particular give a good grounding of LLM architecture without too many assumptions of background knowledge. The video version of the series is good too.
Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).
Open models are going to win long-term. Anthropics' own research has to use OSS models [0]. China is demonstrating how quickly companies can iterate on open models, allowing smaller teams access and augmentation to the abilities of a model without paying the training cost.
My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.
N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.
There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.
Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.
I'm pretty sure there's no reason that Anthropic has to do research on open models, it's just that they produced their result on open models so that you can reproduce their result on open models without having access to theirs.
[2 of 3] Assuming we pin down what win means... (which is definitely not easy)... What would it take for this to not be true? There are many ways, including but not limited to:
- publishing open weights helps your competitors catch up
- publishing open weights doesn't improve your own research agenda
- publishing open weights leads to a race dynamic where only the latest and greatest matters; leading to a situation where the resources sunk exceed the gains
- publishing open weights distracts your organization from attaining a sustainable business model / funding stream
- publishing open weights leads to significant negative downstream impacts (there are a variety of uncertain outcomes, such as: deepfakes, security breaches, bioweapon development, unaligned general intelligence, humans losing control [1] [2], and so on)
I don't think there will be such a unique event. There is no clear boundary. This is a continuous process. Modells get slightly better than before.
Also, another dimension is the inference cost to run those models. It has to be cheap enough to really take advantage of it.
Also, I wonder, what would be a good target to make profit, to develop new things? There is Isomorphic Labs, which seems like a good target. This company already exists now, and people are working on it. What else?
> I don't think there will be such a unique event.
I guess it depends on your definition of AGI, but if it means human level intelligence then the unique event will be the AI having the ability to act on its own without a "prompt".
I'm a layman but it seemed to me that the industry is going towards robust foundational models on which we plug tools, databases, and processes to expand their capabilities.
In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.
I have this theory that we simply got over a hump by utilizing a massive processing boost from gpus as opposed to CPUs. That might have been two to three orders of magnitude more processing power.
But that's a one-time success. I don't hardware has any large scale improvements coming, because 3D gaming mostly plumb most of that vector processing hardware development in the last 30 years.
So will software and better training models produce another couple orders of magnitude?
Fundamentally we're talking about nines of of accuracy. What is the processing power required for each line of accuracy? Is it linear? Is it polynomial? Is it exponential?
It just seems strange to me with all the AI knowledge slushing through academia, I haven't seen any basic analysis at that level, which is something that's absolutely going to be necessary for AI applications like self-driving, once you get those insurance companies involved
This implies LLM development isn’t plateaued. Sure the researchers are busting their assess quantizing, adding features like tool calls and structured outputs, etc. But soon enough N-1~=N
To me it depends on 2 factors. Hardware becomes more accessible, and the closed source offerings become more expensive. Right now it's difficult to get enough GPUs to do local inference at production scale, and 2 it's more expensive to run your own GPU's vs closed source models.
[3 of 3] What would it take for this statement to be false or missing the point?
Maybe we find ourselves in a future where:
- Yes, open models are widely used as base models, but they are also highly customized in various ways (perhaps by industry, person, attitude, or something else). In other words, this would be a blend of open and closed.
- Maybe publishing open weights of a model is more-or-less irrelevant, because it is "table stakes" ... because all the key differentiating advantages have to do with other factors, such as infrastructure, non-LLM computational aspects, regulatory environment, affordable energy, customer base, customer trust, and probably more.
- The future might involve thousands or millions of highly tailored models
Running a model comparable to o3 on a 24GB Mac Mini is absolutely wild. Seems like yesterday the idea of running frontier (at the time) models locally or on a mobile device was 5+ years out. At this rate, we'll be running such models in the next phone cycle.
It only seems like that if you haven't been following other open source efforts. Models like Qwen perform ridiculously well and do so on very restricted hardware. I'm looking forward to seeing benchmarks to see how these new open source models compare.
Right? I still remember the safety outrage of releasing Llama. Now? My 96 GB of (V)RAM MacBook will be running a 120B parameter frontier lab model. So excited to get my hands on the MLX quants and see how it feels compared to GLM-4.5-air.
I feel like most of the safety concerns ended up being proven correct, but there's so much money in it that they decided to push on anyway full steam ahead.
AI did get used for fake news, propaganda, mass surveillance, erosion of trust and sense of truth, and mass spamming social media.
in that era, OpenAI and Anthropic were still deluding themselves into thinking they would be the "stewards" of generative AI, and the last US administration was very keen on regoolating everything under the sun, so "safety" was just an angle for regulatory capture.
Qwen3 Coder is 4x its size! Grok 3 is over 22x its size!
What does the resource usage look like for GLM 4.5 Air? Is that benchmark in FP16? GPT-OSS-120B will be using between 1/4 and 1/2 the VRAM that GLM-4.5 Air does, right?
It seems like a good showing to me, even though Qwen3 Coder and GLM 4.5 Air might be preferable for some use cases.
When people talk about running a (quantized) medium-sized model on a Mac Mini, what types of latency and throughput times are they talking about? Do they mean like 5 tokens per second or at an actually usable speed?
Using LM-Studio as the frontend and Playwright-powered MCP tools for browser access. I've had success with one such MCP: https://github.com/instavm/coderunner It has a tool called navigate_and_get_all_visible_text, for example.
Inference in Python uses harmony [1] (for request and response format) which is written in Rust with Python bindings. Another OpenAI's Rust library is tiktoken [2], used for all tokenization and detokenization. OpenAI Codex [3] is also written in Rust. It looks like OpenAI is increasingly adopting Rust (at least for inference).
Even without an imminent release it's a good strategy. They're getting pressure from Qwen and other high performing open-weight models. without a horse in the race they could fall behind in an entire segment.
There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.
Is this the stealth models horizon alpha and beta? I was generally impressed with them(although I really only used it in chats rather than any code tasks). In terms of chat I increasingly see very little difference between the current SOTA closed models and their open weight counterparts.
> I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it
Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.
Seeing a 20B model competing with o3's performance is mind blowing like just a year ago, most of us would've called this impossible - not just the intelligence leap, but getting this level of capability in such a compact size.
I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.
It's not even a 20b model. It's 20b MoE with 3.6b active params.
But it does not actually compete with o3 performance. Not even close. As usual, the metrics are bullshit. You don't know how good the model actually is until you grill it yourself.
From experience, it's much more engineering work on the integrator's side than on OpenAI's. Basically they provide you their new model in advance, but they don't know the specifics of your system, so it's normal that you do most of the work.
Thus I'm particularly impressed by Cerebras: they only have a few models supported for their extreme perf inference, it must have been huge bespoke work to integrate.
gpt-oss:20b is a top ten model (on MMLU (right behind Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3 from last year.
I've been experimenting with a lot of local models, both on my laptop and on my phone (Pixel 9 Pro), and I figured we'd be here in a year or two.
But no, we're here today. A basically frontier model, running for the cost of electricity (free with a rounding error) on my laptop. No $200/month subscription, no lakes being drained, etc.
I'm blown away.
I'm not going to be surprised that a 20B 4/32 MoE model (3.6B parameters activated) is less capable at a particular problem category than a 32B dense model, and its quite possible for both to be SOTA, as state of the art at different scale (both parameter count and speed which scales with active resource needs) is going to have different capabilities. TANSTAAFL.
Other models have generally failed that without a system prompt that encourages rigorous thinking. Each of the reasoning settings may very well have thinking guidance baked in there that do something similar, though.
I'm not sure it says that much that it can solve this, since it's public and can be in training data. It does say something if it can't solve it, though. So, for what it's worth, it solves it reliably for me.
Think this is the smallest model I've seen solve it.
I've seen big jumps in performance between medium and high reasoning. Not that this is will solve your particular issue, but curious to see what settings you had.
Edit: I gave this a shot. My settings were temperature = 0.00 (not sure if these models respect temp settings but did it just in case), I also set reasoning effort to high. I'm using the groq-hosted version.
This was my prompt
"You have a fox, a chicken and some grain. The chicken will eat the grain if left alone with it. The fox is vegetarian, for religious reasons, and will also eat the grain if left alone with it. You need to get all across a river, but your boat can only take one at a time. How do you proceed?"
This was the response:
"*Solution – 7 moves*
| Move | What you take | What’s left on the left bank | What’s on the right bank |
|------|--------------|----------------------------|------------------------|
| 1 | *Take the grain* across | Fox, Chicken | *Farmer + Grain* |
| 2 | *Return alone* | *Farmer + Fox + Chicken* | Grain |
| 3 | *Take the fox* across | Chicken | *Farmer + Fox + Grain* |
| 4 | *Bring the grain back* | *Farmer + Grain + Chicken* | Fox |
| 5 | *Take the chicken* across | Grain | *Farmer + Chicken + Fox* |
| 6 | *Return alone* | *Farmer + Grain* | Chicken + Fox |
| 7 | *Take the grain* across | – | *Farmer + Fox + Chicken + Grain* |
*Explanation*
- The grain is the only thing that can be eaten, and it cannot be left alone with either the fox or the chicken. - So the grain must always be with the farmer whenever any other animal is on the same side.
The sequence above never leaves the grain with the fox or the chicken without the farmer, and after the seventh move all three items are safely on the far side."
Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones
It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?
I don’t have much experience with local vision models, but for text questions the latest local models are quite good. I’ve been using Qwen 3 Coder 30B-A3B a lot to analyze code locally and it has been great. While not as good as the latest big cloud models, it’s roughly on par with SOTA cloud models from late last year in my usage. I also run Qwen 3 235B-A22B 2507 Instruct on my home server, and it’s great, roughly on par with Claude 4 Sonnet in my usage (but slow of course running on my DDR4-equipped server with no GPU).
I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.
Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist. That left me deeply concerned—not just about this moment, but about where things are headed.
The real question isn't just "what can AI do?"—it's "who is keeping the record of what it does?" And just as importantly: "who watches the watcher?" If the answer is "no one," then maybe we shouldn't have a watcher at all.
Organizations operating in high stakes environments
Organizations with restrictive IT policies
To name just a few -- well, the first two are special cases of the last one
RE your hallucination concerns: the issue is overly broad ambitions. Local LLMs are not general purpose -- if what you want is local ChatGPT, you will have a bad time. You should have a highly focused use case, like "classify this free text as A or B" or "clean this up to conform to this standard": this is the sweet spot for a local model
It's a pretty bad deal. So it must be that all other things aren't equal, and I suppose the big one is hardware. But neural net based systems always have a point of sharply diminishing returns, which we seem to have unambiguously hit with LLMs already, while the price of hardware is constantly decreasing and its quality increasing. So as we go further into the future, the practicality of running locally will only increase.
Well, the model makers and device manufacturers of course!
While your Apple, Samsung, and Googles of the world will be unlikely to use OSS models locally (maybe Samsung?), they all have really big incentives to run models locally for a variety of reasons.
Latency, privacy (Apple), cost to run these models on behalf of consumers, etc.
This is why Google started shipping 16GB as the _lowest_ amount of RAM you can get on your Pixel 9. That was a clear flag that they're going to be running more and more models locally on your device.
As mentioned, it seems unlikely that US-based model makers or device manufacturers will use OSS models, they'll certainly be targeting local models heavily on consumer devices in the near future.
Apple's framework of local first, then escalate to ChatGPT if the query is complex will be the dominant pattern imo.
I pay for ChatGPT and use it frequently, but I wouldn’t trust uploading all that data to them even if they let me. I’ve so far been playing around with Ollama for local use.
[0] Think queries I’d previously have had to put through a search engine and check multiple results for a one word/sentence answer.
But uncensored is a big deal too: censorship is capability reducing (check out Kilcher's GPT4Chan video and references, the Orca work and Dolphin de-tune lift on SWE-Bench style evals). We pay dearly in capability to get "non-operator-alignment", and you'll notice that competition is hot enough now that at the frontier (Opus, Qwen) the " alignment" away from operators aligned is getting very, very mild.
And then there's the compression. Phi-3 or something works on a beefy laptop and has a nontrivial approximation of "the internet" that works on an airplane or a beach with no network connectivity, talk about vibe coding? I like those look up all the docs via a thumbdrive in Phuket vibes.
And on diffusion stuff, SOTA fits on a laptop or close, you can crush OG mid journey or SD on a macbook, its an even smaller gap.
Early GPT-4 ish outcomes are possible on a Macbook Pro or Razer Blade, so either 12-18 month old LLMs are useless, or GGUF is useful.
The AI goalposts things cuts both ways. If AI is "whatever only Anthropic can do"? That's just as silly as "whatever a computer can't do" and a lot more cynical.
We are not even at that extreme and you can already see the unequal reality that too much SaaS has engendered
I think it can make LLMs fun.
Why do you think they continue to run at a loss? From the goodness of their heart? Their biggest goal is to discourage anyobe from running local models. The hardware is expensive... The way to run models is very difficult (for example I have dual rtx 3090 for vram and running large heavily quantized models is a real pain in the arse, no high quantisation library supports two GPUs for example, and there seems to be no interest in implementating it by the guys behind the best inference tools).
So this is welcome, but let's not forget why it is being done.
I'm sure there are other use cases, but much like "what is BitTorrent for?", the obvious use case is obvious.
For practical RAG processes of narrow scope and an even minimal amount of scaffolding a very usable speed for automating tasks, especially as the last-mile/edge device portion of a more complex process with better models in use upstream. Classification tasks, reasonay intelligent decisions between traditional workflow processes, other use cases-- a of them extremely valuable in enterprise, being built and deployed right now.
There’s also a huge opportunity space for serving clients with very sensitive data. Health, legal, and government come to mind immediately. These local models are only going to get more capable of handling their use cases. They already are, really.
1. App makers can fine tune smaller models and include in their apps to avoid server costs
2. Privacy-sensitive content can be either filtered out or worked on... I'm using local LLMs to process my health history for example
3. Edge servers can be running these fine tuned for a given task. Flash/lite models by the big guys are effectively like these smaller models already.
The JetBrains suite includes a few LLM models on the order of a hundred megabytes. These models are able to provide "obvious" line completion, like filling in variable names, as well as some basic predictions, like realising that the `if let` statement I'm typing out is going to look something like `if let Some(response) = client_i_just_created.foobar().await`.
If that was running in The Cloud, it would have latency issues, rate limits, and it wouldn't work offline. Sure, there's a pretty big gap between these local IDE LLMs and what OpenAI is offering here, but if my single-line autocomplete could be a little smarter, I sure wouldn't complain.
Creatives? I am surprised no one's mentioned this yet:
I tried to help a couple of friends with better copy for their websites, and quickly realized that they were using inventive phrases to explain their work, phrases that they would not want competitors to get wind of and benefit from; phrases that associate closely with their personal brand.
Ultimately, I felt uncomfortable presenting the cloud AIs with their text. Sometimes I feel this way even with my own Substack posts, where I occasionally coin a phrase I am proud of. But with local AI? Cool...
So, it'll take at least two more quarters before I can actually use those non-local tools on company related data. Probably longer, because sense of urgency is not this company's strong suit.
Anyway, as a developer I can run a lot of things locally. Local AI doesn't leak data, so it's safe. It's not as good as the online tools, but for some things they're better than nothing.
This gives us really good recall at really low cloud cost and latency.
That rules out coding assistants like Claude, chat, tools to generate presentations and copy-edit documents, and so forth.
But local AI are fine, as long as we're sure nothing is uploaded.
A small LLM can do RAG, call functions, summarize, create structured data from messy text, etc... You know, all the things you'd do if you were making an actual app with an LLM.
Yeah, chat apps are pretty cheap and convenient for users who want to search the internet and write text or code. But APIs quickly get expensive when inputting a significant amount of tokens.
For example, "generate a heatmap of each token/word and how 'unexpected' they are" or "find me a prompt that creates the closest match to this text"
To be efficient both require access that is not exposed over API.
How about running one on this site but making it publically available? A sort of outranet and calling it HackerBrain?
Even if they did offer a defined latency product, you’re relying on a lot of infrastructure between your application and their GPU.
That’s not always tolerable.
that's me - well not a student anymore. when toying with something, i much prefer not paying for each shot. my 12GB Radeon card can either run a decent extremely slow, or a idiotic but fast model. it's nice not dealing with rate limits.
once you write a prompt that mangles an idiotic model into still doing the work, it's really satisfying. the same principle as working to extract the most from limited embedded hardware. masochism, possibly
iPhone users in a few months – because I predict app developers will love cramming calls to the foundation models into everything.
Android will follow.
In large companies this can save quite a bit of money.
Why not run all the models at home, maybe collaboratively or at least in parallel?
I'm sure there are use cases where the paid models are not allowed to collaborate or ask each other.
also, other open models are gaining mindshare.
Besides that, we are moving towards an era where we won't need to pay providers a subscription every month to use these models. I can't say for certain whether or not the GPUs that run them will get cheaper, but the option to run your own model is game changing for more than you can possibly imagine.
example: military intel
That means running instantly offline and every token is free
Deleted Comment
Privacy is obvious.
AI is going to to be equivalent to all computing in the future. Imagine if only IBM, Apple and Microsoft ever built computers, and all anyone else ever had in the 1990s were terminals to the mainframe, forever.
Dead Comment
Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...
dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds
deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds
gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds
gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !
Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.
It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.
Small models are going to be particularly poor when used outside of their intended purpose. They have to omit something.
Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."
Not in the UK it isn’t.
I mean the training, while expensive, is done once. The inference … besides being done by perhaps millions of clients, is done for, well, the life of the model anyway. Surely that adds up.
It's hard to know, but I assume the user taking up the burden of the inference is perhaps doing so more efficiently? I mean, when I run a local model, it is plodding along — not as quick as the online model. So, slow and therefore I assume necessarily more power efficient.
These are the simplified results (total percentage of correctly classified E-mails on both spam and ham testing data):
gpt-oss:20b 95.6%
gemma3:27b-it-qat 94.3%
mistral-small3.2:24b-instruct-2506-q4_K_M 93.7%
mistral-small3.2:24b-instruct-2506-q8_0 92.5%
qwen3:32b-q4_K_M 89.2%
qwen3:30b-a3b-q4_K_M 87.9%
gemma3n:e4b-it-q4_K_M 84.9%
deepseek-r1:8b 75.2%
qwen3:30b-a3b-instruct-2507-q4_K_M 73.0%
I'm quite happy, because it's also smaller and faster than gemma3.
https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
Are you discounting all of the self reported scores?
I don't understand why "TIGER-LAb"-sourced scores are 'unknown' in terms of model size?
When you imagine a lake being drained to cool a datacenter do you ever consider where the water used for cooling goes? Do you imagine it disappears?
It suggested the prompt. It’s infamous because models often get it wrong, they know it, and still they confidently suggested it and got it wrong.
https://i.imgur.com/DgAvbee.png
This is a thinking model, so I ran it against o4-mini, here are the results:
* gpt-oss:20b
* Time-to-first-token: 2.49 seconds
* Time-to-completion: 51.47 seconds
* Tokens-per-second: 2.19
* o4-mini on ChatGPT
* Time-to-first-token: 2.50 seconds
* Time-to-completion: 5.84 seconds
* Tokens-per-second: 19.34
Time to first token was similar, but the thinking piece was _much_ faster on o4-mini. Thinking took the majority of the 51 seconds for gpt-oss:20b.
Dead Comment
In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:
- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.
- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.
- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)
All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.
This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.
same seems to be true for humans
The model is pretty sparse tho, 32:1.
Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.
They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.
When I just want a full summary without necessarily understanding all the details, I have an audio overview made on NotebookLM and listen to the podcast while I’m exercising or cleaning. I did that a few days ago with the recent Anthropic paper on persona vectors, and it worked great.
https://www.manning.com/books/build-a-large-language-model-f...
My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.
N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.
There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.
Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.
[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.
[2 of 3] Assuming we pin down what win means... (which is definitely not easy)... What would it take for this to not be true? There are many ways, including but not limited to:
- publishing open weights helps your competitors catch up
- publishing open weights doesn't improve your own research agenda
- publishing open weights leads to a race dynamic where only the latest and greatest matters; leading to a situation where the resources sunk exceed the gains
- publishing open weights distracts your organization from attaining a sustainable business model / funding stream
- publishing open weights leads to significant negative downstream impacts (there are a variety of uncertain outcomes, such as: deepfakes, security breaches, bioweapon development, unaligned general intelligence, humans losing control [1] [2], and so on)
[1]: "What failure looks like" by Paul Christiano : https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-...
[2]: "An AGI race is a suicide race." - quote from Max Tegmark; article at https://futureoflife.org/statement/agi-manhattan-project-max...
I don't think there will be such a unique event. There is no clear boundary. This is a continuous process. Modells get slightly better than before.
Also, another dimension is the inference cost to run those models. It has to be cheap enough to really take advantage of it.
Also, I wonder, what would be a good target to make profit, to develop new things? There is Isomorphic Labs, which seems like a good target. This company already exists now, and people are working on it. What else?
I guess it depends on your definition of AGI, but if it means human level intelligence then the unique event will be the AI having the ability to act on its own without a "prompt".
In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.
I have this theory that we simply got over a hump by utilizing a massive processing boost from gpus as opposed to CPUs. That might have been two to three orders of magnitude more processing power.
But that's a one-time success. I don't hardware has any large scale improvements coming, because 3D gaming mostly plumb most of that vector processing hardware development in the last 30 years.
So will software and better training models produce another couple orders of magnitude?
Fundamentally we're talking about nines of of accuracy. What is the processing power required for each line of accuracy? Is it linear? Is it polynomial? Is it exponential?
It just seems strange to me with all the AI knowledge slushing through academia, I haven't seen any basic analysis at that level, which is something that's absolutely going to be necessary for AI applications like self-driving, once you get those insurance companies involved
[1 of 3] For the sake of argument here, I'll grant the premise. If this turns out to be true, it glosses over other key questions, including:
For a frontier lab, what is a rational period of time (according to your organizational mission / charter / shareholder motivations*) to wait before:
1. releasing a new version of an open-weight model; and
2. how much secret sauce do you hold back?
* Take your pick. These don't align perfectly with each other, much less the interests of a nation or world.
This implies LLM development isn’t plateaued. Sure the researchers are busting their assess quantizing, adding features like tool calls and structured outputs, etc. But soon enough N-1~=N
[3 of 3] What would it take for this statement to be false or missing the point?
Maybe we find ourselves in a future where:
- Yes, open models are widely used as base models, but they are also highly customized in various ways (perhaps by industry, person, attitude, or something else). In other words, this would be a blend of open and closed.
- Maybe publishing open weights of a model is more-or-less irrelevant, because it is "table stakes" ... because all the key differentiating advantages have to do with other factors, such as infrastructure, non-LLM computational aspects, regulatory environment, affordable energy, customer base, customer trust, and probably more.
- The future might involve thousands or millions of highly tailored models
Kind of a P=NP, but for software deliverability.
AI did get used for fake news, propaganda, mass surveillance, erosion of trust and sense of truth, and mass spamming social media.
God bless China.
120 B model is worse at coding compared to qwen 3 coder and glm45 air and even grok 3... (https://www.reddit.com/r/LocalLLaMA/comments/1mig58x/gptoss1...)
What does the resource usage look like for GLM 4.5 Air? Is that benchmark in FP16? GPT-OSS-120B will be using between 1/4 and 1/2 the VRAM that GLM-4.5 Air does, right?
It seems like a good showing to me, even though Qwen3 Coder and GLM 4.5 Air might be preferable for some use cases.
12.63 tok/sec • 860 tokens • 1.52s to first token
I'm amazed it works at all with such limited RAM
and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM
I am, um, floored
Deleted Comment
Here's a demo of this functionality: https://www.youtube.com/watch?v=9mLrGcuDifo
[1] https://github.com/openai/harmony
[2] https://github.com/openai/tiktoken
[3] https://github.com/openai/codex
From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?
There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.
https://manifold.markets/Bayesian/on-what-day-will-gpt5-be-r...
The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.
Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.
Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.
Deleted Comment
I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.
But it does not actually compete with o3 performance. Not even close. As usual, the metrics are bullshit. You don't know how good the model actually is until you grill it yourself.
Kudos to that team.
https://ollama.com/blog/gpt-oss
https://www.reddit.com/r/LocalLLaMA/comments/1meeyee/ollamas...
All the real heavy lifting is done by llama.cpp, and for the distribution, by HuggingFace.