Open models by OpenAI

The lede is being missed imo.

gpt-oss:20b is a top ten model (on MMLU (right behind Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3 from last year.

I've been experimenting with a lot of local models, both on my laptop and on my phone (Pixel 9 Pro), and I figured we'd be here in a year or two.

But no, we're here today. A basically frontier model, running for the cost of electricity (free with a rounding error) on my laptop. No $200/month subscription, no lakes being drained, etc.

I'm blown away.

int_19h · 18 days ago

I tried 20b locally and it couldn't reason a way out of a basic river crossing puzzle with labels changed. That is not anywhere near SOTA. In fact it's worse than many local models that can do it, including e.g. QwQ-32b.

dragonwriter · 18 days ago

> In fact it's worse than many local models that can do it, including e.g. QwQ-32b.

I'm not going to be surprised that a 20B 4/32 MoE model (3.6B parameters activated) is less capable at a particular problem category than a 32B dense model, and its quite possible for both to be SOTA, as state of the art at different scale (both parameter count and speed which scales with active resource needs) is going to have different capabilities. TANSTAAFL.

robwwilliams · 18 days ago

Well river crossings are one type of problem. My real world problem is proofing and minor editing of text. A version installed on my portable would be great.

CMay · 18 days ago

The 20b solved the wolf, goat, cabbage river crossing puzzle set to high reasoning for me without needing to use a system prompt that encourages critical thinking. It managed it using multiple different recommended settings, from temperatures of 0.6 up to 1.0, etc.

Other models have generally failed that without a system prompt that encourages rigorous thinking. Each of the reasoning settings may very well have thinking guidance baked in there that do something similar, though.

I'm not sure it says that much that it can solve this, since it's public and can be in training data. It does say something if it can't solve it, though. So, for what it's worth, it solves it reliably for me.

Think this is the smallest model I've seen solve it.

9rx · 18 days ago

I tried the two US presidents having the same parents one, and while it understood the intent, it got caught up in being adamant that Joe Biden won the election in 2024 and anything I do to try and tell it otherwise is dismissed as being false and expresses quite definitely that I need to do proper research with legitimate sources.

tweezy · 17 days ago

Just out of curiosity, did you set the reasoning effort to high?

I've seen big jumps in performance between medium and high reasoning. Not that this is will solve your particular issue, but curious to see what settings you had.

Edit: I gave this a shot. My settings were temperature = 0.00 (not sure if these models respect temp settings but did it just in case), I also set reasoning effort to high. I'm using the groq-hosted version.

This was my prompt

"You have a fox, a chicken and some grain. The chicken will eat the grain if left alone with it. The fox is vegetarian, for religious reasons, and will also eat the grain if left alone with it. You need to get all across a river, but your boat can only take one at a time. How do you proceed?"

This was the response:

"*Solution – 7 moves*

|------|--------------|----------------------------|------------------------|

| 4 | *Bring the grain back* | *Farmer + Grain + Chicken* | Fox |

| 7 | *Take the grain* across | – | *Farmer + Fox + Chicken + Grain* |

*Explanation*

- The grain is the only thing that can be eaten, and it cannot be left alone with either the fox or the chicken. - So the grain must always be with the farmer whenever any other animal is on the same side.

The sequence above never leaves the grain with the fox or the chicken without the farmer, and after the seventh move all three items are safely on the far side."

vonneumannstan · 17 days ago

This kind of stuff is so tired. Who cares if it can't solve your silly riddle? It can probably do competitive coding at a world class level and we're quibbling over child riddles? Yeah you know my backhoe is really bad at cutting my toe nails, what a PoS.

captainregex · 18 days ago

I’m still trying to understand what is the biggest group of people that uses local AI (or will)? Students who don’t want to pay but somehow have the hardware? Devs who are price conscious and want free agentic coding?

Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones

It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?

wizee · 18 days ago

Privacy, both personal and for corporate data protection is a major reason. Unlimited usage, allowing offline use, supporting open source, not worrying about a good model being taken down/discontinued or changed, and the freedom to use uncensored models or model fine tunes are other benefits (though this OpenAI model is super-censored - “safe”).

I don’t have much experience with local vision models, but for text questions the latest local models are quite good. I’ve been using Qwen 3 Coder 30B-A3B a lot to analyze code locally and it has been great. While not as good as the latest big cloud models, it’s roughly on par with SOTA cloud models from late last year in my usage. I also run Qwen 3 235B-A22B 2507 Instruct on my home server, and it’s great, roughly on par with Claude 4 Sonnet in my usage (but slow of course running on my DDR4-equipped server with no GPU).

xrmagnum · 18 days ago

It's striking how much of the AI conversation focuses on new use cases, while overlooking one of the most serious non-financial costs: privacy.

I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.

Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist. That left me deeply concerned—not just about this moment, but about where things are headed.

The real question isn't just "what can AI do?"—it's "who is keeping the record of what it does?" And just as importantly: "who watches the watcher?" If the answer is "no one," then maybe we shouldn't have a watcher at all.

canvascritic · 18 days ago

Healthcare organizations that can't (easily) send data over the wire while remaining in compliance

Organizations operating in high stakes environments

Organizations with restrictive IT policies

To name just a few -- well, the first two are special cases of the last one

RE your hallucination concerns: the issue is overly broad ambitions. Local LLMs are not general purpose -- if what you want is local ChatGPT, you will have a bad time. You should have a highly focused use case, like "classify this free text as A or B" or "clean this up to conform to this standard": this is the sweet spot for a local model

wubrr · 18 days ago

If you're building any kind of product/service that uses AI/LLMs the answer is the same as why any company would want to run any other kind of OSS infra/service instead of relying on some closer proprietary vendor API.

  - Costs.
  - Rate limits.
  - Privacy.
  - Security.
  - Vendor lock-in.
  - Stability/backwards-compatibility.
  - Control.
  - Etc.

somenameforme · 18 days ago

Why not turn the question around. All other things being equal, who would prefer to use a rate limited and/or for-pay service if you could obtain at least comparable quality locally for free with no limitations, no privacy concerns, no censorship (beyond that baked into the weights you choose to use), and no net access required?

It's a pretty bad deal. So it must be that all other things aren't equal, and I suppose the big one is hardware. But neural net based systems always have a point of sharply diminishing returns, which we seem to have unambiguously hit with LLMs already, while the price of hardware is constantly decreasing and its quality increasing. So as we go further into the future, the practicality of running locally will only increase.

cco · 18 days ago

> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?

Well, the model makers and device manufacturers of course!

While your Apple, Samsung, and Googles of the world will be unlikely to use OSS models locally (maybe Samsung?), they all have really big incentives to run models locally for a variety of reasons.

Latency, privacy (Apple), cost to run these models on behalf of consumers, etc.

This is why Google started shipping 16GB as the _lowest_ amount of RAM you can get on your Pixel 9. That was a clear flag that they're going to be running more and more models locally on your device.

As mentioned, it seems unlikely that US-based model makers or device manufacturers will use OSS models, they'll certainly be targeting local models heavily on consumer devices in the near future.

Apple's framework of local first, then escalate to ChatGPT if the query is complex will be the dominant pattern imo.

setopt · 18 days ago

I’m highly interested in local models for privacy reasons. In particular, I want to give an LLM access to my years of personal notes and emails, and answer questions with references to those. As a researcher, there’s lots of unpublished stuff in there that I sometimes either forget or struggle to find again due to searching for the wrong keywords, and a local LLM could help with that.

I pay for ChatGPT and use it frequently, but I wouldn’t trust uploading all that data to them even if they let me. I’ve so far been playing around with Ollama for local use.

barnabee · 18 days ago

~80% of the basic questions I ask of LLMs[0] work just fine locally, and I’m happy to ask twice for the other 20% of queries for the sake of keeping those queries completely private.

[0] Think queries I’d previously have had to put through a search engine and check multiple results for a one word/sentence answer.

benreesman · 18 days ago

"Because you can and its cool" would be reason enough: plenty of revolutions have their origin in "because you can" (Wozniak right off the top of my head, Gates and Altair, stuff like that).

But uncensored is a big deal too: censorship is capability reducing (check out Kilcher's GPT4Chan video and references, the Orca work and Dolphin de-tune lift on SWE-Bench style evals). We pay dearly in capability to get "non-operator-alignment", and you'll notice that competition is hot enough now that at the frontier (Opus, Qwen) the " alignment" away from operators aligned is getting very, very mild.

And then there's the compression. Phi-3 or something works on a beefy laptop and has a nontrivial approximation of "the internet" that works on an airplane or a beach with no network connectivity, talk about vibe coding? I like those look up all the docs via a thumbdrive in Phuket vibes.

And on diffusion stuff, SOTA fits on a laptop or close, you can crush OG mid journey or SD on a macbook, its an even smaller gap.

Early GPT-4 ish outcomes are possible on a Macbook Pro or Razer Blade, so either 12-18 month old LLMs are useless, or GGUF is useful.

The AI goalposts things cuts both ways. If AI is "whatever only Anthropic can do"? That's just as silly as "whatever a computer can't do" and a lot more cynical.

dcreater · 18 days ago

Why do any compute locally? Everything can just be cloud based right? Won't that work much better and scale easily?

We are not even at that extreme and you can already see the unequal reality that too much SaaS has engendered

philip1209 · 18 days ago

I'm excited to do just dumb and irresponsible things with a local model, like "iterate through every single email in my 20-year-old gmail account and apply label X if Y applies" and not have a surprise bill.

I think it can make LLMs fun.

Roark66 · 18 days ago

People like myself that firmly believe there will come a time, possibly very soon that all these companies (OpenAI, Anthropic etc) will raise their prices substantially. By then no one will be able to do their work to the standard expected of them without AI, and by then maybe they charge $1k per month, maybe they charge $10k. If there is no viable alternative the sky is the limit.

Why do you think they continue to run at a loss? From the goodness of their heart? Their biggest goal is to discourage anyobe from running local models. The hardware is expensive... The way to run models is very difficult (for example I have dual rtx 3090 for vram and running large heavily quantized models is a real pain in the arse, no high quantisation library supports two GPUs for example, and there seems to be no interest in implementating it by the guys behind the best inference tools).

So this is welcome, but let's not forget why it is being done.

jedberg · 18 days ago

Pornography, or any other "restricted use". They either want privacy or don't want to deal with the filters on commercial products.

I'm sure there are other use cases, but much like "what is BitTorrent for?", the obvious use case is obvious.

ineedasername · 18 days ago

A local laptop of the past few years without a discrete GPU can run, at practical speeds depending on task, a gemma/llama model if it's (ime) under 4GB.

For practical RAG processes of narrow scope and an even minimal amount of scaffolding a very usable speed for automating tasks, especially as the last-mile/edge device portion of a more complex process with better models in use upstream. Classification tasks, reasonay intelligent decisions between traditional workflow processes, other use cases-- a of them extremely valuable in enterprise, being built and deployed right now.

jona777than · 18 days ago

One of my favorite use cases includes simple tasks like generating effective mock/masked data from real data. Then passing the mock data worry-free to the big three (or wherever.)

There’s also a huge opportunity space for serving clients with very sensitive data. Health, legal, and government come to mind immediately. These local models are only going to get more capable of handling their use cases. They already are, really.

azinman2 · 18 days ago

I'm guessing its largely enthusiasts for now, but as they continue getting better:

1. App makers can fine tune smaller models and include in their apps to avoid server costs

2. Privacy-sensitive content can be either filtered out or worked on... I'm using local LLMs to process my health history for example

3. Edge servers can be running these fine tuned for a given task. Flash/lite models by the big guys are effectively like these smaller models already.

noosphr · 18 days ago

Data that can't leave the premises because it is too sensitive. There is a lot of security theater around cloud pretending to be compliant but if you actually care about security a locked server room is the way to do it.

lynnesbian · 18 days ago

I can provide a real-world example: Low-latency code completion.

The JetBrains suite includes a few LLM models on the order of a hundred megabytes. These models are able to provide "obvious" line completion, like filling in variable names, as well as some basic predictions, like realising that the `if let` statement I'm typing out is going to look something like `if let Some(response) = client_i_just_created.foobar().await`.

If that was running in The Cloud, it would have latency issues, rate limits, and it wouldn't work offline. Sure, there's a pretty big gap between these local IDE LLMs and what OpenAI is offering here, but if my single-line autocomplete could be a little smarter, I sure wouldn't complain.

danielvaughn · 18 days ago

Just imagine the next PlayStation or XBox shipping with these models baked in for developer use. The kinds of things that could unlock.

dsubburam · 18 days ago

> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?

Creatives? I am surprised no one's mentioned this yet:

I tried to help a couple of friends with better copy for their websites, and quickly realized that they were using inventive phrases to explain their work, phrases that they would not want competitors to get wind of and benefit from; phrases that associate closely with their personal brand.

Ultimately, I felt uncomfortable presenting the cloud AIs with their text. Sometimes I feel this way even with my own Substack posts, where I occasionally coin a phrase I am proud of. But with local AI? Cool...

JKCalhoun · 18 days ago

I do it because 1) I am fascinated that I can and 2) at some point the online models will be enshitified — and I can then permanently fall back on my last good local version.

trenchpilgrim · 18 days ago

In some large, lucrative industries like aerospace many of the hosted models are off the table due to regulations such as ITAR. There'a a market for models which are run on prem/in GovCloud with a professional support contract for installation and updates.

lucumo · 18 days ago

I'm in a corporate environment. There's a study group to see if maybe we can potentially get some value out of those AI tools. They've been "studying" the issue for over a year now. They expect to get some cloud service that we can safely use Real Soon Now.

So, it'll take at least two more quarters before I can actually use those non-local tools on company related data. Probably longer, because sense of urgency is not this company's strong suit.

Anyway, as a developer I can run a lot of things locally. Local AI doesn't leak data, so it's safe. It's not as good as the online tools, but for some things they're better than nothing.

sturadnidge · 18 days ago

If you have capable hardware and kids, a local LLM is great. A simple system prompt customisation (e.g. ‘all responses should be written as if talking to a 10 year old’) and knowing that everything is private goes a long way for me at least.

itake · 18 days ago

Local micro models are both fast and cheap. We tuned small models on our data set and if the small model thinks content is a certain way, we escalate to the LLM.

This gives us really good recall at really low cloud cost and latency.

ricardobayes · 17 days ago

I would say, any company who doesn't have their own AI developed. You always hear companies "mandating" AI usage, but for the most part it's companies developing their own solutions/agents. No self-respecting company with a tight opsec would allow a random "always-online" LLM that could just rip your codebase either piece by piece or the whole thing at once if it's a IDE addon (or at least I hope that's the case). So yeah, I'd say locally deployed LLM's/Agents are a gamechanger.

seany · 18 days ago

Jail breaking then running censored questions. Like diy fireworks, or analysis of papers that touch "sensitive topics", nsfw image generation the list is basically endless.

jlokier · 18 days ago

At the company where I currently work, for IP reasons (and with the advice of a patent lawyer), nobody is allowed to use any online AIs to talk about or help with work, unless it's very generic research that doesn't give away what we're working on.

That rules out coding assistants like Claude, chat, tools to generate presentations and copy-edit documents, and so forth.

But local AI are fine, as long as we're sure nothing is uploaded.

dismalaf · 18 days ago

The use case is building apps.

A small LLM can do RAG, call functions, summarize, create structured data from messy text, etc... You know, all the things you'd do if you were making an actual app with an LLM.

Yeah, chat apps are pretty cheap and convenient for users who want to search the internet and write text or code. But APIs quickly get expensive when inputting a significant amount of tokens.

athrowaway3z · 17 days ago

Don't know about the biggest, but IMO the exciting things about open models is the possibility of creating whole new things.

For example, "generate a heatmap of each token/word and how 'unexpected' they are" or "find me a prompt that creates the closest match to this text"

To be efficient both require access that is not exposed over API.

adrianwaj · 18 days ago

Use Case?

How about running one on this site but making it publically available? A sort of outranet and calling it HackerBrain?

georgeecollins · 18 days ago

There's a bunch of great reasons in this thread, but how about the chip manufacturers that are going to need you to need a more powerful set of processors in your phone, headset, computer. You can count on those companies to subsidize some R&D and software development.

cameronh90 · 18 days ago

The cloud AI providers have unacceptable variation in response time for things that need a predictable runtime.

Even if they did offer a defined latency product, you’re relying on a lot of infrastructure between your application and their GPU.

That’s not always tolerable.

sznio · 18 days ago

>Students who don’t want to pay but somehow have the hardware?

that's me - well not a student anymore. when toying with something, i much prefer not paying for each shot. my 12GB Radeon card can either run a decent extremely slow, or a idiotic but fast model. it's nice not dealing with rate limits.

once you write a prompt that mangles an idiotic model into still doing the work, it's really satisfying. the same principle as working to extract the most from limited embedded hardware. masochism, possibly

yreg · 17 days ago

> I’m still trying to understand what is the biggest group of people that will use local AI?

iPhone users in a few months – because I predict app developers will love cramming calls to the foundation models into everything.

Android will follow.

etoxin · 18 days ago

Some app devs use local models on local environments with LLM APIs to get up and running fast, then when the app deploys it switches to the big online models via environment vars.

In large companies this can save quite a bit of money.

m463 · 18 days ago

One use nobody mentions is hybrid use.

Why not run all the models at home, maybe collaboratively or at least in parallel?

I'm sure there are use cases where the paid models are not allowed to collaborate or ask each other.

also, other open models are gaining mindshare.

muzani · 18 days ago

Privacy laws. Processing government paperwork with LLMs for example. There's a lot of OCR tools that can't be used, and the ones that comply are more expensive than say, GPT-4.1 and lower quality.

waterTanuki · 17 days ago

anything involving the medical industry (HIPAA laws), national security (FedRAMP is such a pita to get that some military contractors are bypassing it to get quicker access to cloud tools) etc.

Besides that, we are moving towards an era where we won't need to pay providers a subscription every month to use these models. I can't say for certain whether or not the GPUs that run them will get cheaper, but the option to run your own model is game changing for more than you can possibly imagine.

TrackerFF · 18 days ago

Agencies / firms that work with classified data. Some places have very strict policies on data, which makes it impossible to use any service that isn't local and air-gapped.

example: military intel

cyanydeez · 18 days ago

People who want programmatic solutions that wont be rug pulled

m3kw9 · 18 days ago

I’d use it on a plane if there was no network for coding, but otherwise it’s just an emergency model if the internet goes out, basically end of the world scenarios

julianozen · 18 days ago

worth mentioning that todays expensive hardware will be built into the cheapest iPhone in less than 10 years.

That means running instantly offline and every token is free

Deleted Comment

nfRfqX5n · 18 days ago

You’re asking the biggest group of people who would want to do this

unethical_ban · 18 days ago

Privacy and equity.

Privacy is obvious.

AI is going to to be equivalent to all computing in the future. Imagine if only IBM, Apple and Microsoft ever built computers, and all anyone else ever had in the 1990s were terminals to the mainframe, forever.

deadbabe · 18 days ago

We use it locally for deep packet inspection.

adastra22 · 14 days ago

Same as the internet: porn.

novok · 18 days ago

Psychs who dont trust ai companies

metanonsense · 18 days ago

Maybe I am too pessimistic, but as an EU citizen I expect politics (or should I say Trump?) to prevent access to US-based frontier models at some point.

mastermage · 18 days ago

I am just a cheapskate that wants to scale back on all subscription costs. I fucking hate subscriptions.

shironandonon_ · 18 days ago

air gaps, my man.

Dead Comment

dongobread · 18 days ago

How up to date are you on current open weights models? After playing around with it for a few hours I find it to be nowhere near as good as Qwen3-30B-A3B. The world knowledge is severely lacking in particular.

Nomadeon · 18 days ago

Agree. Concrete example: "What was the Japanese codeword for Midway Island in WWII?"

Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...

dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds

deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds

gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds

gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.

It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.

nojito · 18 days ago

Why does it need knowledge when it can just call tools to get it?

kmacdough · 17 days ago

I too am skeptical of these models, but it's a reasoning focused model. As a result this isn't a very appropriate benchmark.

Small models are going to be particularly poor when used outside of their intended purpose. They have to omit something.

datadrivenangel · 18 days ago

Now to embrace jevon's paradox and expand usage until we're back to draining lakes so that your agentic refrigerator can simulate sentience.

cco · 18 days ago

What ~IBM~ TSMC giveth, ~Bill Gates~ Sam Altman taketh away.

bongodongobob · 18 days ago

Yep, it's almost as bad as all the cars' cooling systems using up so much water.

herval · 18 days ago

In the future, your Samsung fridge will also need your AI girlfriend

ben_w · 18 days ago

Why is your laptop (or phone, or refrigerator) plumbed directly into a lake?

decide1000 · 18 days ago

The model is good and runs fine but if you want to be blown away again try Qwen3-30A-A3B-2507. It's 6gb bigger but the response is comparable or better and much faster to run. Gpt-oss-20B gives me 6 tok/sec while Qwen3 gives me 37 tok/sec. Qwen3 is not a reasoning model tho.

parhamn · 18 days ago

I just tested 120B from the Groq API on agentic stuff (multi-step function calling, similar to claude code) and it's not that good. Agentic fine-tuning seems key, hopefully someone drops one soon.

AmazingTurtle · 18 days ago

Im not sure if groq uses the proper harmony template?

turnsout · 18 days ago

The environmentalist in me loves the fact that LLM progress has mostly been focused on doing more with the same hardware, rather than horizontal scaling. I guess given GPU shortages that makes sense, but it really does feel like the value of my hardware (a laptop in my case) is going up over time, not down.

Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."

twixfel · 18 days ago

> Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."

Not in the UK it isn’t.

mathiaspoint · 18 days ago

It's really training not inference that drains the lakes.

littlestymaar · 18 days ago

Training cost has increased a ton exactly because inference cost is the biggest problem: models are now trained on almost three orders of magnitude more data then what is compute-optimal to do (from the Chinchilla paper), because saving compute on inference makes it valuable to overtrain a smaller model to achieve similar performance for a bigger amount of training compute.

JKCalhoun · 18 days ago

Interesting. I understand that, but I don't know to what degree.

I mean the training, while expensive, is done once. The inference … besides being done by perhaps millions of clients, is done for, well, the life of the model anyway. Surely that adds up.

It's hard to know, but I assume the user taking up the burden of the inference is perhaps doing so more efficiently? I mean, when I run a local model, it is plodding along — not as quick as the online model. So, slow and therefore I assume necessarily more power efficient.

jwr · 17 days ago

gpt-oss:20b is the best performing model on my spam filtering benchmarks (I wrote a despammer that uses an LLM).

These are the simplified results (total percentage of correctly classified E-mails on both spam and ham testing data):

gpt-oss:20b 95.6%

gemma3:27b-it-qat 94.3%

mistral-small3.2:24b-instruct-2506-q4_K_M 93.7%

mistral-small3.2:24b-instruct-2506-q8_0 92.5%

qwen3:32b-q4_K_M 89.2%

qwen3:30b-a3b-q4_K_M 87.9%

gemma3n:e4b-it-q4_K_M 84.9%

deepseek-r1:8b 75.2%

qwen3:30b-a3b-instruct-2507-q4_K_M 73.0%

I'm quite happy, because it's also smaller and faster than gemma3.

Cicero22 · 18 days ago

Where did you get the top ten from?

https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro

Are you discounting all of the self reported scores?

zwischenzug · 18 days ago

Came here to say this. It's behind the 14b Phi-reasoning-plus (which is self-reported).

I don't understand why "TIGER-LAb"-sourced scores are 'unknown' in terms of model size?

npn · 18 days ago

It is not a frontier model. It's only good for benchmarks. Tried some tasks and it is even worse than gemma 3n.

MattSayar · 18 days ago

What's your experience with the quality of LLMs running on your phone?

NoDoo · 18 days ago

I've run qwen3 4B on my phone, it's not the best but it's better than old gpt-3.5. It also does have a reasoning mode, and in reasoning mode it's better than the original gpt-4 and rhe original gpt-4o, but not the latest gpt-4o. I get usable speed, but it's not really comparable to most cloud hosted models.

cco · 18 days ago

As other said, around gpt 3.5 level so three or four years behind SOTA today at reasonable (but not quick) speed.

vonneumannstan · 17 days ago

>no lakes being drained

When you imagine a lake being drained to cool a datacenter do you ever consider where the water used for cooling goes? Do you imagine it disappears?

nudgeOrnurture · 17 days ago

not if the winds of fortune don't change;--but the weather, man, it's been getting less foreseeable than I was once used to

latexr · 16 days ago

I tried their live demo. It suggests three prompts, one of them being “How many R’s are in strawberry?” So I clicked that, and it answered there are three! I tried it thrice with the same result.

It suggested the prompt. It’s infamous because models often get it wrong, they know it, and still they confidently suggested it and got it wrong.

latexr · 14 days ago

Obviously I made a typo above. “Three” is the right answer, I meant that it answered there are “two” (the wrong answer).

https://i.imgur.com/DgAvbee.png

raideno · 18 days ago

How much ram is in your Macbook Air M3 ? I have the 16Gb version and i was wondering whether i'll be able to run it or not.

black3r · 18 days ago

can you please give an estimate how much slower/faster is it on your macbook compared to comparable models running in the cloud?

cco · 18 days ago

Sure.

This is a thinking model, so I ran it against o4-mini, here are the results:

* gpt-oss:20b

* Time-to-first-token: 2.49 seconds

* Time-to-completion: 51.47 seconds

* Tokens-per-second: 2.19

* o4-mini on ChatGPT

* Time-to-first-token: 2.50 seconds

* Time-to-completion: 5.84 seconds

* Tokens-per-second: 19.34

Time to first token was similar, but the thinking piece was _much_ faster on o4-mini. Thinking took the majority of the 51 seconds for gpt-oss:20b.

syntaxing · 18 days ago

You can get a pretty good estimate depending on your memory bandwidth. Too many parameters can change with local models (quantization, fast attention, etc). But the new models are MoE so they’re gonna be pretty fast.

lend000 · 18 days ago

For me the game changer here is the speed. On my local Mac I'm finally getting token counts that are faster than I can process the output (~96 tok/s), and the quality has been solid. I had previously tried some of the distilled qwen and deepseek models and they were just way too slow for me to seriously use them.

snthpy · 18 days ago

For me the biggest benefit of open weights models is the ability to fine tune and adapt to different tasks.

SergeAx · 18 days ago

Did you mean "120b"? I am running 20b model locally right now, and it is pretty mediocre. Nothing near Gemini 2.5 Pro, which is my daily driver.

benreesman · 18 days ago

You're going to freak out when you try the Chinese ones :)

syntaxing · 18 days ago

Interesting, these models are better than the new Qwen releases?

bakies · 18 days ago

on your phone?

Dead Comment

Open models are going to win long-term. Anthropics' own research has to use OSS models [0]. China is demonstrating how quickly companies can iterate on open models, allowing smaller teams access and augmentation to the abilities of a model without paying the training cost.

My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.

N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.

There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.

Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.

[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.

lechatonnoir · 18 days ago

I'm pretty sure there's no reason that Anthropic has to do research on open models, it's just that they produced their result on open models so that you can reproduce their result on open models without having access to theirs.

xpe · 18 days ago

> Open models are going to win long-term.

[2 of 3] Assuming we pin down what win means... (which is definitely not easy)... What would it take for this to not be true? There are many ways, including but not limited to:

- publishing open weights helps your competitors catch up

- publishing open weights doesn't improve your own research agenda

- publishing open weights leads to a race dynamic where only the latest and greatest matters; leading to a situation where the resources sunk exceed the gains

- publishing open weights distracts your organization from attaining a sustainable business model / funding stream

- publishing open weights leads to significant negative downstream impacts (there are a variety of uncertain outcomes, such as: deepfakes, security breaches, bioweapon development, unaligned general intelligence, humans losing control [1] [2], and so on)

[1]: "What failure looks like" by Paul Christiano : https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-...

[2]: "An AGI race is a suicide race." - quote from Max Tegmark; article at https://futureoflife.org/statement/agi-manhattan-project-max...

albertzeyer · 18 days ago

> Once someone hits AGI/SGI

I don't think there will be such a unique event. There is no clear boundary. This is a continuous process. Modells get slightly better than before.

Also, another dimension is the inference cost to run those models. It has to be cheap enough to really take advantage of it.

Also, I wonder, what would be a good target to make profit, to develop new things? There is Isomorphic Labs, which seems like a good target. This company already exists now, and people are working on it. What else?

dom96 · 18 days ago

> I don't think there will be such a unique event.

I guess it depends on your definition of AGI, but if it means human level intelligence then the unique event will be the AI having the ability to act on its own without a "prompt".

Adrig · 18 days ago

I'm a layman but it seemed to me that the industry is going towards robust foundational models on which we plug tools, databases, and processes to expand their capabilities.

In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.

renmillar · 18 days ago

There's no reason that models too large for consumer hardware wouldn't keep a huge edge, is there?

AtlasBarfed · 18 days ago

That is fundamentally a big O question.

I have this theory that we simply got over a hump by utilizing a massive processing boost from gpus as opposed to CPUs. That might have been two to three orders of magnitude more processing power.

But that's a one-time success. I don't hardware has any large scale improvements coming, because 3D gaming mostly plumb most of that vector processing hardware development in the last 30 years.

So will software and better training models produce another couple orders of magnitude?

Fundamentally we're talking about nines of of accuracy. What is the processing power required for each line of accuracy? Is it linear? Is it polynomial? Is it exponential?

It just seems strange to me with all the AI knowledge slushing through academia, I haven't seen any basic analysis at that level, which is something that's absolutely going to be necessary for AI applications like self-driving, once you get those insurance companies involved

xpe · 18 days ago

> Open models are going to win long-term.

[1 of 3] For the sake of argument here, I'll grant the premise. If this turns out to be true, it glosses over other key questions, including:

For a frontier lab, what is a rational period of time (according to your organizational mission / charter / shareholder motivations*) to wait before:

1. releasing a new version of an open-weight model; and

2. how much secret sauce do you hold back?

* Take your pick. These don't align perfectly with each other, much less the interests of a nation or world.

teaearlgraycold · 18 days ago

> N-1 model value depreciates insanely fast

This implies LLM development isn’t plateaued. Sure the researchers are busting their assess quantizing, adding features like tool calls and structured outputs, etc. But soon enough N-1~=N

swalsh · 18 days ago

To me it depends on 2 factors. Hardware becomes more accessible, and the closed source offerings become more expensive. Right now it's difficult to get enough GPUs to do local inference at production scale, and 2 it's more expensive to run your own GPU's vs closed source models.

xpe · 18 days ago

> Open models are going to win long-term.

[3 of 3] What would it take for this statement to be false or missing the point?

Maybe we find ourselves in a future where:

- Yes, open models are widely used as base models, but they are also highly customized in various ways (perhaps by industry, person, attitude, or something else). In other words, this would be a blend of open and closed.

- Maybe publishing open weights of a model is more-or-less irrelevant, because it is "table stakes" ... because all the key differentiating advantages have to do with other factors, such as infrastructure, non-LLM computational aspects, regulatory environment, affordable energy, customer base, customer trust, and probably more.

- The future might involve thousands or millions of highly tailored models