> Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.
Yes, the open-models have surpassed my expectations in both quality and speed of release. For a bit of context, when chatgpt launched in Dec22, the "best" open models were GPT-J(~6-7B) and GPT-neoX (~22B?). I actually had an app running live, with users, using gpt-j for ~1 month. It was a pain. The quality was abysmal, there was no instruction following (you had to start your prompt like a story, or come up with a bunch of examples and hope the model will follow along) and so on.
And then something happened, LLama models got "leaked" (I still think it was a on purpose leak - don't sue us, we never meant to release, etc), and the rest is history. With L1 we got lots of optimisations like quantised models, fine-tuning and so on, L2 really saw fine-tuning go off (most of the fine-tunes were better than what meta released), we got alpaca showing off LoRA, and then a bunch of really strong models came out (mistrals, mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)
By some estimations the open models are ~6mo behind what SotA labs have released. (note that doesn't mean the labs are releasing their best models, it's likely they keep those in house to use on next runs data curation, synthetic datasets, for distilling, etc). Being 6mo behind is NUTS! I never in my wildest dreams believed we'll be here. In fact I thought it would take ~2years to reach gpt3.5 levels. It's really something insane that we get to play with these models "locally", fine-tune them and so on.
I'll bite. How do i train/make and/or use LoRA, or, separately, how do i fine-tune? I've been asking this for months, and no one has a decent answer. websearch on my end is seo/geo-spam, with no real instructions.
I know how to make an SD LoRA, and use it. I've known how to do that for 2 years. So what's the big secret about LLM LoRA?
We have been fine-tuning models using Axolotl and Unsloth, with a slight preference for Axolotl. Check out the docs [0] and fine-tune or quantize your first model. There is a lot to be learned in this space, but it's exciting.
I'm not sure if it contains exactly what you're looking for, but it includes several resources and notebooks related to fine-tuning LLMs (including LoRA) that I found useful.
No clear use case for LLMs yet. ("Spicy" aka pornography finetunes are the only ones with broad adoption, but we don't talk about that in polite society here.)
For what? It's really hard to say what model is "generally" better then another, as they're all better/worse at specific things.
My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.
They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.
I haven't tried them (released yesterday I think?). The benchmarks look good (similar I'd say) but that's not saying much these days. The best test you can do is have a couple of cases that match your needs, and run them yourself w/ the cradle that you are using (aider, cline, roo, any of the CLI tools, etc). Openrouter usually has them up soon after launch, and you can run a quick test really cheap (and only deal with one provider for billing & stuff).
> still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.
I believe we are vastly underestimating what our existing hardware is capable of in this space. I worry that narratives like the bitter lesson and the efficient compute frontier are pushing a lot of brilliant minds away from investigating revolutionary approaches.
It is obvious that the current models are deeply inefficient when you consider how much you can decimate the precision of the weights post-training and still have pelicans on bicycles, etc.
I think GP means that if you internalize the bitter lesson (more data more compute wins), you stop imagining how to squeeze SOTA minus 1 performance out of constrained compute environments.
Very well put. There's a lot to be gained from using smaller models and existing hardware. So many enterprise PMs skip straight to using a cutting edge LLM via API. There are many tasks where a self-hosted LLM or even a finetuned small language model can either complete a preliminary step or even handle the full task for much less money. And if a self-hosted model can do the job today, imagine what you'll be able to do in a year or five when you have more powerful hardware and even better models.
Did you understand the implementation or just that it produced a result?
I would hope an LLM could spit out a cobbled form of answer to a common interview question.
Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why did they not just pipe the JSON into our already working app that displays this data?
People around me for the most part are using LLMs to enhance their presentations, not to actually implement anything useful. I have been watching my coworkers use it that way for months.
Another example? A different coworker wanted to build a document macro to perform bulk updates on courseware content. Swapping old words for new words. To build the macro they first wrote a rubrick to prompt an LLM correctly inside of a word doc.
That filled rubrik is then used to generate a program template for the macro. To define the requirements for the macro the coworker then used a slideshow slide to list bullet points of functionality, in this case to Find+Replace words in courseware slides/documents using a list of words from another text document. Due to the complexity of the system, I can’t believe my colleague saved any time. The presentation was interesting though and that is what they got compliments on.
However the solutions are absolutely useless for anyone else but the implementer.
I scanned the code and understood what it was doing, but I didn't spend much time on it once I'd seen that it worked.
If I'm writing code for production systems using LLMs I still review every single line - my personal rule is I need to be able to explain how it works to someone else before I'm willing to commit it.
This is why I love using the Deep-Seek chain of reason output ... I can actually go through and read what it's 'thinking' to validate whether it's basing its solution on valid facts / assumptions. Either way thanks for all of your valuable write-ups on these models I really appreciate them Simon!
I think is the right way to do it. Produce with LLM, debug and read every online. Delete lots of it.
Many people fear this approach for production, but it is reasonable compared to someone with a single course in Coursera writing production JS code.
Yet, we tend to say the LLM wrote this and that which implies model did all the work. In reality it should be understood as a complex heavy lifting machine which is expected to be operated by a very well prepared operator.
The fact I got a Kango and drilled some holes does not make me engineer right? And it takes an engineer to sign off the building even thought it was archicad doing the math.
> However the solutions are absolutely useless for anyone else but the implementer.
Disposable code is where AI shines.
AI generating the boilerplate code for an obtuse build system? Yes, please. AI generating an animation? Ganbatte. (Look at how much work 3Blue1Brown had to put into that--if AI can help that kind of thing, it has my blessings). AI enabling someone who doesn't program to generate some prototype that they can then point at an actual programmer? Excellent.
This is fine because you don't need to understand the result. You have a concrete pass/fail gate and don't care about underneath. This is real value. The problem is that it isn't gigabuck value.
The stuff that would be gigabuck value is unfortunately where AI falls down. Fix this bug in a product. Add this feature to an existing codebase. etc.
AI is also a problem because disposable code is what you would assign to junior programmers in order for them to learn.
> AI is also a problem because disposable code is what you would assign to junior programmers in order for them to learn.
It's also giving PHBs the ability to hand ill-conceived ideas to a magic robot, receive "code" they can't understand, and throw it into production. All the while firing what real developers they had on staff.
The real test is if you can have it tweak things. Have the ship shoot down. Have the space invaders come from the left and right. Add two player simultaneous mode with two ships.
It can usually tweak things, if given specific instruction, but it doesn't know when to refactor (and can't reliably preserve functionality when it does), so the program gets further and further away from something sensible until it can't make edits any more.
most human generated methods look the same. in fact, in SWE, we reward people for generating code that look & feel the same, they call it "work as a team".
This comment is ~3 years late. Every model since gpt3 has had the entirety of available code in their training data. That's not a gotcha anymore.
We went from chatgpt's "oh, look, it looks like python code but everything is wrong" to "here's a full stack boilerplate app that does what you asked and works in 0-shot" inside 2 years. That's the kicker. And the sauce isn't just in the training set, models now do post-training and RL and a bunch of other stuff to get to where we are. Not to mention the insane abilities with extended context (first models were 2/4k max), agentic stuff, and so on.
These kinds of comments are really missing the point.
I've had little success with Agentic coding, and what success I have had has been paired with hours of frustration, where I'd have been better off doing it myself for anything but the most basic tasks.
Even then, when you start to build up complexity within a codebase - the results have often been worse than "I'll start generating it all from scratch again, and include this as an addition to the initial longtail specification prompt as well", and even then... it's been a crapshoot.
I _want_ to like it. The times where it initially "just worked" felt magical and inspired me with the possibilities. That's what prompted me to get more engaged and use it more. The reality of doing so is just frustrating and wishing things _actually worked_ anywhere close to expectations.
Not exactly. The real utility value of LLM for programming is to come up with something new. For Space Invaders, instead of using LLM for that, I might as well just manually search for the code online and use that.
To show that LLM actually can provide value for one-shot programming, you need to find a problem that there's no fully working sample code available online. I'm not trying to say that LLM couldn't to that. But just because LLM can come up with a perfectly-working Space Invaders doesn't mean that it could do that.
You are reading too much into my comment. My point was that the test (a Space Invaders clone) used to asses the model is irrelevant for some time now. I could have gotten a similar result with Mistral Small a few months ago.
It's amazing that none of you even try to falsify you claims anymore. You can literally just put some of the code in a search engine and find the prior art example:
I don't think they are missing the point, because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated. I use Gemini 2.5 Pro every day for coding, and even that one still falls over on tasks that aren't well known to it (which is why I break the problem down into small parts that I know it'll be able to handle properly).
It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.
Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.
For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.
Showing off moderately complicated results that are actually not indicative of performance because they are sniped by the training data turns this from a cool demo to a parlor trick.
Stating that, aha, jokes on you, that's the status quo, is an even bigger indictment.
> These kinds of comments are really missing the point.
I disagree. In my experience, asking coding tools to produce something similar to all of the tutorials and example code out there works amazingly well.
Asking them to produce novel output that doesn’t match the training set produces very different results.
When I tried multiple coding agents for a somewhat unique task recently they all struggled, continuously trying to pull the solution back to the standard examples. It felt like an endless loop of the models grinding through a solution and then spitting out something that matched common examples, after which I had to remind them of the unique properties of the task and they started all over again, eventually arriving back in the same spot.
It shows the reality of working with LLMs and it’s an important consideration.
I was not able to just download a 8-16GB File and then it would be able to generate A LOT of different tools, games etc. for me in multiply programming languages while in parallel ELI5 me research papers, generate svgs and a lot lot lot more.
I see the value in showcasing that LLMs can run locally on laptops — it’s an important milestone, especially given how difficult that was before smaller models became viable.
That said, for something like this, I’d probably get more out of simply finding an existing implementation on github or the like and downloading that.
When it comes to specialized and narrow domains like Space Invaders, the training set is likely to be extremely small and the model's vector space will have limited room to generalize. You'll get code that is more or less identical to the original source and you also have to wait for it to 'type' the code and the value add seems very low. I would rather ask it to point me to known Space Invaders implementations in language X on github (or search there).
Note that ChatGPT gets very nervous if I put this into GPT to clean up the grammar. It wants very badly for me to stress that LLMs don't memorize and overfitting is very unlikely (I believe neither).
Interesting, I can not produce these warnings in ChatGPT - though this is something that really interests me, as it represents immense political power to be able ti interject such warnings (explicitly, or implicitly by slight reformulations)
This raises an interesting question I’ve seen occasionally addressed in science fiction before:
Could today’s consumer hardware run a future superintelligence (or, as a weaker hypothesis, at least contain some lower-level agent that can bootstrap something on other hardware via networking or hyperpersuasion) if the binary dropped out of a wormhole?
This is the premise of all of the ML research I've been into. The only difference is to replace the wormhole with linear genetic programming, neuroevolution, et. al. The size of programs in the demoscene is what originally sent me down this path.
The biggest question I keep asking myself - What is the Kolmogorov complexity of a binary image that provides the exact same capabilities as the current generation LLMs? What are the chances this could run on the machine under my desk right now?
I know how many AAA frames per second my machine is capable of rendering. I refuse to believe the gap between running CS2 at 400fps and getting ~100b/s of UTF8 text out of a NLP black box is this big.
> ~100b/s of UTF8 text out of a NLP black box is this big
That's not a good measure. NP problem solutions are only a single bit, but they are much harder to solve than CS2 frames for large N. If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.
This is what I find fascinating. What hidden capabilities exist, and how far could it be exploited? Especially on exotic or novel hardware.
I think much of our progress is limited by the capacity of the human brain, and we mostly proceed via abstraction which allows people to focus on narrow slices. That abstraction has a cost, sometimes a high one, and it’s interesting to think about what the full potential could be without those limitations.
Abstraction, or efficient modeling of a given system, is probably a feature, not a bug, given the strong similarity between intelligence and compression and all that.
A concise description of the right abstractions for our universe is probably not too far removed from the weights of a superintelligence, modulo a few transformations :)
There's a new model from Qwen today - Qwen3-30B-A3B-Instruct-2507 - that also runs comfortably on my Mac (using about 30GB of RAM with an 8bit quantization).
I tried the "Write an HTML and JavaScript page implementing space invaders" prompt against it and didn't quite get a working game with a single shot, but it was still an interesting result: https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct...
I was talking about the new open models with a group of people yesterday, and saying how good they're getting. The big question is:
Can any company now compete with the big players? Or even more interesting, like you showed in your research, are proprietary models becoming less relevant now that anyone can run these models locally?
This trend of better open models that run locally is really picking up. Do you think we'll get to a point where we won't need to buy AI tokens anymore?
The problem is cost. A machine that can run a decent local model costs thousands of dollars to buy and won't produce results as good as a model that runs on $30,000+ dedicated servers. Meanwhile you can rent access to LLMs running on those expensive machines for fractions of a cent (because you are sharing them with thousands of other users).
I don't think cost will be a reason to use local models for a very long time, if ever.
Yes, the open-models have surpassed my expectations in both quality and speed of release. For a bit of context, when chatgpt launched in Dec22, the "best" open models were GPT-J(~6-7B) and GPT-neoX (~22B?). I actually had an app running live, with users, using gpt-j for ~1 month. It was a pain. The quality was abysmal, there was no instruction following (you had to start your prompt like a story, or come up with a bunch of examples and hope the model will follow along) and so on.
And then something happened, LLama models got "leaked" (I still think it was a on purpose leak - don't sue us, we never meant to release, etc), and the rest is history. With L1 we got lots of optimisations like quantised models, fine-tuning and so on, L2 really saw fine-tuning go off (most of the fine-tunes were better than what meta released), we got alpaca showing off LoRA, and then a bunch of really strong models came out (mistrals, mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)
By some estimations the open models are ~6mo behind what SotA labs have released. (note that doesn't mean the labs are releasing their best models, it's likely they keep those in house to use on next runs data curation, synthetic datasets, for distilling, etc). Being 6mo behind is NUTS! I never in my wildest dreams believed we'll be here. In fact I thought it would take ~2years to reach gpt3.5 levels. It's really something insane that we get to play with these models "locally", fine-tune them and so on.
I know how to make an SD LoRA, and use it. I've known how to do that for 2 years. So what's the big secret about LLM LoRA?
0: https://axolotl.ai/ and https://docs.axolotl.ai/
I'm not sure if it contains exactly what you're looking for, but it includes several resources and notebooks related to fine-tuning LLMs (including LoRA) that I found useful.
There are Colab Notebook tutorials around training models with it as well.
No clear use case for LLMs yet. ("Spicy" aka pornography finetunes are the only ones with broad adoption, but we don't talk about that in polite society here.)
So here's the original
https://web.archive.org/web/20231127123701/https://brev.dev/...
My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.
They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.
Dead Comment
I believe we are vastly underestimating what our existing hardware is capable of in this space. I worry that narratives like the bitter lesson and the efficient compute frontier are pushing a lot of brilliant minds away from investigating revolutionary approaches.
It is obvious that the current models are deeply inefficient when you consider how much you can decimate the precision of the weights post-training and still have pelicans on bicycles, etc.
I would hope an LLM could spit out a cobbled form of answer to a common interview question.
Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why did they not just pipe the JSON into our already working app that displays this data?
People around me for the most part are using LLMs to enhance their presentations, not to actually implement anything useful. I have been watching my coworkers use it that way for months.
Another example? A different coworker wanted to build a document macro to perform bulk updates on courseware content. Swapping old words for new words. To build the macro they first wrote a rubrick to prompt an LLM correctly inside of a word doc.
That filled rubrik is then used to generate a program template for the macro. To define the requirements for the macro the coworker then used a slideshow slide to list bullet points of functionality, in this case to Find+Replace words in courseware slides/documents using a list of words from another text document. Due to the complexity of the system, I can’t believe my colleague saved any time. The presentation was interesting though and that is what they got compliments on.
However the solutions are absolutely useless for anyone else but the implementer.
If I'm writing code for production systems using LLMs I still review every single line - my personal rule is I need to be able to explain how it works to someone else before I'm willing to commit it.
I wrote a whole lot more about my approach to using LLMs to help write "real" code here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/
Many people fear this approach for production, but it is reasonable compared to someone with a single course in Coursera writing production JS code.
Yet, we tend to say the LLM wrote this and that which implies model did all the work. In reality it should be understood as a complex heavy lifting machine which is expected to be operated by a very well prepared operator.
The fact I got a Kango and drilled some holes does not make me engineer right? And it takes an engineer to sign off the building even thought it was archicad doing the math.
Disposable code is where AI shines.
AI generating the boilerplate code for an obtuse build system? Yes, please. AI generating an animation? Ganbatte. (Look at how much work 3Blue1Brown had to put into that--if AI can help that kind of thing, it has my blessings). AI enabling someone who doesn't program to generate some prototype that they can then point at an actual programmer? Excellent.
This is fine because you don't need to understand the result. You have a concrete pass/fail gate and don't care about underneath. This is real value. The problem is that it isn't gigabuck value.
The stuff that would be gigabuck value is unfortunately where AI falls down. Fix this bug in a product. Add this feature to an existing codebase. etc.
AI is also a problem because disposable code is what you would assign to junior programmers in order for them to learn.
It's also giving PHBs the ability to hand ill-conceived ideas to a magic robot, receive "code" they can't understand, and throw it into production. All the while firing what real developers they had on staff.
i have this feeling with LLM's generated react frontend, they all look the same
Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...
We went from chatgpt's "oh, look, it looks like python code but everything is wrong" to "here's a full stack boilerplate app that does what you asked and works in 0-shot" inside 2 years. That's the kicker. And the sauce isn't just in the training set, models now do post-training and RL and a bunch of other stuff to get to where we are. Not to mention the insane abilities with extended context (first models were 2/4k max), agentic stuff, and so on.
These kinds of comments are really missing the point.
Even then, when you start to build up complexity within a codebase - the results have often been worse than "I'll start generating it all from scratch again, and include this as an addition to the initial longtail specification prompt as well", and even then... it's been a crapshoot.
I _want_ to like it. The times where it initially "just worked" felt magical and inspired me with the possibilities. That's what prompted me to get more engaged and use it more. The reality of doing so is just frustrating and wishing things _actually worked_ anywhere close to expectations.
To show that LLM actually can provide value for one-shot programming, you need to find a problem that there's no fully working sample code available online. I'm not trying to say that LLM couldn't to that. But just because LLM can come up with a perfectly-working Space Invaders doesn't mean that it could do that.
https://www.web-leb.com/en/code/2108
Your "AI tools" are just "copyright whitewashing machines."
These kinds of comments are really ignoring reality.
It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.
Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.
For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.
Showing off moderately complicated results that are actually not indicative of performance because they are sniped by the training data turns this from a cool demo to a parlor trick.
Stating that, aha, jokes on you, that's the status quo, is an even bigger indictment.
I disagree. In my experience, asking coding tools to produce something similar to all of the tutorials and example code out there works amazingly well.
Asking them to produce novel output that doesn’t match the training set produces very different results.
When I tried multiple coding agents for a somewhat unique task recently they all struggled, continuously trying to pull the solution back to the standard examples. It felt like an endless loop of the models grinding through a solution and then spitting out something that matched common examples, after which I had to remind them of the unique properties of the task and they started all over again, eventually arriving back in the same spot.
It shows the reality of working with LLMs and it’s an important consideration.
I was not able to just download a 8-16GB File and then it would be able to generate A LOT of different tools, games etc. for me in multiply programming languages while in parallel ELI5 me research papers, generate svgs and a lot lot lot more.
But hey.
That said, for something like this, I’d probably get more out of simply finding an existing implementation on github or the like and downloading that.
When it comes to specialized and narrow domains like Space Invaders, the training set is likely to be extremely small and the model's vector space will have limited room to generalize. You'll get code that is more or less identical to the original source and you also have to wait for it to 'type' the code and the value add seems very low. I would rather ask it to point me to known Space Invaders implementations in language X on github (or search there).
Note that ChatGPT gets very nervous if I put this into GPT to clean up the grammar. It wants very badly for me to stress that LLMs don't memorize and overfitting is very unlikely (I believe neither).
Deleted Comment
Dead Comment
Dead Comment
Could today’s consumer hardware run a future superintelligence (or, as a weaker hypothesis, at least contain some lower-level agent that can bootstrap something on other hardware via networking or hyperpersuasion) if the binary dropped out of a wormhole?
The biggest question I keep asking myself - What is the Kolmogorov complexity of a binary image that provides the exact same capabilities as the current generation LLMs? What are the chances this could run on the machine under my desk right now?
I know how many AAA frames per second my machine is capable of rendering. I refuse to believe the gap between running CS2 at 400fps and getting ~100b/s of UTF8 text out of a NLP black box is this big.
That's not a good measure. NP problem solutions are only a single bit, but they are much harder to solve than CS2 frames for large N. If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.
I think much of our progress is limited by the capacity of the human brain, and we mostly proceed via abstraction which allows people to focus on narrow slices. That abstraction has a cost, sometimes a high one, and it’s interesting to think about what the full potential could be without those limitations.
A concise description of the right abstractions for our universe is probably not too far removed from the weights of a superintelligence, modulo a few transformations :)
Though I suppose, given a few years, that may also be true!
I tried the "Write an HTML and JavaScript page implementing space invaders" prompt against it and didn't quite get a working game with a single shot, but it was still an interesting result: https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct...
Can any company now compete with the big players? Or even more interesting, like you showed in your research, are proprietary models becoming less relevant now that anyone can run these models locally?
This trend of better open models that run locally is really picking up. Do you think we'll get to a point where we won't need to buy AI tokens anymore?
I don't think cost will be a reason to use local models for a very long time, if ever.