Something that may be interesting for the reader of this thread: this project was possible only once I started to tell Opus that it needed to take a file with all the implementation notes, and also accumulating all the things we discovered during the development process. And also, the file had clear instructions to be taken updated, and to be processed ASAP after context compaction. This kinda enabled Opus to do such a big coding task in a reasonable amount of time without loosing track. Check the file IMPLEMENTATION_NOTES.md in the GitHub repo for more info.
People can say what they want about LLMs reducing intelligence/ability; The trend has clearly been that people are beginning to get more organized, document things better, enforce constraints, and think in higher-level patterns. And there's renewed interest in formal verification.
LLMs will force the skilled, employable engineer to chase both maintainability and productivity from the start, in order to maintain a competitive edge with these tools. At least until robots replace us completely.
The thing is that currently most of these projects are just done by engineers, Its easy to stay organized when the project lasts couple of weeks and stays within <5 engineers. The issues starts when the software starts living longer and you add in the modern agile practices, it comes a complete mess which each PM trying to add random features on top of the existing code. As you increase more and more code, the maintainability will just become impossible.
There’s benchmarks in the README. Python is ~10x faster. It’s heavily optimized. Based on the numbers and my experience with Flux.1, I’m guessing the Python run is JIT’d (or Flux.2 is faster), although it’d likely only be ~half as fast if it weren’t (i.e. definitely not 10x slower).
I've also found it's helpful to have it keep an "experiment log" at the bottom of the original spec, or in another document, which it must update whenever things take "a surprising turn"
Honest question: what do you do when your spec has grown to over a megabyte?
Some things I've been doing:
- Move as much actual data into YML as possible.
- Use CEL?
- Ask Claude to rewrite pseudocode in specs into RFC-style constrained language?
How do you sync your spec and code both directions? I have some slash commands that do this but I'm not thrilled with them?
I tend to have to use Gemini for actually juggling the whole spec. Of course it's nice and chunked as much as it can be? but still. There's gonna need to be a whole new way of doing this.
If programming languages can have spooky language at a distance wait until we get into "but paragraph 7, subsection 5 of section G clearly defines asshole as..."
What does a structured language look like when it doesn't need mechanical sympathy?
YML + CEL is really powerful and underexplored but it's still just ... not what I'm actually wanting.
We found, especially with Opus and recent claude code that it is better/more precise at reading existing code for figuring out what the current status is than reading specs. It seems (for us) it is less precise at 'comprehending' the spec English than it is the code and that will sometimes reflect in wrong assumptions for new tasks which will result in incorrect implementations of those tasks. So we dropped this. Because of caching, it doesn't seem too bad on the tokens either.
I'm still sharing this post in the internal org trainings I run for those new to LLMs. Thanks for it - really great overview of the concept!
I saw in your other comment you've made accommodations for the newer generation, and I will confess than in Cursor (with plan mode) I've found an abbreviated form works just as well as the extremely explicit example found in the post.
If you ever had a followup, I imagine it'd be just as well received!
This is amazing. Is there any way you could share the log of prompts you used and other things aside from the implementation notes to reach such a result? Would love to learn from your experience and steps.
Thank you
Do you plan on writing about the other lessons you learned, which you mentioned in the README? As a big fan of your software and writing for many years, I would deeply appreciate your perspective using these tools!
There're multiple task solutions for Claude or other llms that let it define tasks, add implementation notes and (crucially) add sub-tasks and dependencies. I'm using Beads (https://github.com/steveyegge/beads) and I think it really improves the outcome; especially for larger projects.
Yes, Opus could check the image to see if it matched the prompt, but I adviced the model to stop and ask the human for a better check and a description of what the cause of the corrupted image could be. But the fact it could catch obvious regressions was good.
antirez — how do you reliably get Claude to re-read the file after compaction? It's easy to let Claude run for awhile, it compacts and starts getting much worse after compaction, and I don't always catch the moment of compaction to be able to tell it to re-read the notes file.
This development workcycle pattern lends nicely to Antigravity, which kind of does 80% this out the box, and can be nudged to do the rest with a little bit of prompting.
I ran a similar experiment last month and ported Qwen 3 Omni to llama cpp. I was able to get GGUF conversion, quantization, and all input and output modalities working in less than a week. I submitted the work as a PR to the codebase and understandably, it was rejected.
The refusal because often AI writes suboptimal GGML kernels looks very odd, to me. It means that who usually writes manually GGML kernels, could very easily steer the model into writing excellent kernels, and even a document for the agents can be compiled with the instructions on how to do a great work. If they continue in this way, soon a llama.cpp fork will emerge that will be developed much faster and potentially even better: it is unavoidable.
The refusal is probably because OP said "100% written by AI" and didn't indicate an interest in actually reviewing or maintaining the code. In fact, a later PR comment suggests that the AI's approach was needlessly complicated.
Thanks for sharing this — I appreciate your motivation in the README.
One suggestion, which I have been trying to do myself, is to include a PROMPTS.md file. Since your purpose is sharing and educating, it helps others see what approaches an experienced developer is using, even if you are just figuring it out.
One can use a Claude hook to maintain this deterministically. I instruct in AGENTS.md that they can read but not write it. It’s also been helpful for jumping between LLMs, to give them some background on what you’ve been doing.
In this case, instead of a prompt I wrote a specification, but later I had to steer the models for hours. So basically the prompt is the sum of all such interactions: incredibly hard to reconstruct to something meaningful.
This steering is the main "source code" of the program that you wrote, isn't it? Why throw it away. It's like deleting the .c once you have obtained the .exe
Isn't the "steering" in the form of prompts? You note "Even if the code was generated using AI, my help in steering towards the right design, implementation choices, and correctness has been vital during the development." You are a master of this, let others see how you cook, not just taste the sauce!
I only say this as it seems one of your motivations is education. I'm also noting it for others to consider. Much appreciation either way, thanks for sharing what you did.
Regarding the meta experiment of using LLMs to transpile to a different language, how did you feel about the outcome / process, and would you do the same process again in the future?
I've had some moments recently for my own projects as I worked through some bottle necks where I took a whole section of a project and said "rewrite in rust" to Claude and had massive speedups with a 0 shot rewrite, most recently some video recovery programs, but I then had an output product I wouldn't feel comfortable vouching for outside of my homelab setup.
It depends on the situation. In this case the agent worked only using the reference code provided by Flux's Black Forest Labs which is basically just the pipeline implemented as a showcase. The fundamental way for this process to work is that the agent can have a feedback to understand if it is really making progresses, and to debug failures against a reference implementation. But then all the code was implemented with many implementation hints about what I wanted to obtain, and without any reference of other minimal inference libraries or kernels. So I believe this just is the effect of putting together known facts about how Transformers inference works plus an higher level idea of how software should appear to the final user. Btw today somebody took my HNSW implementation for vector sets and translated it to Swift (https://github.com/jkrukowski/swift-hnsw). I'm ok with that, nor I care of this result was obtained with AI or not. However it is nice that the target license is the same, given the implementation is so similar to the C one.
When I first saw the OP, panic started to set in that I am fucked and Chat-Completions/LLMs/AI/whatever-you wanna-call-it will soon be able to create anything and eat away at my earning potential. And I will spend my elder years living with roommates, with no wife or children because I will not be able to provide for them. But upon reading that you used a reference implementation, I've realized that you simply managed to leverage it as the universal translator apenwarr believes is the endgame for this new technology [1]. So, now I feel better. I can sleep soundly tonight knowing my livelihood is safe, because the details still matter.
This is pretty great. I’ve gone and hacked your GTE C inference project to Go purely for kicks, but this one I will look at for possible compiler optimizations and building a Mac CLI for scripting…
I have a set of prompts that are essentially “audit the current code changes for logic errors” (plus linting and testing, including double checking test conditions) and I run them using GPT-5.x-Codex on Claude generated code.
It’s surprising how much even Opus 4.5 still trips itself up with things like off-by-one or logic boundaries, so another model (preferably with a fresh session) can be a very effective peer reviewer.
So my checks are typically lint->test->other model->me, and relatively few things get to me in simple code. Contrived logic or maths, though, it needs to be all me.
Note that the original FLUX.2 [klein] model [1] and python code was only released about 3 days ago (inexact without knowing the times and timezones involved.) Discussed at [2]
Salvatore, how did you pick up the requisite background knowledge on this subject? IIRC this is your first OSS project in the ML domain, just curious if/how much Claude was helpful with providing you with domain expertise while building this engine
And I have a YouTube channel mostly about AI (in Italian language) where I regularly post videos and often read papers that I then explain in the channel. I have a long time passion about AI, I wrote my first NN implementation in 2003 (used here, many years ago, as a showcase of Redis modules https://github.com/antirez/neural-redis), and never stopped since there to implement, for fun, small GPT models and things like that, using PyTorch or C.
Also my work at Redis Vector Sets, in the latest year, exposed me more to working with models (especially text embedding models of many kinds, but also other models).
So while Claude was fundamental to get the implementation fast, I had background to have idea about what was happening in the different stages. I believe it is a very interesting question to understand if this kind of work can be made with programming background and near-zero AI background. My feeling is that you ned more time, more back and forth, maybe to provide the agent with more examples, but eventually it will do something working.
Just because it is in C, doesn't mean you will get C like performance. Just look at the benchmarks, it is 8x slower than just using PyTorch... while I get its cool to use LLMs to generate code at this level, getting super high performing optimized code is very much out of the domain of current frontier LLMs
The PyTorch version is using the GPU (with Metal Performance Shaders); this C version is currently using (in the docs I saw) a single CPU core, with AMX (via Apple Accelerate BLAS) but not yet with OpenMP for parallelism. It’s not slow because LLM code is bad, but because it’s not running on the same hardware. That said, it’s also not as fast as it is because of the LLM—all the critical code is in kernel libraries it calls (the same as for PyTorch).
Absolutely true, but now I'll focus on making it fast and I believe it will be possible to go much faster. I left the agent working in the night with a specification and now I'm going to see the progresses and restart the work.
No it’s not. I have written cuda kernels and 8bit optimizers with this.
They’re actually very good at speed optimization and can iterate very quickly taking notes on trials and failures and benchmarks. I’ve had it write 10 different attempts in around an hour and benchmark them all then merge and beat very strong baselines in torch
1. Now it is much faster, and the Python benchmarks were re-done correctly (the benchmark didn't account for model loading, and did warm-up before starting the actual inference, while the C code was tested exactly in the reverse way).
2. Now there is --mmap support to run on Linux with blas target with 16GB of RAM. Inference is viable on my old-ish Dell Latitude i5.
People can say what they want about LLMs reducing intelligence/ability; The trend has clearly been that people are beginning to get more organized, document things better, enforce constraints, and think in higher-level patterns. And there's renewed interest in formal verification.
LLMs will force the skilled, employable engineer to chase both maintainability and productivity from the start, in order to maintain a competitive edge with these tools. At least until robots replace us completely.
[0] https://www.atlassian.com/work-management/knowledge-sharing/...
Did you run any benchmarking? I'm curious if python's stack is faster or slower than a pure C vibe coded inference tool.
Yep, a constantly updated spec is the key. Wrote about this here:
https://lukebechtel.com/blog/vibe-speccing
I've also found it's helpful to have it keep an "experiment log" at the bottom of the original spec, or in another document, which it must update whenever things take "a surprising turn"
Some things I've been doing:
- Move as much actual data into YML as possible.
- Use CEL?
- Ask Claude to rewrite pseudocode in specs into RFC-style constrained language?
How do you sync your spec and code both directions? I have some slash commands that do this but I'm not thrilled with them?
I tend to have to use Gemini for actually juggling the whole spec. Of course it's nice and chunked as much as it can be? but still. There's gonna need to be a whole new way of doing this.
If programming languages can have spooky language at a distance wait until we get into "but paragraph 7, subsection 5 of section G clearly defines asshole as..."
What does a structured language look like when it doesn't need mechanical sympathy? YML + CEL is really powerful and underexplored but it's still just ... not what I'm actually wanting.
I saw in your other comment you've made accommodations for the newer generation, and I will confess than in Cursor (with plan mode) I've found an abbreviated form works just as well as the extremely explicit example found in the post.
If you ever had a followup, I imagine it'd be just as well received!
This is amazing, Salvatore! Please spend some more time here and free us from the CUDA toolkit and Python.
https://github.com/ggml-org/llama.cpp/pull/18404
https://huggingface.co/TrevorJS/Qwen3-Omni-30B-A3B-GGUF
Dead Comment
Deleted Comment
One suggestion, which I have been trying to do myself, is to include a PROMPTS.md file. Since your purpose is sharing and educating, it helps others see what approaches an experienced developer is using, even if you are just figuring it out.
One can use a Claude hook to maintain this deterministically. I instruct in AGENTS.md that they can read but not write it. It’s also been helpful for jumping between LLMs, to give them some background on what you’ve been doing.
If the spec and/or tests are sufficiently detailed maybe you can step back and let it churn until it satisfies the spec.
I only say this as it seems one of your motivations is education. I'm also noting it for others to consider. Much appreciation either way, thanks for sharing what you did.
I've had some moments recently for my own projects as I worked through some bottle necks where I took a whole section of a project and said "rewrite in rust" to Claude and had massive speedups with a 0 shot rewrite, most recently some video recovery programs, but I then had an output product I wouldn't feel comfortable vouching for outside of my homelab setup.
[1] https://apenwarr.ca/log/20251120
It’s surprising how much even Opus 4.5 still trips itself up with things like off-by-one or logic boundaries, so another model (preferably with a fresh session) can be a very effective peer reviewer.
So my checks are typically lint->test->other model->me, and relatively few things get to me in simple code. Contrived logic or maths, though, it needs to be all me.
Deleted Comment
[1] https://bfl.ai/blog/flux2-klein-towards-interactive-visual-i...
[2] https://news.ycombinator.com/item?id=46653721
https://github.com/antirez/gguf-tools
And I have a YouTube channel mostly about AI (in Italian language) where I regularly post videos and often read papers that I then explain in the channel. I have a long time passion about AI, I wrote my first NN implementation in 2003 (used here, many years ago, as a showcase of Redis modules https://github.com/antirez/neural-redis), and never stopped since there to implement, for fun, small GPT models and things like that, using PyTorch or C.
Also my work at Redis Vector Sets, in the latest year, exposed me more to working with models (especially text embedding models of many kinds, but also other models).
So while Claude was fundamental to get the implementation fast, I had background to have idea about what was happening in the different stages. I believe it is a very interesting question to understand if this kind of work can be made with programming background and near-zero AI background. My feeling is that you ned more time, more back and forth, maybe to provide the agent with more examples, but eventually it will do something working.
They’re actually very good at speed optimization and can iterate very quickly taking notes on trials and failures and benchmarks. I’ve had it write 10 different attempts in around an hour and benchmark them all then merge and beat very strong baselines in torch
1. Now it is much faster, and the Python benchmarks were re-done correctly (the benchmark didn't account for model loading, and did warm-up before starting the actual inference, while the C code was tested exactly in the reverse way).
2. Now there is --mmap support to run on Linux with blas target with 16GB of RAM. Inference is viable on my old-ish Dell Latitude i5.
3. Seed now part of the PNG metadata.
4. Many other improvements, check the README.