If it works, then it’s impressive. Does it work? Looking at test.sh, the oracle tests (the ones compared against SQLite) seem to consist in their entity of three trivial SELECT statements. SQLite has tens of thousands of tests; it should be possible to port some of those over to get a better idea of how functional this codebase is.
Edit: I looked over some of the code.
It's not good. It's certainly not anywhere near SQLite's quality, performance, or codebase size. Many elements are the most basic thing that could possibly work, or else missing entirely. To name some examples:
- Absolutely no concurrency.
- The B-tree implementation has a line "// TODO: Free old overflow pages if any."
- When the pager adds a page to the free list, it does a linear search through the entire free list (which can get arbitrarily large) just to make sure the page isn't in the list already.
- "//! The current planner scope is intentionally small: - recognize single-table `WHERE` predicates that can use an index - choose between full table scan and index-driven lookup."
- The pager calls clone() on large buffers, which is needlessly inefficient, kind of a newbie Rust mistake.
However…
It does seem like a codebase that would basically work. At a large scale, it has the necessary components and the architecture isn't insane. I'm sure there are bugs, but I think the AI could iron out the bugs, given some more time spent working on testing. And at that point, I think it could be perfectly suitable as an embedded database for some application as long as you don't have complex needs.
In practice, there is little reason not to just reach for actual SQLite, which is much more sophisticated. But I can think of one possible reason: SQLite has been known to have memory safety vulnerabilities, whereas this codebase is written in Rust with no unsafe code. It might eat your data, but it won't corrupt memory.
> But I can think of one possible reason: SQLite has been known to have memory safety vulnerabilities, whereas this codebase is written in Rust with no unsafe code.
I've lost every single shred of confidence I had in the comment's more optimistic claims the moment I read this.
If you read through SQLite's CVE history, you'll notice most of those are spurious at best.
I am using sqlite in my project. It definitely solves problems, but I keep seeing overly arrogant and sometimes even irresponsible statements from their website, and can't really appreciate much of their attitude towards software engineering. The below quote from this CVE page is one more example of such statements.
> All historical vulnerabilities reported against SQLite require at least one of these preconditions:
> 1. ...
> 2. The attacker can submit a maliciously crafted database file to the application that the application will then open and query.
> Few real-world applications meet either of these preconditions, and hence few real-world applications are vulnerable, even if they use older and unpatched versions of SQLite.
SQLite is tested against failure to allocate at every step of its operation: running out of memory never causes it to fail in a serious way, eg data loss. It's far more robust than almost every other library.
assuming your malloc function returns NULL when out of memory. Linux systems don't. They return fake addresses that kill your process when you use them.
Lucky that SQLite is also robust against random process death.
Unfortunately it is not so easy. If rigorous tests at every step were able to guarantee that your program can't be exploited, we wouldn't need languages like Rust at all. But once you have a program in an unsafe language that is sufficiently complex, you will have memory corruption bugs. And once you have memory corruption bugs, you eventually will have code execution exploits. You might have to chain them more than in the good old days, but they will be there. SQLite even had single memory write bugs that allowed code execution which lay in the code for 20 years without anyone spotting them. Who knows how many hackers and three letter agencies had tapped into that by the time it was finally found by benevolent security researchers.
- if you're not passing SQLite's open test suite, you didn't build SQLite
- this is a "draw the rest of the owl" scenario; in order to transform this into something passing the suite, you'd need an expert in writing databases
These projects are misnamed. People didn't build counterstrike, a browser, a C compiler, or SQLite solely with coding agents. You can't use them for that purpose--like, you can't drop this in for maybe any use case of SQLite. They're simulacra (slopulacra?)--their true use is as a prop in a huge grift: tricking people (including, and most especially, the creators) into thinking this will be an economical way to build complex software products in the future.
I'm generally not this pedantic, but yeah, "I wrote an embedded database" is fine to say. If you say "I built SQLite", I expected to at least see how many of the SQLite tests your thing passed.
Well--given a full copy of the SQLite test suite, I'm pretty sure it'd get there eventually. I agree that most of these show-off projects are just prop pieces, but that's kind of the point: Demonstrate it's technically possible to do the thing, not actually doing the thing, because that'd have diminishing returns for the demonstration. Still, the idea of setting a swarm of agents to a task, and, given a suitable test suite, have them build a compliant implementation, is sound in itself.
There are lot of embedded SQL libraries out there. I'm not particularly enamoured with some of the design choices SQLite made, for example the "flexible" approach they take to naming column types, so that isn't why I use it.
I use it for one reason: it is the most reliable SQL implementation I know of. I can safely assume if file corruption, or invariants I tried to keep aren't there, it isn't SQLite. By completely eliminating one branch of the failure tree, it saves me time.
That one reason is the one thing this implementation lacks - while keeping what I consider SQLite's warts.
You do not recall correctly. There is more than 500K SLOC of test code in the public source tree. If you "make releasetest" from the public source tarball on Linux, it runs more than 15 million test cases.
It is true that the half-million lines of test code found in the public source tree are not the entirety of the SQLite test suite. There are other parts that are not open-source. But the part that is public is a big chunk of the total.
Why do people fall for this. We're compressing knowledge, including the source code of SQLite into storage, then retrieve and shift it along latents at tremendous cost in a while loop, basically brute forcing a franken version of the original.
Because virtually all software is not novel. For each single partially novel thing, there are tens of thousands of crud apps with just slightly different flow and data. This is what almost every employed programmer does right now - match the previous patterns and produce a solution that's closer to the company requirements. And if we can brute force that quickly, that's beneficial for many people.
While I'm generally sympathetic to the idea that humans and LLM creativity is broadly similar (combining ideas absorbed elsewhere in new ways), when we ask for something that already exists it's basically just laundering open source code
Months (years?) of publicity from AI companies telling us that the AI is nearing AGI and will replace programmers. Some people are excited about that future and want it now.
In reality, LLMs can (currently) build worse versions of things that already exist: a worse database than SQL, a worse C compiler than GCC, a worse website than one done by a human. I'd really like to see some agent create a better version of something that already exists, or, at least, something relatively novel.
>a worse database than SQL, a worse C compiler than GCC, a worse website than one done by a human.
But it enables people who can't do these things at all to appear to be able to do these things and claim reputation and acclaim that they don't deserve for skills they don't have.
copyright laundering machine. which could poison the very notion of ip / copyright, either open or close source. the only code that can't be laundered becomes code hidden behind a server api
> 84 / 154 commits (54.5%) were lock/claim/stale-lock/release coordination.
Parallelism over one code base is clearly not very useful.
I don't understand why going as fast as possible is the goal. We should be trying to be as correct as possible. The whole point is that these agents can run while we sleep. Convergence is non linear. You want every step to be in the right direction. Think of it more as a series of crystalline database transactions that must unroll in perfect order than a big pile of rocks that needs to be moved from a to b.
Orchestration and autonomy are the things people get hyped about, but validation is the real bottleneck, and I'm pretty sure it's not amenable to complete automation. The people pushing orchestration the hardest are trying to get their users to validate for them, which taints the AI related open source ecosystem for everyone (sorry Steve/Peter!).
That's the real unlock in my opinion. It's effectively an automated reverse engineering of how SQLite behaves, which is something agents are really good at.
I did a similar but smaller project a couple of weeks ago to build a Python library that could parse a SQLite SELECT query into an AST - same trick, I ran the SQLite C code as an oracle for how those ASTs should work: https://github.com/simonw/sqlite-ast
Question: you mention the OpenAI and Anthropic Pro plans, was the total cost of this project in the order of $40 ($20 for OpenAI and $20 for Anthropic)? What did you pay for Gemini?
I'm a heavy Cursor user (not yet on Claude) and I see a big disconnect between my own experience and posts like this.
* After a long vibe-coding session, I have to spend an inordinate amount of time cleaning up what Cursor generated. Any given page of code will be just fine on its own, but the overall design (unless I'm extremely specific in what I tell Cursor to do) will invariably be a mess of scattered control, grafted-on logic, and just overall poor design. This is despite me using Plan mode extensively, and instructing it to not create duplicate code, etc.
* I keep seeing metrics of 10s and 100s of thousands of LOC (sometimes even millions), without the authors ever recognizing that a gigantic LOC is probably indicative of terrible heisenbuggy code. I'd find it much more convincing if this post said it generated a 3K SQLite implementation, and not 19K.
Wondering if I'm just lagging in my prompting skills or what. To be clear, I'm very bullish on AI coding, but I do feel people are getting just a bit ahead of themselves in how they report success.
This has been my experience also, but i've been using everything (Claude code, open code, copilot, etc...) It's impressive when I ask it to do something I don't know how like some python apps, but when it's in my stack I have to constantly stop it mid processing and ask it to fix something. I'm still validanting the plan and rewriting a lot of the code because the quality just is not there yet.
And for the most part I use either opus or sonnet, but for planning sometimes I switch to chatgpt since I think claude is too blunt and does not ask enough questions. I also have local setups with OLlama and have tried for personal projects some kimi models. The results are the same for all, but again claude models are slighly better.
I don't think I spoke about the fact that yeah, the code quality is suboptimal and this is purely a proof of concept. So I'm going to update the blog post with that information, but I completely agree with you that the code you get with models is not best practices and this is even more so the case when you have many agents on one project that generate lots of redundancy (which I do cover in the blog post).
Well, I've got it as Auto (configured by my company and I forget to change it). The list of enabled models includes claude-4.6-opus-high, claude-4.5-sonnet, gpt-5.3-codex, and a couple more.
this is the business model bet. the codebase is a big ball of mud that only a superhuman ai can comprehend, therefore everyone must use superhuman ai make changes in the codebase. the selling point is iteration speed, especially early iteration speed
cf. SV conventional wisdom: he who ships first wins the market
in fairness, there is real value in iteration speed. i'm not holding my breath on human comprehensible corporate code bases moving forward. a slew of critical foundational projects, mostly run by the big names, may still care about what used to be called "good engineering practices".
What's the point of building something that already exists in open source. It's just going to use code that already exists. There's probably dozens of examples written by humans that it can pull from.
What do you suggest we build instead, that hasn't already been done? I've been developing for decades, and I can't think of a single thing that hasn't already been kind of done either in the same or other language, or at least similar.
Great work! Obviously the goal of this is not to replace sqlite, but to show that agents can do this today.
That said, I'm a lot more curious about the Harness part ( Bootstrap_Prompt, Agent_Prompt, etc) then I am in what the agents have accomplished. Eg, how can I repeat this myself ? I couldn't find that in the repo...
the code has not been rigorously tested in all honesty, (this is mainly an experiment on agent orchestration as opposed to building a viable sqlite in rust)
- The choice of two workers per model is purely pragmatic: I can't afford more.
- I chose heterogeneous agents because it has not been done yet. There is no performance justification for this choice.
Edit: I looked over some of the code.
It's not good. It's certainly not anywhere near SQLite's quality, performance, or codebase size. Many elements are the most basic thing that could possibly work, or else missing entirely. To name some examples:
- Absolutely no concurrency.
- The B-tree implementation has a line "// TODO: Free old overflow pages if any."
- When the pager adds a page to the free list, it does a linear search through the entire free list (which can get arbitrarily large) just to make sure the page isn't in the list already.
- "//! The current planner scope is intentionally small: - recognize single-table `WHERE` predicates that can use an index - choose between full table scan and index-driven lookup."
- The pager calls clone() on large buffers, which is needlessly inefficient, kind of a newbie Rust mistake.
However…
It does seem like a codebase that would basically work. At a large scale, it has the necessary components and the architecture isn't insane. I'm sure there are bugs, but I think the AI could iron out the bugs, given some more time spent working on testing. And at that point, I think it could be perfectly suitable as an embedded database for some application as long as you don't have complex needs.
In practice, there is little reason not to just reach for actual SQLite, which is much more sophisticated. But I can think of one possible reason: SQLite has been known to have memory safety vulnerabilities, whereas this codebase is written in Rust with no unsafe code. It might eat your data, but it won't corrupt memory.
That is impressive enough for now, I think.
I've lost every single shred of confidence I had in the comment's more optimistic claims the moment I read this.
If you read through SQLite's CVE history, you'll notice most of those are spurious at best.
Some more context here: https://sqlite.org/cves.html
> All historical vulnerabilities reported against SQLite require at least one of these preconditions:
> 1. ...
> 2. The attacker can submit a maliciously crafted database file to the application that the application will then open and query.
> Few real-world applications meet either of these preconditions, and hence few real-world applications are vulnerable, even if they use older and unpatched versions of SQLite.
This 2. precondition is literally one of the idiomatic usage of sqlite that they've suggested on their site: https://sqlite.org/appfileformat.html
Lucky that SQLite is also robust against random process death.
- if you're not passing SQLite's open test suite, you didn't build SQLite
- this is a "draw the rest of the owl" scenario; in order to transform this into something passing the suite, you'd need an expert in writing databases
These projects are misnamed. People didn't build counterstrike, a browser, a C compiler, or SQLite solely with coding agents. You can't use them for that purpose--like, you can't drop this in for maybe any use case of SQLite. They're simulacra (slopulacra?)--their true use is as a prop in a huge grift: tricking people (including, and most especially, the creators) into thinking this will be an economical way to build complex software products in the future.
I believe it's an ad. Everything about it is trying so hard to seem legit and it's the most pointless thing I have ever seen.
There are lot of embedded SQL libraries out there. I'm not particularly enamoured with some of the design choices SQLite made, for example the "flexible" approach they take to naming column types, so that isn't why I use it.
I use it for one reason: it is the most reliable SQL implementation I know of. I can safely assume if file corruption, or invariants I tried to keep aren't there, it isn't SQLite. By completely eliminating one branch of the failure tree, it saves me time.
That one reason is the one thing this implementation lacks - while keeping what I consider SQLite's warts.
It is true that the half-million lines of test code found in the public source tree are not the entirety of the SQLite test suite. There are other parts that are not open-source. But the part that is public is a big chunk of the total.
I would need to see evidence of that. In my experience it's really difficult to get AI to fix one bug without having it introduce others.
That isn't true, not by a long shot. Improvements happen because someone is inspired to do something differently.
How will that ever happen if we're obsessed with proving we can reimplement shit that's already great?
While I'm generally sympathetic to the idea that humans and LLM creativity is broadly similar (combining ideas absorbed elsewhere in new ways), when we ask for something that already exists it's basically just laundering open source code
In reality, LLMs can (currently) build worse versions of things that already exist: a worse database than SQL, a worse C compiler than GCC, a worse website than one done by a human. I'd really like to see some agent create a better version of something that already exists, or, at least, something relatively novel.
But it enables people who can't do these things at all to appear to be able to do these things and claim reputation and acclaim that they don't deserve for skills they don't have.
Parallelism over one code base is clearly not very useful.
I don't understand why going as fast as possible is the goal. We should be trying to be as correct as possible. The whole point is that these agents can run while we sleep. Convergence is non linear. You want every step to be in the right direction. Think of it more as a series of crystalline database transactions that must unroll in perfect order than a big pile of rocks that needs to be moved from a to b.
I wrote a rant about this a while back to try and encourage people to be more responsible: https://sibylline.dev/articles/2026-01-27-stop-orchestrating...
Agreed, a flat set of workers configured like this is probably not the best configuration.
Can you imagine what an all human team configured like this would produce?
That's the real unlock in my opinion. It's effectively an automated reverse engineering of how SQLite behaves, which is something agents are really good at.
I did a similar but smaller project a couple of weeks ago to build a Python library that could parse a SQLite SELECT query into an AST - same trick, I ran the SQLite C code as an oracle for how those ASTs should work: https://github.com/simonw/sqlite-ast
Question: you mention the OpenAI and Anthropic Pro plans, was the total cost of this project in the order of $40 ($20 for OpenAI and $20 for Anthropic)? What did you pay for Gemini?
Gemini is free, I don't even know if they have a paid plan?
* After a long vibe-coding session, I have to spend an inordinate amount of time cleaning up what Cursor generated. Any given page of code will be just fine on its own, but the overall design (unless I'm extremely specific in what I tell Cursor to do) will invariably be a mess of scattered control, grafted-on logic, and just overall poor design. This is despite me using Plan mode extensively, and instructing it to not create duplicate code, etc.
* I keep seeing metrics of 10s and 100s of thousands of LOC (sometimes even millions), without the authors ever recognizing that a gigantic LOC is probably indicative of terrible heisenbuggy code. I'd find it much more convincing if this post said it generated a 3K SQLite implementation, and not 19K.
Wondering if I'm just lagging in my prompting skills or what. To be clear, I'm very bullish on AI coding, but I do feel people are getting just a bit ahead of themselves in how they report success.
And for the most part I use either opus or sonnet, but for planning sometimes I switch to chatgpt since I think claude is too blunt and does not ask enough questions. I also have local setups with OLlama and have tried for personal projects some kimi models. The results are the same for all, but again claude models are slighly better.
What model? Cursor doesn't generate anything itself, and there's a huge difference between gpt5.3-codex and composer 1 for example.
cf. SV conventional wisdom: he who ships first wins the market
in fairness, there is real value in iteration speed. i'm not holding my breath on human comprehensible corporate code bases moving forward. a slew of critical foundational projects, mostly run by the big names, may still care about what used to be called "good engineering practices".
- the memory, thread safety, and build system of Rust
- the elegant syntax of OCaml and Haskell
- the expressive type system of Haskell and TypeScript
- the directness and simplicity of JavaScript
Think coding agents can help here?
How well does the resulting code perform? What are the trade-offs/limitations/benefits compared to SQLite? What problems does it solve?
Why did you use this process? this mixture of models? Why is this a good setup?
- The choice of two workers per model is purely pragmatic: I can't afford more. - I chose heterogeneous agents because it has not been done yet. There is no performance justification for this choice.