LLMs works on both translation steps. But you end up with an healthy amount of tests.
I tagged each tests with the id of the spec so I do get spec to test coverage as well.
Beside standard code coverage given by the tests.
LLMs works on both translation steps. But you end up with an healthy amount of tests.
I tagged each tests with the id of the spec so I do get spec to test coverage as well.
Beside standard code coverage given by the tests.
LLMs works on both translation steps. But you end up with an healthy amount of tests.
I tagged each tests with the id of the spec so I do get spec to test coverage as well.
Beside standard code coverage given by the tests.
For now, it's only about test coverage of the code, but the spec coverage is coming too.
Instead of imperatively letting the agents hammer your codebase into shape through a series of prompts, you declare your intent, observe the outcome and refine the spec.
The agents then serve as a control plane, carrying out the intent.
Definitely won't use it for prod ofc but may try it out for a side-project.
It seems that this is more or less:
- instead of modules, write specs for your modules
- on the first go it generates the code (which you review)
- later, diffs in the spec are translated into diffs in the code (the code is *not* fully regenerated)
this actually sounds pretty usable, esp. if someone likes writing. And wherever you want to dive deep, you can delve down into the code and do "microoptimizations" by rolling something on your own (with what seems to be called here "mixed projects").That said, not sure if I need a separate tool for this, tbh. Instead of just having markdown files and telling cause to see the md diff and adjust the code accordingly.
* regression tests – can be generated
* conformance tests – often can be generated
* acceptance tests – are another form of specification and should come from humans.
Human intent can be expressed as
* documents (specs, etc)
* review comments, etc
* tests with clear yes/no feedback (data for automated tests, or just manual testing)
And this is basically all that matters, see more here: https://www.linkedin.com/posts/abreslav_so-what-would-you-sa...
Will we though? Wouldn't AI need to reach a stage where it is a tool, like a compiler, which is 100% deterministic?
1. You are right that we can redefine what is code. If code is the central artefact that humans are dealing with to tell machines and other humans how the system works, then CodeSpeak specs will become code, and CodeSpeak will be a compiler. This is why I often refer to CodeSpeak as a next-level programming language.
2. I don't think being deterministic per se is what matters. Being predictable certainly does. Human engineers are not deterministic yet people pay them a lot of money and use their work all the time.
The idea, IIUC, seems to be that instead of directly telling an LLM agent how to change the code, you keep markdown "spec" files describing what the code does and then the "codespeak" tool runs a diff on the spec files and tells the agent to make those changes; then you check the code and commit both updated specs and code.
It has the advantage that the prompts are all saved along with the source rather than lost, and in a format that lets you also look at the whole current specification.
The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it (and also can't do LLM-driven changes that refer to the actual code), and also that in general it's not guaranteed that the spec actually reflects all important things about the program, so the code does also potentially contain "source" information (for example, maybe your want the background of a GUI to be white and it is so because the LLM happened to choose that, but it's not written in the spec).
The latter can maybe be mitigated by doing multiple generations and checking them all, but that multiplies LLM and verification costs.
Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
Working on that as well. We need to be a lot more flexible and configurable
The idea, IIUC, seems to be that instead of directly telling an LLM agent how to change the code, you keep markdown "spec" files describing what the code does and then the "codespeak" tool runs a diff on the spec files and tells the agent to make those changes; then you check the code and commit both updated specs and code.
It has the advantage that the prompts are all saved along with the source rather than lost, and in a format that lets you also look at the whole current specification.
The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it (and also can't do LLM-driven changes that refer to the actual code), and also that in general it's not guaranteed that the spec actually reflects all important things about the program, so the code does also potentially contain "source" information (for example, maybe your want the background of a GUI to be white and it is so because the LLM happened to choose that, but it's not written in the spec).
The latter can maybe be mitigated by doing multiple generations and checking them all, but that multiplies LLM and verification costs.
Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
Eventually, we'll end up in a world where humans don't need to touch code, but we are not there yet. We are looking into ways to "catch up" the specs with whatever changes happen in the code not through CodeSpeak (agents or manual changes or whatever). It's an interesting exercise. In the case of agents, it's very helpful to look at the prompts users gave them (we are experimenting with inspecting the sessions from ~/.claude).
More generally, `codespeak takeover` [1] is a tool to convert code into specs, and we are teaching it to take prompts from agent sessions into account. Seems very helpful, actually.
I think it's a valid use case to start something in vibe coding mode and then switch to CodeSpeak if you want long-term maintainability. From "sprint mode" to "marathon mode", so to speak
* This isn't a language, it's some tooling to map specs to code and re-generate
* Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)
* Models are evolving rapidly, this months flavour of Codex/Sonnet/etc would very likely generate different code from last months
* Text specifications are always under-specified, lossy and tend to gloss over a huge amount of details that the code has to make concrete - this is fine in a small example, but in a larger code base?
* Every non-trivial codebase would be made up of of hundreds of specs that interact and influence each other - very hard (and context - heavy) to read all specs that impact functionality and keep it coherent
I do think there are opportunities in this space, but what I'd like to see is:
* write text specifications
* model transforms text into a *formal* specification
* then the formal spec is translated into code which can be verified against the spec
2 and three could be merged into one if there were practical/popular languages that also support verification, in the vain of ADA/Spark.
But you can also get there by generating tests from the formal specification that validate the implementation.
formal specification is no different from code: it will have bugs :)
There's no free lunch here: the informal-to-formal transition (be it words-to-code or words-to-formal-spec) comes through the non-deterministic models, period.
If we want to use the immense power of LLMs, we need to figure out a way to make this transition good enough