Spoiler: there won't be a part 2, or if there is it will be with a different approach. I wrote a followup that summarizes my experiences trying this out in the real world on larger codebases: https://thelisowe.substack.com/p/reflections-on-relentless-v...
tl;dr I use a version of it in my codebases now, but the combination of LLM reward hacking and the long tail of verfiers in a language (some of which don't even exist! Like accurately detecting dead code in Python (vulture et. al can't reliably do this) or valid signatures for property-based tests) make this problem more complicated than it seems on the surface. It's not intractable, but you'd be writing many different language-specific libraries. And even then, with all of those verifiers in place, there's no guarantee that when working in different sized repos it will produce a consistent quality of code.
A couple weeks ago I was curious what the strictest programming language was. ChatGPT listed a couple, and it kicked off a short discussion where I began asking it about the capabilities of stricter programming languages at low levels. Funny enough at the end it mentioned that SPARK/Ada was the strictest you could get at the lowest levels, same as Ironclad.
At one point while asking it about drivers, it said "ACL2’s logic is [...] side‑effect‑free definitions with termination proofs when admitted to the logic. That is misaligned with effectful, interrupt‑driven kernel code.
I'm not an OS or kernel dev, most of my work has been in Web Dev, ML, and a little bit of embedded. How accurate is the information that was presented to me? Here is the link to the discussion: https://chatgpt.com/share/691012a7-a06c-800f-9cc9-54a7c2c8b6...
I don't know SPARK or Ada, but it just bothers me to think that we can't...I guess...prove everything about our software before we run it (yes yes, I'm familiar with halting problem shenanigans, but other than that).
These days I use Codex, with GPT-5-Codex + $200 Pro subscription. I code all day every day and haven't yet seen a single rate limiting issue.
We've come a long way. Just 3-4 months ago, LLMs would start doing a huge mess when faced with a large codebase. They would have massive problems with files with +1k LoC (I know, files should never grow this big).
Until recently, I had to religiously provide the right context to the model to get good results. Codex does not need it anymore.
Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.
My personal workflow when building bigger new features:
1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)
2. Prompt the model to create a PRD
3. CHECK the PRD, improve and enrich it - this can take hours
4. Actually have the AI agent generate the code and lots of tests
5. Use AI code review tools like CodeRabbit, or recently the /review function of Codex, iterate a few times
6. Check and verify manually - often times, there are a few minor bugs still in the implementation, but can be fixed quickly - sometimes I just create a list of what I found and pass it for improving
With this workflow, I am getting extraordinary results.
AMA.
But if you want it to generate chunks of usable and eloquent Python from scratch, it’s pretty decent.
And, FWIW, I’m not fluent in Python.
Why not have static analysis tools on the other side of those generations that constrain how the LLM can write the code?
Location: Wisconsin
---
Last September I built an AI inference tool that hit #3 on HN (https://news.ycombinator.com/item?id=41620530). It processed 17.3M messages in 24 hours and only cost $17 to run.
I specialize in:
- LLM inference optimization (FastAPI, proper batching, memory management)
- CI/CD pipelines for ML deployments
- Making AI systems cost-effective at scale
Recent work: FrankenClaude (reasoning injection experiments, https://thelisowe.substack.com/p/frankenclaude-injecting-dee...), self driving Rocket League (https://thelisowe.substack.com/p/building-an-ai-that-plays-r...), diffdev (AI-powered code modification tool, https://pypi.org/project/diffdev/).
Previously at Sprout Social where I built their ML inference platform - reduced deployment time from 6 months to 6 hours and cut AWS costs by $500K/yr.
Looking for interesting problems in AI infrastructure, performance optimization, or building products from scratch.
Tech: PyTorch, FastAPI, K8s, Docker, AWS, ONNX
---
Resume: https://drive.google.com/file/d/1qO8XdisNTFq_wmrQGDKnu6eWDi2... GitHub: github.com/Mockapapella Blog: thelisowe.substack.com
Contact: My email is in my bio or on my resume
Location: Wisconsin
Remote: Yes
Willing to relocate: Yes
Technologies: Python, PyTorch, Kubernetes, Docker, AWS, FastAPI, ONNX, MLOps
Resume: https://drive.google.com/file/d/1qO8XdisNTFq_wmrQGDKnu6eWDi2g33Me/view
GitHub: https://github.com/Mockapapella
Email: In profile or on resume
AI/ML Engineer specializing in high-performance deployments. Built distributed systems handling 30K QPS, developed a neural network for Rocket League gameplay, and created platforms that cut model deployment time from 6 months to 6 hours. Saved $500K/yr in infrastructure costs through optimization at previous role. Former technical founder with experience in humanoid robotics and AI writing assistance. I write about my projects and musings on my blog: https://thelisowe.substack.com/Seeking roles focusing on ML infrastructure, model optimization, post-training, or full-stack AI engineering.
Location: Wisconsin
---
Last September I built an AI inference tool that hit #3 on HN (https://news.ycombinator.com/item?id=41620530). It processed 17.3M messages in 24 hours and only cost $17 to run.
I specialize in:
- LLM inference optimization (FastAPI, proper batching, memory management)
- CI/CD pipelines for ML deployments
- Making AI systems cost-effective at scale
Recent work: FrankenClaude (reasoning injection experiments, https://thelisowe.substack.com/p/frankenclaude-injecting-dee...), self driving Rocket League (https://thelisowe.substack.com/p/building-an-ai-that-plays-r...), diffdev (AI-powered code modification tool, https://pypi.org/project/diffdev/).
Previously at Sprout Social where I built their ML inference platform - reduced deployment time from 6 months to 6 hours and cut AWS costs by $500K/yr.
Looking for interesting problems in AI infrastructure, performance optimization, or building products from scratch.
Tech: PyTorch, FastAPI, K8s, Docker, AWS, ONNX
---
Resume: https://drive.google.com/file/d/1qO8XdisNTFq_wmrQGDKnu6eWDi2... GitHub: github.com/Mockapapella Blog: thelisowe.substack.com
Contact: My email is in my bio or on my resume
Location: Wisconsin
Remote: Yes
Willing to relocate: Yes
Technologies: Python, PyTorch, Kubernetes, Docker, AWS, FastAPI, ONNX, MLOps
Resume: https://drive.google.com/file/d/1qO8XdisNTFq_wmrQGDKnu6eWDi2g33Me/view
GitHub: https://github.com/Mockapapella
Email: In profile or on resume
AI/ML Engineer specializing in high-performance deployments. Built distributed systems handling 30K QPS, developed a neural network for Rocket League gameplay, and created platforms that cut model deployment time from 6 months to 6 hours. Saved $500K/yr in infrastructure costs through optimization at previous role. Former technical founder with experience in humanoid robotics and AI writing assistance. I write about my projects and musings on my blog: https://thelisowe.substack.com/Seeking roles focusing on ML infrastructure, model optimization, post-training, or full-stack AI engineering.
Tenex, a TUI for managing swarms of AI agents.
I noticed that as I'm using agents more and more my PRs are getting more ambitious (read: bigger diffs), and when I was reviewing them with agents I noticed that the first review wouldn't catch anything but the second would. This decreased my confidence in their capabilities, so I decided to make a tool to let me run 10 review agents at once, then aggregate their findings into a single agent to asses and address.
I was using Codex at the time, so Tenex is kind of a play on "10 Codex agents" and the "10x engineer" meme.
I've since added a lot of features and just today got to use it for the first time in a production system. Some rough edges for sure, but as I'm using it any time anything feels "off" or unintuitive I'm taking notes to improve it.
Fun fact, on my machine, while launching 50x Claude Code instances very nearly crashes it, I was able to launch 100x Codex instances no problem. I tried 500x but I ran into rate limits before they could all spawn :(