Spoiler: there won't be a part 2, or if there is it will be with a different approach. I wrote a followup that summarizes my experiences trying this out in the real world on larger codebases: https://thelisowe.substack.com/p/reflections-on-relentless-v...
tl;dr I use a version of it in my codebases now, but the combination of LLM reward hacking and the long tail of verfiers in a language (some of which don't even exist! Like accurately detecting dead code in Python (vulture et. al can't reliably do this) or valid signatures for property-based tests) make this problem more complicated than it seems on the surface. It's not intractable, but you'd be writing many different language-specific libraries. And even then, with all of those verifiers in place, there's no guarantee that when working in different sized repos it will produce a consistent quality of code.
I don't want or need to be told top down what to do, it's better to think for myself and propose that upward. Execs appreciate it because it makes their jobs easier; users get the features they actually want; I get to work on what I think is important.
What am I missing that makes this a bad strategy?
Even if AI is vaporware is mostly hype and little value, it will take a while to hype to fizzle out and by then AI might start deliver on its promise.
They got a long runway.
It was quite obvious that AI was hype from the get-go. An expensive solution looking for a problem.
The cost of hardware. The impact on hardware and supply chains. The impact to electricity prices and the need to scale up grid and generation capacity. The overall cost to society and impact on the economy. And that's without considering the basic philosophical questions "what is cognition?" and "do we understand the preconditions for it?"
All I know is that the consumer and general voting population loose no matter the outcome. The oligarchs, banking, government and tech-lords will be protected. We will pay the price whether it succeeds or fails.
My personal experience of AI has been poor. Hallucinations, huge inconsistencies in results.
If your day job exists within an arbitrary non-productive linguistic domain, great tool. Image and video generation? Meh. Statistical and data-set analysis. Average.
Even slow non-tech legacy industry companies are deploying chatbots across every department - HR, operations, IT, customer support. All leadership are already planning to cut 50 - 90% of staff from most departments over next decade. It matters, because these initiatives are receiving internal funding which will precipitate out to AI companies to deploy this tech and to scale it.
Where I see a flaw in this is the initial login.
If you're not already on your computer to access the password manager, how do you retrieve the essentially non-memorisable password to unlock your computer in order to get to the password manager to retrieve the essentially non-memorisable password?
The password to unlock the computer, therefore, must be able to be remembered. This pretty much excludes 16-character auto-generated passwords for anyone but a savant.
Am I missing something obvious here? (MFA using an authenticator app on the phone? Is that something that Windows / Mac/ Linux supports?)
And any password length requirement beyond 8 always ends up being just a logical extension of 8 character password (like putting 1234 at the end), if 16 characters is required one would just type their standard password in twice.
If a any of the old passwords (potentially from unrelated applications) get leaked, it's almost trivial to guess current password.
Briefly, the idea is recursively to decompose tasks into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step. The authors use this technique to get a relatively small LLM to solve Towers of Hanoi with 20 rings (1M steps). All of it using natural language.
The most obvious question is whether other tasks, more interesting -- less "rote" -- than Towers of Hanoi, can similarly be recursively decomposed into simple steps. I'm not sure that's always possible.
Most real world prompts can't be reduced to something so consistent and reliable.
Their key finding was that the number of votes grows linearly with number of prompts you are trying to chain.
However the issue is that the number of votes you need will grow exponentially with hallucination rate.
It's like going from YouTube to Tiktok, for most content we consume, you could cut 90% of it without losing anything of value.
Even when I did get paid at some elevated rate, if we divided actual hours worked with the money I got, I still made way less than my hourly rate.
Reminds me of unlimited vacations policies. Great on paper.