There will be a a new kind of job for software engineers, sort of like a cross between working with legacy code and toxic site cleanup.
Like back in the day being brought in to “just fix” a amalgam of FoxPro-, Excel-, and Access-based ERP that “mostly works” and only “occasionally corrupts all our data” that ambitious sales people put together over last 5 years.
But worse - because “ambitious sales people” will no longer be constrained by sandboxes of Excel or Access - they will ship multi-cloud edge-deployed kubernetes micro-services wired with Kafka, and it will be harder to find someone to talk to understand what they were trying to do at the time.
I met a guy on the airplane the other day whose job is to vibe code for people who can't vibe code. He showed me his discord server (he paid for plane wifi), where he charges people 50$/month to be in the server and he helps them unfuck their vibe coded projects. He had around 1000 people in the server.
A big part of the reason that people develop solutions in Excel is that they don’t have to ask anyone’s permission. No business case, no scope, no plan, and most importantly no budget.
Unless a business allows any old employee to spin up cloud services on a whim we’re not going to see sales people spinning up containers and pipelines, AI or not.
What about a sales person interacting with an LLM that is already authz'd to spin up various cloud resources? I don't think that scenario is too far-fetched...
And then over time these Excel spreadsheets become a core system that runs stuff.
I used to live in fear of one of these business analyst folks overwriting a cell or sorting by just the column and not doing the rows at the same time.
Unless they have a linux with some libre office, I fail to see where there is no budget for Excel. Initially you have to keep up with windows licenses then office.
> and it will be harder to find someone to talk to understand what they were trying to do at the time.
This will be the big counter to AI generated tools; at one point they become black boxes and the only thing people can do is to try and fix them or replace them altogether.
Of course, in theory, AI tooling will only improve; today's vibe coded software that in some cases generate revenue can be fed into the models of the future and improved upon. In theory.
Personally, I hate it; I don't like magic or black boxes.
Before AI companies were usually very reticent to do a rewrite or major refactoring of software because of the cost but that calculus may change with AI. A lot of physical products have ended up in this space where it's cheaper to buy a new product and throw out the old broken one rather than try and fix it. If AI lowers the cost of creating software then I'm not sure why it wouldn't go down the same path as physical goods.
The prevailing counter narrative around vibe coding seems to be that "code output isn't the bottle neck, understanding the problem is". But shouldn't that make vibe coding a good tool for the tool belt? Use it to understand the outermost layer of the problem, then throw out the code and write a proper solution.
> Personally, I hate it; I don't like magic or black boxes.
So, no compilers for you neither ?
(To be fair: I'm not loving the whole vibe coding thing. But I'm trying to approach this wave with open mind, and looking for the good arguments in both side. This is not one of them)
> There will be a a new kind of job for software engineers
New? New!?
This is my job now!
I call it software archeology — digging through Windows Server 2012 R2 IIS configuration files with a “last modified date” about a decade ago serving money-handling web apps to the public.
I allowed Claude to debug an ingress rule issue on my cluster last week for a membership platform I run.
Not really the same since Claude didn’t deploy anything — but I WAS surprised at how well it tracked down the ingress issue to a cron job accidentally labeled as a web pod (and attempting to service http requests).
It actually prompted me to patch the cron itself but I don’t think I’m that bullish yet to let CC patch my cluster.
Does anyone remember the websites that front page and dreamweaver used to generate from its wysiwyg editor? It was a nightmare to modify manually and convinced me to never rely on generated code.
I agree that the code that dreamweaver generated was truely awful. But compilers and interpreters also generate code, and these days they are very good at it. Technically the browser’s rendering engine is a code generator as well, so if you’re hand-coding HTML you’re still relying on code generation.
Declarative languages and AI go hand in hand. SQL was intended to be a ‘natural’ language that the query engine (an old-school AI) would use to write code.
Writing natural language prompts to produce code is not that different, but we’re using “stochastic” AI, and stochastic means random, which means mistakes and other non-ideal outputs.
I definitely remember that. Got paid $400 for my very first site in the early 00s.
But we also didn't have an AI tool to do the modifying of that bad code. We just had our own limited-capacity-brain, mistake-making, relatively slow-typing selves to depend on.
I still remember that Frontpage exploit in which a simple google search would return websites that still had the default Frontpage password and thus you could login and modify the webpage.
Agreed, sometimes it seems like there are only two types of roles. Maintaining / updating hot mess legacy code bases for an established company or work 100 hours a week building a new hot mess code base for a startup. Obviously oversimplifying but just my very limited experience scoping out postings and talking to people about current jobs.
Regardless this just made me shudder thinking about the weird little ocean of (now maybe dwindling) random underpaid contract jobs for a few hours a month maintaining ancient Wordpress sites...
>it will be harder to find someone to talk to understand what they were trying to do at the time.
IMHO, there's a strong case for the opposite. My vibe coding prompts are along the lines of "Please implement the plan described in `phase1-epic.md` using `specification.prd` as a guide." The specification and epics are version controlled and a part of the project. My vibe coded software has better design documentation than most software projects I've been involved in.
Do we have a method to let AI analyze the data within the DBs and figure out how to port it to a well designed db? I'm a fan of the philosophy of write strong data structures and stupid algorithms around them, your data will outlive your application, etc. Simple example is a Mongodb field which stores same thing as int or string, relationships without foreign keys in Postgres etc. Then frustrating shit like somebody creating an entire table since he cant `ALTER TABLE ADD COLUMN`
"Claude, connect to DB A via FOO and analyze the data, then figure out to to port it to well designed DB B, come back to me with a proposal and implementation plan"
Im really curious about what other jobs will pop up. As long as there is an element of probability associated with AI, there will need to be manual supervision for certain tasks/jobs.
> it will be harder to find someone to talk to understand what they were trying to do at the time.
These are my favorite types of code bases to work on. The source of truth is the code. You have to read it and debug it to figure it out, and reconcile the actual behaviors with the desired or expected behaviors through your own product oriented thinking
There are always two major results from any software development process: a change in the code and a change in cognition for the people who wrote the code (whether they did so directly or with an LLM).
Python and Typescript are elaborate formal languages that emerged from a lengthy process of development involving thousands of people around the world over many years. They are non-trivially different, and it's neat that we can port a library from one to the other quasi-automatically.
The difficulty, from an economic perspective, is that the "agent" workflow dramatically alters the cognitive demands during the initial development process. It is plain to see that the developers who prompted an LLM to generate this library will not have the same familiarity with the resulting code that they would have had they written it directly.
For some economic purposes, this altering of cognitive effort, and the dramatic diminution of its duration, probably doesn't matter.
But my hunch is that most of the economic value of code is contingent on there being a set of human beings familiar with the code in a manner that requires writing having written it directly.
Denial of this basic reality was an economic problem even before LLMs: how often did churn in a development team result in a codebase that no one could maintain, undermining the long-term prospects of a firm?
> But my hunch is that most of the economic value of code is contingent on there being a set of human beings familiar with the code in a manner that requires writing having written it directly.
This reminds me of a software engineering axiom:
When making software, remember that it is a snapshot of
your understanding of the problem. It states to all,
including your future-self, your approach, clarity, and
appropriateness of the solution for the problem at hand.
Yes! But there's code and code. Not to disrespect anyone, but there is writing a new algorithm, say for optimizing the gradient descent and code to display a simple web form.
The first one is usually short and requires a very deep understanding of one or two profound, new ideas. The second is usually very big and requires a shallow understanding of many not-so-new ideas (which are usually a reflection of the oroganisation that produced the code).
My feeling is that, provided a sufficiently long context window, an LLM will be able to go through the second kind project very easily. It will also be very good at showing that the first kind of project is not so new after all, destroying all people who can't find really new ideas.
In both case, it'll pressure institutions to have less IT specialists...
As someone who trained specifically in computer sciences, I'm a bit scared :-/
I wonder though. One of the superpowers of LLMs is code reading. I say the tools are better and reading than writing. It is very easy to get comprehensive documentation for any code base and get understanding by asking questions. At that point does it matter that there is a living developer who understands the code? If an arbitrary person with knowledge of the technology stack can get up to speed quickly is it important to have the original developers around any more?
Well, according to the recently linked Naur paper, the mental model for a codebase includes just as much what code wasn't written, as much what was - e.g. a decision to do this design over another, etc. This is not recoverable by AI without every meeting note and interaction between the devs/clients/etc.
I don't think LLM can generate good docs for not self documenting code:) Any obscure long function you can't figure out yourself and you're out of luck
I'm not looking for documentation as an alternative to reading the code, but because I want to know elements of the programmer's state of mind that didn't make it into the code. Intentions, expectations, assumptions, alternatives considered and not taken, etc. The LLM's best guess at this is no better than mind (so far).
At humanlayer we have some OSS projects that are 99% written by AI, and a lot of it was written by AI under the supervision of developer(s) that are no longer at the company.
Every now and then we find that there are gaps in our own understanding of the code/architecture that require getting out the old LSP and spelunking through call stacks.
> I say the tools are better and reading than writing.
No way, models are much, much better at writing code than giving you true and correct information. The failure modes are also a lot easier to spot when writing code: it doesn't compile, tests got skipped, it doesn't run right, etc. If Claude Code gave you incorrect information about a system, the only way to verify is to build a pretty good understanding of that system yourself. And because you've incurred a huge debt here, whoever's building that understanding is going to take much more time to do it.
Until LLMs get way closer (not entirely) to 100%, there's always gonna have to be a human in the loop who understands the code. So, in addition to the above issue you've now got a tradeoff: do you want that human to be able to manage multiple code bases but have to come up to speed on a specific one whenever intervention is necessary, or do you want them to be able to quickly intervene but only in 1 code base?
More broadly, you've also now got a human resource problem. Software engineering is pretty different than monitoring LLMs: most people get into into it because they like writing code. You need software experts in the loop, but when the LLMs take the "fun" part for themselves, most SWEs are no longer interested. Thus, you're left with a small subset of an already pretty small group.
Apologists will point out that LLMs are a lot better in strongly typed languages, in code bases with lots of tests, and using language servers, MCP, etc, for their actions. You can imagine more investments and tech here. The downside is models have to work much, much harder in this environment, and you still need a software expert because the failure modes are far more obscure now that your process has obviated the simple stuff. You've solved the "slop" problem, but now you've got a "we have to spend a lot more money on LLMs and a lot more money on a rare type of expert to monitor them" problem.
---
I think what's gonna happen is a division of workflows. The LLM workflows will be cheap and shabby: they'll be black boxes, you'll have to pull the lever over and over again until it does what you want, you'll build no personal skills (because lever pulling isn't a skill), practically all of your revenue--and your most profitable ideas--will go to your rapacious underlying service providers, and you'll have no recourse when anything bad happens.
The good workflows will be bespoke and way more expensive. They'll almost always work, there will be SLAs for when they don't, you'll have (at least some) rights when you use them, they'll empower and enrich you, and you'll have a human to talk to about any of it at reasonable times.
I think jury's out on whether or not this is bad. I'm sympathetic to the "an LLM brain may be better than no brain", but that's hugely contingent on how expensive LLMs actually end up being and any deleterious effects of outsourcing core human cognition to LLMs.
I used the "map is not a territory" to describe this context in the article about visual programming [0]. Code is a map, territory is the mental model of the problem domain the code is supposed to be solving.
But, as other commentators mentioned, LLMs are so much better on reading large codebases, that it even invalidates the whole idea of this post (visualizing codebase in 3D in a fashion similar how I would do it in my head). Which kinda changes the game – if "comprehending" complex codebase becomes an easy task, maybe we won't need to keep developers' mental models and the code in constant sync. (it's an open question)
It's so much easier to build a mental model of a code base with LLMs. You just ask specific questions of a subsystem and they show files, code snippets, point out the idea, etc.
I just recently took the time to understood how the GIL works exactly in CPython, because I just asked a couple of questions about it, Claude showed me the relevant API and examples where can I find it. I looked it up in the CPython codebase and all of a sudden it clicked.
The huge difference was that it cost me MINUTES. I didn't even bother to dig in before, because I can't perfectly read C, the CPython codebase is huge and it would have taken me a really long time to understand everything.
> It is plain to see that the developers who prompted an LLM to generate this library will not have the same familiarity with the resulting code that they would have had they written it directly
I think that's a bit too simplified. Yes, a person just blindly accepting whatever the LLM generates from their unclear prompts probably won't have much understanding or familiarity with it.
But that's not how I personally use LLMs, and I'm sure a lot of others too. Instead, I'm the designer/architect, with a strict control of exactly what I want. I may not actually have written the lines, but all the interfaces/APIs are human designed, the overall design/architecture is human designed, and since I designed it, I know enough to say I'd be familiar with it.
And if I come back to the project in 1-2 years, even if there is no document, it's trivial to spend 10-20 minutes together with an LLM to understand the codebase from every angle, just ask pointed questions, and you can rebuild your mental image quickly.
TLDR: Not everyone is a using LLMs for "vibe-coding" (blind-coding), but as an assistant sitting next to you. So my guess is that the ones who know what you need to know in order to effectively build software, will be a lot more productive. The ones who don't know that (yet?), will drown in spaghetti faster than before.
> After finishing the port, most of the agents settled for writing extra tests or continuously updating agent/TODO.md to clarify how "done" they were. In one instance, the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
Ok, now that is funny! On so many levels.
Now, for the project itself, a few thoughts:
- this was tried before, about 1.5 years ago there was a project setup to spam github with lots of "paper implementations", but it was based on gpt3.5 or 4 or something, and almost nothing worked. Their results are much better.
- surprised it worked as well as it did with simple prompts. "Probably we're overcomplicating stuff". Yeah, probably.
- weird copyright / IP questions all around. This will be a minefield.
- Lots of SaaS products are screwed. Not from this, but from this + 10 engineers in every midsized company. NIH is now justified.
> After finishing the port, most of the agents settled for writing extra tests or continuously updating agent/TODO.md to clarify how "done" they were. In one instance, the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
Is that... the first recorded instance of an AI committing suicide?
The AI doesn't have a self preservation instinct. It's not trying to stay alive. There is usually an end token that means the LLM is done talking. There has been research on tuning how often that is emitted to shorten or lengthen conversations. The current systems respond well to RL for adjusting conversation length.
One of the providers (I think it was Anthropic) added some kind of token (or MCP tool?) for the AI to bail on the whole conversation as a safety measure. And it uses it to their liking, so clearly not trying to self preserve.
> - weird copyright / IP questions all around. This will be a minefield.
Yeah, we're in weird territory because you can drive an LLM as a Bitcoin mixer over intellectual property. That's the entire point/meaning behind https://ghuntley.com/z80.
You can take something that exists, distill it back to specs, and then you've got your own IP. Throw away the tainted IP, and then just run Ralph over a loop. You are able to clone things (not 100%, but it's better than hiring humans).
Basically to avoid the ambiguity of training LLM from unlicensed code, I use it to generate description of the code to another LLM trained from permissively licensed code. (There aren't any usable public domain models I've found)
I use it in real world and it seems that the codegen model work 10-20% of the time (the description is not detailed enough - which is good for "clean room" but a base model couldn't follow that). All models can review the code, retry and write its own implementation based on the codegen result though.
repoMirror is the wrong name, aiCodeLaundering would be more accurate. This is bulk machine translation from one language to another, but in this case, it is code.
No the actual thing will be zillions of little apps made by dev-adjacent folks to automate their tasks. I think we have about 30 of these lying around the office, people gpt up a streamline app, we yeet it into prod.
I am excited by the idea that small businesses with super unique problems may now be able to leverage custom software.
I have long held that high software salaries withhold the power of boutique software from its potential applications in small businesses.
It's possible we're about to see what unleashing software in small businesses might have looked like, to some degree, just with much less expert guidance and wisdom.
I am a developer so my point of view on salaries is not out of bitterness.
I started building a project by trying to wire in existing open source stuff. When I looked at the build and stuff that would cause me to bring in, and the actual stuff I needed from the open source tools, it turned out to be MUCH faster/cleaner to just get Claude to check out the repo and port the stuff I needed directly.
Now I do a calculus with dependencies. Do I want to track the upstream, is the rigging around the core I want valuable, is it well maintained? If not, just port and move on.
As a security professional who makes most of my money from helping companies recover from vibe coded tragedies this puts Looney Toons style dollar signs in my eyes.
Since the entire concept of Vibe Coding existed for a grand total of 5 months, how do companies reach the level of saturation with vibe coding, that it's not only prevalent, but makes sense to specialize in helping them recover from it?
It only takes one tiny vibe-coded insecure extension to a pre-existing codebase (that might have been good secure code), to turn the whole thing into a catastrophe.
It's basically the same as in other parts of IT security: It only takes one lost root password, one exploited software/device/oversight, one slip, to let an attacker in (yes, defense-in-depth architecture might help, but nonetheless, every long exploit-chain starts with the first tiny crack in the armor).
My guess is tons of small/medium sized companies were enamored with the speed and ease of use that LLMs promised and very quickly found solutions that “just worked”.
Also we don’t really specialize in it since that’s not something you would really do. It’s just that the usual vulnerabilities are more common AND compounded.
Would love to hear more about your work and how you have tapped into that market if you're keen to share. Even if it's just anecdotes about vibe-in-production gone wrong, that would be really entertaining.
Before vibe coding became too much of a thing we had the majority of our business coming from poorly developed web applications coming from off shore shops. That’s been more or less the last decade.
Once LLMs became popular we started to see more business on that front which you would expect.
What we didn’t expect is that we started seeing MUCH more “deep” work wherein the threat actor will get into core systems from web apps. You used to not see this that much because core apps were designed/developed/managed by more knowledgeable people. The integrations were more secure.
Now though? Those integrations are being vibe coded and are based on the material you’d find on tutorials/stack etc which almost always come with a “THIS IS JUST FOR DEMONSTRATION DONT USE THIS” warning.
We also see a ton of re-compromised environments. Why? They don’t know how to use CICD and just recommit the vulnerable code.
Oh yeah, before I forget, LLMs favor the same default passwords a lot. We have a list of the ones we’ve seen (will post eventually) but just be aware that that’s something threat actors have picked up on too.
EDIT: Another thing, when we talk to the guys responsible for the integrations or whatever was compromised a lot of the time we hear the excuse “we made sure to ask the LLM if it was secure and it said yes”.
I don’t know if they would have caught the issue before but I feel like there’s a bit of false comfort where they feel like they don’t have to check themselves.
Hard to say for a number of reasons but I can tell you what kind of teams we see.
College grads with no seniors or too few senior devs to oversee them tend to be the worst. Surprisingly, it seems that the worst of these is where the team is very enthusiastic about tech in general. I’ve wondered if it’s a desire to be the next Zuckerberg or maybe not having the massive failure everyone has eventually that makes you realize you aren’t bullet proof.
Experienced devs with too much work to do are common. Genuinely feel bad for these guys.
Off shore shops seem to now ship worse crap faster. Not only that but when one app has an issue you can usually assume they all have the same issue.
Also as a side note Tech focused companies are the most common followed by B2C companies. Manufacturing etc. are really rare for us to see and I think that may be something to do with reticence to adopt new patterns or tech.
If we actually want stuff that works, we need to come up with a new process. If we get "almost" good code from a single invocation, you just going to get a lot of almost good code from a loop. What we likely need is a Cucumberesque format with example tables for requirements that we can distill an AI to use. It will build the tests and then build the code to to pass the tests.
I would consider that expected but not strange. The thing blocking adoption is that most devs/people find those formal languages difficult or boring. That's even true of things like Cucumber - it's boring and most organizations care little for robust QA.
Nice. Check out https://ghuntley.com/ralph to learn more about Ralph. It's currently building a Gen-Z esoteric programming language and porting the standard library from Go to the Cursed programming language. The compiler is working, I'm just finishing up the touches of the standard library before launching.
"At one point we tried “improving” the prompt with Claude’s help. It ballooned to 1,500 words. The agent immediately got slower and dumber. We went back to 103 words and it was back on track."
Isn't this the exact opposite of every other piece of advice we have gotten in a year?
Another general feedback just recently, someone said we need to generate 10 times, because one out of those will be "worth reviewing"
How can anyone be doing real engineering in such a: pick the exact needle out of the constantly churning chaos-simulation-engine that (crashes least, closest to desire, human readable, random guess)
One of the big things I think a lot of tooling misses, which Geoffrey touches on is the automated feedback loops built into the tooling. I expect you could probably incorporate generation time and token cost to automatically self tune this over time. Perhaps such things as discovering which prompts and models are best for which tasks automatically instead of manually choosing these things.
You want to go meta-meta? Get ralph to spawn subagents that analyze the process of how feedback and experimentation with techniques works. Perhaps allocate 10% of the time and effort to identifying what's missing that would make the loops more effective (better context, better tooling, better feedback mechanism, better prompts, ...?). Have the tooling help produce actionable ideas for how humans in the loop can effectively help the tooling. Have the tooling produce information and guidelines for how to review the generated code.
I think one of the big things missing in many of the tools currently available is tracking metrics through the entire software development loop. How long does it take to implement a feature. How many mistakes were made? How many errors were caught by tests? How many tokens does it take? And then using this information to automatically self-tune.
Its not the exact opposite of what ive been reading. Basically every person claiming to have success with LLM coding that ive read have said that too long of a prompt leads to too much context which leads to the LLM diverging from working on the problem as desired.
the core might be - the difference between an LLM context window, and an agent's orders in a text. LLM itself is a core engine, running in an environment of some kind (instruct vs others?). Agents on the other hand, are descendants of the old Marvin Minsky stuff in a way.. it has objectives and capacities, at a glance. LLMs are connected to modern agents because input text is read to start the agent.. inner loops are intermediate outputs of LLM, in language. There is no "internal code" to this set of agents, it is speaking in code and text to the next part of the internal process.
There are probably big oversights or errors in that short explanation. The LLM engine, the runner of the engine, and the specifics of some environment, make a lot of overlap and all of it is quite complicated.
For the work they are doing porting and building off a spec there is already good context in the existing code and spec, compared with net new features in a greenfield project.
Like back in the day being brought in to “just fix” a amalgam of FoxPro-, Excel-, and Access-based ERP that “mostly works” and only “occasionally corrupts all our data” that ambitious sales people put together over last 5 years.
But worse - because “ambitious sales people” will no longer be constrained by sandboxes of Excel or Access - they will ship multi-cloud edge-deployed kubernetes micro-services wired with Kafka, and it will be harder to find someone to talk to understand what they were trying to do at the time.
Unless a business allows any old employee to spin up cloud services on a whim we’re not going to see sales people spinning up containers and pipelines, AI or not.
And then over time these Excel spreadsheets become a core system that runs stuff.
I used to live in fear of one of these business analyst folks overwriting a cell or sorting by just the column and not doing the rows at the same time.
Also VLOOKUP's are the devil.
This will be the big counter to AI generated tools; at one point they become black boxes and the only thing people can do is to try and fix them or replace them altogether.
Of course, in theory, AI tooling will only improve; today's vibe coded software that in some cases generate revenue can be fed into the models of the future and improved upon. In theory.
Personally, I hate it; I don't like magic or black boxes.
Before AI companies were usually very reticent to do a rewrite or major refactoring of software because of the cost but that calculus may change with AI. A lot of physical products have ended up in this space where it's cheaper to buy a new product and throw out the old broken one rather than try and fix it. If AI lowers the cost of creating software then I'm not sure why it wouldn't go down the same path as physical goods.
So, no compilers for you neither ?
(To be fair: I'm not loving the whole vibe coding thing. But I'm trying to approach this wave with open mind, and looking for the good arguments in both side. This is not one of them)
New? New!?
This is my job now!
I call it software archeology — digging through Windows Server 2012 R2 IIS configuration files with a “last modified date” about a decade ago serving money-handling web apps to the public.
Not really the same since Claude didn’t deploy anything — but I WAS surprised at how well it tracked down the ingress issue to a cron job accidentally labeled as a web pod (and attempting to service http requests).
It actually prompted me to patch the cron itself but I don’t think I’m that bullish yet to let CC patch my cluster.
I have seen one Kafka instal that was really the best tool for the job.
More than a hand full of them could have been replaced by Redis, and in the worst cases could have been a table in Postgres.
If Claude thinks it fine, remember it's only a reflection of the dumb shit it finds in its training data.
I don’t recall the last time Claude suggested anything about version control :-)
Declarative languages and AI go hand in hand. SQL was intended to be a ‘natural’ language that the query engine (an old-school AI) would use to write code.
Writing natural language prompts to produce code is not that different, but we’re using “stochastic” AI, and stochastic means random, which means mistakes and other non-ideal outputs.
But we also didn't have an AI tool to do the modifying of that bad code. We just had our own limited-capacity-brain, mistake-making, relatively slow-typing selves to depend on.
Dead Comment
Regardless this just made me shudder thinking about the weird little ocean of (now maybe dwindling) random underpaid contract jobs for a few hours a month maintaining ancient Wordpress sites...
Surely that can't be our fate...
Not at that speed. Scale remains to be seen, so far I'm aware only of hobby-project wreck anecdotes.
IMHO, there's a strong case for the opposite. My vibe coding prompts are along the lines of "Please implement the plan described in `phase1-epic.md` using `specification.prd` as a guide." The specification and epics are version controlled and a part of the project. My vibe coded software has better design documentation than most software projects I've been involved in.
[0] https://x.com/PovilasKorop/status/1959590015018652141
Im really curious about what other jobs will pop up. As long as there is an element of probability associated with AI, there will need to be manual supervision for certain tasks/jobs.
These are my favorite types of code bases to work on. The source of truth is the code. You have to read it and debug it to figure it out, and reconcile the actual behaviors with the desired or expected behaviors through your own product oriented thinking
When I hit your comment:
1. I thought, "YES! Indeed!"
2. Then, "For Sale: Baby Shoes."
3. The similar feel caused me to do a rethink on all this. We are moving REALLY fast!
Nice comment
Python and Typescript are elaborate formal languages that emerged from a lengthy process of development involving thousands of people around the world over many years. They are non-trivially different, and it's neat that we can port a library from one to the other quasi-automatically.
The difficulty, from an economic perspective, is that the "agent" workflow dramatically alters the cognitive demands during the initial development process. It is plain to see that the developers who prompted an LLM to generate this library will not have the same familiarity with the resulting code that they would have had they written it directly.
For some economic purposes, this altering of cognitive effort, and the dramatic diminution of its duration, probably doesn't matter.
But my hunch is that most of the economic value of code is contingent on there being a set of human beings familiar with the code in a manner that requires writing having written it directly.
Denial of this basic reality was an economic problem even before LLMs: how often did churn in a development team result in a codebase that no one could maintain, undermining the long-term prospects of a firm?
https://pages.cs.wisc.edu/~remzi/Naur.pdf
https://news.ycombinator.com/item?id=42592543
Great read overall, an interesting challenge to the conception that at its core, programming is about producing code.
https://gist.github.com/dpritchett/fd7115b6f556e40103ef
This reminds me of a software engineering axiom:
The first one is usually short and requires a very deep understanding of one or two profound, new ideas. The second is usually very big and requires a shallow understanding of many not-so-new ideas (which are usually a reflection of the oroganisation that produced the code).
My feeling is that, provided a sufficiently long context window, an LLM will be able to go through the second kind project very easily. It will also be very good at showing that the first kind of project is not so new after all, destroying all people who can't find really new ideas.
In both case, it'll pressure institutions to have less IT specialists...
As someone who trained specifically in computer sciences, I'm a bit scared :-/
At humanlayer we have some OSS projects that are 99% written by AI, and a lot of it was written by AI under the supervision of developer(s) that are no longer at the company.
Every now and then we find that there are gaps in our own understanding of the code/architecture that require getting out the old LSP and spelunking through call stacks.
It's pretty rare though.
No way, models are much, much better at writing code than giving you true and correct information. The failure modes are also a lot easier to spot when writing code: it doesn't compile, tests got skipped, it doesn't run right, etc. If Claude Code gave you incorrect information about a system, the only way to verify is to build a pretty good understanding of that system yourself. And because you've incurred a huge debt here, whoever's building that understanding is going to take much more time to do it.
Until LLMs get way closer (not entirely) to 100%, there's always gonna have to be a human in the loop who understands the code. So, in addition to the above issue you've now got a tradeoff: do you want that human to be able to manage multiple code bases but have to come up to speed on a specific one whenever intervention is necessary, or do you want them to be able to quickly intervene but only in 1 code base?
More broadly, you've also now got a human resource problem. Software engineering is pretty different than monitoring LLMs: most people get into into it because they like writing code. You need software experts in the loop, but when the LLMs take the "fun" part for themselves, most SWEs are no longer interested. Thus, you're left with a small subset of an already pretty small group.
Apologists will point out that LLMs are a lot better in strongly typed languages, in code bases with lots of tests, and using language servers, MCP, etc, for their actions. You can imagine more investments and tech here. The downside is models have to work much, much harder in this environment, and you still need a software expert because the failure modes are far more obscure now that your process has obviated the simple stuff. You've solved the "slop" problem, but now you've got a "we have to spend a lot more money on LLMs and a lot more money on a rare type of expert to monitor them" problem.
---
I think what's gonna happen is a division of workflows. The LLM workflows will be cheap and shabby: they'll be black boxes, you'll have to pull the lever over and over again until it does what you want, you'll build no personal skills (because lever pulling isn't a skill), practically all of your revenue--and your most profitable ideas--will go to your rapacious underlying service providers, and you'll have no recourse when anything bad happens.
The good workflows will be bespoke and way more expensive. They'll almost always work, there will be SLAs for when they don't, you'll have (at least some) rights when you use them, they'll empower and enrich you, and you'll have a human to talk to about any of it at reasonable times.
I think jury's out on whether or not this is bad. I'm sympathetic to the "an LLM brain may be better than no brain", but that's hugely contingent on how expensive LLMs actually end up being and any deleterious effects of outsourcing core human cognition to LLMs.
But, as other commentators mentioned, LLMs are so much better on reading large codebases, that it even invalidates the whole idea of this post (visualizing codebase in 3D in a fashion similar how I would do it in my head). Which kinda changes the game – if "comprehending" complex codebase becomes an easy task, maybe we won't need to keep developers' mental models and the code in constant sync. (it's an open question)
[0] https://divan.dev/posts/visual_programming_go/
I just recently took the time to understood how the GIL works exactly in CPython, because I just asked a couple of questions about it, Claude showed me the relevant API and examples where can I find it. I looked it up in the CPython codebase and all of a sudden it clicked.
The huge difference was that it cost me MINUTES. I didn't even bother to dig in before, because I can't perfectly read C, the CPython codebase is huge and it would have taken me a really long time to understand everything.
I think that's a bit too simplified. Yes, a person just blindly accepting whatever the LLM generates from their unclear prompts probably won't have much understanding or familiarity with it.
But that's not how I personally use LLMs, and I'm sure a lot of others too. Instead, I'm the designer/architect, with a strict control of exactly what I want. I may not actually have written the lines, but all the interfaces/APIs are human designed, the overall design/architecture is human designed, and since I designed it, I know enough to say I'd be familiar with it.
And if I come back to the project in 1-2 years, even if there is no document, it's trivial to spend 10-20 minutes together with an LLM to understand the codebase from every angle, just ask pointed questions, and you can rebuild your mental image quickly.
TLDR: Not everyone is a using LLMs for "vibe-coding" (blind-coding), but as an assistant sitting next to you. So my guess is that the ones who know what you need to know in order to effectively build software, will be a lot more productive. The ones who don't know that (yet?), will drown in spaghetti faster than before.
Deleted Comment
Ok, now that is funny! On so many levels.
Now, for the project itself, a few thoughts:
- this was tried before, about 1.5 years ago there was a project setup to spam github with lots of "paper implementations", but it was based on gpt3.5 or 4 or something, and almost nothing worked. Their results are much better.
- surprised it worked as well as it did with simple prompts. "Probably we're overcomplicating stuff". Yeah, probably.
- weird copyright / IP questions all around. This will be a minefield.
- Lots of SaaS products are screwed. Not from this, but from this + 10 engineers in every midsized company. NIH is now justified.
Is that... the first recorded instance of an AI committing suicide?
One of the providers (I think it was Anthropic) added some kind of token (or MCP tool?) for the AI to bail on the whole conversation as a safety measure. And it uses it to their liking, so clearly not trying to self preserve.
https://www.youtube.com/watch?app=desktop&t=10&v=xOCurBYI_gY
(Background: Someone training an algorithm to win NES games based on memory state)
Yeah, we're in weird territory because you can drive an LLM as a Bitcoin mixer over intellectual property. That's the entire point/meaning behind https://ghuntley.com/z80.
You can take something that exists, distill it back to specs, and then you've got your own IP. Throw away the tainted IP, and then just run Ralph over a loop. You are able to clone things (not 100%, but it's better than hiring humans).
Basically to avoid the ambiguity of training LLM from unlicensed code, I use it to generate description of the code to another LLM trained from permissively licensed code. (There aren't any usable public domain models I've found)
I use it in real world and it seems that the codegen model work 10-20% of the time (the description is not detailed enough - which is good for "clean room" but a base model couldn't follow that). All models can review the code, retry and write its own implementation based on the codegen result though.
AI output isn't copyrighted in the US.
except you dont
Is Unix “small sharp tools” going away? Is that a relic of having to write everything in x86 and we’re now just finally hitting the end of the arc?
I have long held that high software salaries withhold the power of boutique software from its potential applications in small businesses.
It's possible we're about to see what unleashing software in small businesses might have looked like, to some degree, just with much less expert guidance and wisdom.
I am a developer so my point of view on salaries is not out of bitterness.
Did it just solve The Halting Problem? ;)
Now I do a calculus with dependencies. Do I want to track the upstream, is the rigging around the core I want valuable, is it well maintained? If not, just port and move on.
Exactly the point behind this post https://ghuntley.com/libraries/
Please continue.
It's basically the same as in other parts of IT security: It only takes one lost root password, one exploited software/device/oversight, one slip, to let an attacker in (yes, defense-in-depth architecture might help, but nonetheless, every long exploit-chain starts with the first tiny crack in the armor).
Also we don’t really specialize in it since that’s not something you would really do. It’s just that the usual vulnerabilities are more common AND compounded.
The profession of the future is a garbage man.
Before vibe coding became too much of a thing we had the majority of our business coming from poorly developed web applications coming from off shore shops. That’s been more or less the last decade.
Once LLMs became popular we started to see more business on that front which you would expect.
What we didn’t expect is that we started seeing MUCH more “deep” work wherein the threat actor will get into core systems from web apps. You used to not see this that much because core apps were designed/developed/managed by more knowledgeable people. The integrations were more secure.
Now though? Those integrations are being vibe coded and are based on the material you’d find on tutorials/stack etc which almost always come with a “THIS IS JUST FOR DEMONSTRATION DONT USE THIS” warning.
We also see a ton of re-compromised environments. Why? They don’t know how to use CICD and just recommit the vulnerable code.
Oh yeah, before I forget, LLMs favor the same default passwords a lot. We have a list of the ones we’ve seen (will post eventually) but just be aware that that’s something threat actors have picked up on too.
EDIT: Another thing, when we talk to the guys responsible for the integrations or whatever was compromised a lot of the time we hear the excuse “we made sure to ask the LLM if it was secure and it said yes”.
I don’t know if they would have caught the issue before but I feel like there’s a bit of false comfort where they feel like they don’t have to check themselves.
College grads with no seniors or too few senior devs to oversee them tend to be the worst. Surprisingly, it seems that the worst of these is where the team is very enthusiastic about tech in general. I’ve wondered if it’s a desire to be the next Zuckerberg or maybe not having the massive failure everyone has eventually that makes you realize you aren’t bullet proof.
Experienced devs with too much work to do are common. Genuinely feel bad for these guys.
Off shore shops seem to now ship worse crap faster. Not only that but when one app has an issue you can usually assume they all have the same issue.
Also as a side note Tech focused companies are the most common followed by B2C companies. Manufacturing etc. are really rare for us to see and I think that may be something to do with reticence to adopt new patterns or tech.
If we actually want stuff that works, we need to come up with a new process. If we get "almost" good code from a single invocation, you just going to get a lot of almost good code from a loop. What we likely need is a Cucumberesque format with example tables for requirements that we can distill an AI to use. It will build the tests and then build the code to to pass the tests.
The language is called Cursed.
We were curious to see if we can do away with IMPLEMENTATION_PLAN.md for this kind of task
Isn't this the exact opposite of every other piece of advice we have gotten in a year?
Another general feedback just recently, someone said we need to generate 10 times, because one out of those will be "worth reviewing"
How can anyone be doing real engineering in such a: pick the exact needle out of the constantly churning chaos-simulation-engine that (crashes least, closest to desire, human readable, random guess)
You want to go meta-meta? Get ralph to spawn subagents that analyze the process of how feedback and experimentation with techniques works. Perhaps allocate 10% of the time and effort to identifying what's missing that would make the loops more effective (better context, better tooling, better feedback mechanism, better prompts, ...?). Have the tooling help produce actionable ideas for how humans in the loop can effectively help the tooling. Have the tooling produce information and guidelines for how to review the generated code.
I think one of the big things missing in many of the tools currently available is tracking metrics through the entire software development loop. How long does it take to implement a feature. How many mistakes were made? How many errors were caught by tests? How many tokens does it take? And then using this information to automatically self-tune.
There are probably big oversights or errors in that short explanation. The LLM engine, the runner of the engine, and the specifics of some environment, make a lot of overlap and all of it is quite complicated.
hth
I kind of agree that picking from 10 poorly-promoted projects is dumb.
The engineering is in setting up the engine and verification so one agent can get it right (or 90% right) on a single run (of the infinite ish loop)
They're almost certainly referring to first creating a fleshed out spec and then having it implement that, rather than just 100 words.
"This business will get out of control. It will get out of control and we'll be lucky to live through it."
https://www.youtube.com/watch?v=YZuMe5RvxPQ&t=22s