Readit News logoReadit News
js2 · 6 months ago
Discussion from 4 days ago when the code was announced (846 points, 519 comments):

https://news.ycombinator.com/item?id=44159166

SrslyJosh · 6 months ago
> Reading through these commits sparked an idea: what if we treated prompts as the actual source code? Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.

Please god, no, never do this. For one thing, why would you not commit the generated source code when storage is essentially free? That seems insane for multiple reasons.

> When models inevitably improve, you could connect the latest version and regenerate the entire codebase with enhanced capability.

How would you know if the code was better or worse if it was never committed? How do you audit for security vulnerabilities or debug with no source code?

gizmo686 · 6 months ago
My work has involved a project that is almost entirely generated code for over a decade. Not AI generated, the actual work of the project is in creating the code generator.

One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable. The nature of reviewing changes is just too different between them.

Another thing we learned very quickly was that attempting to generate code, then modify the result is not sustainable; nor is aiming for a 100% generated code base. The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.

Another thing we learned is that any change in the code generator needs to have a feature flag, because someone was relying on the old behavior.

saagarjha · 6 months ago
I think the biggest difference here is that your code generator is probably deterministic and you likely are able to debug the results it produces rather than treating it like a black box.
overfeed · 6 months ago
> One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable

My rule of the thumb is to have both in same repo, but treat generated code like binary data. This was informed by when I was burned by a tooling regression that broke the generated code and the investigation was complicated by having to correlate commits across different repositories

mschild · 6 months ago
> One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable.

Keeping a repository with the prompts, or other commands separate is fine, but not committing the generated code at all I find questionable at best.

skywhopper · 6 months ago
There’s a huge difference between deterministic generated code and LLM generated code. The latter will be different every time, sometimes significantly so. Subsequent prompts would almost immediately be useless. “You did X, but we want Y” would just blow up if the next time through the LLM (or the new model you’re trying) doesn’t produce X at all.
cimi_ · 6 months ago
I will guess that you are generating orders of magnitude more lines of code with your software than people do when building projects with LLMs - if this is true I don't think the analogy holds.
david-gpu · 6 months ago
Please tell us we company you are working for so that we don't send our resumes there.

Jokes aside, I have worked in projects where auto-generating code was the solution that was chosen and it's always been 100% auto-generated, essentially at compilation time. Any hand-coded stuff needed to handle corner cases or glue pieces together was kept outside of the code generator.

potholereseller · 6 months ago
> The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.

This sounds like putting assembly in C code. What was the input language? These two bits ("Not AI generated", "a feature flag") suggest that the code generator didn't have a natural language frontend, but rather a real programming language frontend.

Did you or anyone else inform management that a code generator is essentially a compiler with extra characters? [0] If yes, then what was their response?

I am concerned that your current/past work might have been to build a Compiler-as-a-Service (CaaS). [1] No shade, I'm just concerned that other managers might read all this and then try to build their own CaaS.

[0] Yes, I'm implying that LLMs are compilers. Altman has played us for fools; he's taught a billion people the worst part of programming: fighting the compiler to give you the output you want.

[1] Compiler-as-a-Service is the future our forefathers couldn't imagine warning us about. LLMs are CaaS's; time is a flat circle; where's the exit?; I want off this ride.

lowsong · 6 months ago
I'm the first to admit that I'm an AI skeptic, but this goes way beyond my views about AI and is a fundamentally unsound idea.

Let's assume that a hypothetical future AI is perfect. It will produce correct output 100% of the time, with no bugs, errors, omissions, security flaws, or other failings. It will also generate output instantly and cost nothing to run.

Even with such perfection this idea is doomed to failure because it can only write code based on information in the prompt, which is written by a human. Any ambiguity, unstated assumption, or omission would result in a program that didn't work quite right. Even a perfect AI is not telepathic. So you'd need to explain and describe your intended solution extremely precisely without ambiguity. Especially considering in this "offline generation" case there is no opportunity for our presumed perfect AI to ask clarifying questions.

But, by definition, any language which is precise and clear enough to not produce ambiguity is effectively a programming language, so you've not gained anything over just writing code.

gitgud · 6 months ago
This is so eloquently put and really describes the absurdity of the notion that code itself will become redundant to building a software system
handoflixue · 6 months ago
We already have AI agents that can ask a human for help / clarification in those cases.

It could also analyze the company website, marketing materials, and so forth, and use that to infer the missing pieces. (Again, something that exists today)

fastball · 6 months ago
The idea as stated is a poor one, but a slight reshuffling and it seems promising:

You generate code with LLMs. You write tests for this code, either using LLMs or on your own. You of course commit your actual code: it is required to actually run the program, after all. However you also save the entire prompt chain somewhere. Then (as stated in the article), when a much better model comes along, you re-run that chain, presumably with prompting like "create this project, focusing on efficiency" or "create this project in Rust" or "create this project, focusing on readability of the code". Then you run the tests against the new codebase and if the suite passes you carry on, with a much improved codebase. The theoretical benefit of this over just giving your previously generated code to the LLM and saying "improve the readability" is that the newer (better) LLM is not burdened by the context of the "worse" decisions made by the previous LLM.

Obviously it's not actually that simple, as tests don't catch everything (tho with fuzz testing and complete coverage and such they can catch most issues), but we programmers often treat them as if they do, so it might still be a worthwhile endeavor.

stingraycharles · 6 months ago
Means the temperature should be set to 0 (which not every provider supports) so that the output becomes entirely deterministic. Right now with most models if you give the same input prompt twice it will give two different solutions.
maxemitchell · 6 months ago
Your rephrasing better encompasses my idea, and I should have emphasized in the post that I do not think this is a good idea (nor possible) right now, it was more of a hand-wavy "how could we rethink source control in a post-LLM world" passing thought I had while reading through all the commits.

Clearly it struck a chord with a lot of the folks here though, and it's awesome to read the discourse.

layer8 · 6 months ago
One reason we treat tests that way is that we don’t generally rewrite the application from scratch, but usually only refactor parts of the existing code or make smaller changes. If we regularly did the former, test suites would have to be much mire comprehensive than they typically are. Not to mention that the tests need to change when the API changes, so you generally have to rewrite the unit tests along with the application and can’t apply them unchanged.
rectang · 6 months ago
>> what if we treated prompts as the actual source code?

You would not do this because: unlike programming languages, natural languages are ambiguous and thus inadequate to fully specify software.

squillion · 6 months ago
Exactly!

> this assumes models can achieve strict prompt adherence

What does strict adherence to an ambiguous prompt even mean? It’s like those people asking Babbage if his machine would give the right answer when given the wrong figures. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a proposition.

a012 · 6 months ago
Prompts are like story on the board, and like engineers, depends on the understanding of the model the generated source code can vary. Saying the prompts could be the actual code is so wrong and dangerous thought

Deleted Comment

Xelbair · 6 months ago
Worse. Models aren't deterministic! They use temperature value to control randomness, just so they can escape local minima!

Regenerated code might behave differently, have different bugs(worst case), or not work at all(best case).

chrishare · 6 months ago
Nitpick - it's the ML system that is sampling from model predictions that has a temperature parameter, not the model itself. Temperature and even model aside, there are other sources of randomness like the underlying hardware that can cause the havoc you describe.
kace91 · 6 months ago
Plus, commits depend on the current state of the system.

What sense does “getting rid of vulnerabilities by phasing out {dependency}” make, if the next generation of the code might not rely on the mentioned library at all? What does “improve performance of {method}” mean if the next generation used a fully different implementation?

It makes no sense whatsoever except for a vibecoders script that’s being extrapolated into a codebase.

pollinations · 6 months ago
I'd say commit a comprehensive testing system with the prompts.

Prompts are in a sense what higher level programming languages were to assembly. Sure there is a crucial difference which is reproducibility. I could try and write down my thoughts why I think in the long run it won't be so problematic. I could be wrong of course.

I run https://pollinations.ai which servers over 4 million monthly active users quite reliably. It is mostly coded with AI. Since about a year there was no significant human commit. You can check the codebase. It's messy but not more messy than my codebases were pre-LLMs.

I think prompts + tests in code will be the medium-term solution. Humans will be spending more time testing different architecture ideas and be involved in reviewing and larger changes that involve significant changes to the tests.

maxemitchell · 6 months ago
Agreed with the medium-term solution. I wish I put some more detail into that part of the post, I have more thoughts on it but didn't want to stray too far off topic.
never_inline · 6 months ago
Apart from obvious non-reproducibility, the other problem is lack of navigable structure. I can't command+click or "show usages" or "show definition" any more.
saagarjha · 6 months ago
Just ask the AI for those obviously
tayo42 · 6 months ago
I'm pretty sure most people aren't doing "software engineering" when they program. There's the whole world of WordPress and dream Weaver like programing out there too where the consequences of messing up aren't really important.

Llms can be configured to have deterministic output too

dragonwriter · 6 months ago
Also, while it is in principle possible to have a deterministic LLM, the ones used by coding assistants aren't deterministic, so the prompts would not reliably reproduce the same software.

There is definitely an argument, for also committing prompts, but it makes no sense to only commit prompts.

7speter · 6 months ago
I think the author is saying you commit the prompt with the resulting code. You said it yourself, storage is free, so comment the prompt along with the output (don’t comment that out that if I’m not being clear); it would show the developers(?) intent, and to some degree, almost always contribute to the documentation process.
maxemitchell · 6 months ago
Author here :). Right now, I think the pragmatic thing to do is to include all prompts used in either the PR description and/or in the commit description. This wouldn't make my longshot idea of "regenerating a repo from the ground up" possible, but it still adds very helpful context to code reviewers and can help others on your team learn prompting techniques.
paxys · 6 months ago
Forget different model versions. The exact same model with the exact same prompt will generate vastly different code each subsequent time you invoke it.
renewiltord · 6 months ago
It’s been a thing people have done for at least a year https://github.com/i365dev/LetterDrop
Sevii · 6 months ago
There are lots of reasons not to do it. But if LLMs get good enough that it works consistently people will do it anyway.
minimaxir · 6 months ago
What will people call it when coders rely on vibes even more than vibe coding?
TechDebtDevin · 6 months ago
Some code is generated on the fly, like llm ui/ux that writes python code to do math.

Idk kinda different tho.

visarga · 6 months ago
The idea is good, but we should commit both documentation and tests. They allow regenerating the code at will.
croes · 6 months ago
You couldn’t even tell in advance if the prompt produces code at all.
mellosouls · 6 months ago
Yes, it's too early to be doing that now, but if you see the move to AI-assisted code as at least the same magnitude of change as the move from assembly to high level languages, the argument makes more sense.

Nobody commits the compiled code; this is the direction we are moving in, high level source code is the new assembly.

declan_roberts · 6 months ago
These posts are funny to me because prompt engineers point at them as evidence of the fast-approaching software engineer obsolescence but the years of experience in software engineering necessary to even guide an AI in this way is very high.

The reason he keeps adjusting the prompts is because he knows how to program. He knows what it should look like.

It just blurs the line between engineer and tool.

spaceman_2020 · 6 months ago
The argument is that this stuff will so radically improve senior engineer productivity that the demand for junior engineers will crater. And without a pipeline of junior engineers, the junior-to-senior trajectory will radically atrophy

Essentially, the field will get frozen where existing senior engineers will be able to utilize AI to outship traditional senior-junior teams, even as junior engineers fail to secure employment

I don’t think anything in this article counters this argument

tptacek · 6 months ago
I don't know why people don't give more credence to the argument that the exact opposite thing will happen.
runeks · 6 months ago
> The argument is that this stuff will so radically improve senior engineer productivity that the demand for junior engineers will crater.

What makes people think that an increase in senior engineer productivity causes demand for junior engineers to decrease?

I think it will have the opposite effect: an increase in senior engineer productivity enables the company to add more features to its products, making it more valuable to its customers, who can therefore afford to pay more for the software. With this increase in revenue, the company is able to hire more junior engineers.

latexr · 6 months ago
> It just blurs the line between engineer and tool.

I realise you meant it as “the engineer and their tool blend together”, but I read it like a funny insult: “that guy likes to think of himself as an engineer, but he’s a complete tool”.

visarga · 6 months ago
> prompt engineers point at them as evidence of the fast-approaching software engineer obsolescence

Maybe journalists and bloggers angling for attention do it, prompt engineers are too aware of the limitations of prompting to do that.

tptacek · 6 months ago
I don't know why that's funny. This is not a post about a vibe coding session. It's Kenton Varda['s coding session].

later

updated to clarify kentonv didn't write this article

kevingadd · 6 months ago
I think it makes sense that GP is skeptical of this article considering it contains things like:

> this tool is improving itself, learning from every interaction

which seem to indicate a fundamental misunderstanding of how modern LLMs work: the 'improving' happens by humans training/refining existing models offline to create new models, and the 'learning' is just filling the context window with more stuff, not enhancement of the actual model or the model 'learning' - it will forget everything if you drop the context and as the context grows it can 'forget' things it previously 'learned'.

kiitos · 6 months ago
The sequence of commits talked about by the OP -- i.e. kenton's coding session's commits -- are like one degree removed from 100% pure vibe coding.
thegrim33 · 6 months ago
I mean yeah, the very first prompt given to the AI was put together by an experienced developer; a bunch of code telling the AI exactly what the API should look like and how it would be used. The very first step in the process already required an experienced developer to be involved.
starkparker · 6 months ago
> Almost every feature required multiple iterations and refinements. This isn't a limitation—it's how the collaboration works.

I guess that's where a big miss in understanding so much of the messaging about generative AI in coding happens for me, and why the Fly.io skepticism blog post irritated me so much as well.

It _is_ how collaboration with a person works, but the when you have to fix the issues that the tool created, you aren't collaborating with a person, you're making up for a broken tool.

I can't think of any field where I'd be expected to not only put up with, but also celebrate, a tool that screwed up and required manual intervention so often.

The level of anthropomorphism that occurs in order to advocate on behalf of generative AI use leads to saying things like "it's how collaboration works" here, when I'd never say the same thing about the table saw in my woodshop, or even the relatively smart cruise control on my car.

Generative AI is still just a tool built by people following a design, and which purportedly makes work easier. But when my saw tears out cuts that I have to then sand or recut, or when my car slams on the brakes because it can't understand a bend in the road around a parking lane, I don't shrug and ascribe them human traits and blame myself for being frustrated over how they collaborate with me.

pontifier · 6 months ago
Garbage in, Garbage out... My experiment with vibe coding was quite nice, but it did require a collaborative back and forth, mostly because I didn't know exactly what I wanted. It was easiest to ask for something, then describe how what it gave me needed to be changed. The cost of this type of interaction was much easier than trying to craft the perfect prompt on the first go. My first prompts were garbage, but the output gradually converged to something quite good.
hooverd · 6 months ago
Your table saw hungers for fingers.
isaacremuant · 6 months ago
Likewise when they use all these benchmarks for "intelligence" and the tool will do the silliest things that you'd consider unacceptable from a person once you've told them a few times not to do a certain thing.

I love the paradigm shift but hate when the hype is uninformed or dishonest or not treating it with an eye for quality.

eviks · 6 months ago
> Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.

So every single run will result in different non-reproducible implementation with unique bugs requiring manual expert interventions. How is this better?

SupremumLimit · 6 months ago
It's an interesting review but I really dislike this type of techno-utopian determinism: "When models inevitably improve..." Says who? How is it inevitable? What if they've actually reached their limits by now?
Dylan16807 · 6 months ago
Models are improving every day. People are figuring out thousands of different optimizations to training and to hardware efficiency. The idea that right now in early June 2025 is when improvement stops beggars belief. We might be approaching a limit, but that's going to be a sigmoid curve, not a sudden halt in advancement.
a2128 · 6 months ago
I think at this point we're reaching more incremental updates, which can score higher on some benchmarks but then simultaneously behave worse with real-world prompts, most especially if they were prompt engineered for a specific model. I recall Google updating their Flash model on their API with no way to revert to the old one and it caused a lot of people to complain that everything they've built is no longer working because the model is just behaving differently than when they wrote all the prompts.
deadbabe · 6 months ago
5 years ago a person would be blown away by today’s LLMs. But people today will merely say “cool” at whatever LLMs are in use 5 years from now. Or maybe not even that.
sitkack · 6 months ago
It is copium that it will suddenly stop and the world they knew before will return.

ChatGPT came out in Nov 2022. Attention Was All There Was in 2017, we were already 5 years in the past. Or 5 years of research to catch up to, and then from 2022 to now ... papers and research have been increasing exponentially. Even in if SOTA models were frozen, we still have years of research to apply and optimize in various ways.

groby_b · 6 months ago
It is "inevitable" in the sense that in 99% of the cases, tomorrow is just like yesterday.

LLMs have been continually improving for years now. The surprising thing would be them not improving further. And if you follow the research even remotely, you know they'll improve for a while, because not all of the breakthroughs have landed in commercial models yet.

It's not "techno-utopian determinism". It's a clearly visible trajectory.

Meanwhile, if they didn't improve, it wouldn't make a significant change to the overall observations. It's picking a minor nit.

The observation that strict prompt adherence plus prompt archival could shift how we program is both true, and it's a phenomenon we observed several times in the past. Nobody keeps the assembly output from the compiler around anymore, either.

There's definitely valid criticism to the passage, and it's overly optimistic - in that most non-trivial prompts are still underspecified and have multiple possible implementations, not all correct. That's both a more useful criticism, and not tied to LLM improvements at all.

double0jimb0 · 6 months ago
Are there places that follow the research that speak to the layperson?
its-kostya · 6 months ago
What is ironic, if we buy in to the theory that AI will write majority of the code in the next 5-10 years, what is it going to train on after? ITSELF? Seems this theoretic trajectory of "will inevitably get better" is is only true if humans are producing quality training data. The quality of code LLMs create is very well proportionate on how mature and ubiquitous the langues/projects are.
solarwindy · 6 months ago
I think you neatly summarise why the current pre-trained LLM paradigm is a dead end. If these models were really capable of artificial reasoning and learning, they wouldn’t need more training data at all. If they could learn like a human junior does, and actually progress to being a senior, then I really could believe that we’ll all be out of a job—but they just do not.
sumedh · 6 months ago
More compute mean more faster processing, more context.
Sevii · 6 months ago
Models have improved significantly over the last 3 months. Yet people have been saying 'What if they've actually reached their limits by now?' for pushing 3 years.
BoorishBears · 6 months ago
This is just people talking past each other.

If you want a model that's getting better at helping you as a tool (which for the record, I do), then you'd say in the last 3 months things got better between Gemini's long context performance, the return of Claude Opus, etc.

But if your goal post is replacing SWEs entirely... then it's not hard to argue we definitely didn't overcome any new foundational issues in the last 3 months, and not too many were solved in the last 3 years even.

In the last year the only real foundational breakthrough would be RL-based reasoning w/ test time compute delivering real results, but what that does to hallucinations + even Deepseek catching up with just a few months of post-training shows in its current form, the technique doesn't completely blow up any barriers that were standing the way people were originally touting it.

Overall models are getting better at things we can trivially post-train and synthesize examples for, but it doesn't feel like we're breaking unsolved problems at a substantially accelerated rate (yet.)

greyadept · 6 months ago
For me, improvement means no hallucination, but that only seems to have gotten worse and I'm interested to find out whether it's actually solvable at all.
_pdp_ · 6 months ago
I commented on the original discussion a few days ago but I will do it again.

Why is this such a big deal? This library is not even that interesting. It is very straightforward task I expect most programers will be able to pull off easily. 2/3 of the code is type interfaces and comments. The rest is by book implementation of a protocol that is not even that complex.

Please, there are some React JSX files in your code base with a lot more complexities and intricacies than this.

Has anyone even read the code at all?

JackSlateur · 6 months ago
Of course, this is a pathetic commercial, nothing serious

As you say, the code is not interesting, it deals with a well known topic

And it required lots of man power to get done

tldr: this is a non-event disguised as incredible success. No doubt cloudflare is making money with that AI crap, somehow.

thorum · 6 months ago
Humorous that this article has a strong AI writing smell - the author should publish the prompts they used!
dcre · 6 months ago
I don’t like to accuse, and the article is fine overall, but this stinks: “This transparency transforms git history from a record of changes into a record of intent, creating a new form of documentation that bridges human reasoning and machine implementation.”
keybored · 6 months ago
> I don’t like to accuse, and the article is fine overall, but this stinks:

Now consider your reasonable instinct to not accuse other people coupled with the possibility setting AI lose with “write a positive article about AI where you have some paragraphs about the current limitations based on this link. write like you are just following the evidence.” Meanwhile we are supposed to sit here and weigh every word.

This reminds to write a prompt for a blogpost. How AI could be used for making personal-looking tech-guy who meditates and runs websites. (Do we have the technology? Yes we do)

ZephyrBlu · 6 months ago
Also: "This OAuth library represents something larger than a technical milestone—it's evidence of a new creative dynamic emerging"

Em-dash baby.

OjotCewIo · 6 months ago
> this stinks: “This transparency transforms git history from a record of changes into a record of intent, creating a new form of documentation that bridges human reasoning and machine implementation.”

That's where I stopped reading. If they needed "AI" for turning their git history into a record of intent ("transparency"), then they had been doing it all wrong, previously. Git commit messages have always been a "form of documentation that bridges human reasoning" -- namely, with another human's (the reader's) reasoning.

If you don't walk your reviewer through your patch, in your commit message, as if you were teaching them, then you're doing it wrong.

Left a bad taste in my mouth.

maxemitchell · 6 months ago
I did human notes -> had Claude condense and edit -> manually edit. A few of the sentences (like the stinky one below) were from Claude which I kept if it matched my own thoughts, though most were changed for style/prose.

I'm still experimenting with it. I find it can't match style at all, and even with the manual editing it still "smells like AI" as you picked up. But, it also saves time.

My prompt was essentially "here are my old blog posts, here's my notes on reading a bunch of AI generated commits, help me condense this into a coherent article about the insights I learned"

layer8 · 6 months ago
I wonder if those notes wouldn’t have been more interesting as-is, and possibly also more condensed.
thorum · 6 months ago
Makes sense, I could see the human touch on the article too, so I figured it was something like that.