AI can code, but it can't build software

This is a good headline. LLMs are remarkably good at writing code. Writing code isn't the same thing as delivering working software.

A human expert needs to identify the need for software, decide what the software should do, figure out what's feasible to deliver, build the first version (AI can help a bunch here), evaluate what they've built, show it to users, talk to them about whether it's fit for purpose, iterate based on their feedback, deploy and communicate the value of the software, and manage its existence and continued evolution in the future.

Some of that stuff can be handled by non-developer humans working with LLMs, but a human expert needs who understands code will be able to do this stuff a whole lot more effectively.

I guess the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers, or if programmers can pick up enough enough PM skills to work without PMs.

My money is on both roles continuing to exist and benefit from each other, in a partnership that produces results a lot faster because the previously slow "writing the code" part is a lot faster than it used to be.

prmph · 4 months ago

> LLMs are remarkably good at writing code.

Just this past weekend, I've designed and written code (in Typescript) that I don't think LLMs can even come close to writing in years. I have a subscription to a frontier LLM, but lately I find myself using like 25% of the time.

At a certain level the software architecture problems I'm solving, drawing upon decades of understanding about maintainable, performant, and verifiable design of data structures and types and algorithms, are things LLMs cannot even begin to grasp.

At that point, I find that attempting to use an LLM to even draft an initial solution is a waste of time. At best I can use it for initial brainstorming.

The people saying LLMs can code are hard for me to understand. They are good for simple bash scripts and complex refactoring and drafting basic code idioms and that's about it.

And even for these tasks the amount of hand-holding I need to do is substantial. At least Gemini Pro/CLI seems good at one-shot performance, before its context gets poisoned

Aperocky · 4 months ago

I found that mastering LLM is no less complex than getting to learning a new language, probably between python and C++ in terms of mastery.

The learning curve is very different - with other languages, the learning curve is often upfront, with LLM, it seems linear/even rear loaded, maybe because I've not gotten to the other side.

I've been able to make LLM do more and more, some of it is undoubtly due to the improvement in model, but most of it is probably paradigm and changes in my approach. At the beginning, I run into all of the same complaints that I have eventually found workarounds to many.

jcelerier · 4 months ago

> The people saying LLMs can code are hard for me to understand. They are good for simple bash scripts and complex refactoring and drafting basic code idioms and that's about it

that's like, 90% of the code people are writing

airstrike · 4 months ago

I find LLMs most helpful when I already have half of the answer written and need them to fill in the blanks.

"Take X and Y I've written before, some documentation for Z, an example W from that repo, now smash them together and build the thing I need"

latentsea · 4 months ago

I think C# is really going to shine in the LLM coding era. You can write Roslyn Analyzers to fail the build on arbitrary conditions after inspecting the AST. LLMs are great at helping you write these too. If you get a solid architecture well defined you can then use these as guardrails to constrain development to only happen in the manner you intend. You can then get LLMs to implement features and guarantee the code comes out in the shape you expect it to.

This works well for humans too, but custom analysers are abstract and not many devs know how to write them, so they are mostly provided by library authors. However, being able to generate them via LLMs makes them so much more accessible, and IMHO is a game changer for enforcing an architecture.

I've been exploring this direction a lot lately, and it feels very promising.

CjHuber · 4 months ago

Can you maybe give an example you’ve encountered of an algorithm or a data structure that LLMs cannot handle well?

In my experience implementing algorithms from a good comprehensive description and keeping track of data models is where they shine the most.

saint-evan · 4 months ago

Maybe if you mentioned a more complex, lower level or niche language than typescript like maybe C, MIPS or some niche exotic systems language pushing around registers. I'd believe yu, with caveat, but with abstract high level abstract languages like Python, typescript and the likes? It's highly unlikely that you would've put together syntax in any uniquely surprising combination. Maybe yu mean yu designed a clever fix to a problem within a larger codebase so thar would mean a context/attention issue for the LLM but there's no way in hell yu wrote up a contained piece of code solving a specific problem, not tied to a larger software env, that couldn't also have been written by frontier LLMs provided yu could articulate the problem, a course-of-action and expected output/behavior. LLMs are very good at writing code in isolation, humans still have deeper intuition and we're still extremely good at doing the plug-in, wiring and big picture planning. Yu over-estimate what you've done with typescript or misunderstand what 'LLMs are good at writing code' [in isolation] means

crazygringo · 4 months ago

> The people saying LLM can code are hard for me to understand.

Just today, I spent an hour documenting a function that performs a set of complex scientific simulations. Defined the function input structure, the outputs, and put a bunch of references in the body to function calls it would use.

I then spent 15 minutes explaining to the free version of ChatGPT what the function needs to do both in scientific terms and in computer architecture terms (e.g. what needed to be separated out for unit tests). Then it asked me to answer ~15 questions it had (most were yes/no, it took about 5 min), then it output around 700 lines of code.

It took me about 5 minutes to get it working, since it had a few typos. It ran.

Then I spent another 15 minutes laying out all the categories of unit tests and sanity tests I wanted it to write. It produced ~1500 lines of tests. It took me half an hour to read through them all, adjusting some edge cases that didn't make sense to me and adjusting the code accordingly. And a couple cases where it was testing the right part of the code, but had made valiant but wrong guesses as to what the scientifically correct answer would be. All the tests then passed.

All in all, a little over two hours. And it ran perfectly. In contrast, writing the code and tests myself entirely by hand would have taken at least a couple of entire days.

So when you say they're good for those simple things you list and "that's about it", I couldn't disagree more. In fact, I find myself relying on them more and more for the hardest scientific and algorithmic programming, when I provide the design and the code is relatively self-contained and tests can ensure correctness. I do the thinking, it does the coding.

solumunus · 4 months ago

That amazing code you’ve written is a tiny proportion of code that’s needed to provide business value. Most of the code delivering business value to customers day in, day out is quite simple and can easily be LLM driven.

ratatougi · 4 months ago

Agreed-I often use it when I need to brainstorm which appoarch I should take for my task, or when I need a refactor or generate a large set of mock data.

Dead Comment

roxolotl · 4 months ago

One of the interesting corollaries of the title is that this can also be true of humans. Being able to code is not the same as being a software engineer. It never has been.

echelon · 4 months ago

We're also finding this true with media generation.

AI video is an incredible tool, but it can't make movies.

It's almost as if all of these models are an exoskeleton for people that already know what they're doing. But you still need an expert in the loop.

bloppe · 4 months ago

At least you can teach a human to become a software engineer.

jfim · 4 months ago

> I guess the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers

I'd argue that they can't, at least on a short timeframe. Not because LLMs can't generate a program or product that works, but that there needs to be enough understanding of how the implementation works to fix any complex issues that come up.

One experience I had is that I had tried to generate a MITM HTTPS proxy that uses Netty using Claude, and while it generated a pile of code that looked good on the surface, it didn't actually work. Not knowing enough about Netty, I wasn't able to debug why it didn't work and trying to fix it with the LLM didn't help either.

Maybe PMs can pick up enough knowledge over time to be able to implement products that can scale, but by that time they'd effectively be a software engineer, minus the writing code part.

ambicapter · 4 months ago

LLMs are great for learning though, you can easily ask them questions, and you can evaluate your understanding every step of the way, and gradually build the accuracy of your world model that way. It’s not uncommon for me to ask a general question, drill deeper into a concept, and then either test things manually with some toy code or end up reading the official documentation, this time with at least some exposure to the words that I’m looking for to answer my question.

kaashif · 4 months ago

If an LLM can get you 90% of the way there, you need fewer engineers. But the engineer you need probably needs to be a senior engineer who went through the pain of learning all of the details and can function without AI.

If all juniors are using AI, or even worse, no juniors are ever hired, I'm not sure how we can produce those seniors at the scale we currently do. Which isn't even that large a scale.

Bukhmanizer · 4 months ago

> the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers

I have a strong opinion that AI will boost the importance of people with “special knowledge” more than anyone else regardless of role. So engineers with deep knowledge of a system or PMs with deep knowledge of a domain.

simonw · 4 months ago

That sounds right to me.

samsolomon · 4 months ago

I think you're right, the roles will exist for some time. But I think we'll start to see more and more overlap between engineering, product management and design.

In a lot of ways I think that will lead to stronger delivery teams. As a designer—the best performing teams I've been on have individuals with a core competency, but a lot of overlap in other areas. Product managers with strong engineering instincts, engineers with strong design instincts, etc. When there is less ambiguity in communication, teams deliver better software.

Longer-term I'm unsure. Maybe there is some sort of fusion into all-purpose product people able to do everything?

Deleted Comment

kakacik · 4 months ago

Not happening anytime soon. Those product management types are more expensive than devs in most places, you would be literally a) increasing cost per hour worked; and b) stiffling the use of (pricey) management skills of such manager to do lower pay job.

I have no doubt some broken places end up in similar mode but en masse it doesnt make any financial sense.

Also when SHTF and you can't avoid going into deep debug with strong management pressure and oversight, it will become glaringly obvious which approach can maintain things running. And SHTF always happens, its only a function of time.

shalmanese · 4 months ago

It’s worthwhile reading the original Fred Brooks “No Silver Bullets” paper where they explicitly cover LLMs under their “Hopes for the Silver” AI/Expert Systems/Automatic programming section and explain why it is still not a silver bullet.

https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.p...

adncors · 3 months ago

what if the real paradigm shift isn't about replacing engineers in building durable systems, but about making software so cheap and disposable that the concept of "technical debt" becomes irrelevant for a new class of "single-use" or ultra-short-lifespan applications?

colordrops · 4 months ago

Once all the context that a typical human engineer has to "build software" is available to the LLM, I'm not so sure that this statement will hold true.

bloppe · 4 months ago

But it's becoming increasingly clear that LLMs based on the transformer model will never be able to scale their context much further than the current frontier, due mainly to context rot. Taking advantage of greater context will require architectural breakthroughs.

vrc · 4 months ago

I’m a PM and I’ve been able to do a lot of very interesting near production ready bits of coding recently with an LLM. I say near production ready because I specifically only build functional data processing stuff that I intentionally build with clean I/O requirements to hand to the real engineers on the team to slot in. They still have to fix some things to meet our standards, but I’m basically a “researcher” level coder. Which makes sense — I do have an undergrad and MS in CS, and did a lot of mathy algo stuff. For the last 15+ years I could never use anything in my brain to help the team solve things I was best suited to solve. I am now, and that’s nice.

The one key point is that I am keenly aware of what I can and cannot do. With these new superpowers, I often catch myself doing too much, and I end up doing a lot more rewrites than a real engineer would. But I can see Dunning Kruger playing out everywhere when people say they can vibe code an entire product.

belZaah · 4 months ago

Yeah, no. Had Claude 4.5 generate a mock implementation of an OpenAPI spec. Trivial interaction, just a post of a json object. And Claude invented new fields to check for and failed to check for required ones.

It is helpful in reducing the number of keys I have to press and the amount of documentation-diving I need to do. But saying that’s writing code is like saying StackOverflow is writing code along with autocomplete.

simonw · 4 months ago

What did Claude do when you replied and said "don't add new fields, and make sure you check the required ones"?

IanCal · 4 months ago

I disagree. Unless you’re focussed on right now, in which case case… maybe? Depends on scale.

I have a few scattered thoughts here but I think you’re caught up on how things are done now.

A human expert in a field is the customer.

Do you think, say, gpt5 pro can’t talk to them about a problem and what’s reasonable to try and build in software?

It can build a thing, with tests, run stuff and return to a user.

It can take feedback (talking to people is the key major things LLMs have solved).

They can iterate (see: codex) deploy and they can absolutely write copy.

What do you really think in this list they can’t do?

For simplicity reduce it to a relatively basic crud app. We know that they can make these over several steps. We know they can manage the ui pretty well, do incremental work etc. What’s missing?

I think something huge here is that some of the software engineering roles and management become exceptionally fast and cheap. That means you don’t need to have as many users to be worthwhile writing code to solve a problem. Entirely personal software becomes economically viable. I don’t need to communicate value for the problem my app has solved because it’s solved it for me.

Frankly most of the “AI can’t ever do my thing” comments come across as the same as “nobody can estimate my tasks they’re so unique” we see every time something comes up about planning. Most business relevant SE isn’t complex logically, interestingly unique or frankly hard. It’s just a different language to speak.

Disclaimer: a client of mine is working on making software simpler to build and I’m looking at the AI side, but I have these views regardless.

simonw · 4 months ago

I expect that customers who have those needs would much rather hire somebody to be the intermediary with the LLM writing the code than take on that role themselves.

You'll get the occasional high agency non-technical customer who decides to learn how to get these things done with LLMs but they'll be a pretty rare breed.

I've been forcing myself to "pure vibe-code" on a few projects, where I don't read a single line of code (even the diffs in codex/claude code).

Candidly, it's awful. There are countless situations where it would be faster for me to edit the file directly (CSS, I'm looking at you!).

With that said, I've been surprised at how far the coding agents are able to go[0], and a lot less surprised about where I need to step in.

Things that seem to help: 1. Always create a plan/debug markdown file 2. Prompt the agent to ask questions/present multiple solutions 3. Use git more than normal (squash ugly commits on merge)

Planning is key to avoid half-brained solutions, but having "specs" for debug is almost more important. The LLM will happily dive down a path of editing as few files as possible to fix the bug/error/etc. This, unchecked, can often lead to very messy code.

Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.

I now basically commit every time a plan or debug step is complete. I've tried having the LLM control git, but I feel that it eats into the context a bit too much. Ideally a 3rd party "agent" would handle this.

The last thing I'll mention is that Claude Code (Sonnet 4.5) is still very token-happy, in that it eagerly goes above and beyond when not always necessary. Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault. For both cases, this is where planning up-front is super useful.

[0]Caveat: the projects are either Typescript web apps or Rust utilities, can't speak to performance on other languages/domains.

theshrike79 · 4 months ago

Sonnet 4.5 is rebranded Opus 4. That's where it got its token-happiness.

Try asking Opus to generate a simple application and it'll do it. It'll also add thousands of lines of setup scripts and migration systems and Dockerfiles and reports about how it built everything and... Ooof.

Sonnet 4.5 is the same, but at a slightly smaller scale. It still LOVES to generate markdown reports of features it did. No clue why, but by default it's on, you need to specifically tell it to stop doing that.

svachalek · 4 months ago

Also, put heavy lint rules in place, and commit hooks to make sure everything compiles, lints, passes tests, etc. You've got to be super, super defensive. But Claude Code will see all those barriers and respond to them automatically which saves you the trouble of being vigilant over so many little things. You just need to watch the big picture, like make sure tests are there to replicate bugs, new features are tested, etc, etc.

theshrike79 · 4 months ago

Same as when coding with humans, better tests and linters will give you a shorter and simpler iteration loop.

LLMs love that.

enraged_camel · 4 months ago

>> Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault.

I've seriously tried gpt-5-codex at least two dozen times since it came out, and every single time it was either insufficient or made huge mistakes. Even with the "have another agent write the specs and then give it to codex to implement" approach, it's just not very good. It also stops after trying one thing and then says "I've tried X, tests still failing, next I will try Y" and it's just super annoying. Claude is really good at iterating until it solves the issue.

jumploops · 4 months ago

What type of codebase are you working within?

I've spent quite a bit of time with the normal GPT-5 in Codex (med and high reasoning), so my perspective might be skewed!

Oh, one other tip: Codex by default seems to read partial files (~200 lines at a time), so I make sure to add "Always read files in full" to my AGENTS.md file.

asabla · 4 months ago

> The last thing I'll mention is that Claude Code (Sonnet 4.5) is still very token-happy, in that it eagerly goes above and beyond when not always necessary. Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault.

I very much share your experience. As for the time being I like the experience with codex over claude, just because I find my self in a position where I know much sooner when to step in and just doing it manually.

With claude I find my self in a typing exercise much more often, I could probably get better of knowing when to stop ofc.

throwaway314155 · 4 months ago

> Candidly, it's awful.

Noting your caveat but I’m doing this with Python and your experience is very different from mine.

jumploops · 4 months ago

Oh, don't get me wrong, the models are marvelous!

The "it's awful" admission is due to the "don't look at code" aspect of this exercise.

For real work, my split is more like 80% LLM/20% non-LLM, and I read all the code. It's much faster!

tharkun__ · 4 months ago

    Always create a plan/debug markdown file

Very much necessary. Especially with Claude I find. It auto-compacts so often (Sonnet 4.5) and it instantly goes a-wall stupid after that. I then make it re-read the markdown file, so we can actually continue without it forgetting about 90% of what we just did/talked about.

    Prompt the agent to ask questions/present multiple solutions

I find that only helps marginally. They all output so much text it's not even funny. And that's with one "solution".

I don't get how people can stand reading all that nonsense they spew, especially Claude. Everything is insta-ready to deploy, problem solved, root cause found, go hit the big red button that might destroy the earth in a mushroom cloud. I learned real fast to only skim what it says and ignore all that crap (as in I never tried to "change its personality" for real - I did try to tell it to always use the scientific method and prove its assumptions but just like a junior dev it never does and just tells me stupid things it believes to be true and I have to question it. Again, just like a junior dev, but it's my junior dev that's always on and available when I have time and it does things while I do other stuff. And instead of me having to ask the junior after and hour or two what rabbit hole it went down and get them out of there, Claude and Codex usually visually ping the terminal before I even have time to notice. That's for when I don't have full time focus on what I'm trying to do with the agents, which is why I do like using them.

The times when I am fully attentive, they're just soooo slow. And many many times I could do what they're doing faster or just as fast but without spending extra money and "environment". I've been trying to "only use AI agents for coding" for like a month or two now to see its positives and limitations and form my own opinion(s).

    Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.

I find Claude's "Plan mode" is actually ideal. I just enable it and I don't have to tell it anything. While Codex "breaks out" from time to time and just starts coding even when I just ask it a question. If these machines ever take over, there's probably some record of me swearing at them and I will get a hitman on me. Unlike junior devs, I have no qualms about telling a model that it again ignored everything I told it.

    Ideally a 3rd party "agent" would handle this.

With sub-agents you can. Simple git interactions are perfect for subagents because not much can get lost in translation in the interface between the main agent and the sub agent. Then again, I'm not sure how you loose that much context. I rather use a sub agent for things like running the tests and linter on the whole project in the final steps, which spew a lot of unnecessary output.

Personally, I had a rather bad set of experiences with it controlling git without oversight, so I do that myself, since doing it myself is less taxing than approving everything it wants to do (I automatically allow Claude certain commands that are read only for investigations and reviewing things).