Most software engineers are seriously sleeping on how good LLM agents are right now, especially something like Claude Code.
Once you’ve got Claude Code set up, you can point it at your codebase, have it learn your conventions, pull in best practices, and refine everything until it’s basically operating like a super-powered teammate. The real unlock is building a solid set of reusable “skills” plus a few agents for the stuff you do all the time.
For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
We also had Claude Code create a bunch of ESLint automation, including custom ESLint rules and lint checks that catch and auto-handle a lot of stuff before it even hits review.
Then we take it further: we have a deep code review agent Claude Code runs after changes are made. And when a PR goes up, we have another Claude Code agent that does a full PR review, following a detailed markdown checklist we’ve written for it.
On top of that, we’ve got like five other Claude Code GitHub workflow agents that run on a schedule. One of them reads all commits from the last month and makes sure docs are still aligned. Another checks for gaps in end-to-end coverage. Stuff like that. A ton of maintenance and quality work is just… automated. It runs ridiculously smoothly.
We even use Claude Code for ticket triage. It reads the ticket, digs into the codebase, and leaves a comment with what it thinks should be done. So when an engineer picks it up, they’re basically starting halfway through already.
There is so much low-hanging fruit here that it honestly blows my mind people aren’t all over it. 2026 is going to be a wake-up call.
(used voice to text then had claude reword, I am lazy and not gonna hand write it all for yall sorry!)
I made a similar comment on a different thread, but I think it also fits here: I think the disconnect between engineers is due to their own context. If you work with frontend applications, specially React/React Native/HTML/Mobile, your experience with LLMs is completely different than the experience of someone working with OpenGL, io_uring, libev and other lower level stuff. Sure, Opus 4.5 can one shot Windows utilities and full stack apps, but can't implement a simple shadowing algorithm from a 2003 paper in C++, GLFW, GLAD: https://www.cse.chalmers.se/~uffe/soft_gfxhw2003.pdf
Codex/Claude Code are terrible with C++. It also can't do Rust really well, once you get to the meat of it. Not sure why that is, but they just spit out nonsense that creates more work than it helps me. It also can't one shot anything complete, even though I might feed him the entire paper that explains what the algorithm is supposed to do.
Try to do some OpenGL or Vulkan with it, without using WebGPU or three.js. Try it with real code, that all of us have to deal with every day. SDL, Vulkan RHI, NVRHI. Very frustrating.
Try it with boost, or cmake, or taskflow. It loses itself constantly, hallucinates which version it is working on and ignores you when you provide actual pointers to documentation on the repo.
I've also recently tried to get Opus 4.5 to move the Job system from Doom 3 BFG to the original codebase. Clean clone of dhewm3, pointed Opus to the BFG Job system codebase, and explained how it works. I have also fed it the Fabien Sanglard code review of the job system: https://fabiensanglard.net/doom3_bfg/threading.php
We are not sleeping on it, we are actually waiting for it to get actually useful. Sure, it can generate a full stack admin control panel in JS for my PostgreSQL tables, but is that really "not normal"? That's basic.
We have an in-house, Rust-based proxy server. Claude is unable to contribute to it meaningfully outside of grunt work like minor refactors across many files. It doesn't seem to understand proxying and how it works on both a protocol level and business logic level.
With some entirely novel work we're doing, it's actually a hindrance as it consistently tells us the approach isn't valid/won't work (it will) and then enters "absolutely right" loops when corrected.
I still believe those who rave about it are not writing anything I would consider "engineering". Or perhaps it's a skill issue and I'm using it wrong, but I haven't yet met someone I respect who tells me it's the future in the way those running AI-based companies tell me.
I've had Opus 4.5 hand rolling CUDA kernels and writing a custom event loop on io_uring lately and both were done really well. Need to set up the right feedback loops so it can test its work thoroughly but then it flies.
I'm a quite senior frontend using React and even I see Sonnet 4.5 struggle with basic things. Today it wrote my Zod validation incorrectly, mixing up versions, then just decided it wasn't working and attempted to replace the entire thing with a different library.
I built an open to "game engine" entirely in Lua a many years ago, but relying on many third party libraries that I would bind to with FFI.
I thought I'd revive it, but this time with Vulkan and no third-party dependencies (except for Vulkan)
4.5 Sonet, Opus and Gemini 3.5 flash has helped me write image decoders for dds, png jpg, exr, a wayland window implementation, macOS window implementation, etc.
I find that Gemini 3.5 flash is really good at understanding 3d in general while sonnet might be lacking a little.
All these sota models seem to understand my bespoke Lua framework and the right level of abstraction. For example at the low level you have the generated Vulkan bindings, then after that you have objects around Vulkan types, then finally a high level pipeline builder and whatnot which does not mention Vulkan anywhere.
However with a larger C# codebase at work, they really struggle. My theory is that there are too many files and abstractions so that they cannot understand where to begin looking.
I'll second this. I'm making a fairly basic iOS/Swift app with an accompanying React-based site. I was able to vibe-code the React site (it isn't pretty, but it works and the code is fairly decent). But I've struggled to get the Swift code to be reliable.
Which makes sense. I'm sure there's lots of training data for React/HTML/CSS/etc. but much less with Swift, especially the newer versions.
I have not had this experience at all. It often doesn't get it right on the first pass, yes, but the advantage with Rust vibecoding is that if you give it a rule to "Always run cargo check before you think you're finished" then it will go back and fix whatever it missed on the first pass. What I find particularly valuable is that the compiler forces it to handle all cases like match arms or errors. I find that it often misses edge cases when writing typescript, and I believe that the relative leniency of the typescript compiler is why.
In a similar vein, it is quite good at writing macros (or at least, quite good given how difficult this otherwise is). You often have to cajole it into not hardcoding features into the macro, but since macros resolve at compile time they're quite well-suited for an LLM workflow as most potential bugs will be apparent before the user needs to test. I also think that the biggest hurdle of writing macros to humans is the cryptic compiler errors, but I can imagine that since LLMs have a lot of information about compilers and syntax parsing in their training corpus, they have an easier time with this than the median programmer. I'm sure an actual compiler engineer would be far better than the LLM, but I am not that guy (nor can I afford one) so I'm quite happy to use LLMs here.
For context, I am purely a webdev. I can't speak for how well LLMs fare at anything other than writing SQL, hooking up to REST APIs, React frontend, and macros. With the exception of macros, these are all problems that have been solved a million times thus are more boilerplate than novelty, so I think it is entirely plausible that they're very poor for different domains of programming despite my experiences with them.
I've had pretty good luck with LLM agents coding C. In this case a C compiler that supports a subset of C and targets a customizable microcoded state machine/processor. Then I had Gemini code up a simulator/debugger for the target machine in C++ and it did it in short order and quite successfully - lets you single step through the microcode and examine inputs (and set inputs), outputs & current state - did that in an afternoon and the resulting C++ code looks pretty decent.
Have you experimented with all of these things on the latest models (e.g. Opus 4.5) since Nov 2025? They are significantly better at coding than earlier models.
I've found it to be pretty hit-or-miss with C++ in general, but it's really, REALLY bad at 3D graphics code. I've tried to use it to port an OpenGL project to SDL3_GPU, and it really struggled. It would confidently insist that the code it wrote worked, when all you had to do was run it and look at the output to see a blank screen.
> It also can't do Rust really well, once you get to the meat of it. Not sure why that is
Because types are proofs and require global correctness, you can't just iterate, fix things locally, and wait until it breaks somewhere else that you also have to fix locally.
I have not tried C++, but Codex did a good job with low-level C code, shaders as well as porting 32 bit to 64 bit assembly drawing routines.
I have also tried it with retro-computing programming with relative success.
From what I've seen, CC has troubles with the latest Swift too, partially because of it being latest and partially because it's so convoluted nowadays.
I really think a lof of people tried AI coding earlier, got frustrated at the errors and gave up. That's where the rejection of all these doomer predictions comes from.
And I get it. Coding with Claude Code really was prompting something, getting errors, and asking it to fix it. Which was still useful but I could see why a skilled coder adding a feature to a complex codebase would just give up
Opus 4.5 really is at a new tier however. It just...works. The errors are far fewer and often very minor - "careless" errors, not fundamental issues (like forgetting to add "use client" to a nextjs client component.
This was me. I was a huge AI coding detractor on here for a while (you can check my comment history). But, in order to stay informed and not just be that grouchy curmudgeon all the time, I kept up with the models and regularly tried them out. Opus 4.5 is so much better than anything I've tried before, I'm ready to change my mind about AI assistance.
I even gave -True Vibe Coding- a whirl. Yesterday, from a blank directory and text file list of requirements, I had Opus 4.5 build an Android TV video player that could read a directory over NFS, show a grid view of movie poster thumbnails, and play the selected video file on the TV. The result wasn't exactly full-featured Kodi, but it works in the emulator and actual device, it has no memory leaks, crashes, ANRs, no performance problems, no network latency bugs or anything. It was pretty astounding.
Oh, and I did this all without ever opening a single source file or even looking at the proposed code changes while Opus was doing its thing. I don't even know Kotlin and still don't know it.
This is what people are still doing wrong. Tools in a loop people, tools in a loop.
The agent has to have the tools to detect whatever it just created is producing errors during linting/testing/running. When it can do that, I can loop again, fix the error and again - use the tools to see whether it worked.
I _still_ encounter people who think "AI programming" is pasting stuff into ChatGPT on the browser and they complain it hallucinates functions and produces invalid code.
I have been out of the loop for a couple of months (vacation). I tried Claude Opus 4.5 at the end of November 2025 with the corporate Github Copilot subscription in Agent mode and it was awful: basically ignoring code and hallucinating.
My team is using it with Claude Code and say it works brilliantly, so I'll be giving it another go.
How much of the value comes from Opus 4.5, how much comes from Claude Code, and how much comes from the combination?
my issue hasn't been for a long time now that the code they write works or doesn't work. My issues all stem from that it works, but does the wrong thing
This was me. I have done a full 180 over the last 12 months or so, from "they're an interesting idea, and technically impressive, but not practically useful" to "holy shit I can have entire days/weeks where I don't write a single line of code".
> I really think a lof of people tried AI coding earlier, got frustrated at the errors and gave up. That's where the rejection of all these doomer predictions comes from.
It's not just the deficiencies of earlier versions, but the mismatch between the praise from AI enthusiasts and the reality.
I mean maybe it is really different now and I should definitely try uploading all of my employer's IP on Claude's cloud and see how well it works. But so many people were as hyped by GPT-4 as they are now, despite GPT-4 actually being underwhelming.
Too much hype for disappointing results leads to skepticism later on, even when the product has improved.
> Opus 4.5 really is at a new tier however. It just...works.
Literally tried it yesterday. I didn't see a single difference with whatever model Claude Code was using two months ago. Same crippled context window. Same "I'll read 10 irrelevant lines from a file", same random changes etc.
I know someone who is using a vibe coded or at least heavily assisted text editor, praising it daily, while also saying llms will never be productive. There is a lot of dissonance right now.
I teach at a university, and spend plenty of time programming for research and for fun. Like many others, I spent some time on the holidays trying to push the current generation of Cursor, Claude Code, and Codex as far as I could. (They're all very good.)
I had an idea for something that I wanted, and in five scattered hours, I got it good enough to use. I'm thinking about it in a few different ways:
1. I estimate I could have done it without AI with 2 weeks full-time effort. (Full-time defined as >> 40 hours / week.)
2. I have too many other things to do that are purportedly more important that programming. I really can't dedicate to two weeks full-time to a "nice to have" project. So, without AI, I wouldn't have done it at all.
3. I could hire someone to do it for me. At the university, those are students. From experience with lots of advising, a top-tier undergraduate student could have achieved the same thing, had they worked full tilt for a semester (before LLMs). This of course assumes that I'm meeting them every week.
This is where the LLM coding shines in my opinion, there's a list of things they are doing very well:
- single scripts. Anything which can be reduced to a single script.
- starting greenfield projects from scratch
- code maintenance (package upgrades, old code...)
- tasks which have a very clear and single definition. This isn't linked to complexity, some tasks can be both very complex but with a single definition.
If your work falls into this list they will do some amazing work (and yours clearly fits that), if it doesn't though, prepare yourself because it will be painful.
How do you compare Claude Code to Cursor? I'm a Cursor user quietly watching the CC parade with curiosity. Personally, I haven't been able to give up the IDE experience.
> Most software engineers are seriously sleeping on how good LLM agents are right now, especially something like Claude Code.
Nobody is sleeping. I'm using LLMs daily to help me in simple coding tasks.
But really where is the hurry? At this point not a few weeks go by without the next best thing since sliced bread to come out. Why would I bother "learning" (and there's really nothing to learn here) some tool/workflow that is already outdated by the time it comes out?
> 2026 is going to be a wake-up call
Do you honestly think a developer not using AI won't be able to adapt to a LLM workflow in, say, 2028 or 2029? It has to be 2026 or... What exactly?
There is literally no hurry.
You're using the equivalent of the first portable CD-player in the 80s: it was huge, clunky, had hiccups, had a huge battery attached to it. It was shiny though, for those who find new things shiny. Others are waiting for a portable CD player that is slim, that buffers, that works fine. And you're saying that people won't be able to learn how to put a CD in a slim CD player because they didn't use a clunky one first.
I think getting proficient at using coding agents effectively takes a few months of practice.
It's also a skill that compounds over time, so if you have two years of experience with them you'll be able to use them more effectively than someone with two months of experience.
In that respect, they're just normal technology. A Python programmer with two years of Python experience will be more effective than a programmer with two months of Python.
> Nobody is sleeping. I'm using LLMs daily to help me in simple coding tasks.
That is sleeping.
> But really where is the hurry? At this point not a few weeks go by without the next best thing since sliced bread to come out. Why would I bother "learning" (and there's really nothing to learn here) some tool/workflow that is already outdated by the time it comes out?
You're jumping to conclusions that haven't been justified by any of the development in this space. The learning compounds.
> Do you honestly think a developer not using AI won't be able to adapt to a LLM workflow in, say, 2028 or 2029? It has to be 2026 or... What exactly?
They will, but they'll be competing against people with 2-3 more years of experience in understanding how to leverage these tools.
"But really where is the hurry?" It just depends on why you're programming. For many of us not learning and using up to date products leads to a disadvantage relative to our competition. I personally would very much rather go back to a world without AI, but we're forced to adapt. I didn't like when pagers/cell phones came out either, but it became clear very quickly not having one put me at a disadvantage at work.
The crazy part is, once you have it setup and adapted your workflow, you start to notice all sorts of other "small" things:
claude can call ssh and do system admin tasks. It works amazingly well. I have 3 VM's, which depends on each other (proxmox with openwrt, adguard, unbound), and claude can prove to me that my dns chains works perfectly, my firewalls are perfect etc as claude can ssh into each. Setting up services, diagnosing issues, auditing configs... you name it. Just awesome.
claude can call other sh scripts on the machine, so over time, you can create a bunch of scripts that lets claude one shot certain tasks that would normally eat tokens. It works great. One script per intention - don't have a script do more than one thing.
claude can call the compiler, run the debug executable and read the debug logs.. in real time. So claude can read my android apps debug stream via adb.. or my C# debug console because claude calls the compiler, not me. Just ask it to do it and it will diagnose stuff really quickly.
It can also analyze your db tables (give it readonly sql access), look at the application code and queries, and diagnose performance issues.
The opportunities are endless here. People need to wake up to this.
Claude set up a Raspberry Pi with a display and conference audio device for me to use as an Alexa replacement tied to Home Assistant.
I gave it an ssh key and gave it root.
Then I told it what I wanted, and it did. It asked for me to confirm certain things, like what I could see on screen, whether I could hear the TTS etc. (it was a bit of a surprise when it was suddenly talking to me while I was minding my own business).
It configured everything, while keeping a meticulous log that I can point it at if I want to set up another device, and eventually turn into a runbook if I need to.
I have a /fix-ci-build slash command that instructs Claude how to use `gh` to get the latest build from that specific project's Github Actions and get the logs for the build
In addition there are instructions on how and where to push the possible fixes and how to check the results.
I've yet to encounter a build failure it couldn't fix automatically.
It makes me so exhausted trying to read them... my brain can tell immediately when there's so much redundant information that it just starts shutting itself off.
I think we're entering a world where programmers as such won't really exist (except perhaps in certain niches). Being able to program (and read code, in particular) will probably remain useful, though diminished in value. What will matter more is your ability to actually create things, using whatever tools are necessary and available, and have them actually be useful. Which, in a way, is the same as it ever was. There's just less indirection involved now.
We've been living in that world since the invention of the compiler ("automatic programming"). Few people write machine code any more. If you think of LLMs as a new variety of compiler, a lot of their shortcomings are easier to describe.
> have it learn your conventions, pull in best practices
What do you mean by "have it learn your conventions"? Is there a way to somehow automatically extract your conventions and store it within CLAUDE.md?
> For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
Did you have to develop these skills yourself? How much work was that? Do you have public examples somewhere?
> What do you mean by "have it learn your conventions"?
I'll give you an example: I use ruff to format my python code, which has an opinionated way of formatting certain things. After an initial formatting, Opus 4.5, without prompting, will write code in this same style so that the ruff formatter almost never has anything to do on new commits. Sonnet 4.5 is actually pretty good at this too.
Starting to use Opus 4.5 I'm reduces instrutions in claude.md and just ask claude to look in the codebase to understand the patterns already in use. Going from prompts/docs to instead having code being the "truth". Show don't tell. I've found this patterns has made a huge leap with Opus 4.5.
When I ask Claude to do something, it independently, without me even asking or instructing it to, searches the codebase to understand what the convention is.
I’ve even found it searching node_modules to find the API of non-public libraries.
/init in Claude Code already automatically extracts a bunch, but for something more comprehensive, just tell it which additional types of things you want it to look for and document.
> Did you have to develop these skills yourself? How much work was that? Do you have public examples somewhere?
I don't know about the person above, but I tell Claude to write all my skills and agents for me. With some caveats, you can do this iteratively in a single session ("update the X agent, then re-run it. Repeat until it reliably does Y")
"Claude, clone this repo https://github.com/repo, review the coding conventions, check out any markdown or readme files. This is an example of coding conventions we want to use on this project"
lol does sound like and ad, but is true. Also forgot about hooks use hooks too! I just use voice to text then had claude reword it. Still my real world ideas
All of these things work very well IMO in a professional context.
Especially if you're in a place where a lot of time was spent previously revising PRs for best practices, etc, even for human-submitted code, then having the LLM do that for you that saves a bunch of time. Most humans are bad at following those super-well.
There's a lot of stuff where I'm pretty sure I'm up to at least 2x speed now. And for things like making CLI tools or bash scripts, 10x-20x. But in terms of "the overall output of my day job in total", probably more like 1.5x.
But I think we will need a couple major leaps in tooling - probably deterministic tooling, not LLM tooling - before anyone could responsibly ship code nobody has ever read in situations with millions of dollars on the line (which is different from vibe-coding something that ends up making millions - that's a low-risk-high-reward situation, where big bets on doing things fast make sense. if you're already making millions, dramatic changes like that can become high-risk-low-reward very quickly. In those companies, "I know that only touching these files is 99.99% likely to be completely safe for security-critical functionality" and similar "obvious" intuition makes up for the lack of ability to exhaustively test software in a practical way (even with fuzzers and things), and "i didn't even look at the code" is conceding responsibility to a dangerous degree there.)
> Once you’ve got Claude Code set up, you can point it at your codebase, have it learn your conventions, pull in best practices, and refine everything until it’s basically operating like a super-powered teammate. The real unlock is building a solid set of reusable “skills” plus a few agents for the stuff you do all the time.
I agree with this, but I haven't needed to use any advanced features to get good results. I think the simple approach gets you most of the benefits. Broadly, I just have markdown files in the repo written for a human dev audience that the agent can also use.
Basically:
- README.md with a quick start section for devs, descriptions of all build targets and tests, etc. Normal stuff.
- AGENTS.md (only file that's not written for people specifically) that just describes the overall directory structure and has a short step of instructions for the agent: (1) Always read the readme before you start. (2) Always read the relevant design docs before you start. (3) Always run the linter, a build, and tests whenever you make code changes.
- docs/*.md that contain design docs, architecture docs, and user stories, just text. It's important to have these resources anyway, agent or no.
As with human devs, the better the docs/requirements the better the results.
I'd really encourage you to try using agents for tasks that are repeatable and/or wordy but where most of the words are not relevant for ongoing understanding.
It's a tiny step further, and sub-agents provide a massive benefit the moment you're ready to trust the model even a little bit (relax permissions to not have it prompt you for every little thing; review before committing rather than on every file edit) because they limit what goes into the top level context, and can let the model work unassisted for far longer. I now regularly have it run for hours at a time without stopping.
Running and acting on output from the linter is absolutely an example of that which matters even for much shorter runs.
There's no reason to have all the lint output "polluting" the top level context, nor to have the steps the agent needs to take to fix linter issues that can't be auto-fixed by the linter itself. The top level agent should only need to care about whether the linter run passed or failed (and should know it needs to re-run and possibly investigate if it fails).
Just type /agents, select "Create new agent" and describe a task you often do, and then forget about it (or ask Claude to make changes to it for you)
Cheaper than hiring another developer, probably. My experience: for a few dollars I was able to extensively refactor a Python codebase in half a day. This otherwise would have taken multiple days of very tedious work.
I still struggle with these things being _too_ good at generating code. They have a tendency to add abstractions, classes, wrappers, factories, builders to things that didn't really need all that. I find they spit out 6 files worth of code for something that really only needed 2-3 and I'm spending time going back through simplifying.
There are times those extra layers are worth it but it seems LLMs have a bias to add them prematurely and overcomplicate things. You then end up with extra complexity you didn't need.
They are sleeping on it because there is absolutely no incentive to use it.
When needed it can be picked up in a day. Otherwise they are not paid based in tickets solved etc.
If the incentives were properly aligned everyone would already use it
Use Claude Code... to do what? There are multiple layers of people involved in the decision process and they only come up with a few ideas every now and then. Nothing I can't handle. AI helps but it doesn't have to be an agent.
I'm not saying there aren't use cases for agents, just that it's normal that most software engineers are sleeping on it.
Came across official anthropic repo on gh actions very relevant to what you mentioned. Your idea on scheduled doc updation using llm is brilliant, I’m stealing this idea.
https://github.com/anthropics/claude-code-action
Also new haiku. Not as smart but lighting fast, I've it review code changes impact or if i need a wide but shallow change done I've it scan the files and create a change plan. Saves a lot of time waiting for claude or codex to get their bearing.
Never tried coderabbit, just because this is already good enough with Claude Code. It helped us to catch dozens of important issues we wouldn't have caught.
We gave some instructions in the CLAUDE.md doc in the repository - with including a nice personalized roast of the engineer that did the review in the intro and conclusion to make it fun! :)
Basically, when you do a "create PR" from your Claude Code, it will help you getting your Linear ticket (or creating one if missing), ask you some important questions (like: what tests have you done?), create the PR on Github, request the reviewers, and post a "Auto Review" message with your credentials. It's not an actual review per se but this is enough for our small team.
Thanks for the example! There's a lot (of boilerplate?) here that I don't understand. Does anyone have good references for catching up to speed what's the purpose of all of these files in the demo?
If anyone is excited about, and has experience with this kind of stuff, please DM. I have a role open for setting up these kinds of tools and workflows.
I've tried most of the CLI coding tools with the Claude models and I keep coming back to Claude Code. It hits a sweet spot of simple and capable, and right now I'd say it's the best from an "it just works" perspective.
In my experience the CLI tool is part of the secret sauce. I haven't tried switching models per each CLI tool though. I use claude exclusively at work and for personal projects I use claude, codex, gemini.
I'm at the point where I say fuck it, let them sleep.
The tech industry just went through an insane hiring craze and is now thinning out. This will help to separate the chaff from the wheat.
I don't know why any company would want to hire "tech" people who are terrified of tech and completely obstinate when it comes to utilizing it. All the people I see downplaying it take a half-assed approach at using it then disparage it when it's not completely perfect.
I started tinkering with LLMs in 2022. First use case, speak in natural english to the llm, give it a json structure, have it decipher the natural language and fill in that json structure (vacation planning app, so you talk to it about where/how you want to vacation and it creates the structured data in the app). Sometimes I'd use it for minor coding fixes (copy and paste a block into chatgpt, fix errors or maybe just ideation). This was all personal project stuff.
At my job we got LLM access in mid/late 2023. Not crazy useful, but still was helpful. We got claude code in 2024. These days I only have an IDE open so I can make quick changes (like bumping up a config parameter, changing a config bool, etc.). I almost write ZERO code now. I usually have 3+ claude code sessions open.
On my personal projects I'm using Gemini + codex primarily (since I have a google account and chatgpt $20/month account). When I get throttled on those I go to claude and pay per token. I'll often rip through new features, projects, ideas with one agent, then I have another agent come through and clean things up, look for code smells, etc. I don't allow the agents to have full unfettered control, but I'd say 70%+ of the time I just blindly accept their changes. If there are problems I can catch them on the MR/PR.
I agree about the low hanging fruit and I'm constantly shocked at the sheer amount of FUD around LLMs. I want to generalize, like I feel like it's just the mid/jr level devs that speak poorly about it, but there's definitely senior/staff level people I see (rarely, mind you) that also don't like LLMs.
I do feel like the online sentiment is slowly starting to change though. One thing I've noticed a lot of is that when it's an anonymous post it's more likely to downplay LLMs. But if I go on linkedin and look at actual good engineers I see them praising LLMs. Someone speaking about how powerful the LLMs are - working on sophisticated projects at startups or FAANG. Someone with FUD when it comes to LLM - web dev out of Alabama.
I could go on and on but I'm just ranting/venting a little. I guess I can end this by saying that in my professional/personal life 9/10 of the top level best engineers I know are jumping on LLMs any chance they get. Only 1/10 talks about AI slop or bullshit like that.
Not entirely disagreeing with your point but I think they've mostly been forced to pivot recently for their own sakes; they will never say it though. As much as they may seem eager the most public people tend to also be better at outside communication and knowing what they should say in public to enjoy more opportunities, remain employed or for the top engineers to still seem relevant in the face of the communities they are a part of. Its less about money and more about respect there I think.
The "sudden switch" since Opus 4.5 when many were saying just a few months ago "I enjoy actual coding" but now are praising LLM's isn't a one off occurrence. I do think underneath it is somewhat motivated by fear; not for the job however but for relevance. i.e. its in being relevant to discussions, tech talks, new opportunities, etc.
OK, I am gonna be the guy and put my skin in the game here. I kind of get the hype, but the experience with e.g. Claude Code (or Github Copilot previously and others as weel) has so far been pretty unreliable.
I have Django project with 50 kLOC and it is pretty capable of understanding the architecture, style of coding, naming of variables, functions etc. Sometimes it excels on tasks like "replicate this non-trivial functionality for this other model and update the UI appropriately" and leaves me stunned. Sometimes it solves for me tedious and labourous "replace this markdown editor with something modern, allowing fullscreen edits of content" and does annoying mistake that only visual control shows and is not capable to fix it after 5 prompts. I feel as I am becoming tester more than a developer and I do not like the shift. Especially when I do not like to tell someone he did an obvious mistake and should fix it - it seems I do not care if it is human or AI, I just do not like incompetence I guess.
Yesterday I had to add some parameters to very simple Falcon project and found out it has not been updated for several months and won't build due to some pip issues with pymssql. OK, this is really marginal sub-project so I said - let's migrate it to uv and let's not get hands dirty and let the Claude do it. He did splendidly but in the Dockerfile he missed the "COPY server.py /data/" while I asked him to change the path... Build failed, I updated the path myself and moved on.
And then you listen to very smart guys like Karpathy who rave about Tab, Tab, Tab, while not understanding the language or anything about the code they write. Am I getting this wrong?
I am really far far away from letting agents touch my infrastructure via SSH, access managed databases with full access privileges etc. and dread the day one of my silly customers asks me to give their agent permission to managed services. One might say the liability should then be shifted, but at the end of the day, humans will have to deal with the damage done.
My customer who uses all the codebase I am mentioning here asked me, if there is a way to provide "some AI" with item GTINs and let it generate photos, descriptions, etc. including metadata they handcrafted and extracted for years from various sources. While it looks like nice idea and for them the possibility of decreasing the staff count, I caught the feeling they do not care about the data quality anymore or do not understand the problems the are brining upon them due to errors nobody will catch until it is too late.
TL;DR: I am using Opus 4.5, it helps a lot, I have to keep being (very) cautious. Wake up call 2026? Rather like waking up from hallucination.
Everybody says how good Claude is and I go to my code base and I can't get it to correctly update one xaml file for me. It is quicker to make changes myself than to explain exactly what I need or learn how to do "prompt engineering".
Disclaimer: I don't have access to Claude Code. My employer has only granted me Claude Teams. Supposedly, they don't use my poopy code to train their models if I use my work email Claude so I am supposed to use that. If I'm not pasting code (asking general questions) into Claude, I believe I'm allowed to use whatever.
What's even the point of this comment if you self-admittedly don't have access to the flagship tool that everyone has been using to make these big bold coding claims?
Opus 4.5 ate through my Copilot quota last month, and it's already halfway through it for this month. I've used it a lot, for really complex code.
And my conclusion is: it's still not as smart as a good human programmer. It frequently got stuck, went down wrong paths, ignored what I told it to do to do something wrong, or even repeat a previous mistake I had to correct.
Yet in other ways, it's unbelievably good. I can give it a directory full of code to analyze, and it can tell me it's an implementation of Kozo Sugiyama's dagre graph layout algorithm, and immediately identify the file with the error. That's unbelievably impressive. Unfortunately it can't fix the error. The error was one of the many errors it made during previous sessions.
So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.
Yesterday and today I was upgrading a bunch of unit tests because of a dependency upgrade, and while it was occasionally very helpful, it also regularly got stuck. I got a lot more done than usual in the same time, but I do wonder if it wasn't too much. Wasn't there an easier way to do this? I didn't look for it, because every step of the way, Opus's solution seemed obvious and easy, and I had no idea how deep a pit it was getting me into. I should have been more critical of the direction it was pointing to.
Copilot and many coding agents truncates the context window and uses dynamic summarization to keep costs low for them. That's how they are able to provide flat fee plans.
If you want the full capability, use the API and use something like opencode. You will find that a single PR can easily rack up 3 digits of consumption costs.
Gerring off of their plans and prompts is so worth it, I know from experience, I'm paying less and getting more so far, paying by token, heavy gemini-3-flash user, it's a really good model, this is the future (distillations into fast, good enough for 90% of tasks), not mega models like Claude. Those will still be created for distillations and the harder problems
Maybe not, then. I'm afraid I have no idea what those numbers mean, but it looks like Gemini and ChatGPT 4 can handle a much larger context than Opus, and Opus 4.5 is cheaper than older versions. Is that correct? Because I could be misinterpreting that table.
People are completely missing the points about agentic development. The model is obviously a huge factor in the quality of the output, but the real magic lies in how the tools are managing and injecting context in to them, as well as the tooling. I switched from Copilot to Cursor at the end of 2025, and it was absolute night and day in terms of how the agents behaved.
Interesting you have this opinion yet you're using Cursor instead of Claude Code. By the same logic, you should get even better results directly using Anthropic's wrapper for their own model.
In my experience GPT-5 is also much more effective in the Cursor context than the Codex context. Cursor deserves props for doing something right under the hood.
yes just using AI for code analysis is way under appreciated I think. Even the most sceptical people on using it for coding should try it out as a tool for Q&A style code interrogation as well as generating documentation. I would say it zero-shots documentation generation better than most human efforts would to the point it begs the question of whether it's worth having the documentation in the first place. Obviously it can make mistakes but I would say they are below the threshold of human mistakes from what I've seen.
(I haven't used AI much, so feel free to ignore me.)
This is one thing I've tried using it for, and I've found this to be very, very tricky. At first glance, it seems unbelievably good. The comments read well, they seem correct, and they even include some very non-obvious information.
But almost every time I sit down and really think about a comment that includes any of that more complex analysis, I end up discarding it. Often, it's right but it's missing the point, in a way that will lead a reader astray. It's subtle and I really ought to dig up an example, but I'm unable to find the session I'm thinking about.
This was with ChatGPT 5, fwiw. It's totally possible that other models do better. (Or even newer ChatGPT; this was very early on in 5.)
Code review is similar. It comes up with clever chains of reasoning for why something is problematic, and initially convinces me. But when I dig into it, the review comment ends up not applying.
It could also be the specific codebase I'm using this on? (It's the SpiderMonkey source.)
>So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.
I don't think you've seen the full potential. I'm currently #1 on 5 different very complex computer engineering problems, and I can't even write a "hello world" in rust or cpp. You no longer need to know how to write code, you just need to understand the task at a high level and nudge the agents in the right direction. The game has changed.
All the naysayer here have clearly no idea. Your large matrix multiplication implementation is quite impressive! I have set up a benchmark loop and let GPT-5.1-Codex-Max experiment for a bit (not 5.2/Opus/Gemini, because they are broken in Copilot), but it seems to be missing something crucial. With a bit of encouragement, it has implemented:
- padding from 2000 to 2048 for easier power-of-two splitting
- two-level Winograd matrix multiplication with tiled matmul for last level
- unrolled AVX2 kernel for 64x64 submatrices
- 64 byte aligned memory
- restrict keyword for pointers
- better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)
But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it?
How are you qualified to judge its performance on real code if you don't know how to write a hello world?
Yes, LLMs are very good at writing code, they are so good at writing code that they often generate reams of unmaintainable spaghetti.
When you submit to an informatics contest you don't have paying customers who depend on your code working every day. You can just throw away yesterday's code and start afresh.
Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.
If that is true; then all the commentary around software people having jobs still due to "taste" and other nice words is just that. Commentary. In the end the higher level stuff still needs someone to learn it (e.g. learning ASX2 architecture, knowing what tech to work with); but it requires IMO significantly less practice then coding which in itself was a gate. The skill morphs more into a tech expert rather than a coding expert.
I'm not sure what this means for the future of SWE's though yet. I don't see higher levels of staff in big large businesses bothering to do this, and at some scale I don't see founders still wanting to manage all of these agents, and processes (got better things to do at higher levels). But I do see the barrier of learning to code gone; meaning it probably becomes just like any other job.
None of the problems you've shown there are anything close to "very complex computer engineering problems", they're more like "toy problems with widely-known solutions given to students to help them practice for when they encounter actually complex problems".
I have no idea. Careless use, I guess. I was fixing a bunch of mocks in some once-great but now poorly maintained code, and I wasn't really feeling it so I just fed everything to Claude. Opus, unfortunately. I could easily have downgraded a bit.
If it can consistently verify that the error persists after fix--you can run (ok maybe you can't budget wise but theoretically) 10000 parallel instances of fixer agents then verify afterwards (this is in line with how the imo/ioi models work according to rumors)
What bothers me about posts like this is: mid-level engineers are not tasked with atomic, greenfield projects. If all an engineer did all day was build apps from scratch, with no expectation that others may come along and extend, build on top of, or depend on, then sure, Opus 4.5 could replace them. The hard thing about engineering is not "building a thing that works", its building it the right way, in an easily understood way, in a way that's easily extensible.
No doubt I could give Opus 4.5 "build be a XYZ app" and it will do well. But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right". Any non-technical person might read that and go "if it works it works" but any reasonable engineer will know that thats not enough.
Not necessarily responding to you directly, but I find this take to be interesting, and I see it every time an article like this makes the rounds.
Starting back in 2022/2023:
- (~2022) It can auto-complete one line, but it can't write a full function.
- (~2023) Ok, it can write a full function, but it can't write a full feature.
- (~2024) Ok, it can write a full feature, but it can't write a simple application.
- (~2025) Ok, it can write a simple application, but it can't create a full application that is actually a valuable product.
- (~2025+) Ok, it can write a full application that is actually a valuable product, but it can't create a long-lived complex codebase for a product that is extensible and scalable over the long term.
It's pretty clear to me where this is going. The only question is how long it takes to get there.
> It's pretty clear to me where this is going. The only question is how long it takes to get there.
I don't think its a guarantee. all of the things it can do from that list are greenfield, they just have increasing complexity. The problem comes because even in agentic mode, these models do not (and I would argue, can not) understand code or how it works, they just see patterns and generate a plausible sounding explanation or solution. agentic mode means they can try/fail/try/fail/try/fail until something works, but without understanding the code, especially of a large, complex, long-lived codebase, they can unwittingly break something without realising - just like an intern or newbie on the project, which is the most common analogy for LLMs, with good reason.
Yeah maybe, but personally it feels more like a plateau to me than an exponential takeoff, at the moment.
And this isn't a pessimistic take! I love this period of time where the models themselves are unbelievably useful, and people are also focusing on the user experience of using those amazing models to do useful things. It's an exciting time!
But I'm still pretty skeptical of "these things are about to not require human operators in the loop at all!".
I haven't seen an AI successfully write a full feature to an existing codebase without substantial help, I don't think we are there yet.
> The only question is how long it takes to get there.
This is the question and I would temper expectations with the fact that we are likely to hit diminishing returns from real gains in intelligence as task difficulty increases. Real world tasks probably fit into a complexity hierarchy similar to computational complexity. One of the reasons that the AI predictions made in the 1950s for the 1960s did not come to be was because we assumed problem difficulty scaled linearly. Double the computing speed, get twice as good at chess or get twice as good at planning an economy. P, NP separation planed these predictions. It is likely that current predictions will run into similar separations.
It is probably the case that if you made a human 10x as smart they would only be 1.25x more productive at software engineering. The reason we have 10x engineers is less about raw intelligence, they are not 10x more intelligent, rather they have more knowledge and wisdom.
Sure, eventually we'll have AGI, then no worries, but in the meantime you can only use the tools that exist today, and dreaming about what should be available in the future doesn't help.
I suspect that the timeline from autocomplete-one-line to autocomplete-one-app, which was basically a matter of scaling and RL, may in retrospect turn out to have been a lot faster that the next LLM to AGI step where it becomes capable of using human level judgement and reasoning, etc, to become a developer, not just a coding tool.
This is disingenuous because LLMs were already writing full, simple applications in 2023.[0]
They're definitely better now, but it's not like ChatGPT 3.5 couldn't write a full simple todo list app in 2023. There were a billion blog posts talking about that and how it meant the death of the software industry.
Plus I'd actually argue more of the improvements have come from tooling around the models rather than what's in the models themselves.
Ok, it can create a long-lived complex codebase for a product that is extensible and scalable over the long term, but it doesn't have cool tattoos and can't fancy a matcha
There are two types of right/wrong ways to build: the context specific right/wrong way to build something and an overly generalized engineer specific right/wrong way to build things.
I've worked on teams where multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered. It usually took an outsider to proactively remind them what actually mattered to the business case.
I remember cases where a team of engineers built something the "right" way but it turned out to be the wrong thing. (Well engineered thing no one ever used)
Sometimes hacking something together messily to confirm it's the right thing to be building is the right way. Then making sure it's secure, then finally paying down some technical debt to make it more maintainable and extensible.
Where I see real silly problems is when engineers over-engineer from the start before it's clear they are building the right thing, or when management never lets them clean up the code base to make it maintainable or extensible when it's clear it is the right thing.
There's always a balance/tension, but it's when things go too far one way or another that I see avoidable failures.
*I've worked on teams where multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered. It usually took an outsider to proactively remind them what actually mattered to the business case.*
Gosh I am so tired with that one - someone had a case that burned them in some previous project and now his life mission is to prevent that from happening ever again, and there would be no argument they will take.
Then you get like up to 10 engineers on typical team and team rotation and you end up with all kinds of "we have to do it right because we had to pull all nighter once, 5 years ago" baked in the system.
Not fun part is a lot of business/management people "expect" having perfect solution right away - there are some reasonable ones that understand you need some iteration.
> ...multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered.
I usually resolve this by putting on the table the consequences and their impacts upon my team that I’m concerned about, and my proposed mitigation for those impacts. The mitigation always involves the other proposer’s team picking up the impact remediation. In writing. In the SOP’s. Calling out the design decision by day of the decision to jog memories and names of those present that wanted the design as the SME’s. Registered with the operations center. With automated monitoring and notification code we’re happy to offer.
Once people are asked to put accountable skin in the sustaining operations, we find out real fast who is taking into consideration the full spectrum end to end consequences of their decisions. And we find out the real tradeoffs people are making, and the externalities they’re hoping to unload or maybe don’t even perceive.
> I've worked on teams where multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered. It usually took an outsider to proactively remind them what actually mattered to the business case.
My first thought was that you probably also have different biases, priorities and/or taste. As always, this is probably very context-specific and requires judgement to know when something goes too far. It's difficult to know the "most correct" approach beforehand.
> Sometimes hacking something together messily to confirm it's the right thing to be building is the right way. Then making sure it's secure, then finally paying down some technical debt to make it more maintainable and extensible.
I agree that sometimes it is, but in other cases my experience has been that when something is done, works and is used by customers, it's very hard to argue about refactoring it. Management doesn't want to waste hours on it (who pays for it?) and doesn't want to risk breaking stuff (or changing APIs) when it works. It's all reasonable.
And when some time passes, the related intricacies, bigger picture and initially floated ideas fade from memory. Now other stuff may depend on the existing implementation. People get used to the way things are done. It gets harder and harder to refactor things.
Again, this probably depends a lot on a project and what kind of software we're talking about.
> There's always a balance/tension, but it's when things go too far one way or another that I see avoidable failures.
I think balance/tension describes it well and good results probably require input from different people and from different angles.
I know what you are talking about, but there is more to life than just product-market fit.
Hardly any of us are working on Postgres, Photoshop, blender, etc. but it's not just cope to wish we were.
It's good to think about the needs to business and the needs of society separately. Yes, the thing needs users, or no one is benefiting. But it also needs to do good for those users, and ultimately, at the highest caliber, craftsmanship starts to matter again.
There are legitimate reasons for the startup ecosystem to focus firstly and primarily on getting the users/customers. I'm not arguing against that. What I am arguing is why does the industry need to be dominated by startups in terms of the bulk of the products (not bulk of the users). It begs the question of how much societally-meaningful programming waiting to be done.
I'm hoping for a world where more end users code (vibe or otherwise) and the solve their own problems with their own software. I think that will make more a smaller, more elite software industry that is more focused on infrastructure than last-mile value capture. The question is how to fund the infrastructure. I don't know except for the most elite projects, which is not good enough for the industry (even this hypothetical smaller one) on the whole.
Another thing that gets me with projects like this, there are already many examples of image converters, minesweeper clones etc that you can just fork on GitHub, the value of the LLM here is largely just stripping the copyright off
It’s kind of funny - there’s another thread up where a dev claimed a 20-50x speed up. To their credit they posted videos and links to the repo of their work.
And when you check the work, a large portion of it was hand rolling an ORM (via an LLM). Relatively solved problem that an LLM would excel at, but also not meaningfully moving the needle when you could use an existing library. And likely just creating more debt down the road.
Have you ever tried to find software for a specific need? I usually spend hours investigating anything I can find only to discover that all options are bad in one way or another and cover my use case partially at best. It's dreadful, unrewarding work that I always fear. Being able to spent those hours to develop custom solution that has exactly what I need, no more, no less, that I can evolve further as my requirements evolve, all that while enjoying myself, is a godsend.
Anecdata but I’ve found Claude code with Opus 4.5 able to do many of my real tickets in real mid and large codebases at a large public startup. I’m at senior level (15+ years). It can browse and figure out the existing patterns better than some engineers on my team. It used a few rare features in the codebase that even I had forgotten about and was about to duplicate. To me it feels like a real step change from the previous models I’ve used which I found at best useless. It’s following style guides and existing patterns well, not just greenfield. Kind of impressive, kind of scary
Same anecdote for me (except I'm +/- 40 years experience). I consider my self a pretty good dev for non-web dev (GPU's, assembly, optimisation,...) and my conclusion is the same as you: impressive and scary. If the somehow the idea of what you want to do is on the web in text or in code, then Claude most likely has it. And its ability to understand my own codebases is just crazy (at my age, memory is declining and having Claude to help is just waow). Of course it fails some times, of course it need direction, but the thing it produces is really good.
I'm seeing this as well. Not huge codebases but not tiny - 4 year old startup. I'm new there and it would have been impossible for me to deliver any value this soon.
12 years experience; this thing is definitely amazing. Combined with a human it can be phenomenal. It also helped me tons with lots of external tools, understand what data/marketing teams are doing and even providing pretty crucial insights to our leadership that Gemini have noticed.
I wouldn't try to completely automate the humans out of the loop though just yet, but this tech for sure is gonna downsize team numbers (and at the same time - allow many new startups to come to life with little capital that eventually might grow and hire people. So unclear how this is gonna affect jobs.)
I've also found it to keep such a constrained context window (on large codebases), that it writes a secondary block of code that already had a solution in a different area of the same file.
Nothing I do seems to fix that in its initial code writing steps. Only after it finishes, when I've asked it to go back and rewrite the changes, this time making only 2 or 3 lines of code, does it magically (or finally) find the other implementation and reuse it.
It's freakin incredible at tracing through code and figuring it out. I <3 Opus. However, it's still quite far from any kind of set-and-forget-it.
Same exist in humans also, I worked with a developer who had 15 year experience and was tech lead in a big Indian firm, We started something together, 3 months back when I checked the Tables I was shocked to see how he fucked up and messed the DB. Finally the only option left with me was to quit because i know it will break in production and if i onboarded a single customer my life would be screwed. He mixed many things with frontend and offloaded even permissions to frontend, and literally copied tables in multiple DB (We had 3 services). I still cannot believe how he worked as a tch lead for 15 years. each DB had more than 100 tables and out of that 20-25 were duplicates. He never shared code with me, but I smelled something fishy when bug fixing was never ending loop and my front end guy told me he cannot do it anymore. Only mistake I did was I trusted him and worst part is he is my cousin and the relation became sour after i confronted him and decided to quit.
This sounds like a culture issue in the development process, I have seen this prevented many times. Sure I did have to roll back a feature I did not sign off just before new years. So as you say it happens.
- How quickly is cost of refactor to a new pattern with functional parity going down?
- How does that change the calculus around tech debt?
If engineering uses 3 different abstractions in inconsistent ways that leak implementation details across components and duplicate functionality in ways that are very hard to reason about, that is, in conventional terms, an existential problem that might kill the entire business, as all dev time will end up consumed by bug fixes and dealing with pointless complexity, velocity will fall to nothing, and the company will stop being able to iterate.
But if claude can reliably reorganize code, fix patterns, and write working migrations for state when prompted to do so, it seems like the entire way to reason about tech debt has changed. And it has changed more if you are willing to bet that models within a year will be much better at such tasks.
And in my experience, claude is imperfect at refactors and still requires review and a lot of steering, but it's one of the things it's better at, because it has clear requirements and testing workflows already built to work with around the existing behavior. Refactoring is definitely a hell of a lot faster than it used to be, at least on the few I've dealt with recently.
In my mind it might be kind of like thinking about financial debt in a world with high inflation, in that the debt seems like it might get cheaper over time rather than more expensive.
> But if claude can reliably reorganize code, fix patterns, and write working migrations for state when prompted to do so, it seems like the entire way to reason about tech debt has changed.
Yup, I recently spent 4 days using Claude to clean up a tool that's been in production for over 7 years. (There's only about 3 months of engineering time spent on it in those years.)
We've known what the tool needed for many years, but ugh, the actual work was fairly messy and it was never a priority. I reviewed all of Opus's cleanup work carefully and I'm quite content with the result. Maybe even "enthusiastic" would be accurate.
So even if Claude can't clean up all the tech debt in a totally unsupervised fashion, it can still help address some kinds of tech debt extremely rapidly.
Good point. Most of the cost in dealing with tech debt is reading the code and noting the issues. I found that Claude can produce much better code when it has a functionally correct reference implementation. Also it's not needed to very specifically point out issues. I once mentioned "I see duplicate keys in X and Y, rework it to reduce repetition and verbosity". It came up with a much more elegant way to implement it.
So maybe doing 2-3 stages makes sense. First stage needs to be functionallty correct, but you accept code smells such as leaky abstractions, verbosity and repetition. In stage 2 and 3 you eliminate all this. You could integrate this all into the initial specification; you won't even see the smelly intermediate code; it only exists as a stepping stone for the model to iteratively refine the code!
> The hard thing about engineering is not "building a thing that works", its building it the right way, in an easily understood way, in a way that's easily extensible.
You’re talking like in the year 2026 we’re still writing code for future humans to understand and improve.
I fear we are not doing that. Right now, Opus 4.5 is writing code that later Opus 5.0 will refactor and extend. And so on.
For one, there are objectively detrimental ways to organize code: tight coupling, lots of mutable shared state, etc. No matter who or what reads or writes the code, such code is more error-prone, and more brittle to handle.
Then, abstractions are tools to lower the cognitive load. Good abstractions reduce the total amount of code written, allow to reason about the code in terms of these abstractions, and do not leak in the area of their applicability. Say Sequence, or Future, or, well, function are examples of good abstractions. No matter what kind of cognitive process handles the code, it benefits from having to keep a smaller amount of context per task.
"Code structure does not matter, LLMs will handle it" sounds a bit like "Computer architectures don't matter, the Turing Machine is proved to be able to handle anything computable at all". No, these things matter if you care about resource consumption (aka cost) at the very least.
Opus 4.5 is writing code that Opus 5.0 will refactor and extend. And Opus 5.5 will take that code and rewrite it in C from the ground up. And Opus 6.0 will take that code and make it assembly. And Opus 7.0 will design its own CPU. And Opus 8.0 will make a factory for its own CPUs. And Opus 9.0 will populate mars. And Opus 10.0 will be able to achieve AGI. And Opus 11.0 will find God. And Opus 12.0 will make us a time machine. And so on.
Up until now, no business has been built on tools and technology that no one understands. I expect that will continue.
Given that, I expect that, even if AI is writing all of the code, we will still need people around who understand it.
If AI can create and operate your entire business, your moat is nil. So, you not hiring software engineers does not matter, because you do not have a business.
In my experience, using LLMs to code encouraged me to write better documentation, because I can get better results when I feed the documentation to the LLM.
Also, I've noticed failure modes in LLM coding agents when there is less clarity and more complexity in abstractions or APIs. It's actually made me consider simplifying APIs so that the LLMs can handle them better.
Though I agree that in specific cases what's helpful for the model and what's helpful for humans won't always overlap. Once I actually added some comments to a markdown file as note to the LLM that most human readers wouldn't see, with some more verbose examples.
I think one of the big problems in general with agents today is that if you run the agent long enough they tend to "go off the rails", so then you need to babysit them and intervene when they go off track.
I guess in modern parlance, maintaining a good codebase can be framed as part of a broader "context engineering" problem.
We don't know what Opus 5.0 will be able to refactor.
If argument is "humans and Opus 4.5 cannot maintain this, but if requirements change we can vibe-code a new one from scratch", that's a coherent thesis, but people need to be explicit about this.
(Instead this feels like the mott that is retreated to, and the bailey is essentially "who cares, we'll figure out what to do with our fresh slop later".)
Ironically, I've been Claude to be really good at refactors, but these are refactors I choose very explicitly. (Such as I start the thing manually, then let it finish.) (For an example of it, see me force-pushing to https://github.com/NixOS/nix/pull/14863 implementing my own code review.)
But I suspect this is not what people want. To actually fire devs and not rely on from-scratch vibe-coding, we need to figure out which refactors to attempt in order to implement a given feature well.
That's a very creative open-ended question that I haven't even tried to let the LLMs take a crack at it, because why I would I? I'm plenty fast being the "ideas guy". If the LLM had better ideas than me, how would I even know? I'm either very arrogant or very good because I cannot recall regretting one of my refactors, at least not one I didn't back out of immediately.
Refactoring does always cost something and I doubt LLMs will ever change that. The more interesting question is whether the cost to refactor or "rewrite" the software will ever become negligible. Until it isn't, it's short-sighted to write code in the manner you're describing. If software does become that cheap, then you can't meaningfully maintain a business on selling software anyway.
This is the question! Your narrative is definitely plausible, and I won't be shocked if it turns out this way. But it still isn't my expectation. It wasn't when people were saying this in 2023 or in 2024, and I haven't been wrong yet. It does seem more likely to me now than it did a couple years ago, but still not the likeliest outcome in the next few years.
Yeah I think it's a mistake to focus on writing "readable" or even "maintainable" code. We need to let go of these aging paradigms and be open to adopting a new one.
A greenfield project is definitely 'easy mode' for an LLM; especially if the problem area is well understood (and documented).
Opus is great and definitely speeds up development even in larger code bases and is reasonably good at matching coding style/standard to that of of the existing code base.
In my opinion, the big issue is the relatively small context that quickly overwhelms the models when given a larger task on a large codebase.
For example, I have a largish enterprise grade code base with nice enterprise grade OO patterns and class hierarchies. There was a simple tech debt item that required refactoring about 30-40 classes to adhere to a slightly different class hierarchy. The work is not difficult, just tedious, especially as unit tests need to be fixed up.
I threw Opus at it with very precise instructions as to what I wanted it to do and how I wanted it to do it. It started off well but then disintegrated once it got overwhelmed at the sheer number of files it had to change. At some point it got stuck in some kind of an error loop where one change it made contradicted with another change and it just couldn't work itself out. I tried stopping it and helping it out but at this point the context was so polluted that it just couldn't see a way out.
I'd say that once an LLM can handle more 'context' than a senior dev with good knowledge of a large codebase, LLM will be viable in a whole new realm of development tasks on existing code bases. That 'too hard to refactor this/make this work with that' task will suddenly become viable.
You have to think of Opus as a developer whose job at your company lasts somewhere between 30 to 60 minutes before you fire them and hire a new one.
Yes, it's absurd but it's a better metaphor than someone with a chronic long term memory deficit since it fits into the project management framework neatly.
So this new developer who is starting today is ready to be assigned their first task, they're very eager to get started and once they start they will work very quickly but you have to onboard them. This sounds terrible but they also happen to be extremely fast at reading code and documentation, they know all of the common programming languages and frameworks and they have an excellent memory for the hour that they're employed.
What do you do to onboard a new developer like this? You give them a well written description of your project with a clear style guide and some important dos and don'ts, access to any documentation you may have and a clear description of the task they are to accomplish in less than one hour. The tighter you can make those documents, the better. Don't mince words, just get straight to the point and provide examples where possible.
The task description should be well scoped with a clear definition of done, if you can provide automated tests that verify when it's complete that's even better. If you don't have tests you can also specify what should be tested and instruct them to write the new tests and run them.
For every new developer after the first you need a record of what was already accomplished. Personally, I prefer to use one markdown document per working session whose filename is a date stamp with the session number appended. Instruct them to read the last X log files where X is however many are relevant to the current task. Most of the time X=1 if you did a good job of breaking down the tasks into discrete chunks. You should also have some type of roadmap with milestones, if this file will be larger than 1000 lines then you should break it up so each milestone is its own document and have a table of contents document that gives a simple overview of the total scope. Instruct them to read the relevant milestone.
Other good practices are to tell them to write a new log file after they have completed their task and record a summary of what they did and anything they discovered along the way plus any significant decisions they made. Also tell them to commit their work afterwards and Opus will write a very descriptive commit message by default (but you can instruct them to use whatever format you prefer). You basically want them to get everything ready for hand-off to the next 60 minute developer.
If they do anything that you don't want them to do again make sure to record that in CLAUDE.md. Same for any other interventions or guidance that you have to provide, put it in that document and Opus will almost always stick to it unless they end up overfilling their context window.
I also highly recommend turning off auto-compaction. When the context gets compacted they basically just write a summary of the current context which often removes a lot of the important details. When this happens mid-task you will certainly lose parts of the context that are necessary for completing the task. Anthropic seems to be working hard at making this better but I don't think it's there yet. You might want to experiment with having it on and off and compare the results for yourself.
If your sessions are ending up with >80% of the context window used while still doing active development then you should re-scope your tasks to make them smaller. The last 20% is fine for doing menial things like writing the summary, running commands, committing, etc.
People have built automated systems around this like Beads but I prefer the hands-on approach since I read through the produced docs to make sure things are going ok and use them as a guide for any changes I need to make mid-project.
With this approach I'm 99% sure that Opus 4.5 could handle your refactor without any trouble as long as your classes aren't so enormous that even working on a single one at a time would cause problems with the context window, and if they are then you might be able to handle it by cautioning Opus to not read the whole file and to just try making targeted edits to specific methods. They're usually quite good at finding and extracting just the sections that they need as long as they have some way to know what to look for ahead of time.
I just did something similar and it went swimmingly by doing this: Keep the plan and status in an md file. Tell it to finish one file at a time and run tests and fix issues and then to ask whether to proceed with the next file. You can then easily start a new chat with the same instructions and plan and status if the context gets poisoned.
"Have an agent investiate issue X in modules Y and Z. The agent should place a report at ./doc/rework-xyz-overview.md with all locations that need refactoring. Once you have the report, have agents refactor 5 classes each in parallel. Each agent writes a terse report in ./doc/rework-xyz/ When they are all done, have another agent check all the work. When that agent reports everything is okay, perform a final check yourself"
> If all an engineer did all day was build apps from scratch, with no expectation that others may come along and extend, build on top of, or depend on, then sure, Opus 4.5 could replace them.
Why do they need to be replaced? Programmers are in the perfect place to use AI coding tools productively. It makes them more valuable.
Their thesis is that code quality does not matter as it is now a cheap commodity. As long as it passes the tests today it's great. If we need to refactor the whole goddamn app tomorrow, no problem, we will just pay up the credits and do it in a few hours.
The fundamental assumption is completely wrong. Code is not a cheap commodity. It is in fact so disastrously expensive that the entire US economy is about to implode while we're unbolting jet engines from old planes to fire up in the parking lots of datacenters for electricity.
It matters for all the things you’d be able to justify paying a programmer for. What’s about to change is that there will be tons of these little one-off projects that previously nobody could justify paying $150/hr for. A mass democratization of software development. We’ve yet to see what that really looks like.
> Their thesis is that code quality does not matter as it is now a cheap commodity.
That's not how I read it. I would say that it's more like "If a human no longer needs to read the code, is it important for it to be readable?"
That is, of course, based on the premise that AI is now capable of both generating and maintaining software projects of this size.
Oh, and it begs another question: are human-readable and AI-readable the same thing? If they're not, it very well could make sense to instruct the model to generate code that prioritizes what matters to LLMs over what matters to humans.
I had Opus write a whole app for me in 30 seconds the other night. I use a very extensive AGENTS.md to guide AI in how I like my code chiseled. I've been happily running the app without looking at a line of it, but I was discussing the app with someone today, so I popped the code open to see what it looked like. Perfect. 10/10 in every way. I would not have written it that good. It came up with at least one idea I would not have thought of.
I'm very lucky that I rarely have to deal with other devs and I'm writing a lot of code from scratch using whatever is the latest version of the frameworks. I understand that gives me a lot of privileges others don't have.
>What bothers me about posts like this is: mid-level engineers are not tasked with atomic, greenfield projects
They get those ocassionally all the time though too. Depends on the company. In some software houses it's constant "greenfield projects", one after another. And even in companies with 1-2 pieces of main established software to maintain, there are all kinds of smaller utilities or pipelines needed.
>But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right".
In some cases that's legit. In other cases it's just "it did it well, but not how I'd done it", which is often needless stickness to some particular style (often a contention between 2 human programmers too).
Basically, what FloorEgg says in this thread: "There are two types of right/wrong ways to build: the context specific right/wrong way to build something and an overly generalized engineer specific right/wrong way to build things."
And you can always not just tell it "build me this feature", but tell it (high level way) how to do it, and give it a generic context about such preferences too.
> its building it the right way, in an easily understood way, in a way that's easily extensible.
When I worked at Google, people rarely got promoted for doing that. They got promoted for delivering features or sometimes from rescuing a failing project because everyone was doing the former until promotion velocity dropped and your good people left to other projects not yet bogged down too far.
Yeah. Just like another engineer. When you tell another engineer to build you a feature, it's improbable they'll do it they way that you consider "right."
This sounds a lot like the old arguments around using compilers vs hand-writing asm. But now you can tell the LLM how you want to implement the changes you want. This will become more and more relevant as we try to maintain the code it generates.
But, for right now, another thing Claude's great at is answering questions about the codebase. It'll do the analysis and bring up reports for you. You can use that information to guide the instructions for changes, or just to help you be more productive.
Even if you are going green field, you need to build it the way it is likely to be used based a having a deep familiarity with what that customer's problems are and how their current workflow is done. As much as we imagine everything is on the internet, a bunch of this stuff is not documented anywhere. An LLM could ask the customer requirement questions but that familiarity is often needed to know the right questions to ask. It is hard to bootstrap.
Even if it could build the perfect greenfield app, as it updates the app it is needs to consider backwards compatibility and breaking changes. LLMs seem very far as growing apps. I think this is because LLMs are trained on the final outcome of the engineering process, but not on the incremental sub-commit work of first getting a faked out outline of the code running and then slowly building up that code until you have something that works.
This isn't to say that LLMs or other AI approaches couldn't replace software engineering some day, but they clear aren't good enough yet and the training sets they have currently have access to are unlikely to provide the needed examples.
My favorite benchmark for LLMs and agents is to have it port a medium-complexity library to another programming language. If it can do that well, it's pretty capable of doing real tasks. So far, I always have to spend a lot of time fixing errors. There are also often deep issues that aren't obvious until you start using it.
Comments on here often criticise ports as easy for LLMs to do because there's a lot of training and tests are all there, which is not as complex as real word tasks
I find Opus 4.5 very, very strong at matching the prevailing conventions/idioms/abstractions in a large, established codebase. But I guess I'm quite sensitive to this kind of thing so I explicitly ask Opus 4.5 to read adjacent code which is perhaps why it does it so well. All it takes is a sentence or two, though.
I don’t know what I’m doing wrong. Today I tried to get it to upgrade Nx, yarn and some resolutions in a typescript monorepo with about 20 apps at work (Opus 4.5 through Kiro) and it just…couldn’t do it. It hit some snags with some of the configuration changes required by the upgrade and resorted to trying to make unwanted changes to get it to build correctly. I would have thought that’s something it could hit out of the park. I finally gave up and just looked at the docs and some stack overflow and fixed it myself. I had to correct it a few times about correct config params too. It kept imagining config options that weren’t valid.
> ask Opus 4.5 to read adjacent code which is perhaps why it does it so well. All it takes is a sentence or two, though.
People keep telling me that an LLM is not intelligence, it's simply spitting out statistically relevant tokens. But surely it takes intelligence to understand (and actually execute!) the request to "read adjacent code".
>day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right"
Then don't ask it to "build me this feature" instead lay out a software development process with designated human in the loop where you want it and guard rails to keep it on track. Create a code review agent to look for and reject strange abstractions. Tell it what you don't like and it's really good at finding it.
I find Opus 4.5, properly prompted, to be significantly better at reviewing code than writing it, but you can just put it in a loop until the code it writes matches the review.
Based on my experience using these LLMs regularly I strongly doubt it could even build any application with realistic complexity without screwing things up in major ways everywhere, and even on top of that still not meeting all the requirements.
This! I can count on one hand the number of times I've had a chance to spin up a greenfield project, prototype or proof of concept in my 30 year career. Those were always stolen moments, and the bottleneck was never really coding ability. Most professional software development is wading through janky codebases of others' (or your own) creation, trying to iron out weird little glitches of the kind that LLMs can now generate on an industrial scale (and are incapable of fixing).
In my personal experience, Claude is better at greenfield, Codex is better at fitting in. Claude is the perfect tool for a "vibe coder", Codex is for the serious engineer who wants to get great and real work done.
Codex will regularly give me 1000+ line diffs where all my comments (I review every single line of what agents write) are basically nitpicks. "Make this shallow w/ early return, use | None instead of Optional", that sort of thing.
I do prompt it in detail though. It feels like I'm the person coming in with the architecture most of the time, AI "draws the rest of the owl."
Exactly. The main issue IMO is that "software that seems to work" and "software that works" can be very hard to tell apart without validating the code, yet these are drastically different in terms of long-term outcomes. Especially when there's a lot of money, or even lives, riding on these outcomes. Just because LLMs can write software to run the Therac-25 doesn't mean it's acceptable for them to do so.
But... you can ask! Ask claude to use encapsulation, or to write the equivalent of interfaces in the language you using, and to map out dependencies and duplicate features, or to maintain a dictionary of component responsibilities.
AI coding is a multiplier of writing speed but doesn't excuse planning out and mapping out features.
You can have reasonably engineered code if you get models to stick to well designed modules but you need to tell them.
But time I spend asking is time I could have been writing exactly what I wanted in the first place, if I already did the planning to understand what I wanted. Once I know what I want, it doesn't take that long, usually.
Which is why it's so great for prototyping, because it can create something during the planning, when you haven't planned out quite what you want yet.
> The hard thing about engineering is not "building a thing that works", its building it the right way, in an easily understood way, in a way that's easily extensible.
The number of production applications that achieve this rounds to zero
I’ve probably managed 300 brownfield web, mobile, edge, datacenter, data processing and ML applications/products across DoD, B2B, consumer and literally zero of them were built in this way
Another thing these posts assume is a single developer keep working on the product with a number of AI agents, not a large team. I think we need to rethink how teams work with AI. Its probably not gonna be a single developer typing a prompt but a team somehow collaborates a prompt or equivalent. XP on steroids? Programming by committee?
So far, Im not convinced, but lets take a look at fundmentally whats happening and why humans > agents > LLMs.
At its heart, programming is a constraint satisfaction problem.
The more constraints (requirements, syntax, standards, etc) you have, the harder it is to solve them all simultaneously.
New projects with few contributors have fewer constraints.
The process of “any change” is therefore simpler.
Now, undeniably
1) agents have improved the ability to solve constraints by iterating; eg. Generate, test, modify, etc. over raw LLm output.
2) There is an upper bound (context size, model capability) to solve simultaneous constraints.
3) Most people have a better ability to do this than agents (including claude code using opus 4.5).
So, if youre seeing good results from agents, you probably have a smaller set of constraints than other people.
Similarly, if youre getting bad results, you can probably improve them by relaxing some of the constraints (consistent ui, number of contributors, requirements, standards, security requirements, split code into well defined packages).
This will make both agents and humans more productive.
The open question is: will models continue to improve enough to approach or exceed human level ability in this?
Are humans willing to relax the constraints enough for it to be plausible?
I would say currently people clambering about the end of human developers are cluelessly deceived by the “appearance of complexity” which does not match the “reality of constraints” in larger applications.
Opus 4.5 cannot do the work of a human on code bases Ive worked on. Hell, talented humans struggle to work on some of them.
…but that doesnt mean it doesnt work.
Just that, right now, the constraint set it can solve is not large enough to be useful in those situations.
…and increasingly we see low quality software where people care only about speed of delivery; again, lowering the bar in terms of requirements.
So… you know. Watch this space. Im not counting on having a dev job in 10 years. If I do, it might be making a pile of barely working garbage.
…but I have one now, and anyone who thinks that this year people will be largely replaced by AI is probably poorly informed and has misunderstood the capabilities on these models.
Theres only so low you can go in terms of quality.
After recently applying Codex to a gigantic old and hairy project that is as far from greenfield it can be, I can assure you this assertion is false. It’s bonkers seeing 5.2 churn though the complexity and understanding dependencies that would take me days or weeks to wrap my head around.
On the contrary, Opus 4.5 is the best agent I’ve ever used for making cohesive changes across many files in a large, existing codebase. It maintains our patterns and looks like all the other code. Sometimes it hiccups for sure.
If you have microservices architecture in your project you are set for AI. You can swap out any lacking, legacy microservice in your system with "greenfield" vibecoded one.
LLMs are pretty good at picking up existing codebases. Even with cleared context they can do „look at this codebase and this spec doc that created it. I want to add feature x“
Yeah, all of those applications he shows do not really expose any complex business logic.
With all the due respect: a file converter for windows is glueing few windows APIs with the relevant codec.
Now, good luck working on a complex warehouse management application where you need extremely complex logic to sort the order of picking, assembling, packing on an infinite number of variables: weight, amazon prime priority, distribution centers, number and type of carts available, number and type of assembly stations available, different delivery systems and requirements for different delivery operators (such as GLE, DHL, etc) that has to work with N customers all requiring slightly different capabilities and flows, all having different printers and operations, etc, etc. And I ain't even scratching the surface of the business logic complexity (not even mentioning functional requirements) to avoid boring the reader.
Mind you, AI is still tremendously useful in the analysis phase, and can sort of help in some steps of the implementation one, but the number of times you can avoid looking thoroughly at the code for any minor issue or discrepancy is absolutely close to 0.
you can definitely just tell it what abstractions you want when adding a feature and do incremental work on existing codebase. but i generally prefer gpt-5.2
I've been using 5.2 a lot lately but hit my quota for the first time (and will probably continue to hit it most weeks) so I shelled out for claude code. What differences do you notice? Any 'metagame' that would be helpful?
Man, I've been biting my tongue all day with regards to this thread and overall discussion.
I've been building a somewhat-novel, complex, greenfield desktop app for 6 months now, conceived and architected by a human (me), visually designed by a human (me), implementation heavily leaning on mostly Claude Code but with Codex and Gemini thrown in the mix for the grunt work. I have decades of experience, could have built it bespoke in like 1-2 years probably, but I wanted a real project to kick the tires on "the future of our profession".
TL;DR I started with 100% vibe code simply to test the limits of what was being promised. It was a functional toy that had a lot of problems. I started over and tried a CLI version. It needed a therapist. I started over and went back to visual UI. It worked but was too constrained. I started over again. After about 10 complete start-overs in blank folders, I had a better vision of what I wanted to make, and how to achieve it. Since then, I've been working day after day, screen after screen, building, refactoring, going feature by feature, bug after bug, exactly how I would if I was coding manually. Many times I've reached a point where it feels "feature complete", until I throw a bigger dataset at it, which brings it to its knees. Time to re-architect, re-think memory and storage and algorithms and libraries used. Code bloated, and I put it on a diet until it was trim and svelte. I've tried many different approaches to hard problems, some of which LLMs would suggest that truly surprised me in their efficacy, but only after I presented the issues with the previous implementation. There's a lot of conversation and back and forth with the machine, but we always end up getting there in the end. Opus 4.5 has been significantly better than previous Anthropic models. As I hit milestones, I manually audit code, rewrite things, reformat things, generally polish the turd.
I tell this story only because I'm 95% there to a real, legitimate product, with 90% of the way to go still. It's been half a year.
Vibe coding a simple app that you just want to use personally is cool; let the machine do it all, don't worry about under the hood, and I think a lot of people will be doing that kind of stuff more and more because it's so empowering and immediate.
Using these tools is also neat and amazing because they're a force multiplier for a single person or small group who really understand what needs done and what decisions need made.
These tools can build very complex, maintainable software if you can walk with them step by step and articulate the guidelines and guardrails, testing every feature, pushing back when it gets it wrong, growing with the codebase, getting in there manually whenever and wherever needed.
These tools CANNOT one-shot truly new stuff, but they can be slowly cajoled and massaged into eventually getting you to where you want to go; like, hard things are hard, and things that take time don't get done for a while. I have no moral compunctions or philosophical musings about utilizing these tools, but IMO there's still significant effort and coordination needed to make something really great using them (and literally minimal effort and no coordination needed to make something passable)
If you're solo, know what you want, and know what you're doing, I believe you might see 2x, 4x gains in time and efficiency using Claude Code and all of his magical agents, but if your project is more than a toy, I would bet that 2x or 4x is applied to a temporal period of years, not days or months!
"its building it the right way, in an easily understood way, in a way that's easily extensible"
I am in a unique situation where I work with a variety of codebases over the week. I have had no problem at all utilizing Claude Code w/ Opus 4.5 and Gemini CLI w/ Gemini 3.0 Pro to make excellent code that is indisputably "the right way", in an extremely clear and understandable way, and that is maximally extensible. None of them are greenfield projects.
I feel like this is a bit of je ne sais quoi where people appeal to some indemonstrable essence that these tools just can't accomplish, and only the "non-technical" people are foolish enough to not realize it. I'm a pretty technical person (about 30 years of software development, up to staff engineer and then VP). I think they have reached a pretty high level of competence. I still audit the code and monitor their creations, but I don't think they're the oft claimed "junior developer" replacement, but instead do the work I would have gotten from a very experienced, expert-level developer, but instead of being an expert at a niche, they're experts at almost every niche.
Are they perfect? Far from it. It still requires a practitioner who knows what they're doing. But frequently on here I see people giving takes that sound like they last used some early variant of Copilot or something and think that remains state of the art. The rest of us are just accelerating our lives with these tools, knowing that pretending they suck online won't slow their ascent an iota.
You AI hype thots/bots are all the same. All these claims but never backed up with anything to look at. And also alway claiming “you’re holding it wrong”.
Not in terms of knowledge. That was already phenomenal. But in its ability to act independently: to make decisions, collaborate with me to solve problems, ask follow-up questions, write plans and actually execute them.
You have to experience it yourself on your own real problems and over the course of days or weeks.
Every coding problem I was able to define clearly enough within the limits of the context window, the chatbot could solve and these weren’t easy. It wasn’t just about writing and testing code. It also involved reverse engineering and cracking encoding-related problems. The most impressive part was how actively it worked on problems in a tight feedback loop.
In the traditional sense, I haven’t really coded privately at all in recent weeks. Instead, I’ve been guiding and directing, having it write specifications, and then refining and improving them.
Curious how this will perform in complex, large production environments.
Just some examples I’ve already made public. More complex ones are in the pipeline. With [0], I’m trying to benchmark different coding-agents. With [1], I successfully reverse-engineered an old C64 game using Opus 4.5 only.
Yes, feel free to blame me for the fact that these aren’t very business-realistic.
This has always been my problem whether it's Gemini, openai or Claude. Unless you hand-hold it to an extreme degree, it is going to build a mountain next to a molehill.
It may end up working, but the thing is going to convolute apis and abstractions and mix patterns basically everywhere
Difficult and it really depends on the complexity. I definitely work in a spec-driven way, with a step-by-step implementation phase. If it goes the wrong way I prefer to rewrite the spec and throw away the code.
I have it propose several approaches, pick and choose from each, and remove what I don't want done. "Use the general structure of A, but use the validation structure of D. Using a view translation layer is too much, just rely on FastAPI/SQLModel's implicit view conversion."
I personally try to narrow scope as much as possible to prevent this. If a human hands me a PR that is not digestible size-wise and content-wise (to me), I am not reviewing and merging it. Same thing with what claude generates with my guidance.
Instructions, in the system prompt for not doing that
Once more people realize how easy it is to customize and personalized your agent, I hope they will move beyond what cookie cutter Big AI like Anthropic and Google give you.
I suspect most won't though because (1) it means you have to write human language, communication, and this weird form of persuasion, (2) ai is gonna make a bunch of them lazy and big AI sold them on magic solutions that require no effort on your part (not true, there is a lot of customizing and it has huge dividends)
I find my sweet spot is using the Claude web app as a rubber duck as well as feeding it snippets of code and letting it help me refine the specific thing I'm doing.
When I use Claude Code I find that it *can* add a tremendous amount of ability due to its ability to see my entire codebase at once, but the issue is that if I'm doing something where seeing my entire codebase would help that it blasts through my quota too fast. And if I'm tightly scoping it, it's just as easy & faster for me to use the website.
Because of this I've shifted back to the website. I find that I get more done faster that way.
I've had similar experiences but I've been able to start using Claude Code for larger projects by doing some refactoring with the goal of making the codebase understandable by just looking at the interfaces. This along with instructions to prefer looking at the interface for a module unless working directly on the implementation of the module seems to allow further progress to be made within session limits.
By "the website" do you mean you're copy pasting, or are you using the code system where Anthropic clones your code from GitHub and interacts with it in a VM/container for you.
> In the traditional sense, I haven’t really coded privately at all in recent weeks. Instead, I’ve been guiding and directing, having it write specifications, and then refining and improving them.
I've noticed a huge drop in negative comments on HN when discussing LLMs in the last 1-2 months.
All the LLM coded projects I've seen shared so far[1] have been tech toys though. I've watched things pop up on my twitter feed (usually games related), then quietly go off air before reaching a gold release (I manually keep up to date with what I've found, so it's not the algorithm).
I find this all very interesting: LLMs dont change the fundamental drives needed to build successful products. I feel like I'm observing the TikTokification of software development. I dont know why people aren't finishing. Maybe they stop when the "real work" kicks in. Or maybe they hit the limits of what LLMs can do (so far). Maybe they jump to the next idea to keep chasing the rush.
Acquiring context requires real work, and I dont see a way forward to automating that away. And to be clear, context is human needs; i.e. the reasons why someone will use your product. In the game development world, it's very difficult to overstate how much work needs to be done to create a smooth, enjoyable experience for the player.
While anyone may be able to create a suite of apps in a weekend, I think very few of them will have the patience and time to maintain them (just like software development before LLMs! i.e. Linux, open source software, etc.).
[1] yes, selection bias. There are A LOT of AI devs just marketing their LLMs. Also it's DEFINITELY too early to be certain. Take everything Im saying with a one pound grain of salt.
> I've noticed a huge drop in negative comments on HN when discussing LLMs in the last 1-2 months.
real people get fed up of debating the same tired "omg new model 1000x better now" posts/comments from the astroturfers, the shills and their bots each time OpenAI shits out a new model
Simply this ^ I'm tired of debating bots and people paid to grow the hype, so I won't anymore I'll just work and look for the hype passing by from a distance. In the meanwhile I'll keep waiting for people making actual products with LLMs that will kill old generation products like windows, excel, teams, gmail etc that will replace slop with great ui/ux and push really performant apps
Especially when 90% of these articles are based on personal, anecdotally evidence and keep repeating the same points without offering anything new.
If these articles actually provide quantitative results in a study done across an organization and provide concrete suggestions like what Google did a while ago, that would be refreshing and useful.
(Yes, this very article has strong "shill" vibes and fits the patterns above)
You're only hurting yourself if you decide there's some wild conspiracy afoot here to pay shills to tell people that coding agents are useful... as opposed to people finding them useful enough to want to tell other people about it.
This is a cringe comment from an era of when "Micro$oft" was hip and reads like you are a fanboi for Anthropic/Google foaming at the mouth.
Would be far more useful if you provided actual verifiable information and dropped the cringe memes. Can't take seriously someone using "Microslop" in a sentence".
It could be that the people who are focused on building monetizable products with LLMs don't feel the need to share what they are doing - they're too busy quietly getting on with building and marketing their products.
Sharing how you're using these tools is quite a lot of work!
I admit I'm in this boat. I get immense value from LLMs, easily 5x if not more, and the codebases I work in are large, mature and complex. But providing "receipts" as the kids call it these days would be a huge undertaking, with not a lot of upside. In fact, the downsides are considerable. Aside from the time investment, I have no interest in arguing with people about whether what I work on is just CRUD (it's not) or that the problems I work on are not novel (who cares, your product either provides value for your users or it does not).
Agreed! LLMs are a force multiplier for real products too. They're going to augment people who are willing to do the real work.
But, Im also wondering if LLMs are going to create a new generation of software dev "brain rot" (to use the colloquial term), similar to short form videos.
I should mention in the gamedev world, it's quite common share because sharing is marketing, hence my perspective.
The type of people to use AI are necessarily the people who will struggle most when it comes time to do the last essential 20% of the work that AI can't do. Once thinking is required to bring all the parts into a whole, the person who gives over their thinking skills to AI will not be equipped to do the work, either because they never had the capacity to begin with or because AI has smoothed out the ripples of their brain. I say this from experience.
I think you can tell from some answers here that people talk to these models a lot and adapt their language structure :( Means they stop asking themselves whether it makes any sense what they ask the model for. It does not turn middle management into developers it turns developers into middle managers that just shout louder or replace a critical mind with another yesman or the next super best model that finally brings their genius ideas to life. Then well they get to the same wall of having to learn for themselves to reach gold and ofc that's an insult to any manager. Whoever cannot do the insane job has to be wrong, never the one asking for insanity.
Sad i had to scroll so far down to get some fitting description of why those projects all die. Maybe it's not just me leaving all social networks even HN because well you may not talk to 100% bots but you sure talk to 90% of people that talk to models a lot instead of using them as a tool.
Deploying and maintaining something in a production-ready environment is a huge amount of work. It's not surprising that most people give up once they have a tech demo, especially if they're not interested in spending a ton of time maintaining these projects. Last year Karpathy posted about a similar experience, where he quickly vibe coded some tools only to realize that deploying it would take far more effort than he originally anticipated.
I think it's also rewarding to just be able to build something for yourself, and one benefit of scratching your own itch is that you don't have to go through the full effort of making something "production ready". You can just build something that's tailed specifically to the problem you're trying to solve without worrying about edge cases.
Yeah, I do a lot of hobby game making and the 80/20 rule definitely applies. Your game will be "done" in 20% of the time it takes to create a polished product ready for mass consumption.
Stopping there is just fine if you're doing it as a hobby. I love to do this to test out isolated ideas. I have dozens of RPGs in this state, just to play around with different design concepts from technical to gameplay.
Sometimes I feel like a lot of those posts are instances of Kent Brockman:
"I for one, welcome our new insect overlords."
Given the enthusiasm of our ruling class towards automating software development work, it may make sense for a software engineer to publicly signal how much onboard as a professional they are with it.
But, I've seen stranger stuff throughout my professional life: I still remember people enthusiastically defending EJB 2.1 and xdoclet as perfectly fine ways of writing software.
I hacked together a Swift tool to replace a Python automation I had, merged an ARM JIT engine into a 68k emulator, and even got a very decent start on a synth project I’ve been meaning to do for years.
What has become immensely apparent to me is that even gpt-5-mini can create decent Go CLI apps provided you write down a coherent spec and review the code as if it was a peer’s pull request (the VS Code base prompts and tooling steer even dumb models through a pretty decent workflow).
GPT 5.2 and the codex variants are, to me, every bit as good as Opus but without the groveling and emojis - I can ask it to build an entire CI workflow and it does it in pretty much one shot if I give it the steps I want.
So for me at least this model generation is a huge force multiplier (but I’ve always been the type to plan before coding and reason out most of the details before I start, so it might be a matter of method).
To add to the anecdata, today GPT 5.2-whatever hallucinated the existence of two CLI utilities, and when corrected, then hallucinated the existence of non-existent, but plausible, features/options of CLI utilities that do actually exist.
I had to dig through source code to confirm whether those features actually existed. They don't, so the CLI tools GPT recommended aren't actually applicable to my use case.
Yesterday, it hallucinated features of WebDav clients, and then talked up an abandoned and incomplete project on GitHub with a dozen stars as if it was the perfect fit for what I was trying to do, when it wasn't.
I only remember these because they're recent and CLI related, given the topic, but there are experiences like this daily across different subjects and domains.
Were you running it inside a coding agent like Codex?
If so then it should have realized its mistake when it tried to run those CLI commands and saw the error message. Then it can try something different instead.
If you were using a regular chat interface and expecting it to know everything without having an environment to try things out then yeah, you're going to be disappointed.
Yeah, it needs a steady hand on the tiller. However throw together improvements of 70%, -15%, 95%, 99%, -7% across all the steps and overall you're way ahead.
SimonW's approach of having a suite of dynamic tools (agents) grind out the hallucinations is a big improvement.
In this case expressing the feeback validation and investing in the setup may help smooth these sharp edges.
Gemini 3 Pro (High) via Antigravity has been similarly great recently. So have tools that I imagine call out to these higher-power models: Amp and Junie. In a two-week blur I brought forth the bulk of a Ruby library that includes bindings to the Ratatui rust crate for making TUIs in Ruby. During that time I also brought forth documentation, example applications, build and devops tooling, and significant architectural decisions & roadmaps for the future. It's pretty unbelievable, but it's all there in the git and CI history. https://sr.ht/~kerrick/ratatui_ruby/
I think the following things are true now:
- Vibe Coding is, more than ever, "autopilot" in the aviation sense, not the colloquial sense. You have to watch it, you are responsible, the human has do run takeoff/landing (the hard parts), but it significantly eases and reduces risk on a bulk of the work.
- The gulf of developer experience between today's frontier tooling and six months ago is huge. I pushed hard to understand and use these tools throughout last year, and spent months discouraged--back to manual coding. Folks need to re-evaluate by trying premium tools, not free ones.
- Tooling makers have figured out a lot of neat hacks to work around the limitations of LLMs to make it seem like they're even better than they are. Junie integrates with your IDE, Antigravity has multiple agents maintaining background intel on your project and priorities across chats. Antigravity also compresses contexts and starts new ones without you realizing it, calls to sub-agents to avoid context pollution, and other tricks to auto-manage context.
- Unix tools (sed, grep, awk, etc.) and the git CLI (ls-tree, show, --stat, etc.) have been a huge force-multiplier, as they keep the context small compared to raw ingestion of an entire file, allowing the LLMs to get more work done in a smaller context window.
- The people who hire programmers are still not capable of Vibe Coding production-quality web apps, even with all these improvements. In fact, I believe today this is less of a risk than I feared 10 months ago. These are advanced tools that need constant steering, and a good eye for architecture, design, developer experience, test quality, etc. is the difference between my vibe coded Ruby [0] (which I heavily stewarded) and my vibe coded Rust [1] (I don't even know what borrow means).
Were they able to link Antigravity to your paid subscription? I have a Google ultra AI sub and antigrav ran out of credits within 30 minutes for me. Of course that was a few weeks ago, and I’m hoping that they fixed this
I tried generating code with ChatGPT 5.2, but the results weren't that great:
1) It often overcomplicates things for me. After I refactor its code, it's usually half the size and much more readable. It often adds unnecessary checks or mini-features 'just in case' that I don't need.
2) On the other hand, almost every function it produces has at least one bug or ignores at least one instruction. However, if I ask it to review its own code several times, it eventually finds the bugs.
I still find it very useful, just not as a standalone programming agent. My workflow is that ChatGPT gives me a rough blueprint and I iterate on it myself, I find this faster and less error-prone. It's usually most useful in areas where I'm not an expert, such as when I don't remember exact APIs. In areas where I can immediately picture the entire implementation in my head, it's usually faster and more reliable to write the code myself.
Well, like I pointed out somewhere else, VS Code gives it a set of prompts and tools that makes it very effective for me. I see that a lot of people are still copy/pasting stuff instead of having the “integrated” experience, and it makes a real difference.
The thing is that CLI utilities code is probably easier to write for an LLM than most other things. In my experience an LLM does best with backend and terminal things. Anything that resembles boilerplate is great. It does well refactoring unit tests, wrapping known code in a CLI, and does decent work with backend RESTful APIs. Where it fails utterly is things like HTML/CSS layout, JavaScript frontend code for SPAs, and particularly real world UI stuff that requires seeing and interacting with a web page/app where things like network latency and errors, browser UI, etc. can trip it up. Basically when the input and output are structured and known an LLM will do well. When they are “look and feel” they fail and fail until they make the code unmaintainable.
This experience for me is current but I do not normally use Opus so perhaps I should give it a try and figure out if it can reason around problems I myself do not foresee (for example a browser JS API quirk that I had never seen).
I've been having a surprising amount of success recently telling Claude Code to test the frontend it's building using Playwright, including interacting with the UI and having it take its own screenshots to feed into its vision ability to "see" what's going on.
In my experience with a combo of Claude Code and Gemini Pro (and having added Codex to the mix about a week ago as well), it matters less whether it’s CLI, backend, frontend, DB queries, etc. but more how cookiecutter the thing you’re building is. For building CRUD views or common web application flows, it crushes it, especially if you can point it to a folder and just tell it to do more of the same, adapted to a new use case.
But yes, the more specific you get and the more moving pieces you have, the more you need to break things down into baby steps. If you don’t just need it to make A work, but to make it work together with B and C. Especially given how eager Claude is to find cheap workarounds and escape hatches, botching things together in any way seemingly to please the prompter as fast as possible.
Since one of my holiday projects was completely rebuilding the Node-RED dashboard in Preact, I have to challenge that a bit. How were you using the model?
I couldn't disagree more. I've had Claude absolutely demolish large HTML/CSS/JS/React projects. One key is to give it some way to "see" and interact with the page. I usually use Playwright for this. Allowing it to see its own changes and iterate on them was the key unlock for me.
Putting the performance aside for now as I just started trying out Opus 4.5, can't say too much yet, I don't hype or hate AI as of now, it's simply useful.
Time will tell what happens, but if programming becomes "prompt engineering", I'm planning on quitting my job and pivoting to something else. It's nice to get stuff working fast, but AI just sucks the joy out of building for me.
Trying to not feel the pressure/anxiety from this, but every time a new model drops there is this tiny moment where I think "Is it actually different this time?"
I have similar stance to you. LLM has been very useful for me but it doesn't really change the fun-ness of programming since my circumstances has allowed me find programming to be very fun. I also want to pivot out to something else if English prompt becomes the main way to develop complex software. Though my other passion is having worse career horizon in the generative AI world (art making). We'll see.
> Time will tell what happens, but if programming becomes "prompt engineering", I'm planning on quitting my job and pivoting to something else. It's nice to get stuff working fast, but AI just sucks the joy out of building for me.
I hear you but I think many companies will change the role ; you'll get the technical ownership + big chunks of the data/product/devops responsibility. I'm speculating but I think one person can take that on himself with the new tools and deliver tremendous value. I don't know how they'll call this new role though, we'll see.
A couple weeks ago I had Opus 4.5 go over my project and improve anything it could find. It "worked" but the architecture decisions it made were baffling, and had many, many bugs. I had to rewrite half of the code. I'm not an AI hater, I love AI for tests, finding bugs, and small chores. Opus is great for specific, targeted tasks. But don't ask it to do any general architecture, because you'll be soon to regret it.
Instead you should prompt it to come up with suggestions, look for inconsistencies etc. Then you get a list, and you pick the ones you find promising. Then you ask Claude to explain what why and how of the idea. And only then you let it implement something.
these models work best when you know what you want to achieve and it helps you get there while you guide it. "Improve anything you can find" sounds like you didn't really know
As a tool to help developers I think it's really useful. It's great at stuff people are bad at, and bad at stuff people are good at. Use it as a tool, not a replacement.
"Improve anything you can find" is like going to your mechanic and saying "I'm going on a long road trip, can you tell me anything that needs to be fixed?"
In my experience these models (including opus) aren’t very good at “improving” existing code. I’m not exactly sure why, because the code they produce themselves is generally excellent.
I like these examples that predictably show the weaknesses of current models.
This reminds me of that example where someone asked an agent to improve a codebase in a loop overnight and they woke up to 100,000 lines of garbage [0]. Similarly you see people doing side-by-side of their implementation and what an AI did, which can also quite effectively show how AI can make quite poor architecture decisions.
This is why I think the “plan modes” and spec driven development are so important effective for agents, because it helps to avoid one of their main weaknesses.
To me, this doesn't show the weakness of current models, it shows the variability of prompts and the influence on responses. Because without the prompt it's hard to tell what influenced the outcome.
I had this long discussion today with a co-worker about the merits of detailed queries with lots of guidance .md documents, vs just asking fairly open ended questions. Spelling out in great detail what you want, vs just generally describing what you want the outcomes to be in general then working from there.
His approach was to write a lot of agent files spelling out all kinds of things like code formatting style, well defined personas, etc. And here's me asking vague questions like, "I'm thinking of splitting off parts of this code base into a separate service, what do you think in general? Are there parts that might benefit from this?"
>> A couple weeks ago I had Opus 4.5 go over my project and improve anything it could find. It "worked" but the architecture decisions it made were baffling, and had many, many bugs.
So you gave it an poorly defined task, and it failed?
I'm using AI tools to find issues in my code. 9/10 of their suggestions are utter nonsense and fixing them would make my code worse. That said, there are real issues they're finding, so it's worth it.
I wouldn't be surprised to find out that they will find issues infinitely, if looped with fixes.
I've found it to be terrible when you allow it to be creative. Constrain it, and it does much better.
Have you tried the planning mode? Ask it to review the codebase and identify defects, but don't let it make any changes until you've discussed each one or each category and planned out what to do to correct them. I've had it refactor code perfectly, but only when given examples of exactly what you want it to do, or given clear direction on what to do (or not to do).
Once you’ve got Claude Code set up, you can point it at your codebase, have it learn your conventions, pull in best practices, and refine everything until it’s basically operating like a super-powered teammate. The real unlock is building a solid set of reusable “skills” plus a few agents for the stuff you do all the time.
For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
We also had Claude Code create a bunch of ESLint automation, including custom ESLint rules and lint checks that catch and auto-handle a lot of stuff before it even hits review.
Then we take it further: we have a deep code review agent Claude Code runs after changes are made. And when a PR goes up, we have another Claude Code agent that does a full PR review, following a detailed markdown checklist we’ve written for it.
On top of that, we’ve got like five other Claude Code GitHub workflow agents that run on a schedule. One of them reads all commits from the last month and makes sure docs are still aligned. Another checks for gaps in end-to-end coverage. Stuff like that. A ton of maintenance and quality work is just… automated. It runs ridiculously smoothly.
We even use Claude Code for ticket triage. It reads the ticket, digs into the codebase, and leaves a comment with what it thinks should be done. So when an engineer picks it up, they’re basically starting halfway through already.
There is so much low-hanging fruit here that it honestly blows my mind people aren’t all over it. 2026 is going to be a wake-up call.
(used voice to text then had claude reword, I am lazy and not gonna hand write it all for yall sorry!)
Edit: made an example repo for ya
https://github.com/ChrisWiles/claude-code-showcase
Codex/Claude Code are terrible with C++. It also can't do Rust really well, once you get to the meat of it. Not sure why that is, but they just spit out nonsense that creates more work than it helps me. It also can't one shot anything complete, even though I might feed him the entire paper that explains what the algorithm is supposed to do.
Try to do some OpenGL or Vulkan with it, without using WebGPU or three.js. Try it with real code, that all of us have to deal with every day. SDL, Vulkan RHI, NVRHI. Very frustrating.
Try it with boost, or cmake, or taskflow. It loses itself constantly, hallucinates which version it is working on and ignores you when you provide actual pointers to documentation on the repo.
I've also recently tried to get Opus 4.5 to move the Job system from Doom 3 BFG to the original codebase. Clean clone of dhewm3, pointed Opus to the BFG Job system codebase, and explained how it works. I have also fed it the Fabien Sanglard code review of the job system: https://fabiensanglard.net/doom3_bfg/threading.php
We are not sleeping on it, we are actually waiting for it to get actually useful. Sure, it can generate a full stack admin control panel in JS for my PostgreSQL tables, but is that really "not normal"? That's basic.
With some entirely novel work we're doing, it's actually a hindrance as it consistently tells us the approach isn't valid/won't work (it will) and then enters "absolutely right" loops when corrected.
I still believe those who rave about it are not writing anything I would consider "engineering". Or perhaps it's a skill issue and I'm using it wrong, but I haven't yet met someone I respect who tells me it's the future in the way those running AI-based companies tell me.
I thought I'd revive it, but this time with Vulkan and no third-party dependencies (except for Vulkan)
4.5 Sonet, Opus and Gemini 3.5 flash has helped me write image decoders for dds, png jpg, exr, a wayland window implementation, macOS window implementation, etc.
I find that Gemini 3.5 flash is really good at understanding 3d in general while sonnet might be lacking a little.
All these sota models seem to understand my bespoke Lua framework and the right level of abstraction. For example at the low level you have the generated Vulkan bindings, then after that you have objects around Vulkan types, then finally a high level pipeline builder and whatnot which does not mention Vulkan anywhere.
However with a larger C# codebase at work, they really struggle. My theory is that there are too many files and abstractions so that they cannot understand where to begin looking.
Which makes sense. I'm sure there's lots of training data for React/HTML/CSS/etc. but much less with Swift, especially the newer versions.
I have not had this experience at all. It often doesn't get it right on the first pass, yes, but the advantage with Rust vibecoding is that if you give it a rule to "Always run cargo check before you think you're finished" then it will go back and fix whatever it missed on the first pass. What I find particularly valuable is that the compiler forces it to handle all cases like match arms or errors. I find that it often misses edge cases when writing typescript, and I believe that the relative leniency of the typescript compiler is why.
In a similar vein, it is quite good at writing macros (or at least, quite good given how difficult this otherwise is). You often have to cajole it into not hardcoding features into the macro, but since macros resolve at compile time they're quite well-suited for an LLM workflow as most potential bugs will be apparent before the user needs to test. I also think that the biggest hurdle of writing macros to humans is the cryptic compiler errors, but I can imagine that since LLMs have a lot of information about compilers and syntax parsing in their training corpus, they have an easier time with this than the median programmer. I'm sure an actual compiler engineer would be far better than the LLM, but I am not that guy (nor can I afford one) so I'm quite happy to use LLMs here.
For context, I am purely a webdev. I can't speak for how well LLMs fare at anything other than writing SQL, hooking up to REST APIs, React frontend, and macros. With the exception of macros, these are all problems that have been solved a million times thus are more boilerplate than novelty, so I think it is entirely plausible that they're very poor for different domains of programming despite my experiences with them.
Because types are proofs and require global correctness, you can't just iterate, fix things locally, and wait until it breaks somewhere else that you also have to fix locally.
From what I've seen, CC has troubles with the latest Swift too, partially because of it being latest and partially because it's so convoluted nowadays.
But it's übercharged™ for C#
And I get it. Coding with Claude Code really was prompting something, getting errors, and asking it to fix it. Which was still useful but I could see why a skilled coder adding a feature to a complex codebase would just give up
Opus 4.5 really is at a new tier however. It just...works. The errors are far fewer and often very minor - "careless" errors, not fundamental issues (like forgetting to add "use client" to a nextjs client component.
I even gave -True Vibe Coding- a whirl. Yesterday, from a blank directory and text file list of requirements, I had Opus 4.5 build an Android TV video player that could read a directory over NFS, show a grid view of movie poster thumbnails, and play the selected video file on the TV. The result wasn't exactly full-featured Kodi, but it works in the emulator and actual device, it has no memory leaks, crashes, ANRs, no performance problems, no network latency bugs or anything. It was pretty astounding.
Oh, and I did this all without ever opening a single source file or even looking at the proposed code changes while Opus was doing its thing. I don't even know Kotlin and still don't know it.
This is what people are still doing wrong. Tools in a loop people, tools in a loop.
The agent has to have the tools to detect whatever it just created is producing errors during linting/testing/running. When it can do that, I can loop again, fix the error and again - use the tools to see whether it worked.
I _still_ encounter people who think "AI programming" is pasting stuff into ChatGPT on the browser and they complain it hallucinates functions and produces invalid code.
Well, d'oh.
My team is using it with Claude Code and say it works brilliantly, so I'll be giving it another go.
How much of the value comes from Opus 4.5, how much comes from Claude Code, and how much comes from the combination?
It's not just the deficiencies of earlier versions, but the mismatch between the praise from AI enthusiasts and the reality.
I mean maybe it is really different now and I should definitely try uploading all of my employer's IP on Claude's cloud and see how well it works. But so many people were as hyped by GPT-4 as they are now, despite GPT-4 actually being underwhelming.
Too much hype for disappointing results leads to skepticism later on, even when the product has improved.
Literally tried it yesterday. I didn't see a single difference with whatever model Claude Code was using two months ago. Same crippled context window. Same "I'll read 10 irrelevant lines from a file", same random changes etc.
Deleted Comment
Dead Comment
I had an idea for something that I wanted, and in five scattered hours, I got it good enough to use. I'm thinking about it in a few different ways:
1. I estimate I could have done it without AI with 2 weeks full-time effort. (Full-time defined as >> 40 hours / week.)
2. I have too many other things to do that are purportedly more important that programming. I really can't dedicate to two weeks full-time to a "nice to have" project. So, without AI, I wouldn't have done it at all.
3. I could hire someone to do it for me. At the university, those are students. From experience with lots of advising, a top-tier undergraduate student could have achieved the same thing, had they worked full tilt for a semester (before LLMs). This of course assumes that I'm meeting them every week.
- single scripts. Anything which can be reduced to a single script.
- starting greenfield projects from scratch
- code maintenance (package upgrades, old code...)
- tasks which have a very clear and single definition. This isn't linked to complexity, some tasks can be both very complex but with a single definition.
If your work falls into this list they will do some amazing work (and yours clearly fits that), if it doesn't though, prepare yourself because it will be painful.
Nobody is sleeping. I'm using LLMs daily to help me in simple coding tasks.
But really where is the hurry? At this point not a few weeks go by without the next best thing since sliced bread to come out. Why would I bother "learning" (and there's really nothing to learn here) some tool/workflow that is already outdated by the time it comes out?
> 2026 is going to be a wake-up call
Do you honestly think a developer not using AI won't be able to adapt to a LLM workflow in, say, 2028 or 2029? It has to be 2026 or... What exactly?
There is literally no hurry.
You're using the equivalent of the first portable CD-player in the 80s: it was huge, clunky, had hiccups, had a huge battery attached to it. It was shiny though, for those who find new things shiny. Others are waiting for a portable CD player that is slim, that buffers, that works fine. And you're saying that people won't be able to learn how to put a CD in a slim CD player because they didn't use a clunky one first.
It's also a skill that compounds over time, so if you have two years of experience with them you'll be able to use them more effectively than someone with two months of experience.
In that respect, they're just normal technology. A Python programmer with two years of Python experience will be more effective than a programmer with two months of Python.
That is sleeping.
> But really where is the hurry? At this point not a few weeks go by without the next best thing since sliced bread to come out. Why would I bother "learning" (and there's really nothing to learn here) some tool/workflow that is already outdated by the time it comes out?
You're jumping to conclusions that haven't been justified by any of the development in this space. The learning compounds.
> Do you honestly think a developer not using AI won't be able to adapt to a LLM workflow in, say, 2028 or 2029? It has to be 2026 or... What exactly?
They will, but they'll be competing against people with 2-3 more years of experience in understanding how to leverage these tools.
claude can call ssh and do system admin tasks. It works amazingly well. I have 3 VM's, which depends on each other (proxmox with openwrt, adguard, unbound), and claude can prove to me that my dns chains works perfectly, my firewalls are perfect etc as claude can ssh into each. Setting up services, diagnosing issues, auditing configs... you name it. Just awesome.
claude can call other sh scripts on the machine, so over time, you can create a bunch of scripts that lets claude one shot certain tasks that would normally eat tokens. It works great. One script per intention - don't have a script do more than one thing.
claude can call the compiler, run the debug executable and read the debug logs.. in real time. So claude can read my android apps debug stream via adb.. or my C# debug console because claude calls the compiler, not me. Just ask it to do it and it will diagnose stuff really quickly.
It can also analyze your db tables (give it readonly sql access), look at the application code and queries, and diagnose performance issues.
The opportunities are endless here. People need to wake up to this.
Deleted Comment
Claude set up a Raspberry Pi with a display and conference audio device for me to use as an Alexa replacement tied to Home Assistant.
I gave it an ssh key and gave it root.
Then I told it what I wanted, and it did. It asked for me to confirm certain things, like what I could see on screen, whether I could hear the TTS etc. (it was a bit of a surprise when it was suddenly talking to me while I was minding my own business).
It configured everything, while keeping a meticulous log that I can point it at if I want to set up another device, and eventually turn into a runbook if I need to.
In addition there are instructions on how and where to push the possible fixes and how to check the results.
I've yet to encounter a build failure it couldn't fix automatically.
> have it learn your conventions, pull in best practices
What do you mean by "have it learn your conventions"? Is there a way to somehow automatically extract your conventions and store it within CLAUDE.md?
> For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
Did you have to develop these skills yourself? How much work was that? Do you have public examples somewhere?
I'll give you an example: I use ruff to format my python code, which has an opinionated way of formatting certain things. After an initial formatting, Opus 4.5, without prompting, will write code in this same style so that the ruff formatter almost never has anything to do on new commits. Sonnet 4.5 is actually pretty good at this too.
I’ve even found it searching node_modules to find the API of non-public libraries.
/init in Claude Code already automatically extracts a bunch, but for something more comprehensive, just tell it which additional types of things you want it to look for and document.
> Did you have to develop these skills yourself? How much work was that? Do you have public examples somewhere?
I don't know about the person above, but I tell Claude to write all my skills and agents for me. With some caveats, you can do this iteratively in a single session ("update the X agent, then re-run it. Repeat until it reliably does Y")
Especially if you're in a place where a lot of time was spent previously revising PRs for best practices, etc, even for human-submitted code, then having the LLM do that for you that saves a bunch of time. Most humans are bad at following those super-well.
There's a lot of stuff where I'm pretty sure I'm up to at least 2x speed now. And for things like making CLI tools or bash scripts, 10x-20x. But in terms of "the overall output of my day job in total", probably more like 1.5x.
But I think we will need a couple major leaps in tooling - probably deterministic tooling, not LLM tooling - before anyone could responsibly ship code nobody has ever read in situations with millions of dollars on the line (which is different from vibe-coding something that ends up making millions - that's a low-risk-high-reward situation, where big bets on doing things fast make sense. if you're already making millions, dramatic changes like that can become high-risk-low-reward very quickly. In those companies, "I know that only touching these files is 99.99% likely to be completely safe for security-critical functionality" and similar "obvious" intuition makes up for the lack of ability to exhaustively test software in a practical way (even with fuzzers and things), and "i didn't even look at the code" is conceding responsibility to a dangerous degree there.)
Reword? But why not just voice to text alone...
Oh but we all read the partially synthetic ad by this point. Psyche.
I agree with this, but I haven't needed to use any advanced features to get good results. I think the simple approach gets you most of the benefits. Broadly, I just have markdown files in the repo written for a human dev audience that the agent can also use.
Basically:
- README.md with a quick start section for devs, descriptions of all build targets and tests, etc. Normal stuff.
- AGENTS.md (only file that's not written for people specifically) that just describes the overall directory structure and has a short step of instructions for the agent: (1) Always read the readme before you start. (2) Always read the relevant design docs before you start. (3) Always run the linter, a build, and tests whenever you make code changes.
- docs/*.md that contain design docs, architecture docs, and user stories, just text. It's important to have these resources anyway, agent or no.
As with human devs, the better the docs/requirements the better the results.
It's a tiny step further, and sub-agents provide a massive benefit the moment you're ready to trust the model even a little bit (relax permissions to not have it prompt you for every little thing; review before committing rather than on every file edit) because they limit what goes into the top level context, and can let the model work unassisted for far longer. I now regularly have it run for hours at a time without stopping.
Running and acting on output from the linter is absolutely an example of that which matters even for much shorter runs.
There's no reason to have all the lint output "polluting" the top level context, nor to have the steps the agent needs to take to fix linter issues that can't be auto-fixed by the linter itself. The top level agent should only need to care about whether the linter run passed or failed (and should know it needs to re-run and possibly investigate if it fails).
Just type /agents, select "Create new agent" and describe a task you often do, and then forget about it (or ask Claude to make changes to it for you)
There are times those extra layers are worth it but it seems LLMs have a bias to add them prematurely and overcomplicate things. You then end up with extra complexity you didn't need.
When needed it can be picked up in a day. Otherwise they are not paid based in tickets solved etc. If the incentives were properly aligned everyone would already use it
I'm not saying there aren't use cases for agents, just that it's normal that most software engineers are sleeping on it.
codex cli even has a skill to create skills; it's super easy to get up to speed with them
https://github.com/openai/skills/blob/main/skills/.system/sk...
Deleted Comment
Dead Comment
take my downvote as hard as you can. this sort of thing is awfully off-putting.
The tech industry just went through an insane hiring craze and is now thinning out. This will help to separate the chaff from the wheat.
I don't know why any company would want to hire "tech" people who are terrified of tech and completely obstinate when it comes to utilizing it. All the people I see downplaying it take a half-assed approach at using it then disparage it when it's not completely perfect.
I started tinkering with LLMs in 2022. First use case, speak in natural english to the llm, give it a json structure, have it decipher the natural language and fill in that json structure (vacation planning app, so you talk to it about where/how you want to vacation and it creates the structured data in the app). Sometimes I'd use it for minor coding fixes (copy and paste a block into chatgpt, fix errors or maybe just ideation). This was all personal project stuff.
At my job we got LLM access in mid/late 2023. Not crazy useful, but still was helpful. We got claude code in 2024. These days I only have an IDE open so I can make quick changes (like bumping up a config parameter, changing a config bool, etc.). I almost write ZERO code now. I usually have 3+ claude code sessions open.
On my personal projects I'm using Gemini + codex primarily (since I have a google account and chatgpt $20/month account). When I get throttled on those I go to claude and pay per token. I'll often rip through new features, projects, ideas with one agent, then I have another agent come through and clean things up, look for code smells, etc. I don't allow the agents to have full unfettered control, but I'd say 70%+ of the time I just blindly accept their changes. If there are problems I can catch them on the MR/PR.
I agree about the low hanging fruit and I'm constantly shocked at the sheer amount of FUD around LLMs. I want to generalize, like I feel like it's just the mid/jr level devs that speak poorly about it, but there's definitely senior/staff level people I see (rarely, mind you) that also don't like LLMs.
I do feel like the online sentiment is slowly starting to change though. One thing I've noticed a lot of is that when it's an anonymous post it's more likely to downplay LLMs. But if I go on linkedin and look at actual good engineers I see them praising LLMs. Someone speaking about how powerful the LLMs are - working on sophisticated projects at startups or FAANG. Someone with FUD when it comes to LLM - web dev out of Alabama.
I could go on and on but I'm just ranting/venting a little. I guess I can end this by saying that in my professional/personal life 9/10 of the top level best engineers I know are jumping on LLMs any chance they get. Only 1/10 talks about AI slop or bullshit like that.
The "sudden switch" since Opus 4.5 when many were saying just a few months ago "I enjoy actual coding" but now are praising LLM's isn't a one off occurrence. I do think underneath it is somewhat motivated by fear; not for the job however but for relevance. i.e. its in being relevant to discussions, tech talks, new opportunities, etc.
I have Django project with 50 kLOC and it is pretty capable of understanding the architecture, style of coding, naming of variables, functions etc. Sometimes it excels on tasks like "replicate this non-trivial functionality for this other model and update the UI appropriately" and leaves me stunned. Sometimes it solves for me tedious and labourous "replace this markdown editor with something modern, allowing fullscreen edits of content" and does annoying mistake that only visual control shows and is not capable to fix it after 5 prompts. I feel as I am becoming tester more than a developer and I do not like the shift. Especially when I do not like to tell someone he did an obvious mistake and should fix it - it seems I do not care if it is human or AI, I just do not like incompetence I guess.
Yesterday I had to add some parameters to very simple Falcon project and found out it has not been updated for several months and won't build due to some pip issues with pymssql. OK, this is really marginal sub-project so I said - let's migrate it to uv and let's not get hands dirty and let the Claude do it. He did splendidly but in the Dockerfile he missed the "COPY server.py /data/" while I asked him to change the path... Build failed, I updated the path myself and moved on.
And then you listen to very smart guys like Karpathy who rave about Tab, Tab, Tab, while not understanding the language or anything about the code they write. Am I getting this wrong?
I am really far far away from letting agents touch my infrastructure via SSH, access managed databases with full access privileges etc. and dread the day one of my silly customers asks me to give their agent permission to managed services. One might say the liability should then be shifted, but at the end of the day, humans will have to deal with the damage done.
My customer who uses all the codebase I am mentioning here asked me, if there is a way to provide "some AI" with item GTINs and let it generate photos, descriptions, etc. including metadata they handcrafted and extracted for years from various sources. While it looks like nice idea and for them the possibility of decreasing the staff count, I caught the feeling they do not care about the data quality anymore or do not understand the problems the are brining upon them due to errors nobody will catch until it is too late.
TL;DR: I am using Opus 4.5, it helps a lot, I have to keep being (very) cautious. Wake up call 2026? Rather like waking up from hallucination.
I shortened it for anyone else that might need it
----
Software engineers are sleeping on Claude Code agents. By teaching it your conventions, you can automate your entire workflow:
Custom Skills: Generates code matching your UI library and API patterns.
Quality Ops: Automates ESLint, doc syncing, and E2E coverage audits.
Agentic Reviews: Performs deep PR checks against custom checklists.
Smart Triage: Pre-analyzes tickets to give devs a head start.
Check out the showcase repo to see these patterns in action.
Disclaimer: I don't have access to Claude Code. My employer has only granted me Claude Teams. Supposedly, they don't use my poopy code to train their models if I use my work email Claude so I am supposed to use that. If I'm not pasting code (asking general questions) into Claude, I believe I'm allowed to use whatever.
And my conclusion is: it's still not as smart as a good human programmer. It frequently got stuck, went down wrong paths, ignored what I told it to do to do something wrong, or even repeat a previous mistake I had to correct.
Yet in other ways, it's unbelievably good. I can give it a directory full of code to analyze, and it can tell me it's an implementation of Kozo Sugiyama's dagre graph layout algorithm, and immediately identify the file with the error. That's unbelievably impressive. Unfortunately it can't fix the error. The error was one of the many errors it made during previous sessions.
So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.
Yesterday and today I was upgrading a bunch of unit tests because of a dependency upgrade, and while it was occasionally very helpful, it also regularly got stuck. I got a lot more done than usual in the same time, but I do wonder if it wasn't too much. Wasn't there an easier way to do this? I didn't look for it, because every step of the way, Opus's solution seemed obvious and easy, and I had no idea how deep a pit it was getting me into. I should have been more critical of the direction it was pointing to.
You can see some of the context limits here:
https://models.dev/
If you want the full capability, use the API and use something like opencode. You will find that a single PR can easily rack up 3 digits of consumption costs.
This is one thing I've tried using it for, and I've found this to be very, very tricky. At first glance, it seems unbelievably good. The comments read well, they seem correct, and they even include some very non-obvious information.
But almost every time I sit down and really think about a comment that includes any of that more complex analysis, I end up discarding it. Often, it's right but it's missing the point, in a way that will lead a reader astray. It's subtle and I really ought to dig up an example, but I'm unable to find the session I'm thinking about.
This was with ChatGPT 5, fwiw. It's totally possible that other models do better. (Or even newer ChatGPT; this was very early on in 5.)
Code review is similar. It comes up with clever chains of reasoning for why something is problematic, and initially convinces me. But when I dig into it, the review comment ends up not applying.
It could also be the specific codebase I'm using this on? (It's the SpiderMonkey source.)
I don't think you've seen the full potential. I'm currently #1 on 5 different very complex computer engineering problems, and I can't even write a "hello world" in rust or cpp. You no longer need to know how to write code, you just need to understand the task at a high level and nudge the agents in the right direction. The game has changed.
- https://highload.fun/tasks/3/leaderboard
- https://highload.fun/tasks/12/leaderboard
- https://highload.fun/tasks/15/leaderboard
- https://highload.fun/tasks/18/leaderboard
- https://highload.fun/tasks/24/leaderboard
Yes, LLMs are very good at writing code, they are so good at writing code that they often generate reams of unmaintainable spaghetti.
When you submit to an informatics contest you don't have paying customers who depend on your code working every day. You can just throw away yesterday's code and start afresh.
Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.
I'm not sure what this means for the future of SWE's though yet. I don't see higher levels of staff in big large businesses bothering to do this, and at some scale I don't see founders still wanting to manage all of these agents, and processes (got better things to do at higher levels). But I do see the barrier of learning to code gone; meaning it probably becomes just like any other job.
Ah yes, well known very complex computer engineering problems such as:
* Parsing JSON objects, summing a single field
* Matrix multiplication
* Parsing and evaluating integer basic arithmetic expressions
And you're telling me all you needed to do to get the best solution in the world to these problems was talk to an LLM?
Try it again using Claude Code and a subscription to Claude. It can run as a chat window in VS Code and Cursor too.
Sure, Copilot charges 3x tokens for using Opus 4.5, but, how were you still able to use up half the allocated tokens not even one week into January?
I thought using up 50% was mad for me (inline completions + opencode), that's even worse
Deleted Comment
No doubt I could give Opus 4.5 "build be a XYZ app" and it will do well. But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right". Any non-technical person might read that and go "if it works it works" but any reasonable engineer will know that thats not enough.
Starting back in 2022/2023:
- (~2022) It can auto-complete one line, but it can't write a full function.
- (~2023) Ok, it can write a full function, but it can't write a full feature.
- (~2024) Ok, it can write a full feature, but it can't write a simple application.
- (~2025) Ok, it can write a simple application, but it can't create a full application that is actually a valuable product.
- (~2025+) Ok, it can write a full application that is actually a valuable product, but it can't create a long-lived complex codebase for a product that is extensible and scalable over the long term.
It's pretty clear to me where this is going. The only question is how long it takes to get there.
I don't think its a guarantee. all of the things it can do from that list are greenfield, they just have increasing complexity. The problem comes because even in agentic mode, these models do not (and I would argue, can not) understand code or how it works, they just see patterns and generate a plausible sounding explanation or solution. agentic mode means they can try/fail/try/fail/try/fail until something works, but without understanding the code, especially of a large, complex, long-lived codebase, they can unwittingly break something without realising - just like an intern or newbie on the project, which is the most common analogy for LLMs, with good reason.
Case in point: Self driving cars.
Also, consider that we need to pirate the whole internet to be able to do this, so these models are not creative. They are just directed blenders.
We've been having same progression with self driving cars and they are also stuck on the last 10% for last 5 years
And this isn't a pessimistic take! I love this period of time where the models themselves are unbelievably useful, and people are also focusing on the user experience of using those amazing models to do useful things. It's an exciting time!
But I'm still pretty skeptical of "these things are about to not require human operators in the loop at all!".
The trend is definitely here, but even today, heavily depends on the feature.
While extra useful, it requires intense iteration and human insight for > 90% of our backlog. We develop a cybersecurity product.
> The only question is how long it takes to get there.
This is the question and I would temper expectations with the fact that we are likely to hit diminishing returns from real gains in intelligence as task difficulty increases. Real world tasks probably fit into a complexity hierarchy similar to computational complexity. One of the reasons that the AI predictions made in the 1950s for the 1960s did not come to be was because we assumed problem difficulty scaled linearly. Double the computing speed, get twice as good at chess or get twice as good at planning an economy. P, NP separation planed these predictions. It is likely that current predictions will run into similar separations.
It is probably the case that if you made a human 10x as smart they would only be 1.25x more productive at software engineering. The reason we have 10x engineers is less about raw intelligence, they are not 10x more intelligent, rather they have more knowledge and wisdom.
By your logic, does it mean that engineers will never get replaced?
- (~2022) "It's so over for developers". 2022 ends with more professional developers than 2021.
- (~2023) "Ok, now it's really over for developers". 2023 ends with more professional developers than 2022.
- (~2024) "Ok, now it's really, really over for developers". 2024 ends with more professional developers than 2023.
- (~2025) "Ok, now it's really, really, absolutely over for developers". 2025 ends with more professional developers than 2024.
- (~2025+) etc.
Sources: https://www.jetbrains.com/lp/devecosystem-data-playground/#g...
I suspect that the timeline from autocomplete-one-line to autocomplete-one-app, which was basically a matter of scaling and RL, may in retrospect turn out to have been a lot faster that the next LLM to AGI step where it becomes capable of using human level judgement and reasoning, etc, to become a developer, not just a coding tool.
They're definitely better now, but it's not like ChatGPT 3.5 couldn't write a full simple todo list app in 2023. There were a billion blog posts talking about that and how it meant the death of the software industry.
Plus I'd actually argue more of the improvements have come from tooling around the models rather than what's in the models themselves.
[0] eg https://www.youtube.com/watch?v=GizsSo-EevA
I've worked on teams where multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered. It usually took an outsider to proactively remind them what actually mattered to the business case.
I remember cases where a team of engineers built something the "right" way but it turned out to be the wrong thing. (Well engineered thing no one ever used)
Sometimes hacking something together messily to confirm it's the right thing to be building is the right way. Then making sure it's secure, then finally paying down some technical debt to make it more maintainable and extensible.
Where I see real silly problems is when engineers over-engineer from the start before it's clear they are building the right thing, or when management never lets them clean up the code base to make it maintainable or extensible when it's clear it is the right thing.
There's always a balance/tension, but it's when things go too far one way or another that I see avoidable failures.
Gosh I am so tired with that one - someone had a case that burned them in some previous project and now his life mission is to prevent that from happening ever again, and there would be no argument they will take.
Then you get like up to 10 engineers on typical team and team rotation and you end up with all kinds of "we have to do it right because we had to pull all nighter once, 5 years ago" baked in the system.
Not fun part is a lot of business/management people "expect" having perfect solution right away - there are some reasonable ones that understand you need some iteration.
I usually resolve this by putting on the table the consequences and their impacts upon my team that I’m concerned about, and my proposed mitigation for those impacts. The mitigation always involves the other proposer’s team picking up the impact remediation. In writing. In the SOP’s. Calling out the design decision by day of the decision to jog memories and names of those present that wanted the design as the SME’s. Registered with the operations center. With automated monitoring and notification code we’re happy to offer.
Once people are asked to put accountable skin in the sustaining operations, we find out real fast who is taking into consideration the full spectrum end to end consequences of their decisions. And we find out the real tradeoffs people are making, and the externalities they’re hoping to unload or maybe don’t even perceive.
My first thought was that you probably also have different biases, priorities and/or taste. As always, this is probably very context-specific and requires judgement to know when something goes too far. It's difficult to know the "most correct" approach beforehand.
> Sometimes hacking something together messily to confirm it's the right thing to be building is the right way. Then making sure it's secure, then finally paying down some technical debt to make it more maintainable and extensible.
I agree that sometimes it is, but in other cases my experience has been that when something is done, works and is used by customers, it's very hard to argue about refactoring it. Management doesn't want to waste hours on it (who pays for it?) and doesn't want to risk breaking stuff (or changing APIs) when it works. It's all reasonable.
And when some time passes, the related intricacies, bigger picture and initially floated ideas fade from memory. Now other stuff may depend on the existing implementation. People get used to the way things are done. It gets harder and harder to refactor things.
Again, this probably depends a lot on a project and what kind of software we're talking about.
> There's always a balance/tension, but it's when things go too far one way or another that I see avoidable failures.
I think balance/tension describes it well and good results probably require input from different people and from different angles.
Hardly any of us are working on Postgres, Photoshop, blender, etc. but it's not just cope to wish we were.
It's good to think about the needs to business and the needs of society separately. Yes, the thing needs users, or no one is benefiting. But it also needs to do good for those users, and ultimately, at the highest caliber, craftsmanship starts to matter again.
There are legitimate reasons for the startup ecosystem to focus firstly and primarily on getting the users/customers. I'm not arguing against that. What I am arguing is why does the industry need to be dominated by startups in terms of the bulk of the products (not bulk of the users). It begs the question of how much societally-meaningful programming waiting to be done.
I'm hoping for a world where more end users code (vibe or otherwise) and the solve their own problems with their own software. I think that will make more a smaller, more elite software industry that is more focused on infrastructure than last-mile value capture. The question is how to fund the infrastructure. I don't know except for the most elite projects, which is not good enough for the industry (even this hypothetical smaller one) on the whole.
And when you check the work, a large portion of it was hand rolling an ORM (via an LLM). Relatively solved problem that an LLM would excel at, but also not meaningfully moving the needle when you could use an existing library. And likely just creating more debt down the road.
- I used AI-assisted programming to create a project.
Even if the content is identical, or if the AI is smart enough to replicate the project by itself, the latter can be included on a CV.
Nothing I do seems to fix that in its initial code writing steps. Only after it finishes, when I've asked it to go back and rewrite the changes, this time making only 2 or 3 lines of code, does it magically (or finally) find the other implementation and reuse it.
It's freakin incredible at tracing through code and figuring it out. I <3 Opus. However, it's still quite far from any kind of set-and-forget-it.
- How quickly is cost of refactor to a new pattern with functional parity going down?
- How does that change the calculus around tech debt?
If engineering uses 3 different abstractions in inconsistent ways that leak implementation details across components and duplicate functionality in ways that are very hard to reason about, that is, in conventional terms, an existential problem that might kill the entire business, as all dev time will end up consumed by bug fixes and dealing with pointless complexity, velocity will fall to nothing, and the company will stop being able to iterate.
But if claude can reliably reorganize code, fix patterns, and write working migrations for state when prompted to do so, it seems like the entire way to reason about tech debt has changed. And it has changed more if you are willing to bet that models within a year will be much better at such tasks.
And in my experience, claude is imperfect at refactors and still requires review and a lot of steering, but it's one of the things it's better at, because it has clear requirements and testing workflows already built to work with around the existing behavior. Refactoring is definitely a hell of a lot faster than it used to be, at least on the few I've dealt with recently.
In my mind it might be kind of like thinking about financial debt in a world with high inflation, in that the debt seems like it might get cheaper over time rather than more expensive.
Yup, I recently spent 4 days using Claude to clean up a tool that's been in production for over 7 years. (There's only about 3 months of engineering time spent on it in those years.)
We've known what the tool needed for many years, but ugh, the actual work was fairly messy and it was never a priority. I reviewed all of Opus's cleanup work carefully and I'm quite content with the result. Maybe even "enthusiastic" would be accurate.
So even if Claude can't clean up all the tech debt in a totally unsupervised fashion, it can still help address some kinds of tech debt extremely rapidly.
So maybe doing 2-3 stages makes sense. First stage needs to be functionallty correct, but you accept code smells such as leaky abstractions, verbosity and repetition. In stage 2 and 3 you eliminate all this. You could integrate this all into the initial specification; you won't even see the smelly intermediate code; it only exists as a stepping stone for the model to iteratively refine the code!
You’re talking like in the year 2026 we’re still writing code for future humans to understand and improve.
I fear we are not doing that. Right now, Opus 4.5 is writing code that later Opus 5.0 will refactor and extend. And so on.
For one, there are objectively detrimental ways to organize code: tight coupling, lots of mutable shared state, etc. No matter who or what reads or writes the code, such code is more error-prone, and more brittle to handle.
Then, abstractions are tools to lower the cognitive load. Good abstractions reduce the total amount of code written, allow to reason about the code in terms of these abstractions, and do not leak in the area of their applicability. Say Sequence, or Future, or, well, function are examples of good abstractions. No matter what kind of cognitive process handles the code, it benefits from having to keep a smaller amount of context per task.
"Code structure does not matter, LLMs will handle it" sounds a bit like "Computer architectures don't matter, the Turing Machine is proved to be able to handle anything computable at all". No, these things matter if you care about resource consumption (aka cost) at the very least.
Given that, I expect that, even if AI is writing all of the code, we will still need people around who understand it.
If AI can create and operate your entire business, your moat is nil. So, you not hiring software engineers does not matter, because you do not have a business.
Also, I've noticed failure modes in LLM coding agents when there is less clarity and more complexity in abstractions or APIs. It's actually made me consider simplifying APIs so that the LLMs can handle them better.
Though I agree that in specific cases what's helpful for the model and what's helpful for humans won't always overlap. Once I actually added some comments to a markdown file as note to the LLM that most human readers wouldn't see, with some more verbose examples.
I think one of the big problems in general with agents today is that if you run the agent long enough they tend to "go off the rails", so then you need to babysit them and intervene when they go off track.
I guess in modern parlance, maintaining a good codebase can be framed as part of a broader "context engineering" problem.
If argument is "humans and Opus 4.5 cannot maintain this, but if requirements change we can vibe-code a new one from scratch", that's a coherent thesis, but people need to be explicit about this.
(Instead this feels like the mott that is retreated to, and the bailey is essentially "who cares, we'll figure out what to do with our fresh slop later".)
Ironically, I've been Claude to be really good at refactors, but these are refactors I choose very explicitly. (Such as I start the thing manually, then let it finish.) (For an example of it, see me force-pushing to https://github.com/NixOS/nix/pull/14863 implementing my own code review.)
But I suspect this is not what people want. To actually fire devs and not rely on from-scratch vibe-coding, we need to figure out which refactors to attempt in order to implement a given feature well.
That's a very creative open-ended question that I haven't even tried to let the LLMs take a crack at it, because why I would I? I'm plenty fast being the "ideas guy". If the LLM had better ideas than me, how would I even know? I'm either very arrogant or very good because I cannot recall regretting one of my refactors, at least not one I didn't back out of immediately.
But nobody knows for sure!
Deleted Comment
Opus is great and definitely speeds up development even in larger code bases and is reasonably good at matching coding style/standard to that of of the existing code base.
In my opinion, the big issue is the relatively small context that quickly overwhelms the models when given a larger task on a large codebase.
For example, I have a largish enterprise grade code base with nice enterprise grade OO patterns and class hierarchies. There was a simple tech debt item that required refactoring about 30-40 classes to adhere to a slightly different class hierarchy. The work is not difficult, just tedious, especially as unit tests need to be fixed up.
I threw Opus at it with very precise instructions as to what I wanted it to do and how I wanted it to do it. It started off well but then disintegrated once it got overwhelmed at the sheer number of files it had to change. At some point it got stuck in some kind of an error loop where one change it made contradicted with another change and it just couldn't work itself out. I tried stopping it and helping it out but at this point the context was so polluted that it just couldn't see a way out. I'd say that once an LLM can handle more 'context' than a senior dev with good knowledge of a large codebase, LLM will be viable in a whole new realm of development tasks on existing code bases. That 'too hard to refactor this/make this work with that' task will suddenly become viable.
Yes, it's absurd but it's a better metaphor than someone with a chronic long term memory deficit since it fits into the project management framework neatly.
So this new developer who is starting today is ready to be assigned their first task, they're very eager to get started and once they start they will work very quickly but you have to onboard them. This sounds terrible but they also happen to be extremely fast at reading code and documentation, they know all of the common programming languages and frameworks and they have an excellent memory for the hour that they're employed.
What do you do to onboard a new developer like this? You give them a well written description of your project with a clear style guide and some important dos and don'ts, access to any documentation you may have and a clear description of the task they are to accomplish in less than one hour. The tighter you can make those documents, the better. Don't mince words, just get straight to the point and provide examples where possible.
The task description should be well scoped with a clear definition of done, if you can provide automated tests that verify when it's complete that's even better. If you don't have tests you can also specify what should be tested and instruct them to write the new tests and run them.
For every new developer after the first you need a record of what was already accomplished. Personally, I prefer to use one markdown document per working session whose filename is a date stamp with the session number appended. Instruct them to read the last X log files where X is however many are relevant to the current task. Most of the time X=1 if you did a good job of breaking down the tasks into discrete chunks. You should also have some type of roadmap with milestones, if this file will be larger than 1000 lines then you should break it up so each milestone is its own document and have a table of contents document that gives a simple overview of the total scope. Instruct them to read the relevant milestone.
Other good practices are to tell them to write a new log file after they have completed their task and record a summary of what they did and anything they discovered along the way plus any significant decisions they made. Also tell them to commit their work afterwards and Opus will write a very descriptive commit message by default (but you can instruct them to use whatever format you prefer). You basically want them to get everything ready for hand-off to the next 60 minute developer.
If they do anything that you don't want them to do again make sure to record that in CLAUDE.md. Same for any other interventions or guidance that you have to provide, put it in that document and Opus will almost always stick to it unless they end up overfilling their context window.
I also highly recommend turning off auto-compaction. When the context gets compacted they basically just write a summary of the current context which often removes a lot of the important details. When this happens mid-task you will certainly lose parts of the context that are necessary for completing the task. Anthropic seems to be working hard at making this better but I don't think it's there yet. You might want to experiment with having it on and off and compare the results for yourself.
If your sessions are ending up with >80% of the context window used while still doing active development then you should re-scope your tasks to make them smaller. The last 20% is fine for doing menial things like writing the summary, running commands, committing, etc.
People have built automated systems around this like Beads but I prefer the hands-on approach since I read through the produced docs to make sure things are going ok and use them as a guide for any changes I need to make mid-project.
With this approach I'm 99% sure that Opus 4.5 could handle your refactor without any trouble as long as your classes aren't so enormous that even working on a single one at a time would cause problems with the context window, and if they are then you might be able to handle it by cautioning Opus to not read the whole file and to just try making targeted edits to specific methods. They're usually quite good at finding and extracting just the sections that they need as long as they have some way to know what to look for ahead of time.
Hope this helps and happy Clauding!
"Have an agent investiate issue X in modules Y and Z. The agent should place a report at ./doc/rework-xyz-overview.md with all locations that need refactoring. Once you have the report, have agents refactor 5 classes each in parallel. Each agent writes a terse report in ./doc/rework-xyz/ When they are all done, have another agent check all the work. When that agent reports everything is okay, perform a final check yourself"
Why do they need to be replaced? Programmers are in the perfect place to use AI coding tools productively. It makes them more valuable.
That's not how I read it. I would say that it's more like "If a human no longer needs to read the code, is it important for it to be readable?"
That is, of course, based on the premise that AI is now capable of both generating and maintaining software projects of this size.
Oh, and it begs another question: are human-readable and AI-readable the same thing? If they're not, it very well could make sense to instruct the model to generate code that prioritizes what matters to LLMs over what matters to humans.
Dead Comment
I'm very lucky that I rarely have to deal with other devs and I'm writing a lot of code from scratch using whatever is the latest version of the frameworks. I understand that gives me a lot of privileges others don't have.
They get those ocassionally all the time though too. Depends on the company. In some software houses it's constant "greenfield projects", one after another. And even in companies with 1-2 pieces of main established software to maintain, there are all kinds of smaller utilities or pipelines needed.
>But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right".
In some cases that's legit. In other cases it's just "it did it well, but not how I'd done it", which is often needless stickness to some particular style (often a contention between 2 human programmers too).
Basically, what FloorEgg says in this thread: "There are two types of right/wrong ways to build: the context specific right/wrong way to build something and an overly generalized engineer specific right/wrong way to build things."
And you can always not just tell it "build me this feature", but tell it (high level way) how to do it, and give it a generic context about such preferences too.
When I worked at Google, people rarely got promoted for doing that. They got promoted for delivering features or sometimes from rescuing a failing project because everyone was doing the former until promotion velocity dropped and your good people left to other projects not yet bogged down too far.
This sounds a lot like the old arguments around using compilers vs hand-writing asm. But now you can tell the LLM how you want to implement the changes you want. This will become more and more relevant as we try to maintain the code it generates.
But, for right now, another thing Claude's great at is answering questions about the codebase. It'll do the analysis and bring up reports for you. You can use that information to guide the instructions for changes, or just to help you be more productive.
This thing jumped into a giant JSF (yes, JSF) codebase and started fixing things with nearly zero guidance.
Even if it could build the perfect greenfield app, as it updates the app it is needs to consider backwards compatibility and breaking changes. LLMs seem very far as growing apps. I think this is because LLMs are trained on the final outcome of the engineering process, but not on the incremental sub-commit work of first getting a faked out outline of the code running and then slowly building up that code until you have something that works.
This isn't to say that LLMs or other AI approaches couldn't replace software engineering some day, but they clear aren't good enough yet and the training sets they have currently have access to are unlikely to provide the needed examples.
People keep telling me that an LLM is not intelligence, it's simply spitting out statistically relevant tokens. But surely it takes intelligence to understand (and actually execute!) the request to "read adjacent code".
Then don't ask it to "build me this feature" instead lay out a software development process with designated human in the loop where you want it and guard rails to keep it on track. Create a code review agent to look for and reject strange abstractions. Tell it what you don't like and it's really good at finding it.
I find Opus 4.5, properly prompted, to be significantly better at reviewing code than writing it, but you can just put it in a loop until the code it writes matches the review.
Codex will regularly give me 1000+ line diffs where all my comments (I review every single line of what agents write) are basically nitpicks. "Make this shallow w/ early return, use | None instead of Optional", that sort of thing.
I do prompt it in detail though. It feels like I'm the person coming in with the architecture most of the time, AI "draws the rest of the owl."
Your hobby project, though, knock yourself out.
AI coding is a multiplier of writing speed but doesn't excuse planning out and mapping out features.
You can have reasonably engineered code if you get models to stick to well designed modules but you need to tell them.
Which is why it's so great for prototyping, because it can create something during the planning, when you haven't planned out quite what you want yet.
The number of production applications that achieve this rounds to zero
I’ve probably managed 300 brownfield web, mobile, edge, datacenter, data processing and ML applications/products across DoD, B2B, consumer and literally zero of them were built in this way
When I'm reading piles of LLM slop, I know that just reading it is already more effort than it took to write. It feels like I'm being played.
This is entirely subjective and emotional. But when someone writes something with an LLM in 5 seconds and asks me to spend hours reviewing...fuck off.
So far, Im not convinced, but lets take a look at fundmentally whats happening and why humans > agents > LLMs.
At its heart, programming is a constraint satisfaction problem.
The more constraints (requirements, syntax, standards, etc) you have, the harder it is to solve them all simultaneously.
New projects with few contributors have fewer constraints.
The process of “any change” is therefore simpler.
Now, undeniably
1) agents have improved the ability to solve constraints by iterating; eg. Generate, test, modify, etc. over raw LLm output.
2) There is an upper bound (context size, model capability) to solve simultaneous constraints.
3) Most people have a better ability to do this than agents (including claude code using opus 4.5).
So, if youre seeing good results from agents, you probably have a smaller set of constraints than other people.
Similarly, if youre getting bad results, you can probably improve them by relaxing some of the constraints (consistent ui, number of contributors, requirements, standards, security requirements, split code into well defined packages).
This will make both agents and humans more productive.
The open question is: will models continue to improve enough to approach or exceed human level ability in this?
Are humans willing to relax the constraints enough for it to be plausible?
I would say currently people clambering about the end of human developers are cluelessly deceived by the “appearance of complexity” which does not match the “reality of constraints” in larger applications.
Opus 4.5 cannot do the work of a human on code bases Ive worked on. Hell, talented humans struggle to work on some of them.
…but that doesnt mean it doesnt work.
Just that, right now, the constraint set it can solve is not large enough to be useful in those situations.
…and increasingly we see low quality software where people care only about speed of delivery; again, lowering the bar in terms of requirements.
So… you know. Watch this space. Im not counting on having a dev job in 10 years. If I do, it might be making a pile of barely working garbage.
…but I have one now, and anyone who thinks that this year people will be largely replaced by AI is probably poorly informed and has misunderstood the capabilities on these models.
Theres only so low you can go in terms of quality.
LLMs are pretty good at picking up existing codebases. Even with cleared context they can do „look at this codebase and this spec doc that created it. I want to add feature x“
Copy-paste the bug report and watch it go.
With all the due respect: a file converter for windows is glueing few windows APIs with the relevant codec.
Now, good luck working on a complex warehouse management application where you need extremely complex logic to sort the order of picking, assembling, packing on an infinite number of variables: weight, amazon prime priority, distribution centers, number and type of carts available, number and type of assembly stations available, different delivery systems and requirements for different delivery operators (such as GLE, DHL, etc) that has to work with N customers all requiring slightly different capabilities and flows, all having different printers and operations, etc, etc. And I ain't even scratching the surface of the business logic complexity (not even mentioning functional requirements) to avoid boring the reader.
Mind you, AI is still tremendously useful in the analysis phase, and can sort of help in some steps of the implementation one, but the number of times you can avoid looking thoroughly at the code for any minor issue or discrepancy is absolutely close to 0.
I've been building a somewhat-novel, complex, greenfield desktop app for 6 months now, conceived and architected by a human (me), visually designed by a human (me), implementation heavily leaning on mostly Claude Code but with Codex and Gemini thrown in the mix for the grunt work. I have decades of experience, could have built it bespoke in like 1-2 years probably, but I wanted a real project to kick the tires on "the future of our profession".
TL;DR I started with 100% vibe code simply to test the limits of what was being promised. It was a functional toy that had a lot of problems. I started over and tried a CLI version. It needed a therapist. I started over and went back to visual UI. It worked but was too constrained. I started over again. After about 10 complete start-overs in blank folders, I had a better vision of what I wanted to make, and how to achieve it. Since then, I've been working day after day, screen after screen, building, refactoring, going feature by feature, bug after bug, exactly how I would if I was coding manually. Many times I've reached a point where it feels "feature complete", until I throw a bigger dataset at it, which brings it to its knees. Time to re-architect, re-think memory and storage and algorithms and libraries used. Code bloated, and I put it on a diet until it was trim and svelte. I've tried many different approaches to hard problems, some of which LLMs would suggest that truly surprised me in their efficacy, but only after I presented the issues with the previous implementation. There's a lot of conversation and back and forth with the machine, but we always end up getting there in the end. Opus 4.5 has been significantly better than previous Anthropic models. As I hit milestones, I manually audit code, rewrite things, reformat things, generally polish the turd.
I tell this story only because I'm 95% there to a real, legitimate product, with 90% of the way to go still. It's been half a year.
Vibe coding a simple app that you just want to use personally is cool; let the machine do it all, don't worry about under the hood, and I think a lot of people will be doing that kind of stuff more and more because it's so empowering and immediate.
Using these tools is also neat and amazing because they're a force multiplier for a single person or small group who really understand what needs done and what decisions need made.
These tools can build very complex, maintainable software if you can walk with them step by step and articulate the guidelines and guardrails, testing every feature, pushing back when it gets it wrong, growing with the codebase, getting in there manually whenever and wherever needed.
These tools CANNOT one-shot truly new stuff, but they can be slowly cajoled and massaged into eventually getting you to where you want to go; like, hard things are hard, and things that take time don't get done for a while. I have no moral compunctions or philosophical musings about utilizing these tools, but IMO there's still significant effort and coordination needed to make something really great using them (and literally minimal effort and no coordination needed to make something passable)
If you're solo, know what you want, and know what you're doing, I believe you might see 2x, 4x gains in time and efficiency using Claude Code and all of his magical agents, but if your project is more than a toy, I would bet that 2x or 4x is applied to a temporal period of years, not days or months!
Dead Comment
I am in a unique situation where I work with a variety of codebases over the week. I have had no problem at all utilizing Claude Code w/ Opus 4.5 and Gemini CLI w/ Gemini 3.0 Pro to make excellent code that is indisputably "the right way", in an extremely clear and understandable way, and that is maximally extensible. None of them are greenfield projects.
I feel like this is a bit of je ne sais quoi where people appeal to some indemonstrable essence that these tools just can't accomplish, and only the "non-technical" people are foolish enough to not realize it. I'm a pretty technical person (about 30 years of software development, up to staff engineer and then VP). I think they have reached a pretty high level of competence. I still audit the code and monitor their creations, but I don't think they're the oft claimed "junior developer" replacement, but instead do the work I would have gotten from a very experienced, expert-level developer, but instead of being an expert at a niche, they're experts at almost every niche.
Are they perfect? Far from it. It still requires a practitioner who knows what they're doing. But frequently on here I see people giving takes that sound like they last used some early variant of Copilot or something and think that remains state of the art. The rest of us are just accelerating our lives with these tools, knowing that pretending they suck online won't slow their ascent an iota.
You AI hype thots/bots are all the same. All these claims but never backed up with anything to look at. And also alway claiming “you’re holding it wrong”.
And yes I do make sure it's not generating crazy architecture. It might do that.. if you let it. So don't let it.
Not in terms of knowledge. That was already phenomenal. But in its ability to act independently: to make decisions, collaborate with me to solve problems, ask follow-up questions, write plans and actually execute them.
You have to experience it yourself on your own real problems and over the course of days or weeks.
Every coding problem I was able to define clearly enough within the limits of the context window, the chatbot could solve and these weren’t easy. It wasn’t just about writing and testing code. It also involved reverse engineering and cracking encoding-related problems. The most impressive part was how actively it worked on problems in a tight feedback loop.
In the traditional sense, I haven’t really coded privately at all in recent weeks. Instead, I’ve been guiding and directing, having it write specifications, and then refining and improving them.
Curious how this will perform in complex, large production environments.
Yes, feel free to blame me for the fact that these aren’t very business-realistic.
[0] https://github.com/s-macke/coding-agent-benchmark
[1] https://github.com/s-macke/weltendaemmerung
How do you stop it from over-engineering everything?
It may end up working, but the thing is going to convolute apis and abstractions and mix patterns basically everywhere
Once more people realize how easy it is to customize and personalized your agent, I hope they will move beyond what cookie cutter Big AI like Anthropic and Google give you.
I suspect most won't though because (1) it means you have to write human language, communication, and this weird form of persuasion, (2) ai is gonna make a bunch of them lazy and big AI sold them on magic solutions that require no effort on your part (not true, there is a lot of customizing and it has huge dividends)
When I use Claude Code I find that it *can* add a tremendous amount of ability due to its ability to see my entire codebase at once, but the issue is that if I'm doing something where seeing my entire codebase would help that it blasts through my quota too fast. And if I'm tightly scoping it, it's just as easy & faster for me to use the website.
Because of this I've shifted back to the website. I find that I get more done faster that way.
This is basically all my side projects.
All the LLM coded projects I've seen shared so far[1] have been tech toys though. I've watched things pop up on my twitter feed (usually games related), then quietly go off air before reaching a gold release (I manually keep up to date with what I've found, so it's not the algorithm).
I find this all very interesting: LLMs dont change the fundamental drives needed to build successful products. I feel like I'm observing the TikTokification of software development. I dont know why people aren't finishing. Maybe they stop when the "real work" kicks in. Or maybe they hit the limits of what LLMs can do (so far). Maybe they jump to the next idea to keep chasing the rush.
Acquiring context requires real work, and I dont see a way forward to automating that away. And to be clear, context is human needs; i.e. the reasons why someone will use your product. In the game development world, it's very difficult to overstate how much work needs to be done to create a smooth, enjoyable experience for the player.
While anyone may be able to create a suite of apps in a weekend, I think very few of them will have the patience and time to maintain them (just like software development before LLMs! i.e. Linux, open source software, etc.).
[1] yes, selection bias. There are A LOT of AI devs just marketing their LLMs. Also it's DEFINITELY too early to be certain. Take everything Im saying with a one pound grain of salt.
real people get fed up of debating the same tired "omg new model 1000x better now" posts/comments from the astroturfers, the shills and their bots each time OpenAI shits out a new model
(article author is a Microslop employee)
If these articles actually provide quantitative results in a study done across an organization and provide concrete suggestions like what Google did a while ago, that would be refreshing and useful.
(Yes, this very article has strong "shill" vibes and fits the patterns above)
Would be far more useful if you provided actual verifiable information and dropped the cringe memes. Can't take seriously someone using "Microslop" in a sentence".
Sharing how you're using these tools is quite a lot of work!
But, Im also wondering if LLMs are going to create a new generation of software dev "brain rot" (to use the colloquial term), similar to short form videos.
I should mention in the gamedev world, it's quite common share because sharing is marketing, hence my perspective.
That people making startups is too bussy working to share it on HN or that AI is useless in real projects.
Sad i had to scroll so far down to get some fitting description of why those projects all die. Maybe it's not just me leaving all social networks even HN because well you may not talk to 100% bots but you sure talk to 90% of people that talk to models a lot instead of using them as a tool.
I think it's also rewarding to just be able to build something for yourself, and one benefit of scratching your own itch is that you don't have to go through the full effort of making something "production ready". You can just build something that's tailed specifically to the problem you're trying to solve without worrying about edge cases.
Which is to say, you're absolutely right :).
I interpret it more as spooked silence
Stopping there is just fine if you're doing it as a hobby. I love to do this to test out isolated ideas. I have dozens of RPGs in this state, just to play around with different design concepts from technical to gameplay.
Given the enthusiasm of our ruling class towards automating software development work, it may make sense for a software engineer to publicly signal how much onboard as a professional they are with it.
But, I've seen stranger stuff throughout my professional life: I still remember people enthusiastically defending EJB 2.1 and xdoclet as perfectly fine ways of writing software.
I hacked together a Swift tool to replace a Python automation I had, merged an ARM JIT engine into a 68k emulator, and even got a very decent start on a synth project I’ve been meaning to do for years.
What has become immensely apparent to me is that even gpt-5-mini can create decent Go CLI apps provided you write down a coherent spec and review the code as if it was a peer’s pull request (the VS Code base prompts and tooling steer even dumb models through a pretty decent workflow).
GPT 5.2 and the codex variants are, to me, every bit as good as Opus but without the groveling and emojis - I can ask it to build an entire CI workflow and it does it in pretty much one shot if I give it the steps I want.
So for me at least this model generation is a huge force multiplier (but I’ve always been the type to plan before coding and reason out most of the details before I start, so it might be a matter of method).
I had to dig through source code to confirm whether those features actually existed. They don't, so the CLI tools GPT recommended aren't actually applicable to my use case.
Yesterday, it hallucinated features of WebDav clients, and then talked up an abandoned and incomplete project on GitHub with a dozen stars as if it was the perfect fit for what I was trying to do, when it wasn't.
I only remember these because they're recent and CLI related, given the topic, but there are experiences like this daily across different subjects and domains.
If so then it should have realized its mistake when it tried to run those CLI commands and saw the error message. Then it can try something different instead.
If you were using a regular chat interface and expecting it to know everything without having an environment to try things out then yeah, you're going to be disappointed.
SimonW's approach of having a suite of dynamic tools (agents) grind out the hallucinations is a big improvement.
In this case expressing the feeback validation and investing in the setup may help smooth these sharp edges.
I think the following things are true now:
- Vibe Coding is, more than ever, "autopilot" in the aviation sense, not the colloquial sense. You have to watch it, you are responsible, the human has do run takeoff/landing (the hard parts), but it significantly eases and reduces risk on a bulk of the work.
- The gulf of developer experience between today's frontier tooling and six months ago is huge. I pushed hard to understand and use these tools throughout last year, and spent months discouraged--back to manual coding. Folks need to re-evaluate by trying premium tools, not free ones.
- Tooling makers have figured out a lot of neat hacks to work around the limitations of LLMs to make it seem like they're even better than they are. Junie integrates with your IDE, Antigravity has multiple agents maintaining background intel on your project and priorities across chats. Antigravity also compresses contexts and starts new ones without you realizing it, calls to sub-agents to avoid context pollution, and other tricks to auto-manage context.
- Unix tools (sed, grep, awk, etc.) and the git CLI (ls-tree, show, --stat, etc.) have been a huge force-multiplier, as they keep the context small compared to raw ingestion of an entire file, allowing the LLMs to get more work done in a smaller context window.
- The people who hire programmers are still not capable of Vibe Coding production-quality web apps, even with all these improvements. In fact, I believe today this is less of a risk than I feared 10 months ago. These are advanced tools that need constant steering, and a good eye for architecture, design, developer experience, test quality, etc. is the difference between my vibe coded Ruby [0] (which I heavily stewarded) and my vibe coded Rust [1] (I don't even know what borrow means).
[0]: https://git.sr.ht/~kerrick/ratatui_ruby/tree/stable/item/lib
[1]: https://git.sr.ht/~kerrick/ratatui_ruby/tree/stable/item/ext...
1) It often overcomplicates things for me. After I refactor its code, it's usually half the size and much more readable. It often adds unnecessary checks or mini-features 'just in case' that I don't need.
2) On the other hand, almost every function it produces has at least one bug or ignores at least one instruction. However, if I ask it to review its own code several times, it eventually finds the bugs.
I still find it very useful, just not as a standalone programming agent. My workflow is that ChatGPT gives me a rough blueprint and I iterate on it myself, I find this faster and less error-prone. It's usually most useful in areas where I'm not an expert, such as when I don't remember exact APIs. In areas where I can immediately picture the entire implementation in my head, it's usually faster and more reliable to write the code myself.
(Cue the “you’re holding it wrong meme” :))
This experience for me is current but I do not normally use Opus so perhaps I should give it a try and figure out if it can reason around problems I myself do not foresee (for example a browser JS API quirk that I had never seen).
But yes, the more specific you get and the more moving pieces you have, the more you need to break things down into baby steps. If you don’t just need it to make A work, but to make it work together with B and C. Especially given how eager Claude is to find cheap workarounds and escape hatches, botching things together in any way seemingly to please the prompter as fast as possible.
Time will tell what happens, but if programming becomes "prompt engineering", I'm planning on quitting my job and pivoting to something else. It's nice to get stuff working fast, but AI just sucks the joy out of building for me.
Trying to not feel the pressure/anxiety from this, but every time a new model drops there is this tiny moment where I think "Is it actually different this time?"
On the hobby side (music) I don't feel the pressure as bad but that's because I don't have any commercial aspirations, it's purely for fun.
I hear you but I think many companies will change the role ; you'll get the technical ownership + big chunks of the data/product/devops responsibility. I'm speculating but I think one person can take that on himself with the new tools and deliver tremendous value. I don't know how they'll call this new role though, we'll see.
I enjoy the plan, think, code cycle - it's just fun.
My brain has problems with not understanding how the thing I'm delivering works, maybe I'll get used to it.
A program, by definition, is analyzable and repeatable, whereas prompting is anything but that.
But it is also very early to say, maybe the next iteration of tools will completely change my perspective, I might enjoy it some day!
They're going to find a lot of stuff to fix.
This reminds me of that example where someone asked an agent to improve a codebase in a loop overnight and they woke up to 100,000 lines of garbage [0]. Similarly you see people doing side-by-side of their implementation and what an AI did, which can also quite effectively show how AI can make quite poor architecture decisions.
This is why I think the “plan modes” and spec driven development are so important effective for agents, because it helps to avoid one of their main weaknesses.
[0] https://gricha.dev/blog/the-highest-quality-codebase
I had this long discussion today with a co-worker about the merits of detailed queries with lots of guidance .md documents, vs just asking fairly open ended questions. Spelling out in great detail what you want, vs just generally describing what you want the outcomes to be in general then working from there.
His approach was to write a lot of agent files spelling out all kinds of things like code formatting style, well defined personas, etc. And here's me asking vague questions like, "I'm thinking of splitting off parts of this code base into a separate service, what do you think in general? Are there parts that might benefit from this?"
So you gave it an poorly defined task, and it failed?
I wouldn't be surprised to find out that they will find issues infinitely, if looped with fixes.
Have you tried the planning mode? Ask it to review the codebase and identify defects, but don't let it make any changes until you've discussed each one or each category and planned out what to do to correct them. I've had it refactor code perfectly, but only when given examples of exactly what you want it to do, or given clear direction on what to do (or not to do).