I'm still calibrating myself on the size of task that I can get Claude Code to do before I have to intervene.
I call this problem the "goldilocks" problem. The task has to be large enough that it outweighs the time necessary to write out a sufficiently detailed specification AND to review and fix the output. It has to be small enough that Claude doesn't get overwhelmed.
The issue with this is, writing a "sufficiently detailed specification" is task dependent. Sometimes a single sentence is enough, other times a paragraph or two, sometimes a couple of pages is necessary. And the "review and fix" phase again is totally dependent and completely unknown. I can usually estimate the spec time but the review and fix phase is a dice roll dependent on the output of the agent.
And the "overwhelming" metric is again not clear. Sometimes Claude Code can crush significant tasks in one shot. Other times it can get stuck or lost. I haven't fully developed an intuition for this yet, how to differentiate these.
What I can say, this is an entirely new skill. It isn't like architecting large systems for human development. It isn't like programming. It is its own thing.
This is why I'm still dubious about the overall productivity increase we'll see from AI once all the dust settles.
I think it's undeniable that in narrow well controlled use cases the AI does give you a bump. Once you move beyond that though the time you have to spend on cleanup starts to seriously eat into any efficiency gains.
And if you're in a domain you know very little about, I think any use case beyond helping you learn a little quicker is a net negative.
Absolutely. And what I find fascinating that this experience is highly personal. I read probably 876 different “How I code with LLMs” and I can honestly say not a single thing I read and tried (and I tried A LOT) “worked” for me…
>I haven't fully developed an intuition for this yet, how to differentiate these.
The big issue is that, even though there is a logical side to it, part is adapting to a close system that can change under your feet. New model, new prompt, there goes your practice.
> What I can say, this is an entirely new skill. It isn't like architecting large systems for human development. It isn't like programming. It is its own thing.
It's management!
I find myself asking very similar questions to you: how much detail is too much? How likely is this to succeed without my assistance? If it does succeed, will I need to refactor? Am I wasting my time delegating or should I just do it?
It's almost identical to when I delegate a task to a junior... only the feedback cycle of "did I guess correctly here" is a lot faster... and unlike a junior, the AI will never get better from the experience.
My experience is: AI written prompts are overly long and overly specific. I prefer to write the instructions myself and then direct the LLM to ask clarifying questions or provide an implementation plan. Depending on the size of change I go 1-3 rounds of clarifications until Claude indicates it is ready and provides a plan that I can review.
I do this in a task_descrtiption.md file and I include the clarifications in its own section (the files follow a task.template.md format).
This illustrates a fundamental truth of maintaining software with LLMs: While programmers can use LLMs to produce huge amounts of code in a short time, they still need to read and understand it. It is simply not possible to delegate understanding a huge codebase to an AI, at least not yet.
In my experience, the real "pain" of programming lies in forcing yourself to absorb a flood of information and connecting the dots. Writing code is, in many ways, like taking a walk: you engage in a cognitively light activity that lets ideas shuffle, settle, and mature in the background.
When LLMs write all the code for you, you lose that essential mental rest. The quiet moments where you internalize concepts, spot hidden bugs, and develop a mental map of the system.
Has anyone else had the experience of dreading a session with Claude, because his personality is often chirpy and annoying; he's always got positive things to say; and working with him as the main code author actually takes away one of the joys of being a programmer -- the ability to interact with a system that is _not_ like people -- it is rigid and deterministic, not all soft and mushy like human beings.
When I write piece of code that is elegant, efficient, and -- "right" -- I get a dopamine rush, like I finished a difficult crossword puzzle. Seems like that joy is going to go away, replaced by something more akin to developing a good relatioship with a slightly quirky colleague who happens to be real good (and fast) at some things -- especially things management likes, like N LOC per week -- but this colleague sucks up to everyone, always thinks they have the right answer, often seems to understand things on a superficial level, and oh -- works for $200 / month...
That's a great insight---the problem with LLMs is that they write code and elegant prose for us, so we have more time to do chores. I want it the other way around!!!
MCP up Playwright, have a detailed spec, and tell claude to generate a detailed test plan for every story in the spec, then keep iterating on a test -> fix -> ... loop until every single component has been fully tested. If you get claude to write all the components (usually by subfolder) out to todos, there's a good chance it'll go >1 hour before it tries to stop, and if you have an anti-stopping hook it can go quite a bit longer.
Programming and vibe coding are two entirely separate disciplines. The process of producing software, and the end result, is wildly different between them.
People who vibe code don't care about the code, but about producing something that delivers value, whatever that may be. Code is just an intermediate artifact to achieve that goal. ML tools are great for this.
People who program care about the code. They want to understand how it works, what it does, in addition to whether it achieves what they need. They may also care about its quality, efficiency, maintainability, and other criteria. ML tools can be helpful for programming, but they're not a panacea. There is no shortcut for building robust, high quality software. A human still needs to understand whatever the tool produces, and ensure that it meets their quality criteria. Maybe this will change, and future generations of this tech will produce high quality software without hand-holding, but frankly, I wouldn't bet on the current approaches to get us there.
When building a project from scratch using AI, it can be tempting to give in to the vibe and ignore the structure/architecture and let it evolve naturally. This is a bad idea when humans do it, and it's also a bad idea when LLM agents do it. You have to be considering architecture, dataflow, etc from the beginning, and always stay on top of it without letting it drift.
I have tried READMEs scattered through the codebase but I still have trouble keeping the agent aware of the overall architecture we built.
This should be called the eternal, unbearable slowness of code review, because the author writes that the AI actually churns out code extremely rapidly. The (hopefully capable, attentive, careful) human is the bottleneck here, as it should be
> ... I’ll keep pulling PRs locally, adding more git hooks to enforce code quality, and zooming through coding tasks—only to realize ChatGPT and Claude hallucinated library features and I now have to rip out Clerk and implement GitHub OAuth from scratch.
I don't get this, how many git hooks do you need to identify that Claude had hallucinated a library feature? Wouldn't a single hook running your tests identify that?
I don't have a ton of tests. From what I've seen, Claude will often just update the tests to no-op so tests passing isn't trustworthy.
My workflow is often to plan with ChatGPT and what I was getting at here is ChatGPT can often hallucinate features of 3rd party libraries. I usually dump the plan from ChatGPT straight into Claude Code and only look at the details when I'm testing.
That said, I've become more careful in auditing the plans so I don't run in to issues like this.
Tell Claude to use a code review sub agent after every significant change set, tell them to run the tests and evaluate the change set, don't tell Claude it wrote the code, and give them strict review instructions. Works like a charm.
What bothers me is this: Claude & I work hard on a subtle issue; eventually (often after wiping Claude's memory clean and trying again) we collectively come to a solution that works.
But the insights gleaned from that battle are (for Claude) lost forever as soon as I start on a new task.
The way LLM's (fail to) handle memory and in-situ learning (beyond prompt engineering and working within the context window) is just clearly deficient compared to how human minds work.
The reason these tools haven't achieved greatness yet is because 99% of us are struggling at work with domain knowledge - how does this special project work in the frame of this company. If an AI tool is unable to "learn the ropes" at a specific company over time, they will never be better than a mid-senior developer on day 1 at the company. They NEED to be able to learn. They NEED to be able have long-term memory and to read entire codebases.
And the thing is all these “memory features” don’t help either because the “memory” is too specific either to the task at hand and not generalizable to all things, or it is time bound and therefore won’t be useful later (eg: “user is searching for a new waterbed with flow master manifolds”). And rarely can you directly edit the memory so you are stuck with a bunch of potential nonsense polluting your context (with little control when or why the memory is presented).
Yes, it's a common problem. There are 'memory' plugins that you can use to collect insights and feed it back to the LLM, but I tend just to update an AGENTS.md file (or equivalent).
I've no idea why, but the phrase "it's addicting" is really annoying, I'm pretty certain it should "it's addictive". I've started seeing it everywhere. (Note, I haven't completely lost my mind, it's in that article).
I call this problem the "goldilocks" problem. The task has to be large enough that it outweighs the time necessary to write out a sufficiently detailed specification AND to review and fix the output. It has to be small enough that Claude doesn't get overwhelmed.
The issue with this is, writing a "sufficiently detailed specification" is task dependent. Sometimes a single sentence is enough, other times a paragraph or two, sometimes a couple of pages is necessary. And the "review and fix" phase again is totally dependent and completely unknown. I can usually estimate the spec time but the review and fix phase is a dice roll dependent on the output of the agent.
And the "overwhelming" metric is again not clear. Sometimes Claude Code can crush significant tasks in one shot. Other times it can get stuck or lost. I haven't fully developed an intuition for this yet, how to differentiate these.
What I can say, this is an entirely new skill. It isn't like architecting large systems for human development. It isn't like programming. It is its own thing.
I think it's undeniable that in narrow well controlled use cases the AI does give you a bump. Once you move beyond that though the time you have to spend on cleanup starts to seriously eat into any efficiency gains.
And if you're in a domain you know very little about, I think any use case beyond helping you learn a little quicker is a net negative.
You articulated what I was wrestling with in the post perfectly.
Absolutely. And what I find fascinating that this experience is highly personal. I read probably 876 different “How I code with LLMs” and I can honestly say not a single thing I read and tried (and I tried A LOT) “worked” for me…
The big issue is that, even though there is a logical side to it, part is adapting to a close system that can change under your feet. New model, new prompt, there goes your practice.
It's management!
I find myself asking very similar questions to you: how much detail is too much? How likely is this to succeed without my assistance? If it does succeed, will I need to refactor? Am I wasting my time delegating or should I just do it?
It's almost identical to when I delegate a task to a junior... only the feedback cycle of "did I guess correctly here" is a lot faster... and unlike a junior, the AI will never get better from the experience.
I do this in a task_descrtiption.md file and I include the clarifications in its own section (the files follow a task.template.md format).
In my experience, the real "pain" of programming lies in forcing yourself to absorb a flood of information and connecting the dots. Writing code is, in many ways, like taking a walk: you engage in a cognitively light activity that lets ideas shuffle, settle, and mature in the background.
When LLMs write all the code for you, you lose that essential mental rest. The quiet moments where you internalize concepts, spot hidden bugs, and develop a mental map of the system.
When I write piece of code that is elegant, efficient, and -- "right" -- I get a dopamine rush, like I finished a difficult crossword puzzle. Seems like that joy is going to go away, replaced by something more akin to developing a good relatioship with a slightly quirky colleague who happens to be real good (and fast) at some things -- especially things management likes, like N LOC per week -- but this colleague sucks up to everyone, always thinks they have the right answer, often seems to understand things on a superficial level, and oh -- works for $200 / month...
Shades of outsourcing to other continents...
Writing code is my favorite part of the job, why would I outsource it so I can spend even more time reading and QAing?
But AI reviewers can do little beyond checking coding standards.
People who vibe code don't care about the code, but about producing something that delivers value, whatever that may be. Code is just an intermediate artifact to achieve that goal. ML tools are great for this.
People who program care about the code. They want to understand how it works, what it does, in addition to whether it achieves what they need. They may also care about its quality, efficiency, maintainability, and other criteria. ML tools can be helpful for programming, but they're not a panacea. There is no shortcut for building robust, high quality software. A human still needs to understand whatever the tool produces, and ensure that it meets their quality criteria. Maybe this will change, and future generations of this tech will produce high quality software without hand-holding, but frankly, I wouldn't bet on the current approaches to get us there.
I have tried READMEs scattered through the codebase but I still have trouble keeping the agent aware of the overall architecture we built.
Initially I would barely read any of the code generated and as my project has grown in size, I have approached the limits of that approach.
Often because Claude Code makes very poor architectural choices.
I don't get this, how many git hooks do you need to identify that Claude had hallucinated a library feature? Wouldn't a single hook running your tests identify that?
Works every time
• Good news! The code is compiling successfully (the errors shown are related to an existing macro issue, not our new code).
When infact, it managed to insert 10 compilation errors that were not at all related with any macros.
My workflow is often to plan with ChatGPT and what I was getting at here is ChatGPT can often hallucinate features of 3rd party libraries. I usually dump the plan from ChatGPT straight into Claude Code and only look at the details when I'm testing.
That said, I've become more careful in auditing the plans so I don't run in to issues like this.
But the insights gleaned from that battle are (for Claude) lost forever as soon as I start on a new task.
The way LLM's (fail to) handle memory and in-situ learning (beyond prompt engineering and working within the context window) is just clearly deficient compared to how human minds work.
I dunno.