After iterating on that for a while, I did a bunch manually (90) and then gave the LLM a list of pull requests as examples, and asked _it_ to write the prompt. It still failed.
Finally, I broke the problem up and started to ask it to generate tools to perform each step. It started to make progress - each execution gave me a new checkpoint so it wouldn't make new mistakes.
Or, any of:
- the problem was too big in scope and needed a stepped plan to refer to and execute step by step
- your instructions weren't clear enough
- the context you provided was missing something crucial it couldn't find agentically, or build knowledge of (in which case, document that part of the codebase first)
- your rules/AGENTS.md/CLAUDE.md needs some additions or tweaking
- you may need a more powerful model to plan implementation first
Just throwing away and moving on is often the wrong choice and you'll get better at using these tools slower. If you're still within the "time it would have taken me to do it myself" window, think about what caused it to go off the rails or fail spectacularly and try giving it another go (not following up, throw away current results and chat and try again with the above in mind)
I feel good because real humans are using what I've built and they like it.
We use feature flags. However, cleaning them up is something rarely done. It typically takes me ~3minutes to clean one up.
To clean up the flag:
1) delete the test where the flag is off
2) delete all the code setting the flag to on
3) anything getting the value of the flag is set to true
4) resolve all "true" expressions, cleaning up if's and now constant parameters.
5) prep a pull request and send it for review
This is all fully supported by the indexing and refactoring tooling in my IDE.
However, when I prompted the LLM with those steps (and examples), it failed. Over and over again. It would delete tests where the value was true, forget to resolve the expressions, and try to run grep/find across a ginormous codebase.
If this was an intern, I would only have to correct them once. I would correct the LLM, and then it would make a different mistake. It wouldn't follow the instructions, and it would use tools I told it to not use.
It took 5-10 minutes to make the change, and then would require me to spend a couple of minutes fixing things. It was at the point of not saving me any time.
I've got a TONNE of low-hanging fruit that I can't give to an intern, but could easily sick a tool as capable as an intern on. This was not that.
I've been using Cursor for the last few months and notice that for tasks like this, it helps to give examples of the code you're looking for, tell it more or less how the feature flags are implemented and also have it spit out a list of files it would modify first.
Going from "thing in my head that I need to pay someone $100/h to try" to "thing a user can literally use in 3 minutes that will make that hypothetical-but-nonexistent $100/h person cry"... like there is way more texture of roles in that territory than your punchy comment gives credit. No one cares is it's maintainable if they now know what's possible, and that matters 1000x more than future maintenance concerns. People spend years working up to this step that someone can now simply jank out* in 3 minutes.
* to jank out. verb. 1. to crank out via vibe-coding, in the sense of productive output.
[0] - https://hjelp.yr.no/hc/en-us/articles/9260735234076-Lightnin...