Claude is really good at specific analysis, but really terrible at open-ended problems.
"Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.
"Hey claude, anything I could do to improve Y?", and it'll struggle beyond the basics that a linter might suggest.
It suggested enthusiastically a library for <work domain> and it was all "Recommended" about it, but when I pointed out that the library had been considered and rejected because <issue>, it understood and wrote up why that library suffered from that issue and why it was therefore unsuitable.
There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.
That may well change, so I don't want to embed that thought too deeply into my own priors, because the LLM space seems to evolve rapidly. I wouldn't want to find myself blind to the progress because I write it off from a class of problems.
But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.
That's why you treat it like a junior dev. You do the fun stuff of supervising the product, overseeing design and implementation, breaking up the work, and reviewing the outputs. It does the boring stuff of actually writing the code.
I am phenomenally productive this way, I am happier at my job, and its quality of work is extremely high as long as I occasionally have it stop and self-review it's progress against the style principles articulated in its AGENTS.md file. (As it tends to forget a lot of rules like DRY)
Yeah at this point I basically have to dictate all implementation details: do this, but do it this specific way, handle xyz edge cases by doing that, plug the thing in here using that API. Basically that expands 10 lines into 100-200 lines of code.
However if I just say “I have this goal, implement a solution”, chances are that unless it is a very common task, it will come up with a subpar/incomplete implementation.
What’s funny to me is that complexity has inverted for some tasks: it can ace a 1000 lines ML model for a general task I give it, yet will completely fail to come up with a proper solution for a 2D geometric problem that mostly has high school level maths that can be solved in 100 lines
It may seem decent until you look closer. Just like with a junior dev, you should always review the code very carefully, you can absolutely not trust it. It's not bad at trivial stuff, but fails almost always if things get more complex and unlike a junior dev, it does not tell you, when things get too complex for it.
In the same breath (same paragraph) you state two polar opposites about working with AI:
- I am phenomenally productive
- "as long as I occasionally have it stop" and "it tends to forget a lot of rules like DRY"
I don't see how you can claim to be "phenomenally productive" when working with a tool you have to babysit because it forgets your instructions the whole time.
If it was the "junior dev" you also mention, I suspect you would very quickly invite the "junior dev" to find a job elsewhere.
Cool cool cool. So if you use LLMs as junior devs, let me ask you how future awesome senior devs like you will come around? From WHAT job experience? From what coding struggle?
> That's why you treat it like a junior dev. You do the fun stuff of supervising the product, overseeing design and implementation, breaking up the work, and reviewing the outputs. It does the boring stuff of actually writing the code.
I am so tired of this analogy. Have the people who say this never worked with a junior dev before? If you treat your junior devs as brainless code monkeys who only exist to type out your brilliant senior developer designs and architectures instead of, you know, human beings capable of solving problems, 1) you're wasting your time, because a less experienced dev is still capable of solving problems independently, 2) the juniors working under you will hate it because they get no autonomy, and 3) the juniors working under you will stay junior because they have no opportunity to learn--which means you've failed at one of your most important tasks as a senior developer, which is mentorship.
Few weeks ago I'd disagree with you, but recently I've been struggling with concentration and motivation and now I kind of try to embrace coding with AI. I guide it pretty strictly, try to stick with pure functions, and always read the output thoroughly. In a couple of places requiring some carefulness I coded them in executable pseudocode (Python) and made AI translate it to the more boilerplate-y target language.
I don't know if I'm any faster than I would be if I was motivated, but I'm A LOT more productive in my current state. I still hope for the next AI winter though.
I enjoy finding the problem and then telling Claude to fix it. Specifying the function and the problem. Then going to get a coffee from the breakroom to see it finished when I return. The junior dev has questions when I did that. Claude just fixes it.
I wonder if DRY is still a principle worth holding onto in the AI coding era. I mean it probably is, but this feels like enough of a shift in coding design that re-evaluating principles designed for human-only coding might be worth the effort
TBH I think its ability to structure unstructured data is what makes it a powerhouse tool and there is so much juice to squeeze there that we can make process improvements for years even if it doesnt get any better at general intelligence.
If I had a pdf printout of a table, the workflow i used to have to use to get that back into a table data structure to use for automation was hard (annoying). dedicated OCR tools with limitations on inputs, multiple models in that tool for the different ways the paper the table was on might be formatted. it took hours for a new input format
now i can take a photo of something with my phone and get a data table in like 30 seconds.
people seem so desperate to outsource their thinking to these models and operating at the limits of their capability, but i have been having a blast using it to cut through so much tedium that werent unsolved problems but required enough specialized tooling and custom config to be left alone unless you really had to
this fits into what youre saying with using it to do the grunt work i find boring i suppose, but feels a little bit more than that - like it has opened a lot of doors to spaces that had grunt work that wasnt worth doing for the end result previously but now it is
> There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.
While this is true in my experience, the opposite is not true. LLMs are very good at helping me go through a structure processing of thinking about architectural and structural design and then help build a corresponding specification.
This:
Each question should build on my previous answers, and our end goal is to have a detailed specification I can hand off to a developer. Let’s do this iteratively and dig into every relevant detail. Remember, only one question at a time.
I've checked the linked page and there's nothing about even learning the domain or learning the tech platform you're going to use. It's all blind faith, just a small step above copying stuff from GitHub or StackOverflow and pushing it to prod.
This is it. It doesn't replace the higher level knowledge part very well.
I asked Claude to fix a pet peeve of mine, spawning a second process inside an existing Wine session (pretty hard if you use umu, since it runs in a user namespace). I asked Claude to write me a python server to spawn another process to pass through a file handler "in Proton", and it proceeded a long loop of trying to find a way to launch into an existing wine session from Linux with tons of environment variables that didn't exist.
Then I specified "server to run in Wine using Windows Python" and it got more things right. Except it tried to use named pipes for IPC. Which, surprise surprise, doesn't work to talk to the Linux piece. Only after I specified "local TCP socket" it started to go right. Had I written all those technical constraints and made the design decisions in the first message it'd have been a one-hit success.
> without seeing just how effective it can be once you zoom in.
The love/hate flame war continues because the LLM companies aren't selling you on this. The hype is all about "this tech will enable non-experts to do things they couldn't do before" not "this tech will help already existing experts with their specific niche," hence the disconnect between the sales hype and reality.
If OpenAI, Anthropic, Google, etc. were all honest and tempered their own hype and misleading marketing, I doubt there would even be a flame war. The marketing hype is "this will replace employees" without the required fine print of "this tool still needs to be operated by an expert in the field and not your average non technical manager."
claude2() {
claude "$(claude "Generate a prompt and TODO list that works towards this goal: <goal>$*</goal>" -p)"
}
$ claude2 pls give ranked ideas for make code better
>> But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.
This is exactly how I use it. I prefer Gemini 3 personally.
I try to learn as much as I can about different architectures, usually by reading books or other implementations and coding first principals to build a mental model. I apply the architecture to the problem and the AI fills in the gaps. I try my best to focus and cover those gaps.
The reason I think it is inconsistent in nailing a variety of tasks is the recipe for training LLMs, which is pre-training + RL. The RL environment sends a training signal to update all the weights in its trajectory for the successful response. Karpathy calls it “sucking supervision through a straw”. This breaks other parts of the model.
Current AI, and in particular RL-based, is already or will soon achieve super human performance on problems that can be - quickly - verified and measured.
So maths, algorithms, etc and well defined bugs fall into that category.
However architectural decision, design, long-term planning where there is little data, no model allowing synthetic data generation, and long iteration cycles are not so much amenable to it.
It works great in C# (where you have strong typing + strict compiler).
Try this:
Have a look at xyz.cs. Do a full audit of the file and look for any database operations in loops that can be pre-filtered.
Or:
Have a look at folder /folderX/ and add .AsNoTracking() to all read-only database queries. When you are done, run the compiler and fix the errors. Only modify files in /folderX/ and do not go deeper in the call hierarchy. Once you are done, do a full audit of each file and make sure you did not accidentally added .AsNoTracking() to tracked entities. Do no create any new files or make backups, I already created a git branch for you. Do not make any git commits.
Or:
Have a look at the /Controllers/ folder. Do a full audit of each controller file and make sure there are no hard-coded credentials, username, password or tokens.
Or: Have a look at folder /folderX/. Find any repeated hard-coded values, magic values and literals that will make good candidates to extract to Constants.cs. Make sure to add XML comments to the Constants.cs file to document what the value is for. You may create classes within Constants.cs to better group certain values, like AccountingConstants or SystemConstants etc.
These kinds of tasks works amazing in claude code an can often be one shotted. Make sure you check your git diffs - you cannot and should not blame AI for shitty code - its your name next to the commit, make sure it is correct. You can even ask claude to review the file with you afterwards. I've used this kind of approach to greatly increase our overall code quality & performance tuning - I really don't understand all the negative comments as this approach has chopped down days worth of refactorings to a couple of minutes and hours.
In places where you see your coding assistant is slow or making mistakes or it is going line by line where you know a simple regex find/replace would work instantly, ask it to help you create a shell script as a tool for itself to call, that does task xyz that it can call. I've made a couple of scripts that uses this approach that Claude can call locally to fix certain code pattern in 5 seconds that would've taken it (and me checking it) 30 mins at least and it wont eat up context or tokens.
The structured vs open-ended distinction here applies to code review too. When you ask an LLM to "find issues in this code", it'll happily find something to say, even if the code is fine. And when there are actual security vulnerabilities, it often gets distracted by style nitpicks and misses the real issues.
Static analysis has the opposite problem - very structured, deterministic, but limited to predefined patterns and overwhelms you in false positives.
The sweet spot seems to be to give structure to what the LLM should look for, rather than letting it roam free on an open-ended "review this" prompt.
Exactly, if you visualize software as a bunch separate "states" (UI state, app state, DB state) then our job is to mutate states and synchronize those mutations across the system. LLMs are good at mutating a specific state in a specific way. They are trash at designing what data shape a state should be, and they are bad at figuring out how/why to propagate mutations across a system.
I think slash commands are great to help Claude with this. I have many like /code:dry /code:clean-code etc that has a semi long prompt and references to longer docs to review code from a specific perspective. I think it atleast improves Claude a bit in this area. Like processes or templates for thinking in broader ways. But yes I agree it struggles a lot in this area.
Somewhat tangential but interestingly I'd hate for Claude to make any changes with the intent of sticking to "DRY" or "Clean Code".
Neither of those are things I follow, and either way design is better informed by the specific problems that need to be solved rather than by such general, prescriptive principles.
I remember about a problem I had while quick testing notcurses. I tried chatGPT which produced a lot of weird but kinda believable statements about the fact that I had to include wchar and define a specific preprocessor macro, AND I had to place the includes for notcurses, other includes and macros in a specific order.
My sentiment was "that's obviously a weird non-intended hack" but I wanted to test quickly, and well ... it worked. Later, reading the man-pages I aknowledged the fact that I needed to declare specific flags for gcc in place of the gpt advised solution.
I think these kind of value based judgements are hard to emulate for LLMs, it's hard for them to identifiate a single source as the most authoritative source in a sea of lesser authoritative (but numerous) sources.
Using the plan mode in cursor (or asking claude to first come up with a plan) makes it pretty good at generic "how can I improve" prompts. It can spend more effort exploring the codebase and thinking before implementing.
> There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving
I'd hesitate to call this a blind spot. LLMs have a lot of actual blind spots - things people developing them overlook or deprioritize. This strikes me more as something acutely aware of & failing at, despite significant efforts to solve.
> "Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.
This is true, as for "Open Ended" I use Beads with Claude code, I ask it to identify things based on criteria (even if its open ended) then I ask it to make tasks, then when its done I ask it to research and ask clarifying questions for those tasks. This works really well.
I’ve had reasonable success having it ultrathink of every possible X (exhaustively) and their trades offs and then give me a ranked list and rationale of its top recommendations. I almost always choose the top but just reading the list and then giving it next steps has worked really well for me.
Not at all my experience. I’ve often tried things like telling Claude this SIMD code I wrote performed poorly and I needed some ideas to make it go faster. Claude usually does a good job rewriting the SIMD to use different and faster operations.
I'm not a C++ programmer, but wouldn't your example be a fairly structured problem? You wanted to improve performance of a specific part of your code base.
My experience has been with Claude that having it "review" my code has produced some helpful feedback and refactoring suggestions, but also, it falls short in others
It's not the same. Recently Opus 4.5 diagnosed and fixed a bug in the F# compiler for me, for example (https://github.com/dotnet/fsharp/pull/19123). The root cause is pretty subtle and very non-obvious, and of course the critical snippet of the stack trace `at FSharp.Compiler.Symbols.FSharpExprConvert.GetWitnessArgs` has no hits on Google other than my own bug report. I would have been completely lost fixing it.
One under-discussed lever that senior / principal engineers can pull is the ability to write linters & analyzers that will stop junior engineers ( or LLMs ) from doing something stupid that's specific to your domain.
Let's say you don't want people to make async calls while owning a particular global resource, it only takes a few minutes to write an analyzer that will prevent anyone from doing so.
Avoid hours of back-and-forth over code review by encoding your preferences and taste into your build pipeline and stop it at source.
I am basically rawdogging Claude these days, I don’t use MCPs or anything else, I just lay down all of the requirements and the suggestions and the hints, and let it go to work.
When I see my colleagues use an LLM they are treating it like a mind reader and their prompts are, frankly, dogshit.
It shows that articulating a problem is an important skill.
One of my favorite personal evals for llms is testing its stability as a reviewer.
The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?
Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?
A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.
I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.
You could also interpret these results to be a proxy for obsequiousness.
Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?
It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.
I agree, I mostly use Claude for writing code, but I always get GPT5 to review it. Like you, I find it astonishingly consistent and useful, especially compared to Claude. I like to reset my context frequently, so I’ll often paste the problems from GPT into Claude, then get it to review those fixes (going around that loop a few times), then reset the context and get it to do a new full review. It’s very reassuring how consistent the results are.
You mean literally assign a grade, like B+? This is unlikely to work based on how token prediction & temperature works. You're going to get a probability distribution in the end that is reflective of the model runtime parameters, not the intelligence of the model.
my experience reviewing pr is that sometimes it says it is perfect with some nipicks and other times the same pr that it is trash and need a lot of work
LLMs have this strong bias towards generating code, because writing code is the default behavior from pre-training.
Removing code, renaming files, condensing, and other edits is mostly a post-training stuff, supervised learning behavior. You have armies of developers across the world making 17 to 35 dollars an hour solving tasks step by step which are then basically used to generate prompt/responses pairs of desired behavior for a lot of common development situations, adding desired output for things like tool calling, which is needed for things like deleting code.
A typical human working on post-training dataset generation task would involve a scenario like: given this Dockerfile for a python application, when we try to run pytest it fails with exception foo not found. The human will notice that package foo is not installed, change the requirements.txt file and write this down, then he will try pip install, and notice that the foo package requires a certain native library to be installed. The final output of this will be a response with the appropriate tool calls in a structured format.
Given that the amount of unsupervised learning is way bigger than the amount spent on fine-tuning for most models, it is not surprise that given any ambiguous situation, the model will default to what it knows best.
More post-training will usually improve this, but the quality of the human generated dataset probably will be the upper bound of the output quality, not to mention the risk of overfitting if the foundation model labs embrace SFT too enthusiastically.
During pre-training the model is learning next-token prediction, which is naturally additive. Even if you added DEL as a token it would still be quite hard to change the data so that it can be used in a mext-token prediction task
Hope that helps
I like to ask LLMs to find problems o improvements in 1-2 files. They are pretty good at finding bugs, but for general code improvements, 50-60% edits are trash. They add completely unnecessary stuff. If you ask them to improve a pretty well-written code, they rarely say it's good enough already.
For example, in a functional-style codebase, they will try to rewrite everything to a class. I have to adjust the prompt to list things that I'm not interested in. And some inexperienced people are trying to write better code by learning from such changes of LLMs...
I asked Claude the other day to look at one of my hobby projects that has a client/server architecture and a bespoke network protocol, and brainstorm ideas for converting it over to HTTP, JSON-RPC, or something else standards-based. I specifically told it to "go wild" and really explore the space. It thought for a while and provided a decent number of suggestions (several I was unaware of) with "verdicts". Ultimately, though, it concluded that none of them were ideal, and that the custom wire protocol was fine and appropriate for the project. I was kind of shocked at this conclusion: I expected it to behave like that eager intern persona we all have come to expect--ready to rip up the code and "do things."
If you just ask it to find problems, it will do its best to find them - like running a while loop with no return condition. That's why I put some breaker in the prompt, which in this case would be "don't make any improvements if the positive impact is marginal". I've mostly seen it do nothing and just summarize why, followed by some suggestions in case I still want to force the issue
I guess "marginal impact" for them is a pretty random metric, which will be different on each run. Will try it next time.
Another problem is that they try to add handling of different cases that are never present in my data. I have to mention that there is no need to update handling to be more generalized. For example, my code handles PNG files, and they add JPG handling that never happens.
Yeah. I noticed Claud suffers when it reaches context overload - its too opinionated, so it shortens its own context with decisions I would not ever make, yet I see it telling itself that the shortcuts are a good idea because the project is complex...then it gets into a loop where it second guesses its own decisions and forgets the context and then continues to spiral uncontrollably into deeper and deeper failures - often missing the obvious glitch and instead looking into imaginary land for answers - constantly diverting the solution from patching to completely rewriting...
I think it suffers from performance anxiety...
----
The only solution I have found is to - rewrite the prompt from scratch, change the context myself, and then clear any "history or memories" and then try again.
I have even gone so far as to open nested folders in separate windows to "lock in" scope better.
As soon as I see the agent say "Wait, that doesnt make sense, let me review the code again" its cooked
> Yeah. I noticed Claud suffers when it reaches context overload
All LLMs degrade in quality as soon as you go beyond one user message and one assistant response. If you're looking for accuracy and highest possible quality, you need to constantly redo the conversations from scratch, never go beyond one user message.
If the LLM gets it wrong in their first response, instead of saying "No, what I meant was...", you need to edit your first response, and re-generate, otherwise the conversation becomes "poisoned" almost immediately, and every token generated after that will suffer.
Yeah, I used to write some fiction for myself with LLMs as a recreational pasttime, it's funny to see how as the story gets longer, LLMs progressively either get dumber, start repeating themselves, or become unhinged.
I’m keeping Claude’s tasks small and focused, then if I can I clear between.
It’s REAL FUCKING TEMPTING to say ”hey Claude, go do this thing that would take me hours and you seconds” because he will happily, and it’ll kinda work. But one way or another you are going to put those hours in.
There’s definitely a certain point I reach when using Claude code where I have to make the specifications so specific that it becomes more work than just writing the code myself
For me, too many compactions throughout the day eventually lead to a decline in Claude's thinking ability. And, during that time, I have given it so much context to help drive the coding interaction. Thus, restarting Claude requires me to remember the small bits of "nuggets" we discovered during the last session so I find myself repeating the same things every day (my server IP is: xxx, my client IP is: yyy, the code should live in directory: a/b/c). Using the resume feature with Claude simply brings back the same decline in thinking that led me to stop it in the first place. I am sure there is a better way to remember these nuggets between sessions but I have not found it yet.
That has been my greatest stumbling block with these AI agents: context. I was trying to have one help vibe code a puzzle game and most of the time I added a new rule it broke 5 existing rules. It also never approached the rules engine with a context of building a reusable abstraction, just Hammer meet Nail.
The point he’s making - that LLM’s aren’t ready for broadly unsupervised software development - is well made.
It still requires an exhausting amount of thought and energy to make the LLM go in the direction I want, which is to say in a direction which considers the code which is outside the current context window.
I suspect that we will not solve the context window problem for a long time. But we will see a tremendous growth in “on demand tooling” for things which do fit into a context window and for which we can let the AI “do whatever it wants.”
For me, my work product needs to conform to existing design standards and I can’t figure out how to get Claude to not just wire up its own button styles.
But it’s remarkable how—despite all of the nonsense—these tools remain an irreplaceable part of my work life.
I feel like I’ve figured out a good workflow with AI coding tools now. I use it in “Planning mode” to describe the feature or whatever I am working on and break it down into phases. I iterate on the planning doc until it matches what I want to build.
Then, I ask it to execute each phase from the doc one at a time. I review all the code it writes or sometimes just write it myself. When it is done it updates the plan with what was accomplished and what needs to be done next.
This has worked for me because:
- it forces the planning part to happen before coding. A lot of Claude’s “wtf” moments can be caught in this phase before it write a ton of gobbledygook code that I then have to clean up
- the code is written in small chunks, usually one or two functions at a time. It’s small enough that I can review all the code and understand before I click accept. There’s no blindly accepting junk code.
- the only context is the planning doc. Claude captures everything it needs there, and it’s able to pick right up from a new chat and keep working.
- it helps my distraction-prone brain make plans and keep track of what I was doing. Even without Claude writing any code, this alone is a huge productivity boost for me. It’s like have a magic notebook that keeps track of where I was in my projects so I can pick them up again easily.
Which is why I think agentic software development is not really worth it today. It can solve well-defined problems, and work through issues by rote, but to give it some task and have it work on it for a couple hours, then you have to come in and fix it up.
I think LLMs are still at the 'advanced autocomplete' stage, where the most productive way to use them is to have a human in the loop.
In this, accuracy of following instructions, and short feedback time is much more important than semi-decent behavior over long-horizon tasks.
This is an interesting experiment that we can summarize as "I gave a smart model a bad objective", with the key result at the end
"...oh and the app still works, there's no new features, and just a few new bugs."
Nobody thinks that doing 200 improvement passes on functioning code base is a good idea. The prompt tells the model that it is a principal engineer, then contradicts that role the imperative "We need to improve the quality of this codebase". Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn't tell the model that it can decide the code is good enough. I think we would see a different behavior if the prompt was changed to "Inspect the codebase, determine if we can do anything to improve code quality, then immediately implement it." If the model is smart enough, this will increasingly result in passes where the agent decides there is nothing left to do.
In my experience with CC I get great results where I make an open ended question about a large module and instruct it to come back to me with suggestions. Claude generates 5-10 suggestions and ranks them by impact. It's very low-effort from the developer's perspective and it can generate some good ideas.
"Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.
"Hey claude, anything I could do to improve Y?", and it'll struggle beyond the basics that a linter might suggest.
It suggested enthusiastically a library for <work domain> and it was all "Recommended" about it, but when I pointed out that the library had been considered and rejected because <issue>, it understood and wrote up why that library suffered from that issue and why it was therefore unsuitable.
There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.
That may well change, so I don't want to embed that thought too deeply into my own priors, because the LLM space seems to evolve rapidly. I wouldn't want to find myself blind to the progress because I write it off from a class of problems.
But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.
I am phenomenally productive this way, I am happier at my job, and its quality of work is extremely high as long as I occasionally have it stop and self-review it's progress against the style principles articulated in its AGENTS.md file. (As it tends to forget a lot of rules like DRY)
However if I just say “I have this goal, implement a solution”, chances are that unless it is a very common task, it will come up with a subpar/incomplete implementation.
What’s funny to me is that complexity has inverted for some tasks: it can ace a 1000 lines ML model for a general task I give it, yet will completely fail to come up with a proper solution for a 2D geometric problem that mostly has high school level maths that can be solved in 100 lines
It may seem decent until you look closer. Just like with a junior dev, you should always review the code very carefully, you can absolutely not trust it. It's not bad at trivial stuff, but fails almost always if things get more complex and unlike a junior dev, it does not tell you, when things get too complex for it.
In the same breath (same paragraph) you state two polar opposites about working with AI:
I don't see how you can claim to be "phenomenally productive" when working with a tool you have to babysit because it forgets your instructions the whole time.If it was the "junior dev" you also mention, I suspect you would very quickly invite the "junior dev" to find a job elsewhere.
I am so tired of this analogy. Have the people who say this never worked with a junior dev before? If you treat your junior devs as brainless code monkeys who only exist to type out your brilliant senior developer designs and architectures instead of, you know, human beings capable of solving problems, 1) you're wasting your time, because a less experienced dev is still capable of solving problems independently, 2) the juniors working under you will hate it because they get no autonomy, and 3) the juniors working under you will stay junior because they have no opportunity to learn--which means you've failed at one of your most important tasks as a senior developer, which is mentorship.
I don't know if I'm any faster than I would be if I was motivated, but I'm A LOT more productive in my current state. I still hope for the next AI winter though.
Deleted Comment
Principles like DRY
If I had a pdf printout of a table, the workflow i used to have to use to get that back into a table data structure to use for automation was hard (annoying). dedicated OCR tools with limitations on inputs, multiple models in that tool for the different ways the paper the table was on might be formatted. it took hours for a new input format
now i can take a photo of something with my phone and get a data table in like 30 seconds.
people seem so desperate to outsource their thinking to these models and operating at the limits of their capability, but i have been having a blast using it to cut through so much tedium that werent unsolved problems but required enough specialized tooling and custom config to be left alone unless you really had to
this fits into what youre saying with using it to do the grunt work i find boring i suppose, but feels a little bit more than that - like it has opened a lot of doors to spaces that had grunt work that wasnt worth doing for the end result previously but now it is
Deleted Comment
While this is true in my experience, the opposite is not true. LLMs are very good at helping me go through a structure processing of thinking about architectural and structural design and then help build a corresponding specification.
More specifically the "idea honing" part of this proposed process works REALLY well: https://harper.blog/2025/02/16/my-llm-codegen-workflow-atm/
This: Each question should build on my previous answers, and our end goal is to have a detailed specification I can hand off to a developer. Let’s do this iteratively and dig into every relevant detail. Remember, only one question at a time.
Deleted Comment
I asked Claude to fix a pet peeve of mine, spawning a second process inside an existing Wine session (pretty hard if you use umu, since it runs in a user namespace). I asked Claude to write me a python server to spawn another process to pass through a file handler "in Proton", and it proceeded a long loop of trying to find a way to launch into an existing wine session from Linux with tons of environment variables that didn't exist.
Then I specified "server to run in Wine using Windows Python" and it got more things right. Except it tried to use named pipes for IPC. Which, surprise surprise, doesn't work to talk to the Linux piece. Only after I specified "local TCP socket" it started to go right. Had I written all those technical constraints and made the design decisions in the first message it'd have been a one-hit success.
Very easy to write it off when it spins out on the open-ended problems, without seeing just how effective it can be once you zoom in.
Of course, zooming in that far gives back some of the promised gains.
Edit: typo
The love/hate flame war continues because the LLM companies aren't selling you on this. The hype is all about "this tech will enable non-experts to do things they couldn't do before" not "this tech will help already existing experts with their specific niche," hence the disconnect between the sales hype and reality.
If OpenAI, Anthropic, Google, etc. were all honest and tempered their own hype and misleading marketing, I doubt there would even be a flame war. The marketing hype is "this will replace employees" without the required fine print of "this tool still needs to be operated by an expert in the field and not your average non technical manager."
This is exactly how I use it. I prefer Gemini 3 personally.
I try to learn as much as I can about different architectures, usually by reading books or other implementations and coding first principals to build a mental model. I apply the architecture to the problem and the AI fills in the gaps. I try my best to focus and cover those gaps.
The reason I think it is inconsistent in nailing a variety of tasks is the recipe for training LLMs, which is pre-training + RL. The RL environment sends a training signal to update all the weights in its trajectory for the successful response. Karpathy calls it “sucking supervision through a straw”. This breaks other parts of the model.
Current AI, and in particular RL-based, is already or will soon achieve super human performance on problems that can be - quickly - verified and measured.
So maths, algorithms, etc and well defined bugs fall into that category.
However architectural decision, design, long-term planning where there is little data, no model allowing synthetic data generation, and long iteration cycles are not so much amenable to it.
[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
Try this:
Have a look at xyz.cs. Do a full audit of the file and look for any database operations in loops that can be pre-filtered.
Or:
Have a look at folder /folderX/ and add .AsNoTracking() to all read-only database queries. When you are done, run the compiler and fix the errors. Only modify files in /folderX/ and do not go deeper in the call hierarchy. Once you are done, do a full audit of each file and make sure you did not accidentally added .AsNoTracking() to tracked entities. Do no create any new files or make backups, I already created a git branch for you. Do not make any git commits.
Or:
Have a look at the /Controllers/ folder. Do a full audit of each controller file and make sure there are no hard-coded credentials, username, password or tokens.
Or: Have a look at folder /folderX/. Find any repeated hard-coded values, magic values and literals that will make good candidates to extract to Constants.cs. Make sure to add XML comments to the Constants.cs file to document what the value is for. You may create classes within Constants.cs to better group certain values, like AccountingConstants or SystemConstants etc.
These kinds of tasks works amazing in claude code an can often be one shotted. Make sure you check your git diffs - you cannot and should not blame AI for shitty code - its your name next to the commit, make sure it is correct. You can even ask claude to review the file with you afterwards. I've used this kind of approach to greatly increase our overall code quality & performance tuning - I really don't understand all the negative comments as this approach has chopped down days worth of refactorings to a couple of minutes and hours.
In places where you see your coding assistant is slow or making mistakes or it is going line by line where you know a simple regex find/replace would work instantly, ask it to help you create a shell script as a tool for itself to call, that does task xyz that it can call. I've made a couple of scripts that uses this approach that Claude can call locally to fix certain code pattern in 5 seconds that would've taken it (and me checking it) 30 mins at least and it wont eat up context or tokens.
Static analysis has the opposite problem - very structured, deterministic, but limited to predefined patterns and overwhelms you in false positives.
The sweet spot seems to be to give structure to what the LLM should look for, rather than letting it roam free on an open-ended "review this" prompt.
We built Autofix Bot[1] around this idea.
[1] https://autofix.bot (disclosure: founder)
Neither of those are things I follow, and either way design is better informed by the specific problems that need to be solved rather than by such general, prescriptive principles.
My sentiment was "that's obviously a weird non-intended hack" but I wanted to test quickly, and well ... it worked. Later, reading the man-pages I aknowledged the fact that I needed to declare specific flags for gcc in place of the gpt advised solution.
I think these kind of value based judgements are hard to emulate for LLMs, it's hard for them to identifiate a single source as the most authoritative source in a sea of lesser authoritative (but numerous) sources.
I'd hesitate to call this a blind spot. LLMs have a lot of actual blind spots - things people developing them overlook or deprioritize. This strikes me more as something acutely aware of & failing at, despite significant efforts to solve.
This is true, as for "Open Ended" I use Beads with Claude code, I ask it to identify things based on criteria (even if its open ended) then I ask it to make tasks, then when its done I ask it to research and ask clarifying questions for those tasks. This works really well.
thats called job security!
Claude is for getting shit done, it's not at its best at long research tasks.
until works { try again }
The stuff is getting so cheap and so fast... a sufficient increment in quantity can produce a phase change in quality.
Back in the day, we would just do this with a search engine.
One under-discussed lever that senior / principal engineers can pull is the ability to write linters & analyzers that will stop junior engineers ( or LLMs ) from doing something stupid that's specific to your domain.
Let's say you don't want people to make async calls while owning a particular global resource, it only takes a few minutes to write an analyzer that will prevent anyone from doing so.
Avoid hours of back-and-forth over code review by encoding your preferences and taste into your build pipeline and stop it at source.
When I see my colleagues use an LLM they are treating it like a mind reader and their prompts are, frankly, dogshit.
It shows that articulating a problem is an important skill.
Dead Comment
The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?
Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?
A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.
I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.
You could also interpret these results to be a proxy for obsequiousness.
Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?
It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.
Deleted Comment
There's a reason why reasoning models are bad for creative writing. The thinking constrains the output.
Deleted Comment
Removing code, renaming files, condensing, and other edits is mostly a post-training stuff, supervised learning behavior. You have armies of developers across the world making 17 to 35 dollars an hour solving tasks step by step which are then basically used to generate prompt/responses pairs of desired behavior for a lot of common development situations, adding desired output for things like tool calling, which is needed for things like deleting code.
A typical human working on post-training dataset generation task would involve a scenario like: given this Dockerfile for a python application, when we try to run pytest it fails with exception foo not found. The human will notice that package foo is not installed, change the requirements.txt file and write this down, then he will try pip install, and notice that the foo package requires a certain native library to be installed. The final output of this will be a response with the appropriate tool calls in a structured format.
Given that the amount of unsupervised learning is way bigger than the amount spent on fine-tuning for most models, it is not surprise that given any ambiguous situation, the model will default to what it knows best.
More post-training will usually improve this, but the quality of the human generated dataset probably will be the upper bound of the output quality, not to mention the risk of overfitting if the foundation model labs embrace SFT too enthusiastically.
what does this even mean? could you expand on it
For example, in a functional-style codebase, they will try to rewrite everything to a class. I have to adjust the prompt to list things that I'm not interested in. And some inexperienced people are trying to write better code by learning from such changes of LLMs...
Another problem is that they try to add handling of different cases that are never present in my data. I have to mention that there is no need to update handling to be more generalized. For example, my code handles PNG files, and they add JPG handling that never happens.
I think it suffers from performance anxiety...
----
The only solution I have found is to - rewrite the prompt from scratch, change the context myself, and then clear any "history or memories" and then try again.
I have even gone so far as to open nested folders in separate windows to "lock in" scope better.
As soon as I see the agent say "Wait, that doesnt make sense, let me review the code again" its cooked
All LLMs degrade in quality as soon as you go beyond one user message and one assistant response. If you're looking for accuracy and highest possible quality, you need to constantly redo the conversations from scratch, never go beyond one user message.
If the LLM gets it wrong in their first response, instead of saying "No, what I meant was...", you need to edit your first response, and re-generate, otherwise the conversation becomes "poisoned" almost immediately, and every token generated after that will suffer.
It’s REAL FUCKING TEMPTING to say ”hey Claude, go do this thing that would take me hours and you seconds” because he will happily, and it’ll kinda work. But one way or another you are going to put those hours in.
It’s like programming… is proof of work.
It still requires an exhausting amount of thought and energy to make the LLM go in the direction I want, which is to say in a direction which considers the code which is outside the current context window.
I suspect that we will not solve the context window problem for a long time. But we will see a tremendous growth in “on demand tooling” for things which do fit into a context window and for which we can let the AI “do whatever it wants.”
For me, my work product needs to conform to existing design standards and I can’t figure out how to get Claude to not just wire up its own button styles.
But it’s remarkable how—despite all of the nonsense—these tools remain an irreplaceable part of my work life.
Then, I ask it to execute each phase from the doc one at a time. I review all the code it writes or sometimes just write it myself. When it is done it updates the plan with what was accomplished and what needs to be done next.
This has worked for me because:
- it forces the planning part to happen before coding. A lot of Claude’s “wtf” moments can be caught in this phase before it write a ton of gobbledygook code that I then have to clean up
- the code is written in small chunks, usually one or two functions at a time. It’s small enough that I can review all the code and understand before I click accept. There’s no blindly accepting junk code.
- the only context is the planning doc. Claude captures everything it needs there, and it’s able to pick right up from a new chat and keep working.
- it helps my distraction-prone brain make plans and keep track of what I was doing. Even without Claude writing any code, this alone is a huge productivity boost for me. It’s like have a magic notebook that keeps track of where I was in my projects so I can pick them up again easily.
I think LLMs are still at the 'advanced autocomplete' stage, where the most productive way to use them is to have a human in the loop.
In this, accuracy of following instructions, and short feedback time is much more important than semi-decent behavior over long-horizon tasks.
"...oh and the app still works, there's no new features, and just a few new bugs."
Nobody thinks that doing 200 improvement passes on functioning code base is a good idea. The prompt tells the model that it is a principal engineer, then contradicts that role the imperative "We need to improve the quality of this codebase". Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn't tell the model that it can decide the code is good enough. I think we would see a different behavior if the prompt was changed to "Inspect the codebase, determine if we can do anything to improve code quality, then immediately implement it." If the model is smart enough, this will increasingly result in passes where the agent decides there is nothing left to do.
In my experience with CC I get great results where I make an open ended question about a large module and instruct it to come back to me with suggestions. Claude generates 5-10 suggestions and ranks them by impact. It's very low-effort from the developer's perspective and it can generate some good ideas.
LLMs are incapable of reducing entropy in a code base
I've always had this nagging feeling, but I think this really captures the essence of it succintly.