One weird trick is to tell the LLM to ask you questions about anything that’s unclear at this point. I tell it eg to ask up to 10 questions. Often I do multiple rounds of these Q&A and I‘m always surprised at the quality of the questions (w/ Opus). Getting better results that way, just because it reduces the degrees of freedom in which the agent can go off in a totally wrong direction.
This is more or less what the "architect" mode is in KiloCode. It does all the planning and documentation, and then has to be switched to "Code" in order to author any of it. It allows me to ensure we're on the same page, more or less, with intentions and scope before giving it access to writing anything.
It consumes ~30-40% of the tokens associated with a project, in my experience, but they seem to be used in a more productive way long-term, as it doesn't need to rehash anything later on if it got covered in planning. That said, I don't pay too close attention to my consumption, as I found that QwenCoder 30B will run on my home desktop PC (48GB RAM/12GB vRAM) in a way that's plenty functional and accomplishes my goals (albeit a little slower than Copilot on most tasks).
Workflow improvement: Use a repo bundler to make a single file and drop your entire codebase in gemini or chatgpt. Their whole codebase comprehension is great and you can chat for a long time without the api cost. You can even get them to comment on each other's feedback, it's great.
This is a little anthropomorphic. The faster option is to tell it to give you the full content of an ideal context for what you’re doing and adjust or expand as necessary. Less back and forth.
It’s not though, one of the key gaps right now is that people do not provide enough direction on the tradeoffs they want to make. Generally LLMs will not ask you about them, they will just go off and build. But if you have them ask, they will often come back with important questions about things you did not specify.
"You can ask the agent for advice on ways to improve your application, but be really careful; it loves to “improve” things, and is quick to suggest adding abstraction layers, etc. Every single idea it gives you will seem valid, and most of them will seem like things that you should really consider doing. RESIST THE URGE..."
A thousand times this. LLMs love to over-engineer things. I often wonder how much of this is attributable to the training data...
They’re not dissimilar to human devs, who also often feel the need to replat, refactor, over-generalize, etc.
The key thing in both cases, human and AI, is to be super clear about goals. Don’t say “how can this be improved”, say “what can we do to improve maintainability without major architectural changes” or “what changes would be required to scale to 100x volume” or whatever.
Open-ended, poorly-defined asks are bad news in any planning/execution based project.
A senior programmer does not suggest adding more complexity/abstraction layers just to say something. An LLM absolutely does, every single time in my experience.
There are however human developers that have built enough general and project-specific expertise to be able to answer these open-ended, poorly-defined requests. In fact, given how often that happens, maybe that’s at the core of what we’re being paid for.
This is something I experienced first hand a few weeks ago when I first used Claude. I have this recursive-decent-based parser library I haven't touched in a few years that I want to continue developing but always procrastinate on. It has always been kinda slow so I wanted to see if Claude could improve the speed. It made very reasonable suggestions, the main one being caching parsing rules based on the leading token kind. It made code that looked fine and didn't break tests, but when I did a simple timed looped performance comparison, Claude's changes were slightly slower. Digging through the code, I discovered I already was caching rules in a similar way and forgot about it, so the slight performance loss was from doing this twice.
Caching sounds fine, and it is a very potent method. Nevertheless, I avoid using it until I have almost no other options left, and no good ones. You now have to manage that cache, introduce a potential for hard to debug and rare runtime timing errors, and add a lot of complexity. For me, adding caching should come at the end, when the whole project is finished, you exhausted all your architecture options, and you still need more speed. And I'll add some big warnings, and pray I don't run into too many new issues introduced by caching.
It's better for things that are well isolated and definitely completely "inside the box" with no apparent way for the effects to have an effect outside the module, but you never know when you overlook something, or when some later refactoring leads to the originally sane and clean assumptions to be made invalid without anyone noticing, because whoever does the refactoring only looks at a sub-section of the code. So it is not just a question of getting it right for the current system, but to anticipate that anything that can go wrong might actually indeed go wrong, if I leave enough opportunities (complexity) even in right now well-encapsulated modules.
I mean, it's like having more than one database and you have to use both and keep them in sync. Who does that voluntarily? There's already caching inside many of the lower levels, from SSDs, CPUs, to the OS, and it's complex enough already, and can lead to unexpected behavior. Adding even more of that in the app itself does not appeal to me, if I can help it. I'm just way too stupid for all this complexity, I need it nice and simple. Well, as nice and simple as it gets these days, we seem to be moving towards biological system level complexity in larger IT systems.
If you are not writing the end-system but a library, there is also the possibility that the actual system will do its own caching on a higher level. I would carefully evaluate if there is really a need to do any caching inside my library? Depending on how it is used, the higher level doing it too would likely make that obsolete because the library functions will not be called as often as predicted in the first place.
There is also that you need a very different focus and mindset for the caching code, compared to the code doing the actual work. For caching, you look at very different things than what you think about for the algorithm. For the app you think on a higher level, how to get work done, and for caching you go down into the oily and dirty gear boxes of the machine and check all the shafts and gears and connections. Ideally caching would not be part of the business code at all, but it is hard to avoid and the result is messy, very different kinds of code, dealing with very different problems, close together or even intertwined.
Am I alone in spending $1k+/month on tokens? It feels like the most useful dollars i've ever spent in my life. The software I've been able to build on a whim over the last 6 months is beyond my wildest dreams from a a year or two ago.
I’m unclear how you’re hitting $1k/mo in personal usage. GitHub Copilot charges $0.04 per task with a frontier model in agent mode - and it’s considered expensive. That’s 850 coding tasks per day for $1k/mo, or around 1 per minute in a 16hr day.
I’m not sure a single human could audit & review the output of $1k/mo in tokens from frontier models at the current market rate. I’m not sure they could even audit half that.
I would if there were any positive ROI for these $12k/year, or if it were a small enough fraction of my income. For me, neither are true, so I don’t :).
Like the siblings I would be interested in having your perspective on what kind of thing you do with so many tokens.
I would personally never. Do I want to spend all my time reviewing AI code instead of writing? Not really. I also don't like having a worse mental model of the software.
What kind of software are you building that you couldn't before?
You're not alone in using $1k+/month in tokens. But if you are spending that much, you should definitely be on something like Anthropic's Max plan instead of going full API, since it is a fraction of the cost.
Define heavy... There's a band where the max subscription makes most sense. Thread here talks $1000/month, the plan beats that. But there's a larger area beyond that where you're back to having to use API or buy credits.
A full day of Opus 4.1 or GPT 5 high reasoning doing pair programming or guided code review across multiple issues or PRs in parallel will burn the max monthly limits and then stop you or cost $1500 in top up credits for a 15 hour day. Wait, WTF, that's $300k/year! OK, while true, misses that that's accomplishing 6 - 8 in parallel, all day, with no drop in efficacy.
At enterprise procurement cost rates, hiring a {{specific_tech}} expert can run $240/hr or $3500/day and is (a) less knowledgable on the 3+ year old tech the enterprise is using, (b) wants to advise instead of type.
So the question then isn't what it costs, it's what's the cost of being blocked and in turn blocking committers waiting for reviews? Similarly, what's the cost of a Max for a dev that doesn't believe in using it?
TL;DR: At the team level, for guided experts and disbelievers, API likely ends up cheaper again.
> One of the weird things I found out about agents is that they actually give up on fixing test failures and just disable tests. They’ll try once or twice and then give up.
Its important to not think in terms of generalities like this. How they approach this depends on your tests framework, and even on the language you use. If disabling tests is easy and common in that language / framework, its more likely to do it.
For testing a cli, i currently use run_tests.sh and never once has it tried to disable a test. Though that can be its own problem when it hits 1 it can't debug.
# run_tests.sh
# Handle multiple script arguments or default to all .sh files
Another tip. For a specific tasks don't bother with "please read file x.md", Claude Code (and others) accept the @file syntax which puts that into context right away.
As a human dev, can I humbly ask you to separate out your LLM "readme" from your human README.md? If I see a README.md in a directory I assume that means the directory is a separate module that can be split out into a separate repo or indeed storage elsewhere. If you're putting copy in your codebase that's instructions for a bot, that isn't a README.md. By all means come up with a new convention e.g. BOTS.md for this. As a human dev I know I can safely ignore such a file unless I am working with a bot.
I think things are moving towards using AGENTS.md files: https://agents.md/ . I’d like something like this to become the consensus for most commonly used tools at some point.
> If I see a README.md in a directory I assume that means the directory is a separate module that can be split out into a separate repo or indeed storage elsewhere.
While I can understand why someone might develop that first-impression, it's never been safe to assume, especially as one starts working with larger projects or at larger organizations. It's not that unusual for essential sections of the same big project to have their own make-files, specialized utility scripts, tweaks to auto-formatter, etc.
In other cases things are together in a repo for reasons of coordination: Consider frontend/backend code which runs with different languages on different computers, with separate READMEs etc. They may share very little in terms of their build instructions, but you want corresponding changes on each end of their API to remain in lockstep.
Another example: One of my employer's projects has special GNU gettext files for translation and internationalization. These exist in a subdirectory with its own documentation and support scripts, but it absolutely needs to stay within the application that is using it for string-conversions.
You're absolutely right - I explained my reasoning poorly. Let me try again: a README.md marks a conceptual project root. It's a way of flagging to developers: "The code in this directory builds, deploys and/or runs separately to other code in this repo." It's a marker that says you need to think of this directory as something that could meaningfully be split out into a separate repo or storage, but isn't because it only makes sense in the context of this repo. It's common in public repos with docs and examples dirs in the project root. The docs are built separately. The examples are meant to be standalone and could be implemented that way even if they actually import code from the parent repo instead of requiring it as a third party dependency.
> If you are a heavy user, you should use pay-as-you go pricing; TANSTAAFL.
This is very very wrong. Anthropic's Max plan is like 10% of the cost of paying for tokens directly if you are a heavy user. And if you still hit the rate-limits, Claude Code can roll-over into you paying for tokens through API credits. Although, I have never hit the rate limits since I upgraded to the $200/month plan.
There are many people who quickly hit the limits of the $200/month plan. I hit the limits of the $20/month plan in less than a day. So I never tried the $200/month plan but I suspect you are wrong.
Also, if you sign up for Anthropic’s feedback program you get a 30% reduction on API usage.
Especially if you hit the rate limits, you should be on the Max plan. It will save you at least $2000/month if you hit the rate-limits on the $200/month plan, and then you can go and spend however much more you want on the API after hitting the rate limits. There are many people showing how they use 3, 4, or even 5 thousand dollars of API credits a month using the Max plan. You're just going to pay those extra thousands of dollars for the sake of it?!
It is insanity to spend thousands of dollars a month when you could be spending hundreds for the exact same product.
It's an absolute no-brainer. And it's not even "either or". You can use both the plan and fallback to the API when you get rate-limited. A 30% discount on tokens cannot match the 90% discount on tokens you get using the plan. The math is so unbelievably in favour of the plan.
There are probably people who are not heavy users where the plan may not make sense. But for heavy users, you are burning piles of your own money by not using Anthropic's Max plan. You only need a week of moderate usage a month and already the plan will have paid for itself compared to paying for API credits directly.
OP here. I was not solicited by anyone nor did I solicit or accept compensation from anyone for this or any other post.
It’s not an advertisement; I apologize if I come off as a Claude Code fanboy.
If you read part 1 of my post (linked in my OP) you will see that I disclosed exactly how much I paid for my usage, and also the reasons that I ended up choosing Claude Code over other agents.
To be fair, allocating some token for planning (recursively) helps a lot. It requires more hands on work, but produce much better results. Clarifying the tasks and breaking them down is very helpful too. Just you end up spending lots of time on it. On the bright side, Qwen3 30B is quite decent, and best of all "free".
Some of these sample prompts in this blog post are extremely verbose:
If you are considering leveraging any of the documentation or examples, you need to validate that the documentation or example actually matches what is currently in the code.
I have better luck being more concise and avoiding anthropomorphizing. Something like:
"validate documentation against existing code before implementation"
I have the best luck with RFC speak. “You MUST validate that the documentation validates existing code before implementation. You MAY update documentation to correct any mismatches.”
But I also use more casual style when investigating. “See what you think about the existing inheritance model, propose any improvements that will make it easier to maintain. I was thinking that creating a new base class for tree and flower to inherit from might make sense, but maybe that’s over complicating things”
(Expressing uncertainty seems to help avoid the model latching on to every idea with “you’re absolutely right!”)
Also, there's a big difference between giving general "always on" context (as in agents.md) for vibe coding - like "validate against existing code" etc - versus bouncing ideas in a chat session like your example, where you don't necessarily have a specific approach in mind and burning a few extra tokens for a one off query is no big deal.
Context isn't free (either literally or in terms of processing time) and there's definitely a balance to be found for a given task.
I've had both experiences. On some projects concise instructions seem to work better. Other times, the LLM seems to benefit from verbosity.
This is definitely a way in which working with LLMs is frustrating. I find them helpful, but I don't know that I'm getting "better" at using them. Every time I feel like I've discovered something, it seems to be situation specific.
Not my experience at all. I find that the shorter my prompts, the more garbage the results. But if I give it a lot of detail and elaborate on my thought process, it performs very well, and often one-shots the solution.
It consumes ~30-40% of the tokens associated with a project, in my experience, but they seem to be used in a more productive way long-term, as it doesn't need to rehash anything later on if it got covered in planning. That said, I don't pay too close attention to my consumption, as I found that QwenCoder 30B will run on my home desktop PC (48GB RAM/12GB vRAM) in a way that's plenty functional and accomplishes my goals (albeit a little slower than Copilot on most tasks).
Dead Comment
"You can ask the agent for advice on ways to improve your application, but be really careful; it loves to “improve” things, and is quick to suggest adding abstraction layers, etc. Every single idea it gives you will seem valid, and most of them will seem like things that you should really consider doing. RESIST THE URGE..."
A thousand times this. LLMs love to over-engineer things. I often wonder how much of this is attributable to the training data...
The key thing in both cases, human and AI, is to be super clear about goals. Don’t say “how can this be improved”, say “what can we do to improve maintainability without major architectural changes” or “what changes would be required to scale to 100x volume” or whatever.
Open-ended, poorly-defined asks are bad news in any planning/execution based project.
It's better for things that are well isolated and definitely completely "inside the box" with no apparent way for the effects to have an effect outside the module, but you never know when you overlook something, or when some later refactoring leads to the originally sane and clean assumptions to be made invalid without anyone noticing, because whoever does the refactoring only looks at a sub-section of the code. So it is not just a question of getting it right for the current system, but to anticipate that anything that can go wrong might actually indeed go wrong, if I leave enough opportunities (complexity) even in right now well-encapsulated modules.
I mean, it's like having more than one database and you have to use both and keep them in sync. Who does that voluntarily? There's already caching inside many of the lower levels, from SSDs, CPUs, to the OS, and it's complex enough already, and can lead to unexpected behavior. Adding even more of that in the app itself does not appeal to me, if I can help it. I'm just way too stupid for all this complexity, I need it nice and simple. Well, as nice and simple as it gets these days, we seem to be moving towards biological system level complexity in larger IT systems.
If you are not writing the end-system but a library, there is also the possibility that the actual system will do its own caching on a higher level. I would carefully evaluate if there is really a need to do any caching inside my library? Depending on how it is used, the higher level doing it too would likely make that obsolete because the library functions will not be called as often as predicted in the first place.
There is also that you need a very different focus and mindset for the caching code, compared to the code doing the actual work. For caching, you look at very different things than what you think about for the algorithm. For the app you think on a higher level, how to get work done, and for caching you go down into the oily and dirty gear boxes of the machine and check all the shafts and gears and connections. Ideally caching would not be part of the business code at all, but it is hard to avoid and the result is messy, very different kinds of code, dealing with very different problems, close together or even intertwined.
Deleted Comment
I'd reckon anywhere between 99.9%-100%. Give or take.
if you’re a heavy user you should pay for a monthly subscription for Claude Code which is significantly cheaper than API costs.
If you don't mind sharing, I'm really curious - what kind of things do you build and what is your skillset?
I’m not sure a single human could audit & review the output of $1k/mo in tokens from frontier models at the current market rate. I’m not sure they could even audit half that.
I would if there were any positive ROI for these $12k/year, or if it were a small enough fraction of my income. For me, neither are true, so I don’t :).
Like the siblings I would be interested in having your perspective on what kind of thing you do with so many tokens.
What kind of software are you building that you couldn't before?
A full day of Opus 4.1 or GPT 5 high reasoning doing pair programming or guided code review across multiple issues or PRs in parallel will burn the max monthly limits and then stop you or cost $1500 in top up credits for a 15 hour day. Wait, WTF, that's $300k/year! OK, while true, misses that that's accomplishing 6 - 8 in parallel, all day, with no drop in efficacy.
At enterprise procurement cost rates, hiring a {{specific_tech}} expert can run $240/hr or $3500/day and is (a) less knowledgable on the 3+ year old tech the enterprise is using, (b) wants to advise instead of type.
So the question then isn't what it costs, it's what's the cost of being blocked and in turn blocking committers waiting for reviews? Similarly, what's the cost of a Max for a dev that doesn't believe in using it?
TL;DR: At the team level, for guided experts and disbelievers, API likely ends up cheaper again.
Deleted Comment
Its important to not think in terms of generalities like this. How they approach this depends on your tests framework, and even on the language you use. If disabling tests is easy and common in that language / framework, its more likely to do it.
For testing a cli, i currently use run_tests.sh and never once has it tried to disable a test. Though that can be its own problem when it hits 1 it can't debug.
# run_tests.sh # Handle multiple script arguments or default to all .sh files
scripts=("${@/#/./examples/}")
[ $# -eq 0 ] && scripts=(./examples/*.sh)
for script in "${scripts[@]}"; do
doneecho " OK"
----
Another tip. For a specific tasks don't bother with "please read file x.md", Claude Code (and others) accept the @file syntax which puts that into context right away.
There was a discussion here 3 days ago: https://news.ycombinator.com/item?id=44957443 .
While I can understand why someone might develop that first-impression, it's never been safe to assume, especially as one starts working with larger projects or at larger organizations. It's not that unusual for essential sections of the same big project to have their own make-files, specialized utility scripts, tweaks to auto-formatter, etc.
In other cases things are together in a repo for reasons of coordination: Consider frontend/backend code which runs with different languages on different computers, with separate READMEs etc. They may share very little in terms of their build instructions, but you want corresponding changes on each end of their API to remain in lockstep.
Another example: One of my employer's projects has special GNU gettext files for translation and internationalization. These exist in a subdirectory with its own documentation and support scripts, but it absolutely needs to stay within the application that is using it for string-conversions.
Not 'this is a separate project'. Not 'project documentation file'.
You can have read mes dotted all over a project if that's necessary.
It's simply a file that a previous developer is asking you to read before you start making around in that directory.
This is very very wrong. Anthropic's Max plan is like 10% of the cost of paying for tokens directly if you are a heavy user. And if you still hit the rate-limits, Claude Code can roll-over into you paying for tokens through API credits. Although, I have never hit the rate limits since I upgraded to the $200/month plan.
Also, if you sign up for Anthropic’s feedback program you get a 30% reduction on API usage.
It is insanity to spend thousands of dollars a month when you could be spending hundreds for the exact same product.
It's an absolute no-brainer. And it's not even "either or". You can use both the plan and fallback to the API when you get rate-limited. A 30% discount on tokens cannot match the 90% discount on tokens you get using the plan. The math is so unbelievably in favour of the plan.
There are probably people who are not heavy users where the plan may not make sense. But for heavy users, you are burning piles of your own money by not using Anthropic's Max plan. You only need a week of moderate usage a month and already the plan will have paid for itself compared to paying for API credits directly.
I acknowledge that and get like $400 worth of tokens from my $20 Claude Code Pro subscription every month.
I'm building tools I can use when the VC money runs out or a clear winner gets on top and the prices shoot up to realistic levels.
At that point I've hopefully got enough local compute to run a local model though.
It’s not an advertisement; I apologize if I come off as a Claude Code fanboy.
If you read part 1 of my post (linked in my OP) you will see that I disclosed exactly how much I paid for my usage, and also the reasons that I ended up choosing Claude Code over other agents.
If you are considering leveraging any of the documentation or examples, you need to validate that the documentation or example actually matches what is currently in the code.
I have better luck being more concise and avoiding anthropomorphizing. Something like:
"validate documentation against existing code before implementation"
Should accomplish the same thing!
But I also use more casual style when investigating. “See what you think about the existing inheritance model, propose any improvements that will make it easier to maintain. I was thinking that creating a new base class for tree and flower to inherit from might make sense, but maybe that’s over complicating things”
(Expressing uncertainty seems to help avoid the model latching on to every idea with “you’re absolutely right!”)
Also, there's a big difference between giving general "always on" context (as in agents.md) for vibe coding - like "validate against existing code" etc - versus bouncing ideas in a chat session like your example, where you don't necessarily have a specific approach in mind and burning a few extra tokens for a one off query is no big deal.
Context isn't free (either literally or in terms of processing time) and there's definitely a balance to be found for a given task.
This is definitely a way in which working with LLMs is frustrating. I find them helpful, but I don't know that I'm getting "better" at using them. Every time I feel like I've discovered something, it seems to be situation specific.