How Anthropic teams use Claude Code

A repeated trend is that Claude Code only gets 70-80% of the way, which is fine and something I wish was emphasized more by people pushing agents.

This bullet point is funny:

> Treat it like a slot machine

> Save your state before letting Claude work, let it run for 30 minutes, then either accept the result or start fresh rather than trying to wrestle with corrections. Starting over often has a higher success rate than trying to fix Claude's mistakes.

That's easy to say when the employee is not personally paying the massive amount of compute running Claude Code for a half-hour.

throwmeaway222 · a month ago

Thanks for the tip - we employees should run and re-run the code generation hundreds of times even if the changes are pretty good. That way, the brass will see a huge bill without many actual commits.

Sorry boss, it looks like we need to hire more software engineers since the AI route still isn't mathing.

mdaniel · a month ago

> we employees should run and re-run the code generation hundreds of times

Well, Anthropic sure thinks that you should. Number go up!

godelski · a month ago

Unironically this can actually be a good idea. Instead of "rerunning," run in parallel. Then pick the best solution.

  Pros:
   - Saved Time!
   - Scalable! 
   - Big Bill?

  Cons:
   - Big Bill
   - AI written code

gmueckl · a month ago

Data centers are CapEx, employees are OpEx. Building more data centers is cheap. Employees can always supervise more agents...

Graziano_M · a month ago

Don’t forget to smash the power looms as well.

preommr · a month ago

> A repeated trend is that Claude Code only gets 70-80% of the way, which is fine and something I wish was emphasized more by people pushing agents.

I have been pretty successful at using llms for code generation.

I have a simple rule that something is either 90%>ai or none at all (exluding inline completions, and very obvious text editing).

The model has an inherent understanding of some problems due to it's training data (e.g. setting up a web server with little to no deps in golang), that it can do with almost 100% certainty, where it's really easy to blaze through in a few minutes, and then I can setup the architecture for some very flat code flows. This can genuinely improve my output by 30%-50%

randmeerkat · a month ago

> I have a simple rule that something is either 90%>ai or none at all…

10% is the time it works 100% of the time.

MPSimmons · a month ago

Agree with your experiences. I've also found that if I build a lightweight skeleton of the structure of the program, it does a much better job. Also, ensuring that it does a full fledged planning/non-executing step before starting to change things leads to good results.

I have been using Cline in VSCode, and I've been enjoying it a lot.

maerch · a month ago

> A repeated trend is that Claude Code only gets 70-80% of the way, which is fine and something I wish was emphasized more by people pushing agents.

Recently, I realized that this applies not only to the first 70–80% of a project but sometimes also to the final 70-80%.

I couldn’t make progress with Claude on a major refactoring from scratch, so I started implementing it myself. Once I had shaped the idea clearly enough but in a very early state, I handed it back to Claude to finish and it worked flawlessly, down to the last CHANGELOG entry, without any further input from me.

I saw this as a form of extensive guardrails or prompting-by-example.

golergka · a month ago

That’s why I like using it and get more fulfilment from coding than before: I do the fun parts. AI does the mundane.

bavell · a month ago

I need to try this - started using Claude code a few days ago and have been struggling to get good implementations with some high-complexity refactors. It keeps over engineering and creating more problems than it solves. It's getting close though, and I think your approach would work very well for this scenario!

benreesman · a month ago

The slot machine thing has a pretty compelling corollary: crank the formal systems rigor up as high as you can.

Vibe coding in Python is seductive but ultimately you end up in a bad place with a big bill to show for it.

Vibe coding in Haskell is a "how much money am I willing to pour in per unit clean, correct, maintainable code" exercise. With GHC cranked up to `-Wall -Werror` and some nasty property tests? Watching Claude Code try to weasel out with a mock goes from infuriating to amusing: bam, unused parameter! Now why would the test suite be demanding that a property holds on an unused parameter...

And Haskell is just an example, TypeScript is in some ways even more powerful in it's type system, so lots of projects have scope to dabble with what I'm calling "hyper modern vibe coding": just start putting a bunch of really nasty fastcheck and generic bounds on stuff and watch Claude Code try to cheat. Your move, Claude Code, I know you want to check off that line on the TODO list like I want to breathe, so what's it gonna be?

I find it usually gives up and does the work you paid for.

kevinventullo · a month ago

Interesting, I wonder if there is a way to quantify the value of this technique. Like give Claude the same task in Haskell vs. Python and see which one converges correctly first.

Deleted Comment

AzzyHN · a month ago

Not to mention, if an employee could usually write pretty good code but maybe 30% of the time they wrote something so non-functional it had to be entirely scrapped, they'd be fired.

melagonster · a month ago

But what if he only want 20$/per month?

threatofrain · a month ago

This is an easy calculation for everyone. Think about whether Claude is giving you the a sufficient boost in performance, and if not... then it's too expensive. No doubt some people are in some combination of domain, legacy, complexity of codebase, etc., where Claude just doesn't cut it.

TrainedMonkey · a month ago

$200 per month will get you roughly 4-5 hours of non-stop single-threaded usage per day.

A bigger issue here is that the random process is not a good engineering pattern. It's not repeatable, does not drive coherent architecture, and struggles with complex problems. In my experience, problem size correlates inversely with generated code quality. Engineering is a process of divide-and-conquer and there is a good reason people don't use bogo (random) sort in production.

More specifically, if you only look at the final code, you are either spending a lot of time reviewing the code or accepting the code with less review scrutiny. Carefully reviewing semi random diffs seems like a poor use of time... so I suspect the default is less review scrutiny and higher tech debt. Interestingly enough, higher tech debt might be an acceptable tradeoff if you believe that soon Code Assistants will be good enough to burn the tech debt down autonomously or with minimal oversight.

On the other hand, if the code you are writing is not allowed to fail, the stakes change and you can't pick the less review option. I never thought to codify it as a process, but here is what I do to guide the development process:

- Start by stating the problem and asking Claude Code to: analyze the existing code, restate the problem in a structured fashion, scan the codebase for existing patterns solving the problem, brainstorm alternative solutions. An enhancement here could be to have a map / list of the codebase to improve the search.

- Evaluate presented solutions and iterate on the list. Add problem details, provide insight, eliminate the solutions that would not work. A lot of times I have enough context to pick a winner here, but if not, I ask for more details about each solution and their relative pros and cons.

- Ask Claude to provide a detailed plan for the down-selected solution. Carefully review the plan (a significantly faster endeavor compared to reviewing the whole diff). Iterate on the plan as needed; after that, tell Claude to save the plan for comparison after the implementation and then to get cracking.

- Review Claude's report of what was implemented vs. what was initially planned. This step is crucial because Claude will try dumb things to get things working, and I've already done the legwork on making sure we're not doing anything dumb in the previous step. Make changes as needed.

- After implementation, I generally do a pass on the unit tests because Claude is extremely prolific with them. You generally need to let it write unit tests to make sure it is on the right track. Here, I ask it to scan all of the unit tests and identify similar or identical code. After that, I ask for refactor options that most importantly maximize clarity, secondly minimize lines of code, and thirdly minimize diffs. Pick the best ones.

Yes, I accept that the above process takes significantly longer for any single change; however, in my experience, it produces far superior results in a bounded amount of time.

P.S. if you got this far please leave some feedback on how I can improve the flow.

nightshift1 · a month ago

I agree with that list. I would also add that you should explicitly ask the llm to read the whole files at least once before starting edits because they often have tunnel vision. The project map is auto generated with a script to avoid reading too many files but the files to be edited should be fresh in the context imo.

bavell · a month ago

Very nice, going to try this out tomorrow on some tough refactors Claude has been struggling with!

bdangubic · a month ago

That's easy to say when the employee is not personally paying the massive amount of compute running Claude Code for a half-hour.

you can do the same for $200/month

tough · a month ago

it has limits too, it lasted like 1-2 weeks without only (for me personally at least)

tomlockwood · a month ago

Yeah sweet what's the burn rate?

FeepingCreature · a month ago

Yeah my most common aider command sequence is

    > /undo
    > /clear
    > ↑ ↑ ↑ ⏎

jonstewart · a month ago

And just like a slot machine, it seems pretty clear that some people get addicted to it even if it doesn’t make them better off.

oc1 · a month ago

Funny thing their recommendation to save state as claude code has still no ability for restore checkpoints (like cline has) despite being many times requested. Who are they kidding.

Deleted Comment

paulddraper · a month ago

Who is paying?

Should be the same party as is getting the rewards of the productivity gains.

jordanb · a month ago

And this is the marketing pitch from the people selling this stuff. ¯\_ (ツ)_/¯

I've been trying Claude Code for a few weeks after using Gemini Cli.

There's something a little better the tool use loop, which is nice.

But Claude seems a little dumber and is aggressive about "getting things done", often ignoring common sense or explicit instructions or design information.

If I tell it to make a test pass, it will sometimes change my database structure to avoid having to debug the test. At least twice it deleted protobufs from my project and replaced it with JSON because it struggled to immediately debug a proto issue.

adregan · a month ago

I’ve seen Claude code get halfway through a small sized refactor (function parameters changed shape or something like that), say something that looks like frustration at the amount of time it’s taking, revert all of the good changes, and start writing a bash script to automate the whole process.

In that case, you have put a stop to it and point out that it would already be done if it hadn’t decided to blow it all up in an effort to write a one time use codemod. Of course it agrees with that point as it agrees with everything. It’s the epitome of strong opinions loosely held.

maronato · a month ago

Claude trying to cheat its way through tests has been my experience as well. Often it’ll delete or skip them and proudly claim all issues have been fixed. This behavior seems to be intrinsic to it since it happens with both Claude Code and Cursor.

Interestingly, it’s the only LLM I’ve seen behave that way. Others simply acknowledge the failure and, after a few hints, eventually get everything working.

Claude just hopes I won’t notice its tricks. It makes me wonder what else it might try to hide when misalignment has more serious consequences.

animex · a month ago

I just had the same thing happen. Some comprehensive tests were failing, and it decide to write a simple test instead rather than investigate why these more complicated tests were failing. I wonder if the team is trying to save compute by urging it to complete tasks more quickly! Claude seems to be under a compute crunch as often I get API timeouts/errors.

jonstewart · a month ago

The hilarious part I’ve found is that when it runs into the least bit of trouble with a step on one of its plans, it will say it has been “Deferred” and then make up an excuse for why that’s acceptable.

It is sometimes acceptable for humans to use judgment and defer work; the machine doesn’t have judgment so it is not acceptable for it to do so.

physix · a month ago

Talking about hilarious, we had a Close Encounter of the Hallucinating Kind today. We were having mysterious simultaneous gRPC socket-closed exceptions on the client and server side running in Kubernetes talking to each other through an nginx ingress.

We captured debug logs, described the detailed issue to Gemini 2.5 Flash giving it the nginx logs for the one second before and after an example incident, about 10k log entries.

It came back with a clear verdict, saying

"The smoking gun is here: 2025/07/24 21:39:51 [debug] 32#32: *5902095 rport:443 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.233.100.128, server: grpc-ai-test.not-relevant.org, request: POST /org.not-relevant.cloud.api.grpc.CloudEventsService/startStreaming HTTP/2.0, upstream: grpc://10.233.75.54:50051, host: grpc-ai-test.not-relevant.org"

and gave me a detailed action plan.

I was thinking this is cool, don't need to use my head on this, until I realized that the log entry simply did not exist. It was entirely made up.

(And yes I admit, I should know better than to do lousy prompting on a cheap foundation model)

quintu5 · a month ago

My favorite is when you ask Claude to implement two requirements and it implements the first, gets confused by the the second, removes the implementation for the first to “focus” on the second, and then finishes by having implemented nothing.

ants_everywhere · a month ago

Oh yeah totally. It feels a bit deceptive sometimes.

Like just now it says "great the tests are consistently passing!" So I ran the same test command and 4 of the 7 tests are so broken they don't even build.

stkdump · a month ago

Well I would say that the machine should not override the human input. But if the machine makes up the plans in the first place, then why should it not be allowed to change the plans? I think that the hilarious part in modifying tests to make them work without understanding why they fail is that it probably happens due to training from humans.

mattigames · a month ago

"This task seems more appropriate for lesser beings e.g. humans"

Fade_Dance · a month ago

I even heard that it will aggressively delete your codebase and then lie about it. To your face.

victorbjorklund · a month ago

You are using version control so what is the issue?

minimaxir · a month ago

forgotmypw17 · a month ago

I’ve implemented and maintained an entire web app with CC, and also used many other tools (and took classes and taught workshops on using AI coding tools).

The most effective way I’ve found to use CC so far is this workflow:

Have a detailed and also compressed spec in an md file. It can be called anything, because you’re going to reference it explicitly in every prompt. (CC usually forgets about CLAUDE.md ime)

Start with the user story, and ask it to write a high-level staged implementation plan with atomic steps. Review this plan and have CC rewrite as necessary. (Another md file results.)

Then, based on this file, ask it to write a detailed implementation plan, also with atomic stages. Then review it together and ask if it’s ready to implement.

Then tell Claude to go ahead and implement it on a branch.

Remember the automated tests and functional testing.

Then merge.

stillsut · a month ago

Great advice, matches up to my experience. Personally I go a little cheaper and dirtier on the first prompt, then revise as needed. By the way what classes / workshops did you teach?

I've written a little about some my findings and workflow in detail here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...

Thank you for sharing. I taught some workshops on AI-assisted development using Cursor a Windsurf for MIT students (we built an application and wrote a book) and TAed another similar for-credit course. I’ve also been teaching high schoolers how to code, and we use ChatGPT to help us understand and solve leetcode problems by breaking them down into smaller exercises. There’s also now a Harvard CS course on developing with GenAI which I followed along with. The field is exploding.

AstroBen · a month ago

Is working this way actually faster, or any improvement than just writing the code yourself?

Much, much faster, and I’d say the code is more formal, and I’ve never had such a complete test suite.

The downside is I don’t have as much of a grasp on what’s actually happening in my project, while with hand-written projects I’d know every detail.

Nimitz14 · a month ago

We're all gonna be PMs.

That’s basically what it amounts to, being a PM to a team of extremely cracked and also highly distractable coders.

eagerpace · a month ago

Do you have an example of this you could share?

I can share my own ai-generated codebase:

- there's a devlog showing all the prompts and accepted outputs: https://github.com/sutt/agro/blob/master/docs/dev-summary-v1...

- and you can look at the ai-generated tests (as is being discussed above) and see they aren't very well thought out for the behavior, but are syntactically impressive: https://github.com/sutt/agro/tree/master/tests

- check out the case-studies in the docs if you're interested in more ideas.

amedviediev · a month ago

This matches my experience as well. But what I also found is that I hate this workflow so much that I would almost always rather write the code by hand. Writing specs and user stories was always my least favorite task.

jasonthorsness · a month ago

Claude Code works well for lots of things; for example yesterday I asked it to switch weather APIs backing a weather site and it came very close to one-shotting the whole thing even though the APIs were quite different.

I use it at home via the $20/m subscription and am piloting it at work via AWS Bedrock. When used with Bedrock APIs, at the end of every session it shows you the dollar amount spent which is a bit disconcerting. I hope the fine-grained metering of inference is a temporary situation otherwise I think it will have a chilling/discouraging effect on software developers, leading to less experimentation and fewer rewrites, overall lower quality.

I imagine Anthropic gets to consume it unmetered internally so I they probably completely avoid this problem.

spike021 · a month ago

a couple weekends ago i handed it the basic MLB api and asked it to create some widgets for MacOS to show me stuff like league/division/wildcard standings along with basic settings to pick which should be shown. it cranked out a working widget in like a half hour with minimal input.

i know some swift so i checked on what it was doing. for a quick hack project it did all the work and easily updated things i saw issues with.

for a one-off like that, not bad at all. not too dissimilar from your example.

lovich · a month ago

> I use it at home via the $20/m subscription and am piloting it at work via AWS Bedrock. When used with Bedrock APIs, at the end of every session it shows you the dollar amount spent which is a bit disconcerting. I hope the fine-grained metering of inference is a temporary situation otherwise I think it will have a chilling/discouraging effect on software developers, leading to less experimentation and fewer rewrites, overall lower quality.

I’m legitimately surprised at your feeling on this. I might not want the granular cost put in my face constantly but I do like the ability to see how much my queries cost when I am experimenting with prompt setup for agents. Occasionally I find wording things one way or the other has a significantly cheaper cost.

Why do you think it will lead to a chilling effect instead of the normal effect of engineers ruthlessly innovating costs down now that there is a measurable target?

mwigdahl · a month ago

I’ve seen it firsthand at work, where my developers are shy about spending even a single digit number of dollars on Claude Code, even when it saves them 10 times that much in opportunity cost. It’s got to be some kind of psychological loss aversion effect.

I think it’s easy to spend _time_ when the reward is intangible or unlikely, like an evening writing toy applications to learn something new or prototyping some off-the-wall change in a service that might have an interesting performance impact. If development becomes metered in both time and to-the-penny dollars, I at least will have to fight the attitude that the rewards also need to be more concrete and probable.

lumost · a month ago

once upon a time - engineers often had to concern themselves with datacenter bills, cloud bills, and eventually SaaS bills. We'll probably have 5-10 years of being concerned about AI bills before the AI expense is trivial compared to the human time.

achierius · a month ago

"once upon a time"? Engineers concern themselves with cloud bills right now, today! It's not a niche thing either, probably the majority of AWS consumers have to think about this, regularly.

whatever1 · a month ago

This guy has not been hit with a 100k/mo cloudwatch bill

philomath_mn · a month ago

AI bills are already trivial compared to human time. I pay for claude max, all I need to do is save an hour a month and I will be breaking even.

nextworddev · a month ago

You will start seriously worrying about coding AI bills within 6 months

corytheboyd · a month ago

> it shows you the dollar amount spent which is a bit disconcerting

I can assure you that I don’t at all care about the MAYBE $10 charge my monster Claude Code session billed the company. They also clearly said “don’t worry about cost, just go figure out how to work with it”

duped · a month ago

Meanwhile I ask it to write what I think are trivial functions and it gets them subtly wrong, but obvious in testing. I would be more suspicious if I were you.

theshrike79 · a month ago

Ask it to write tests first and then implement based on the tests, don't let it change the tests.

chubot · a month ago

I use Claude and like it, but this post has kind of a clunky and stilted style

So I guess the blog team also uses Claude

kylestanfield · a month ago

The MCP documentation site has the same problem. It’s basically just a list of bullet points without any details

maxnevermind · a month ago

Also started to suspect that, but I have a bigger problem with the content than styling:

> "Instead of remembering complex Kubernetes commands, they ask Claude for the correct syntax, like "how to get all pods or deployment status," and receive the exact commands needed for their infrastructure work."

Duh, you can ask LLM tech questions and stuff. What is the point of putting something like that on the tech blog of the company which supposed to be working on beading edge tech.

LeafItAlone · a month ago

To get more people using it, and more. I’ve encountered people who don’t use it because they think that it isn’t something that will help them, even in tech. Showing how different groups find value in it might get people in those same positions using it.

Even with people who do use it, they might thinking about it narrowly. They use it for code generation, but might not think to use it for simplified man pages.

Of course there are people who are the exact opposite and use it for every last thing they do. And maybe from this they learn how to better approach their prompts.

politelemon · a month ago

I think this is meant to serve as a bit of an advert/marketing and bit of a signal to investors that look, we're doing things.

AIPedant · a month ago

I don't think the problem is using Claude - in fact some of the writing is quite clumsy and amateurish, suggesting an actual human wrote it. The overall post reads like a collection of survey responses, with no overarching organization, and no filtering of repetitive or empty responses. Nobody was in charge.

You’re absolutely right!

mepiethree · a month ago

Yeah this is kind of a stunning amount of information to provide but also basically like polished bullet points

vlovich123 · a month ago

Must feel that way because that’s probably exactly what it is

mixdup · a month ago

The first example was helping debug k8s issues, which was diagnosed as IP pool exhaustion, and Claude helped them fix it without needing a network expert

But, if they had an expert in networking build it in the first place, would they have not avoided the error entirely up front?

danielbln · a month ago

Experts make mistakes too. In fact, all humans do.

apwell23 · a month ago

they don't make dumb mistakes like claude

mfrye0 · a month ago

My optimization hack is that I'm using speech recognition now with Claude Code.

I can just talk to it like a person and explain the full context / history of things. Way faster than typing it all out.

onprema · a month ago

The SuperWhisper app is great, if you use a Mac.

I checked that one out. The one that Reddit recommended was Voice Type. It's completely offline and a one-time charge:

https://apps.apple.com/us/app/voice-type-local-dictation/id6...

The developer is pretty cool too. I found a few bugs here and there and reported them. He responds pretty much immediately.

jwr · a month ago

There is also MacWhisper which works very well. I've been using it for several months now.

I highly recommend getting a good microphone, I use a Rode smartlav. It makes a huge difference.

sipjca · a month ago

Open source and cross platform one: https://handy.computer

foob · a month ago

I've been pretty happy with the python package hns for this [1]. You can run it from the terminal with uvx hns and it will listen until you press enter and then copy the transcription to the clipboard. It's a simple tool that does one thing well and integrates smoothly with a CLI-based workflow.

[1] - https://github.com/primaprashant/hns

I'll check that one out.

The copy aspect was the main value prop for the app I chose: Voice Type. You can do ctrl-v to start recording, again to stop, and it pastes it in the active text box anywhere on your computer.

So you sit in a room, talking to it? Doesn't it feel weird?

I type a lot faster than I speak :D

Haha yeah, it does feel a bit weird.

I often work on large, complicated projects that span the whole codebase and multiple micro services. So it's often a blend of engineering, architectural, and product priorities. I can end up talking for paragraphs or multiple pages to fully explain the context. Then Claude typically has follow-up questions, things that aren't clear, or issues that I didn't catch.

Honestly, I just get sick of typing out "dissertations" every time. It's easier just to have a conversation, save it to a file, and then use that as context to start a new thread and do the work.

jon-wood · a month ago

Not only do I type faster than I speak I'm also able to edit as I go along, correcting any mistakes or things I've stumbled over and can make clearer. Half my experience of using even basic voice assistants is starting to ask for something and then going "ugh, no cancel" because I stumbled over part of a sentence and I know I'll end up with some utter nonsense in my todo list.

any options for ubuntu ?

as above, open source and cross platform, should work on Ubuntu

https://handy.computer

> When Kubernetes clusters went down and weren't scheduling new pods, the team used Claude Code to diagnose the issue. They fed screenshots of dashboards into Claude Code, which guided them through Google Cloud's UI menu by menu until they found a warning indicating pod IP address exhaustion. Claude Code then provided the exact commands to create a new IP pool and add it to the cluster, bypassing the need to involve networking specialists.

This seems rather inefficient, and also surprising that Claude Code was even needed for this.

ktzar · a month ago

They're subsidizing a world where we need ai instead of understanding or, at the very least, knowing who can help us. Eventually for us to be so dumb we are the ai slaves.

moomoo11 · a month ago

Not really.

Is it really value add to my life that I know some detail on page A or have some API memorized?

I’d rather we be putting smart people in charge of using AI to build out great products.

It should make things 10000x more competitive. I’m for one excited AF for what the future holds.

If people want to be purists and pat themselves on the back sure. I mean people have hobbies like arts.

whycombagator · a month ago

Yes. This is what id expect from an intern or very junior engineer (which could be the case here)