The gambling analogy completely falls apart on inspection. Slot machines have variable reward schedules by design — every element is optimized to maximize time on device. Social media optimizes for engagement, and compulsive behavior is the predictable output. The optimization target produces the addiction.
What's Anthropic's optimization target??? Getting you the right answer as fast as possible! The variability in agent output is working against that goal, not serving it. If they could make it right 100% of the time, they would — and the "slot machine" nonsense disappears entirely. On capped plans, both you and Anthropic are incentivized to minimize interactions, not maximize them. That's the opposite of a casino. It's ... alignment (of a sort)
An unreliable tool that the manufacturer is actively trying to make more reliable is not a slot machine. It's a tool that isn't finished yet.
I've been building a space simulator for longer than some of the people diagnosing me have been programming. I built things obsessively before LLMs. I'll build things obsessively after.
The pathologizing of "person who likes making things chooses making things over Netflix" requires you to treat passive consumption as the healthy baseline, which is obviously a claim nobody in this conversation is bothering to defend.
> What's Anthropic's optimization target??? Getting you the right answer as fast as possible!
What makes you believe this? The current trend in all major providers seem to be: get you to spin up as many agents as possible so that you can get billed more and their number of requests goes up.
> Slot machines have variable reward schedules by design
LLMs by all major providers are optimized used RLHF where they are optimized in ways we don't entirely understand to keep you engaged.
These are incredibly naive assumptions. Anthropic/OpenAI/etc don't care if you get your "answer solved quickly", they care that you keep paying and that all their numbers go up. They aren't doing this as a favor to you and there's no reason to believe that these systems are optimized in your interest.
> I built things obsessively before LLMs. I'll build things obsessively after.
The core argument of the "gambling hypothesis" is that many of these people aren't really building things. To be clear, I certainly don't know if this is true of you in particular, it probably isn't. But just because this doesn't apply to you specifically doesn't mean it's not a solid argument.
> The current trend in all major providers seem to be: get you to spin up as many agents as possible so that you can get billed more and their number of requests goes up.
I was surprised when I saw that Cursor added a feature to set the number of agents for a given prompt. I figured it might be a performance thing - fan out complex tasks across multiple agents that can work on the problem in parallel and get a combined solution. I was extremely disappointed when I realized it's just "repeat the same prompt to N separate agents, let each one take a shot and then pick a winner". Especially when some tasks can run for several minutes, rapidly burning through millions of tokens per agent.
At that point it's just rolling dice. If an agent goes so far off-script that its result is trash, I would expect that to mean I need to rework the instructions and context I gave it, not that I should try the same thing again and hope that entropy fixes it. But editing your prompt offline doesn't burn tokens, so it's not what makes them money.
Simply, cut-throat competition. Given multiple nations are funding different AI-labs, quality of output and speed are one of the most important things.
There’s a line to be trod between returning the best result immediately, and forcing multiple attempts. Google got caught red-handed reducing search quality to increase ad impressions, no reason to think the AI companies (of which Google is one) will slowly gravitate to the same.
My (possibly dated) understanding is that OpenAI/Anthropic are charging less than it costs right now to run inference. They are losing money while they build the market.
Assuming that is still true, then they absolutely have an incentive to keep your tokens/requests to the absolute minimum required to solve your problem and wow you.
I'm also seeing a lot of new rambling in Sonnet 4.6 when compared to 4.5, more markdown slop and pointing out details and things in the context which isn't too useful etc...
which then causes increased token usage because you need to prompt multiple times.
> The gambling analogy completely falls apart on inspection. Slot machines have variable reward schedules by design — every element is optimized to maximize time on device. Social media optimizes for engagement, and compulsive behavior is the predictable output. The optimization target produces the addiction.
Intermittent variable rewards, whether produced by design or merely as a byproduct, will induce compulsive behavior, no matter the optimization target. This applies to Claude
Sometimes I will go out and I will plant a pepper plant and take care of it all summer long and obsessively ensure it has precisely the right amount of water and compost and so on... and ... for some reason (maybe I was on vacation and it got over 105 degrees?) I don't get a good crop.
Does this mean I should not garden because it's a variable reward? Of course not.
Sometimes I will go out fishing and I won't catch a damn thing. Should I stop fishing?
Obviously no.
So what's the difference? What is the precise mechanism here that you're pointing at? Because sometimes life is disappointing is a reason to do nothing. And yet.
> Intermittent variable rewards, whether produced by design or merely as a byproduct, will induce compulsive behavior, no matter the optimization target.
This is an incorrect understanding of intermittent variable reward research.
Claims that it "will induce compulsive behavior" are not consistent with the research. Most rewards in life are variable and intermittent and people aren't out there developing compulsive behavior for everything that fits that description.
There are many counter-examples, such as job searching: It's clearly an intermittent variable reward to apply for a job and get a good offer for it, but it doesn't turn people into compulsive job-applying robots.
The strongest addictions to drugs also have little to do with being intermittent or variable. Someone can take a precisely measured abuse-threshold dose of a drug on a strict schedule and still develop compulsions to take more. Compulsions at a level that eclipse any behavior they'd encounter naturally.
Intermittent variable reward schedules can be a factor in increasing anticipatory behavior and rewards, but claiming that they "will induce compulsive behavior" is a severe misunderstanding of the science.
And that's only bad if it's illusory or fake. This reaction evolved because it's adaptive. In slot machines the brain is tricked to believe there is some strategy or method to crack and the reward signals make the addict feel there is some kind of progress being made in return to some kind of effort.
The variability in eg soccer kicks or basketball throws is also there but clearly there is a skill element and a potential for progress. Same with many other activities. Coding with LLMs is not so different. There are clearly ways you can do it better and it's not pure randomness.
Right. A platform who makes money the more you have to use it is definitely optimizing to get you the right answer in as few tokens as possible.
There is absolutely no incentive to do that, for any of these companies. The incentive is to make the model just bad enough you keep coming back, but not so bad you go to a competitor.
We've already seen this play out. We know Google made their search results worse to drive up and revenue. Exact same incentives are at play here, only worse.
IF I USE LESS TOKENS, ANTHROPIC GETS MORE MONEY! You are blindly pattern matching to "corporation bad!" without actually considering the underlying structure of the situation. I believe there's a phrase for this to do with probabilistic avians?
> What's Anthropic's optimization target??? Getting you the right answer as fast as possible!
Are you totally sure they are not measuring/optimizing engagement metrics? Because at least I can bet OpenAI is doing that with every product they have to offer.
Thank you! I don't get how so many people want to see dark patterns everywhere. All arguments miss the big counterargument: in a world where you have competitors, even free ones, you can't fuck around. You need to get it working. it's not a slot machine for me. How on earth are people using it? And if it would be I'd take my money elsewhere (kimi for example, openrouter or whatever). It needs to do my work as correct as possible. That's the business they are in. Tech folks talking about economics is so cringe. It's always just "corporations bad". As if they exist in a vacuum.
> What's Anthropic's optimization target??? Getting you the right answer as fast as possible!
That is a generous interpretation. Mighr be correct. But they dont make as much money if you quickly get the right answer. They make more money if you spend as many tokens as possible being on that "maybe next time" hook.
Im not saying theyre actually optimizng for that. But charlie munger said "show me the incentives, and ill show you the outcome"
I know for sure that each and every AI I use wants to write whole novellas in response to every prompt unless I carefully remind it to keep responses short over and over and over again.
This didn't used to be the case, so I assume that it must be intentional.
> The gambling analogy completely falls apart on inspection.
The analogy was too strained to make sense.
Despite being framed as a helpful plea to gambling addicts, I think it’s clear this post was actually targeted at an anti-LLM audience. It’s supposed to make the reader feel good for choosing not to use them by portraying LLM users as poor gambling addicts.
At one point, people said Google's optimization target was giving you the right search results as soon as possible. What will prevent Anthropic from falling into the same pattern of enshittification as its predecessors, optimizing for profit like all other businesses?
I stopped using Google years ago because they stopped trying to provide good search results. If Anthropic stops trying to provide a good coding agent, I'll stop using them too.
Doesn't the alignment sort of depend on who is paying for all the tokens?
If Dave the developer is paying, Dave is incentivized to optimize token use along with Anthropic (for the different reasons mentioned).
If the Dave's employer, Earl, is paying and is mostly interested in getting Dave to work more, then what incentive does Dave have to minimize tokens? He's mostly incentivized by Earl to produce more code, and now also by Anthropic's accidentally variable-reward coding system, to code more... ?
>The pathologizing of "person who likes making things chooses making things over Netflix" requires you to treat passive consumption as the healthy baseline, which is obviously a claim nobody in this conversation is bothering to defend
I think their greater argument was to highlight how agentic coding is eroding work life balance, and that companies are beginning to make that the norm.
Disagree. Unreliability is intractable because of the human, not the tool.
Even a perfect LLM will not be able to produce perfect outputs because humans will never put in all the context necessary to zero-shot any non-trivial query. LLMs can't read your mind and will always make distasteful assumptions unless driven by users without any unique preferences or a lot of time on their hands to ruminate on exactly how they want something done.
I think it will always be mostly boring back-and-forth until the jackpot comes. Maybe future generations will align their preferences with the default LLM output instead of human preferences in that domain, though.
> "person who likes making things chooses making things over Netflix"
This is subtly different. It's not clear that the people depicted like making things, in the sense of enjoying the process. The narrative is about LLMs fitting into the already-existing startup culture. There's already a blurry boundary between "risky investment" and "gambling", given that most businesses (of all types, not just startups) have a high failure rate. The socially destructive characteristic identified here is: given more opportunity to pull the handle on the gambling machine, people are choosing to do that at the expense of other parts of their life.
But yes, this relies on a subjective distinction between "building, but with unpredictable results" and "gambling, with its associated self-delusions".
> What's Anthropic's optimization target??? Getting you the right answer as fast as possible!
Wait, what? Anthropic makes money by getting you to buy and expend tokens. The last thing they want is for you to get the right answer as fast as possible. They want you to sometimes get the right answer unpredictably, but with enough likelihood that this time will work that you keep hitting Enter.
Given that pre-paid plans are the most popular way to subscribe to Claude, it quite plainly is a "the less tokens you use, the more money Anthropic makes" kind of situation.
In an environment where providers are almost entirely interchangeable and tiniest of perceived edges (because there's still no benchmark unambiguously judging which model is "better") make or break user retention, I just don't see how it's not ludicrous on its face that any LLM provider would be incentivized to give unreliable answers at some high-enough probability.
The LLM is not the slot machine. The LLM is the lever of the slot machine, and the slot machine itself is capitalism. Pull the lever, see if it generates a marketable product or moment of virality, get rich if you hit the jackpot. If not, pull again.
I don't know why you were downvoted. This is the FOMO that encourages agent gambling, automated experimentation in the hopes of accidentally striking digital gold before your peers do. A million monkeys racing 24/7 to create the next Harry Potter first.
Ideas are a dime a dozen, now proofs of concept are a load of tokens a dozen.
You may have a point but either way: immediately taking it personally like this and creating a whole semi-rant that includes something to the effect "I've been doing this since before you were born" really makes you sound like a person with a gambling problem.
Trust me, we all feel like the house is our friend until its isn't!
I wish the author had stuck to the salient point about work/life balance instead of drifting into the gambling tangent, because the core message is actually more unsettling. With the tech job market being rough and AI tools making it so frictionless to produce real output, the line between work time and personal time is basically disappearing.
To the bluesky poster's point: Pulling out a laptop at a party feels awkward for most; pulling out your phone to respond to claude barely registers. That’s what makes it dangerous: It's so easy to feel some sense of progress now. Even when you’re tired and burned out, you can still make progress by just sending off a quick message. The quality will, of course, slip over time; but far less than it did previously.
Add in a weak labor market and people feel pressure to stay working all the time. Partly because everyone else is (and nobody wants to be at the bottom of the stack ranking), and partly because it’s easier than ever to avoid hitting a wall by just "one more message". Steve Yegge's point about AI vampires rings true to me: A lot of coworkers I’ve talked to feel burned out after just a few months of going hard with AI tools. Those same people are the ones working nights and weekends because "I can just have a back-and-forth with Claude while I'm watching a show now".
The likely result is the usual pattern for increases in labor productivity. People who can’t keep up get pushed out, people who can keep up stay stuck grinding, and companies get to claim the increase in productivity while reducing expenses. Steve's suggestion for shorter workdays sound nice in theory, but I would bet significant amounts of money the 40-hour work week remains the standard for a long time to come.
Another interesting thing here is that the gap between "burned out but just producing subpar work" and "so crispy I literally cannot work" is even wider with AI. The bar for just firing off prompts is low, but the mental effort required to know the right prompts to ask and then validate is much higher so you just skip that part. You can work for months doing terrible work and then eventually the entire codebase collapses.
> With the tech job market being rough and AI tools making it so frictionless to produce real output, the line between work time and personal time is basically disappearing.
This isn't generally true at all. The "all tech companies are going to 996" meme comes up a lot here but all of the links and anecdotes go back to the same few sources.
It is very true that the tech job market is competitive again after the post-COVID period where virtually nobody was getting fired and jobs were easy to find.
I do not think it's true that the median or even 90th percentile tech job is becoming so overbearing that personal time is disappearing. If you're at a job where they're trying to normalize overwork as something everyone is doing, they're just lying to you to extract more work.
It would never show up as some explicit rule or document. It just sort of happens when a few things line up: execs start off-handedly praising 996, stack ranking is still a thing, and the job market is bad enough that getting fired feels genuinely dangerous.
It starts with people who feel they’ve got more to lose (like those supporting a family) working extra to avoid looking like a low performer, whether that fear is reasonable or not. People aren’t perfectly rational, and job-loss anxiety makes them push harder than they otherwise would. Especially now, when "pushing harder" might just mean sending chat messages to claude during your personal time.
Totally anecdotal (strike 1), and I'm at a FAANG which is definitely not the median tech job (strike 2), but it’s become pretty normal for me to come back Monday to a pile of messages sent by peers over the weekend. A couple years ago even that was extremely unusual; even if people were working on the weekend they at least kept up a facade that they weren't.
I know I'm running a bit late to the party here, but maybe someone can provide some color that I (on the slightly older end of the spectrum when it comes to this) don't fully understand.
When people talk about leaving their agents to run overnight, what are those agents actually doing? The limited utility I've had using agent-supported software development requires a significant amount of hand holding, maybe because I'm in an industry with limited externally available examples to build am model off of (though all of the specifications are public, I've yet to see an agent build an appropriate implementation).
So it's much more transactional...I ask, it does something (usually within seconds), I correct, it iterates again...
What sort of tasks are people putting these agents to? How are people running 'multiple' of these agents? What am I missing here?
My impression so far is that the parallel agent story is a fabrication of "ai influencers" and the labs themselves.
I might run 3-4 claude sessions because that's the only way to have "multiple chats" to e.g. ask unrelated things. Occasionally a task takes long enough to keep multiple sessions busy, but that's rather rare and if it happens its because the agent runs a long running task like the whole test suite.
The story of running multiple agents to build full features in parallel... doesn't really add up in my experience. It kinda works for a bit if you have a green field project where the complexity is still extremely low.
However once you have a feature interaction matrix that is larger than say 3x3 you have to hand hold the system to not make stupid assumptions. Or you prompt very precisely but this also takes time and prevents you from ever running into the parallel situation.
The feature interaction matrix size is my current proxy "pseudo-metric" for when agentic coding might work well and at which abstraction level.
This is exactly my experience as well. The feature interaction matrix is growing as models get better, and I tend to build "prompt library components" for each project which saves time on "you prompt very precisely but this also takes time".
But so far that doesn't change the reality - I can't find any opportunities to let an agent run for more than 30 minutes at best, and parallel agents just seem to confuse each other.
I came from embedded, where I wasn't able to use agents very effectively for anything other than quick round trip iterative stuff. They were still really useful, but I definitely could never envision just letting an agent run unattended.
But I recently switched domains into vaguely "fullstack web" using very popular frameworks. If I spend a good portion of my day going back and forth with an agent, working on a detailed implementation plan that spawns multiple agents, there is seemingly no limit* to the scope of the work they are able to accurately produce. This is because I'm reading through the whole plan and checking for silly gotchyas and larger implementation mistakes before I let them run. It's also great because I can see how the work can be parallelized at certain parts, but blocked at others, and see how much work can be parallelized at once.
Once I'm ready, I can usually let it start with not even the latest models, because the actual implementation is so straightforwardly prompted that it gets it close to perfectly right. I usually sit next to it and validate it while it's working, but I could easily imagine someone letting it run overnight to wake up to a fresh PR in the morning.
Don't get me wrong, it's still more work that just "vibing" the whole thing, but it's _so_ much more efficient than actually implementing it, especially when it's a lot of repetitive patterns and boilerplate.
* I think the limit is how much I can actually keep in my brain and spec out in a well thought out manner that doesn't let any corner cases through, which is still a limit, but not necessarily one coming from the agents. Once I have one document implemented, I can move on to the next with my own fresh mental context which makes it a lot easier to work.
The amount of boilerplate people talk about seems like the fault of these big modern frameworks honestly. A good system design shouldn't HAVE so much boilerplate. Think people would be better off simplifying and eliminating it deterministically before reaching for the LLM slot machine.
I had a few useful examples of this. In order to make it work you need to define your quality gates, and rather complex spec. I personally use https://github.com/probelabs/visor for creating the gates. It can be a code-review gate, or how well implementation align with the spec and etc. And basically it makes agent loop until it pass it. One of the tips, especially when using Claude Code, is explictly ask to create a "tasks", and also use subagents. For example I want to validate and re-structure all my documentation - I would ask it to create a task to research state of my docs, then after create a task per specific detail, then create a task to re-validate quality after it has finished task. You can also play around with the gates with a more simple tooling, for example https://probelabs.com/vow/
> One of the tips, especially when using Claude Code, is explictly ask to create a "tasks", and also use subagents. For example I want to validate and re-structure all my documentation - I would ask it to create a task to research state of my docs, then after create a task per specific detail, then create a task to re-validate quality after it has finished task.
This is definitely a way to keep those who wear Program and Project manager hats busy.
That is interesting. Never considered trying to throw one or two into a loop together to try to keep it honest. Appreciate the Visor recommendation, I'll give it a look and see if I can make this all 'make sense'.
As i build with agents, i frequently run into new issues that arent in scope for the task im on and would cause context drift. I have the agent create a github issue with a short problem description and keep going on the current task. In another terminal i spin up a new agent and just tell it “investigate GH issue 123” and it starts diving in, finds the root cause, and proposes a fix. Depending on what parts of the code the issue fix touches and what other agents ive got going, i can have 3-4 agents more or less independently closing out issues/creating PRs for review at a time. The agents log their work in a work log- what they did, what worked what didnt, problems they encountered using tools - and about once a day i have an agent review the worklog and update the AGENTS.md with lessons learned.
What are you using for environment for this, I am running into similar issues, can't really spin up a second agent because they would collide. Just a newly cloned repo?
With 5.3 Codex, the execplans skill and a well specified implementation task, you can get a good couple of hours work in a single turn. That's already in the scope of "set it up before bed and review it in the morning".
If you have a loop set up, e.g., using OpenClaw or a Ralph loop, you can stretch that out further.
I would suggest that when you get to that point really, you want some kind of adversarial system set up with code reviews (e.g., provided by CodeRabbit or Sourcery) and automation to feed that back into the coding agent.
Providing material for attention-grabbing headlines and blog posts, primarily. Can't (in good conscience, at least) claim you had an agent running all night if you didn't actually run an agent all night.
If you visualize it as AI Agents throwing a rope to wrangle a problem, and then visualize a dozen of these agents throwing their ropes around a room, and at each other -- very quickly you'll also visualize the mess of code that a collections of agents creates without oversight. It might even run, some might say that's the only true point but... at what cost in code complexity, performance waste, cascading bugs, etc.
Is it possible? Yes, I've had success with having a model output a 100 step plan that tried to deconflict among multiple agents. Without re-creating 'Gas town', I could not get the agents to operate without stepping on toes. With _me_ as the grand coordinator, I was able to execute and replicate a SaaS product (at a surface level) in about 24hrs. Output was around 100k lines of code (without counting css/js).
Who can prove that it works correctly though? An AI enthusiasts will say "as long as you've got test coverage blah blah blah". Those who have worked large scale products know that tests passing is basically "bare minimum". So you smoke test it, hope you've got all the paths, and toss it up and try to collect money from people? I don't know. If _this_ is the future, this will collapse under the weight of garbage code, security and privacy breaches, and who knows what else.
I will give you an example I heard from an acquaintance yesterday - this person is very smart but not strictly “technical”.
He is building a trading automation for personal use. In his design he gets a message on whatsapp/signal/telegram and approves/rejects the trade suggestion.
To define specifications for this, he defined multiple agents (a quant, a data scientist, a principal engineer, and trading experts - “warren buffett”, “ray dalio”) and let the agents run until they reached a consensus on what the design should be. He said this ran for a couple of hours (so not strictly overnight) after he went to sleep; in the morning he read and amended the output (10s of pages equivalent) and let it build.
This is not a strictly-defined coding task, but there are now many examples of emerging patterns where you have multiple agents supporting each other, running tasks in parallel, correcting/criticising/challenging each other, until some definition of “done” has been satisfied.
That said, personally my usage is much like yours - I run agents one at a time and closely monitor output before proceeding, to avoid finding a clusterfuck of bad choices built on top of each other. So you are not alone my friend :-)
This is my experience of it too. Perhaps if it was chunking through a large task like upgrading all of our repos to the latest engine supported by our cloud provider, I could leave it overnight. Even then it would just result in a large daylight backlog of "not quite right" to review and redo.
I think that's the issue I have with using these tools so far (definitely professionally, but even in pet projects for embedded systems). The mental load of having to go back through and make sure all of the lines of code do what the agent claims they do, even with tests, is significantly more than it would take to learn the implementation myself.
I can see the utility in creating very simple web-based tools where there's a monstrous wealth of public resources to build a model off of, but even the most recent models provided by Anthro, OpenAI, or MSFT seem prone to not quite perfection. And every time I find an error I'm left wondering what other bugs I'm not catching.
This is very dependent on what kind of work you're asking the agent to do. For software, I've had quite a bit of success providing detailed API specifications and asking an LLM to build a client library for that. You can leave it running unattended as long as it knows what it's supposed to build and it won't need a lot of correction since you're providing the routes, returned statuses and possible error messages.
Do some people just create complete SaaSlop apps with it overnight? Of course, just put together a plan (by asking the LLM to write the plan) with everything you want the app to do and let it run.
> it won't need a lot of correction since you're providing the routes, returned statuses and possible error messages.
Wouldn’t be better to setup an api docs (Postman, RapidApi,…), extract an OpenAPI version from that, then use a generator for your language of choice (Nswag,…)?
There has only been one instance of coding where I let the agent run for like 7 hours. To generate playwright test. Once the scaffolding is done, it is just matter of writing test for each of the component. But yeah even for that I didn't just fire and forget.
I wrote a program to classify thousands of images but that was using a model running on my gaming PC. Took about 3 days to classify them all. Only cost me the power right?
Power, gaming rig, internet, somewhere to store the rig, probably pay property taxes too.
You can draw the line wherever you want. :) Personally, I wish I'd built a new gaming rig a year ago so I could mess with local models and pay all these same costs.
I have agents run at night to work through complicated TTRPG campaigns. For example I have a script that runs all night simulating NPCs before a session. The NPCs have character sheets + motivations and the LLMs do one prompt per NPC in stages so combat can happen after social interactions. IF you run enough of these and make the prompts well written you can save a lot of time. You can't like... simulate the start of a campaign and then jump in. Its more like you know there is a big event, you already have characters, you can throw them in a folder to see how things would cook all else being equal and then use that to riff off of when you actually write your notes.
I think of my agents like golems from disc world, they are defined by their script. Adding texture to them improves the results so I usually keep a running tally of what they have worked on and add that to the header. They are a prompt in a folder that a script loops over and sends to gemeni(spawning an agent and moving to the next golem script)
I also was curious to see if it could be used it for developing some small games, whenever I would run into a problem I couldn't be bothered to solve or needed a variety of something I would let a few llms work on it so in the morning I had something to bounce off. I had pretty good success with this for RTS games and shooting games where variety is something well documented and creativity is allowed. I imagine there could be a use here, I've been calling it dredging cause I imagine myself casting a net down into the slop to find valuables.
I did have an idea where all my sites and UI would be checked against some UI heuristic like Oregon State's inclusivity heuristic but results have been mixed so far. The initial reports are fine, the implementation plans are ok but it seems like the loop of examine, fix, examine... has too much drift? That does seem solvable but I have a concern that this is like two lines that never touch but get closer as you approach infinity.
There is some usefulness in running these guys all night but I'm still figuring out when its useful and when its a waste of resources.
Spin up a mid sized linux vm (or any machine with 8 or 12 cores will do with at least 16GB RAM with nmve). Add 10 users. Install claude 10 times (one per user). Clone repo 10 times (one per user). Have a centralized place to get tasks from (db, trello, txt, etc) - this is the memory. Have a cron wake up every 10 minutes and call your script. Your script calls claude in non-interactive mode + auto accept. It grabs a new task, takes a crack at it and create a pull request. That is 6 tasks per hour per user, times 12 hours. Go from there and refine your harnesses/skills/scripts that claude's can use.
In my case, I built a small api that claude can call to get tasks. I update the tasks on my phone.
The assumption is that you have a semi-well structured codebase already (ours is 1M LOC C#). You have to use languages with strong typing + strict compiler.You have to force claude to frequently build the code (hence the cpu cores + ram + nmve requirement).
If you have multiple machines doing work, have single one as the master and give claude ssh to the others and it can configure them and invoke work on them directly. The usecase for this is when you have a beefy proxmox server with many smaller containers (think .net + debian). Give the main server access to all the "worker servers". Let claude document this infrastructure too and the different roles each machine plays. Soon you will have a small ranch of AI's doing different things, on different branches, making pull requests and putting feedback back into task manager for me to upvote or downvote.
Just try it. It works. Your mind will be blown what is possible.
I know it's popular comparing coding agents to slot machines right now, but the comparison doesn't entirely hold for me.
It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.
(I saw "no actual evidence pointing to these improvements" with a footnote and didn't even need to click that footnote to know it was the METR thing. I wish AI holdouts would find a few more studies.)
Steve Yegge of all people published something the other day that has similar conclusions to this piece - that the productivity boost for coding agents can lead to burnout, especially if companies use it to drive their employees to work in unsustainable ways: https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163
Yeah I'm finding that there's "clock time" (hours) and "calendar time" (days/weeks/months) and pushing people to work 'more' is based on the fallacy that our productivity is based on clock time (like it is in a factory pumping out widgets) rather than calendar time (like it is in art and other creative endeavors). I'm finding that even if the LLM can crank out my requested code in an hour, I'll still need a few days to process how it feels to use. The temptation is to pull the lever 10 times in a row because it was so easy, but now I'll need a few weeks to process the changes as a human. This is just for my own personal projects, and it makes sense that the business incentives would be even more intense. But you can't get around the fact that, no matter how brilliant your software or interface, customers are not going to start paying in a few hours.
I can churn out features faster, but that means I don't get time to fully absorb each feature and think through its consequences and relationships to other existing or future features.
If you are really good and fast validating/fixing code output or you are actually not validating it more than just making sure it runs (no judging), I can see it paying out 95% of the time.
But for what I've seen both validating my and others coding agents outputs I'd estimate a much lower percentage (Data Engineering/Science work). And, oh boy, some colleages are hooked to generating no matter the quality. Workslop is a very real phenomenon.
This matches my experience using LLMs for science. Out of curiosity, I downloaded a randomized study and the CONSORT checklist, and asked Claude code to do a review using the checklist.
I was really impressed with how it parsed the structured checklist. I was not at all impressed by how it digested the paper. Lots of disguised errors.
> It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.
“It’s not like a slot machine, it’s like… a slot machine… that I feel good using”
That aside if a slot machine is doing your job correctly 95% of the time it seems like either you aren’t noticing when it’s doing your job poorly or you’ve shifted the way that you work to only allow yourself to do work that the slot machine is good at.
> It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.
I think you are mistaken on what the "payout" is. There's only one reason someone is working all hours and during a party and whatnot: it's to become rich and powerful. The payout is not "more code", it's a big house, fast cars, beautiful women etc. Nobody can trick it into paying out even 1% of the time, let alone 95%.
It's 95% if you're using it for the stuff it's good at. People inevitably try to push it further than that (which is only natural!), and if you're operating at/beyond the capability frontier then the success rate eventually drops.
Being on a $200 plan is a weird motivator. Seeing the unused weekly limit for codex and the clock ticking down, and knowing I can spam GPT 5.2 Pro "for free" because I already paid for it.
> It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it
Right but the <100% chance is actually why slot machines are addictive. If it pays out continuously the behaviour does not persist as long. It's called the partial reinforcement extinction effect.
That 95% payout only works if you already know what good looks like. The sketchy part is when you can't tell the diff between correct and almost-correct. That's where stuff goes sideways.
If you are trying to build something well represented in the training data, you could get a usable prototype.
If you are unfamiliar with the various ways that naive code would fail in production, you could be fooled into thinking generated code is all you need.
If you try to hold the hand of the coding agents to bring code to a point where it is production ready, be prepared for a frustrating cycle of models responding with ‘Fixed it!’ while only having introduced further issues.
How are we still citing the (excellent) METR study in support of conclusions about productivity that its authors rightly insist[0] it does not support?
My paraphrase of their caveats:
- experts on their own open source proj are not representative of most software dev
- measuring time undervalues trading time for effort
- tools are noticeably better than they were a year ago when the study was conducted
- it really does take months of use to get the hang of it (or did then, less so now)
Before you respond to these points, please look at the full study’s treatment of the caveats! It’s fantastic, and it’s clear almost no one citing the study actually read it.
This does seem like a person getting hooked on idle games, or mobile/online games with artificially limited progress (that you can pay to lift). It's a type of delayed gratification that makes you anxious to get next one.
Not everyone gets hooked on those, but I do. I've played a bunch of those long-winded idle games, and it looks like a slight addiction. I would get impatient that it takes so long to progress, and it would add anxiety to e.g. run this during breaks at work, or just before going to sleep. "Just one more click".
And to be perfectly honest, it seems like the artificial limits of Anthropic (5 hour session limits) dig into similar mechanism. I do less non-programming hobbies since I've got myself a subscription.
Rather than let results be random, iteratively and continuously add more and more guardrails and grounding.
Tests, linting, guidance in response to key events (Claude Code hooks are great for this), automatically passing the agent’s code plan to another model invocation then passing back whatever feedback that model has on the plan so you don’t have to point out the same flaws in plans over and over.. custom scripts that iterate your codebase for antipatterns (they can walk the AST or be regex based - ask your agent to write them!)
Codify everything you’re looping back to your agent about and make it a guardrail. Give your agent the tools it needs to give itself grounding.
An agent without guardrails or grounding is like a person unconnected to their senses: disconnected from the world, all you do is dream - in a dream anything can happen, there’s nothing to ensure realism. When you look at it that way it’s a miracle coding agents produce anything useful at all :)
What's Anthropic's optimization target??? Getting you the right answer as fast as possible! The variability in agent output is working against that goal, not serving it. If they could make it right 100% of the time, they would — and the "slot machine" nonsense disappears entirely. On capped plans, both you and Anthropic are incentivized to minimize interactions, not maximize them. That's the opposite of a casino. It's ... alignment (of a sort)
An unreliable tool that the manufacturer is actively trying to make more reliable is not a slot machine. It's a tool that isn't finished yet.
I've been building a space simulator for longer than some of the people diagnosing me have been programming. I built things obsessively before LLMs. I'll build things obsessively after.
The pathologizing of "person who likes making things chooses making things over Netflix" requires you to treat passive consumption as the healthy baseline, which is obviously a claim nobody in this conversation is bothering to defend.
What makes you believe this? The current trend in all major providers seem to be: get you to spin up as many agents as possible so that you can get billed more and their number of requests goes up.
> Slot machines have variable reward schedules by design
LLMs by all major providers are optimized used RLHF where they are optimized in ways we don't entirely understand to keep you engaged.
These are incredibly naive assumptions. Anthropic/OpenAI/etc don't care if you get your "answer solved quickly", they care that you keep paying and that all their numbers go up. They aren't doing this as a favor to you and there's no reason to believe that these systems are optimized in your interest.
> I built things obsessively before LLMs. I'll build things obsessively after.
The core argument of the "gambling hypothesis" is that many of these people aren't really building things. To be clear, I certainly don't know if this is true of you in particular, it probably isn't. But just because this doesn't apply to you specifically doesn't mean it's not a solid argument.
I was surprised when I saw that Cursor added a feature to set the number of agents for a given prompt. I figured it might be a performance thing - fan out complex tasks across multiple agents that can work on the problem in parallel and get a combined solution. I was extremely disappointed when I realized it's just "repeat the same prompt to N separate agents, let each one take a shot and then pick a winner". Especially when some tasks can run for several minutes, rapidly burning through millions of tokens per agent.
At that point it's just rolling dice. If an agent goes so far off-script that its result is trash, I would expect that to mean I need to rework the instructions and context I gave it, not that I should try the same thing again and hope that entropy fixes it. But editing your prompt offline doesn't burn tokens, so it's not what makes them money.
Simply, cut-throat competition. Given multiple nations are funding different AI-labs, quality of output and speed are one of the most important things.
Assuming that is still true, then they absolutely have an incentive to keep your tokens/requests to the absolute minimum required to solve your problem and wow you.
which then causes increased token usage because you need to prompt multiple times.
Idk, maybe it's just me though.
Dead Comment
Intermittent variable rewards, whether produced by design or merely as a byproduct, will induce compulsive behavior, no matter the optimization target. This applies to Claude
Does this mean I should not garden because it's a variable reward? Of course not.
Sometimes I will go out fishing and I won't catch a damn thing. Should I stop fishing?
Obviously no.
So what's the difference? What is the precise mechanism here that you're pointing at? Because sometimes life is disappointing is a reason to do nothing. And yet.
This is an incorrect understanding of intermittent variable reward research.
Claims that it "will induce compulsive behavior" are not consistent with the research. Most rewards in life are variable and intermittent and people aren't out there developing compulsive behavior for everything that fits that description.
There are many counter-examples, such as job searching: It's clearly an intermittent variable reward to apply for a job and get a good offer for it, but it doesn't turn people into compulsive job-applying robots.
The strongest addictions to drugs also have little to do with being intermittent or variable. Someone can take a precisely measured abuse-threshold dose of a drug on a strict schedule and still develop compulsions to take more. Compulsions at a level that eclipse any behavior they'd encounter naturally.
Intermittent variable reward schedules can be a factor in increasing anticipatory behavior and rewards, but claiming that they "will induce compulsive behavior" is a severe misunderstanding of the science.
The variability in eg soccer kicks or basketball throws is also there but clearly there is a skill element and a potential for progress. Same with many other activities. Coding with LLMs is not so different. There are clearly ways you can do it better and it's not pure randomness.
Deleted Comment
So you're saying businesses shouldn't hire people either?
There is absolutely no incentive to do that, for any of these companies. The incentive is to make the model just bad enough you keep coming back, but not so bad you go to a competitor.
We've already seen this play out. We know Google made their search results worse to drive up and revenue. Exact same incentives are at play here, only worse.
IF I USE LESS TOKENS, ANTHROPIC GETS MORE MONEY! You are blindly pattern matching to "corporation bad!" without actually considering the underlying structure of the situation. I believe there's a phrase for this to do with probabilistic avians?
https://www-cdn.anthropic.com/58284b19e702b49db9302d5b6f135a...
(cmd-f "slot machine")
Are you totally sure they are not measuring/optimizing engagement metrics? Because at least I can bet OpenAI is doing that with every product they have to offer.
Deleted Comment
That is a generous interpretation. Mighr be correct. But they dont make as much money if you quickly get the right answer. They make more money if you spend as many tokens as possible being on that "maybe next time" hook.
Im not saying theyre actually optimizng for that. But charlie munger said "show me the incentives, and ill show you the outcome"
This didn't used to be the case, so I assume that it must be intentional.
The analogy was too strained to make sense.
Despite being framed as a helpful plea to gambling addicts, I think it’s clear this post was actually targeted at an anti-LLM audience. It’s supposed to make the reader feel good for choosing not to use them by portraying LLM users as poor gambling addicts.
I found it interesting that Google removed the "summary cards" supposedly "to improve user experience" however the AI overview was added back.
I suspect the AI overview is much more influenceable by advertisement money then the summary cards where.
Deleted Comment
If Dave the developer is paying, Dave is incentivized to optimize token use along with Anthropic (for the different reasons mentioned).
If the Dave's employer, Earl, is paying and is mostly interested in getting Dave to work more, then what incentive does Dave have to minimize tokens? He's mostly incentivized by Earl to produce more code, and now also by Anthropic's accidentally variable-reward coding system, to code more... ?
I think their greater argument was to highlight how agentic coding is eroding work life balance, and that companies are beginning to make that the norm.
Deleted Comment
Deleted Comment
Even a perfect LLM will not be able to produce perfect outputs because humans will never put in all the context necessary to zero-shot any non-trivial query. LLMs can't read your mind and will always make distasteful assumptions unless driven by users without any unique preferences or a lot of time on their hands to ruminate on exactly how they want something done.
I think it will always be mostly boring back-and-forth until the jackpot comes. Maybe future generations will align their preferences with the default LLM output instead of human preferences in that domain, though.
yeah I think the bluesky embed is much more along the lines of what I'm experiencing than the OP itself.
This is subtly different. It's not clear that the people depicted like making things, in the sense of enjoying the process. The narrative is about LLMs fitting into the already-existing startup culture. There's already a blurry boundary between "risky investment" and "gambling", given that most businesses (of all types, not just startups) have a high failure rate. The socially destructive characteristic identified here is: given more opportunity to pull the handle on the gambling machine, people are choosing to do that at the expense of other parts of their life.
But yes, this relies on a subjective distinction between "building, but with unpredictable results" and "gambling, with its associated self-delusions".
It is a business that sells monthly subscriptions
Wait, what? Anthropic makes money by getting you to buy and expend tokens. The last thing they want is for you to get the right answer as fast as possible. They want you to sometimes get the right answer unpredictably, but with enough likelihood that this time will work that you keep hitting Enter.
In an environment where providers are almost entirely interchangeable and tiniest of perceived edges (because there's still no benchmark unambiguously judging which model is "better") make or break user retention, I just don't see how it's not ludicrous on its face that any LLM provider would be incentivized to give unreliable answers at some high-enough probability.
Ideas are a dime a dozen, now proofs of concept are a load of tokens a dozen.
Dead Comment
Trust me, we all feel like the house is our friend until its isn't!
To the bluesky poster's point: Pulling out a laptop at a party feels awkward for most; pulling out your phone to respond to claude barely registers. That’s what makes it dangerous: It's so easy to feel some sense of progress now. Even when you’re tired and burned out, you can still make progress by just sending off a quick message. The quality will, of course, slip over time; but far less than it did previously.
Add in a weak labor market and people feel pressure to stay working all the time. Partly because everyone else is (and nobody wants to be at the bottom of the stack ranking), and partly because it’s easier than ever to avoid hitting a wall by just "one more message". Steve Yegge's point about AI vampires rings true to me: A lot of coworkers I’ve talked to feel burned out after just a few months of going hard with AI tools. Those same people are the ones working nights and weekends because "I can just have a back-and-forth with Claude while I'm watching a show now".
The likely result is the usual pattern for increases in labor productivity. People who can’t keep up get pushed out, people who can keep up stay stuck grinding, and companies get to claim the increase in productivity while reducing expenses. Steve's suggestion for shorter workdays sound nice in theory, but I would bet significant amounts of money the 40-hour work week remains the standard for a long time to come.
This isn't generally true at all. The "all tech companies are going to 996" meme comes up a lot here but all of the links and anecdotes go back to the same few sources.
It is very true that the tech job market is competitive again after the post-COVID period where virtually nobody was getting fired and jobs were easy to find.
I do not think it's true that the median or even 90th percentile tech job is becoming so overbearing that personal time is disappearing. If you're at a job where they're trying to normalize overwork as something everyone is doing, they're just lying to you to extract more work.
It starts with people who feel they’ve got more to lose (like those supporting a family) working extra to avoid looking like a low performer, whether that fear is reasonable or not. People aren’t perfectly rational, and job-loss anxiety makes them push harder than they otherwise would. Especially now, when "pushing harder" might just mean sending chat messages to claude during your personal time.
Totally anecdotal (strike 1), and I'm at a FAANG which is definitely not the median tech job (strike 2), but it’s become pretty normal for me to come back Monday to a pile of messages sent by peers over the weekend. A couple years ago even that was extremely unusual; even if people were working on the weekend they at least kept up a facade that they weren't.
When people talk about leaving their agents to run overnight, what are those agents actually doing? The limited utility I've had using agent-supported software development requires a significant amount of hand holding, maybe because I'm in an industry with limited externally available examples to build am model off of (though all of the specifications are public, I've yet to see an agent build an appropriate implementation).
So it's much more transactional...I ask, it does something (usually within seconds), I correct, it iterates again...
What sort of tasks are people putting these agents to? How are people running 'multiple' of these agents? What am I missing here?
I might run 3-4 claude sessions because that's the only way to have "multiple chats" to e.g. ask unrelated things. Occasionally a task takes long enough to keep multiple sessions busy, but that's rather rare and if it happens its because the agent runs a long running task like the whole test suite.
The story of running multiple agents to build full features in parallel... doesn't really add up in my experience. It kinda works for a bit if you have a green field project where the complexity is still extremely low.
However once you have a feature interaction matrix that is larger than say 3x3 you have to hand hold the system to not make stupid assumptions. Or you prompt very precisely but this also takes time and prevents you from ever running into the parallel situation.
The feature interaction matrix size is my current proxy "pseudo-metric" for when agentic coding might work well and at which abstraction level.
But so far that doesn't change the reality - I can't find any opportunities to let an agent run for more than 30 minutes at best, and parallel agents just seem to confuse each other.
I came from embedded, where I wasn't able to use agents very effectively for anything other than quick round trip iterative stuff. They were still really useful, but I definitely could never envision just letting an agent run unattended.
But I recently switched domains into vaguely "fullstack web" using very popular frameworks. If I spend a good portion of my day going back and forth with an agent, working on a detailed implementation plan that spawns multiple agents, there is seemingly no limit* to the scope of the work they are able to accurately produce. This is because I'm reading through the whole plan and checking for silly gotchyas and larger implementation mistakes before I let them run. It's also great because I can see how the work can be parallelized at certain parts, but blocked at others, and see how much work can be parallelized at once.
Once I'm ready, I can usually let it start with not even the latest models, because the actual implementation is so straightforwardly prompted that it gets it close to perfectly right. I usually sit next to it and validate it while it's working, but I could easily imagine someone letting it run overnight to wake up to a fresh PR in the morning.
Don't get me wrong, it's still more work that just "vibing" the whole thing, but it's _so_ much more efficient than actually implementing it, especially when it's a lot of repetitive patterns and boilerplate.
* I think the limit is how much I can actually keep in my brain and spec out in a well thought out manner that doesn't let any corner cases through, which is still a limit, but not necessarily one coming from the agents. Once I have one document implemented, I can move on to the next with my own fresh mental context which makes it a lot easier to work.
Hope it helps!
This is definitely a way to keep those who wear Program and Project manager hats busy.
As i build with agents, i frequently run into new issues that arent in scope for the task im on and would cause context drift. I have the agent create a github issue with a short problem description and keep going on the current task. In another terminal i spin up a new agent and just tell it “investigate GH issue 123” and it starts diving in, finds the root cause, and proposes a fix. Depending on what parts of the code the issue fix touches and what other agents ive got going, i can have 3-4 agents more or less independently closing out issues/creating PRs for review at a time. The agents log their work in a work log- what they did, what worked what didnt, problems they encountered using tools - and about once a day i have an agent review the worklog and update the AGENTS.md with lessons learned.
If you have a loop set up, e.g., using OpenClaw or a Ralph loop, you can stretch that out further.
I would suggest that when you get to that point really, you want some kind of adversarial system set up with code reviews (e.g., provided by CodeRabbit or Sourcery) and automation to feed that back into the coding agent.
Providing material for attention-grabbing headlines and blog posts, primarily. Can't (in good conscience, at least) claim you had an agent running all night if you didn't actually run an agent all night.
Is it possible? Yes, I've had success with having a model output a 100 step plan that tried to deconflict among multiple agents. Without re-creating 'Gas town', I could not get the agents to operate without stepping on toes. With _me_ as the grand coordinator, I was able to execute and replicate a SaaS product (at a surface level) in about 24hrs. Output was around 100k lines of code (without counting css/js).
Who can prove that it works correctly though? An AI enthusiasts will say "as long as you've got test coverage blah blah blah". Those who have worked large scale products know that tests passing is basically "bare minimum". So you smoke test it, hope you've got all the paths, and toss it up and try to collect money from people? I don't know. If _this_ is the future, this will collapse under the weight of garbage code, security and privacy breaches, and who knows what else.
He is building a trading automation for personal use. In his design he gets a message on whatsapp/signal/telegram and approves/rejects the trade suggestion.
To define specifications for this, he defined multiple agents (a quant, a data scientist, a principal engineer, and trading experts - “warren buffett”, “ray dalio”) and let the agents run until they reached a consensus on what the design should be. He said this ran for a couple of hours (so not strictly overnight) after he went to sleep; in the morning he read and amended the output (10s of pages equivalent) and let it build.
This is not a strictly-defined coding task, but there are now many examples of emerging patterns where you have multiple agents supporting each other, running tasks in parallel, correcting/criticising/challenging each other, until some definition of “done” has been satisfied.
That said, personally my usage is much like yours - I run agents one at a time and closely monitor output before proceeding, to avoid finding a clusterfuck of bad choices built on top of each other. So you are not alone my friend :-)
I can see the utility in creating very simple web-based tools where there's a monstrous wealth of public resources to build a model off of, but even the most recent models provided by Anthro, OpenAI, or MSFT seem prone to not quite perfection. And every time I find an error I'm left wondering what other bugs I'm not catching.
Do some people just create complete SaaSlop apps with it overnight? Of course, just put together a plan (by asking the LLM to write the plan) with everything you want the app to do and let it run.
Wouldn’t be better to setup an api docs (Postman, RapidApi,…), extract an OpenAPI version from that, then use a generator for your language of choice (Nswag,…)?
You can draw the line wherever you want. :) Personally, I wish I'd built a new gaming rig a year ago so I could mess with local models and pay all these same costs.
I think of my agents like golems from disc world, they are defined by their script. Adding texture to them improves the results so I usually keep a running tally of what they have worked on and add that to the header. They are a prompt in a folder that a script loops over and sends to gemeni(spawning an agent and moving to the next golem script)
I also was curious to see if it could be used it for developing some small games, whenever I would run into a problem I couldn't be bothered to solve or needed a variety of something I would let a few llms work on it so in the morning I had something to bounce off. I had pretty good success with this for RTS games and shooting games where variety is something well documented and creativity is allowed. I imagine there could be a use here, I've been calling it dredging cause I imagine myself casting a net down into the slop to find valuables.
I did have an idea where all my sites and UI would be checked against some UI heuristic like Oregon State's inclusivity heuristic but results have been mixed so far. The initial reports are fine, the implementation plans are ok but it seems like the loop of examine, fix, examine... has too much drift? That does seem solvable but I have a concern that this is like two lines that never touch but get closer as you approach infinity.
There is some usefulness in running these guys all night but I'm still figuring out when its useful and when its a waste of resources.
In my case, I built a small api that claude can call to get tasks. I update the tasks on my phone.
The assumption is that you have a semi-well structured codebase already (ours is 1M LOC C#). You have to use languages with strong typing + strict compiler.You have to force claude to frequently build the code (hence the cpu cores + ram + nmve requirement).
If you have multiple machines doing work, have single one as the master and give claude ssh to the others and it can configure them and invoke work on them directly. The usecase for this is when you have a beefy proxmox server with many smaller containers (think .net + debian). Give the main server access to all the "worker servers". Let claude document this infrastructure too and the different roles each machine plays. Soon you will have a small ranch of AI's doing different things, on different branches, making pull requests and putting feedback back into task manager for me to upvote or downvote.
Just try it. It works. Your mind will be blown what is possible.
Generate material for yet another retarded twitter hype post.
Dead Comment
It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.
(I saw "no actual evidence pointing to these improvements" with a footnote and didn't even need to click that footnote to know it was the METR thing. I wish AI holdouts would find a few more studies.)
Steve Yegge of all people published something the other day that has similar conclusions to this piece - that the productivity boost for coding agents can lead to burnout, especially if companies use it to drive their employees to work in unsustainable ways: https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163
Yeah I really feel that!
I recently learned the term "cognitive debt" for this from https://margaretstorey.com/blog/2026/02/09/cognitive-debt/ and I think it's a great way to capture this effect.
I can churn out features faster, but that means I don't get time to fully absorb each feature and think through its consequences and relationships to other existing or future features.
But for what I've seen both validating my and others coding agents outputs I'd estimate a much lower percentage (Data Engineering/Science work). And, oh boy, some colleages are hooked to generating no matter the quality. Workslop is a very real phenomenon.
I was really impressed with how it parsed the structured checklist. I was not at all impressed by how it digested the paper. Lots of disguised errors.
“It’s not like a slot machine, it’s like… a slot machine… that I feel good using”
That aside if a slot machine is doing your job correctly 95% of the time it seems like either you aren’t noticing when it’s doing your job poorly or you’ve shifted the way that you work to only allow yourself to do work that the slot machine is good at.
There's also this article on hbr.org https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies...
This is a real thing, and it looks like classic addiction.
I think you are mistaken on what the "payout" is. There's only one reason someone is working all hours and during a party and whatnot: it's to become rich and powerful. The payout is not "more code", it's a big house, fast cars, beautiful women etc. Nobody can trick it into paying out even 1% of the time, let alone 95%.
Claude Code wasting my time with nonsense output one in twenty times seems roughly correct. The rest of the time it's hitting jackpots.
Right but the <100% chance is actually why slot machines are addictive. If it pays out continuously the behaviour does not persist as long. It's called the partial reinforcement extinction effect.
If you are unfamiliar with the various ways that naive code would fail in production, you could be fooled into thinking generated code is all you need.
If you try to hold the hand of the coding agents to bring code to a point where it is production ready, be prepared for a frustrating cycle of models responding with ‘Fixed it!’ while only having introduced further issues.
My paraphrase of their caveats:
- experts on their own open source proj are not representative of most software dev
- measuring time undervalues trading time for effort
- tools are noticeably better than they were a year ago when the study was conducted
- it really does take months of use to get the hang of it (or did then, less so now)
Before you respond to these points, please look at the full study’s treatment of the caveats! It’s fantastic, and it’s clear almost no one citing the study actually read it.
[0]: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
Not everyone gets hooked on those, but I do. I've played a bunch of those long-winded idle games, and it looks like a slight addiction. I would get impatient that it takes so long to progress, and it would add anxiety to e.g. run this during breaks at work, or just before going to sleep. "Just one more click".
And to be perfectly honest, it seems like the artificial limits of Anthropic (5 hour session limits) dig into similar mechanism. I do less non-programming hobbies since I've got myself a subscription.
Tests, linting, guidance in response to key events (Claude Code hooks are great for this), automatically passing the agent’s code plan to another model invocation then passing back whatever feedback that model has on the plan so you don’t have to point out the same flaws in plans over and over.. custom scripts that iterate your codebase for antipatterns (they can walk the AST or be regex based - ask your agent to write them!)
Codify everything you’re looping back to your agent about and make it a guardrail. Give your agent the tools it needs to give itself grounding.
An agent without guardrails or grounding is like a person unconnected to their senses: disconnected from the world, all you do is dream - in a dream anything can happen, there’s nothing to ensure realism. When you look at it that way it’s a miracle coding agents produce anything useful at all :)