Please do not A/B test my workflow

The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much. I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo. That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

SlinkyOnStairs · 10 hours ago

> I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo.

I disagree in the case of LLMs.

AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output".

It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around.

And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.

> That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review.

People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client.

DoctorOetker · 11 minutes ago

Would you have a problem with the following scheme?

Every client is free and encouraged to feed back its financial health: profit for that hour/day/month/...

The AB(-X) test run by the LLM provider uses the correlation of a client's profit with its AB(-X) test, so that participating with the testing improves your profit statistically speaking (sometimes up sometimes down, but on average up).

You may say, what about that hiring decision? One thing is certain: when companies make more profit they are more likely to seek and accept more employees.

londons_explore · 10 hours ago

I think you would be hard pushed to find any big tech company which doesn't do some kind of A B testing. It's pretty much required if you want to build a great product.

steve-atx-7600 · 10 hours ago

Long term effectiveness? LLMs are such a fast moving target. Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that? Next month a new model could be released to everyone else - or by a competitor - that’s a big step difference in performance in tasks you care about. You’d rather be on your own path learning about the state of the world that doesn’t exist anymore? nov-ish 2025 and after, for example, seemed like software engineering changed forever because of improvements in opus.

garciasn · 10 hours ago

> And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.

LLMs are non-deterministic anyway, as you note above with your comment on the 'reproducibility' issue. So; any sort of research into CC's long-term effectiveness would already have taken into account that you can run it 15x in a row and get a different response every time.

johnisgood · 10 hours ago

Then do not use LLMs for hiring, or use a specific LLM, or self-host your own!

sfn42 · 6 hours ago

Anyone who trusts LLMs to do anything has shit coming. You can not trust them. If you do, that's on you. I don't care if you want to trust it to manage hiring, you can't. If you do anyway then the ethical problems are squarely on you.

People keep complaining about LLMs taking jobs, meanwhile others complain that they can't take their jobs and here I am just using them as a useful tool more powerful than a simple search engine and it's great. No chance it'll replace me, but it sure helps me do ny job better and faster.

airza · 10 hours ago

Isn’t the horrendous ethical and legal decision delegating your hiring process to a black box?

raw_anon_1111 · 10 hours ago

Would you rather they change things for everyone at once without testing?

simianwords · 9 hours ago

Strange! You benefitted from all the previous a/b experiments to give you a somewhat optimal model now. But now it’s too inconvenient for you?

hollow-moe · 10 hours ago

Tech companies really have issues with "informed and conscious consent" doesn't they

ramoz · 11 hours ago

I apologize for doing this - and I agree. I will revise

s3p · 8 hours ago

I still think you have a point here. Doing this kind of testing on users unwittingly is unethical in my opinion

everdrive · 10 hours ago

>I don't believe A/B testing is an inherent evil,

Evil might be a stretch, but I really hate A/B testing. Some feature or UI component you relied on is now different, with no warning, and you ask a coworker about it, and they have no idea what you're talking about.

Usually, the change is for the worse, but gets implemented anyway. I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.

cosmic_cheese · 9 hours ago

> I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.

In my experience all manner of analytics data frequently gets misused to support whatever narrative the product manager wants it to support.

With enough massaging you can make “objective” numbers say anything, especially if you do underhanded things like bury a previously popular feature three modals deep or put it behind a flag. “Oh would you look at that, nobody uses this feature any more! Must be safe to remove it.”

Deleted Comment

xg15 · 3 hours ago

> The framing of A/B testing as a "silent experimentation on users"

Sorry, but how is A/B testing not exactly that? The experiments may be on non-disruptive things like button color, but they're experiments no less.

The users are also rarely informed about the experiment taking place, let alone on the motivation or evaluation criteria.

mschuster91 · 10 hours ago

> The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much.

No. Users aren't free test guinea pigs. A/B testing cannot be done ethically unless you actively point out to users that they are being A/B tested and offering the users a way to opt out, but that in turn ruins a large part of the promise behind A/B tests.

bcrl · 5 hours ago

Please name a computer science program that has an ethics component.

Yes, I wish software developers were more like actual engineers in this regard.

saltcured · 5 hours ago

Yeah, and if you don't already have an IRB, your organization probably isn't ready to be doing such things responsibly...

tomalbrc · 11 hours ago

Would love to know why you would consider invoking Meta “a little much”. Sounds more than appropriate.

krisbolton · 10 hours ago

Not to start an internet argument -- I don't think it is appropriate in this context. A/B testing the features of a web app is not unexpected or unethical. So invoking the memory of cambridge analytica (etc) is disproportionate. It's far more legitimate to just discuss how much A/B testing should negatively affect a user. I don't have an answer and it's an interesting and relevant question.

cyanydeez · 8 hours ago

Relying on a paid service for anything significant is basically accepting the Company Store feudal serfdom.

Enshittification is coming for AI.

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

onion2k · 11 hours ago

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).

The complaint is about A/B tests with no visible warnings, not AI.

reconnecting · 11 hours ago

There's a distinction worth making here. A/B testing the interface button placement, hue of a UI element, title styling — is one thing. But you wouldn't accept Photoshop silently changing your #000000 to #333333 in the actual file. That's your output, not the UI around it. That's what LLMs do. The randomness isn't in the wrapper, it's in the result you take away.

duskdozer · 11 hours ago

Honestly I find it kind of surprising that anyone finds this surprising. This is standard practice for proprietary software. LLMs are very much not replicable anyway.

dkersten · 11 hours ago

Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too. Also the model quality inconsistency: before a new model release, there’s a week or two where their previous model is garbage.

A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.

This isn’t what I want from a professional tool. For business, we need consistency and reliability.

r_lee · 10 hours ago

> vibe coded Claude code cli releases are a buggy mess too

this is what gets me.

are they out of money? are so desperate to penny pinch that they can't just do it properly?

what's going on in this industry?

ordersofmag · 11 hours ago

Any tool that auto-updates carries the implication that behavior will change over time. And one criteria for being a skilled professional is having expert understanding of ones tools. That includes understanding the strengths and weaknesses of the tools (including variability of output) and making appropriate choices as a result. If you don't feel you can produce professional code with LLM's then certainly you shouldn't use them. That doesn't mean others can't leverage LLM's as part of their process and produce professional results. Blindly accepting LLM output and vibe coding clearly doesn't consistently product professional results. But that's different than saying professionals can't use LLM in ways that are productive.

johnisgood · 11 hours ago

Well put. I would upvote this many times if I could.

hrmtst93837 · 11 hours ago

Replicability is a spectrum not a binary and if you bake in enough eval harnessing plus prompt control you can get LLMs shockingly close to deterministic for a lot of workloads. If the main blocker for "professional" use was unpredictability the entire finance sector would have shutdown years ago from half the data models and APIs they limp along on daily.

Mtinie · 10 hours ago

What would you do differently if LLM outputs were deterministic?

Perhaps I approach this from a different perspective than you do, so I’m interested to understand other viewpoints.

I review everything that my models produce the same way I review work from my coworkers: Trust but verify.

WillAdams · 11 hours ago

Yeah, I've been using Copilot to process scans of invoices and checks (w/ a pen laid across the account information) converted to a PDF 20 at a time and it's pretty rare for it to get all 20, but it's sufficiently faster than opening them up in batches of 50 and re-saving using the Invoice ID and then using a .bat file to rename them (and remembering to quite Adobe Acrobat after each batch so that I don't run into the bug in it where it stops saving files after a couple of hundred have been so opened and re-saved).

danielbln · 11 hours ago

I don't get your point. Web tools have been doing A/B feature testing all the time, way before we had LLMs.

reconnecting · 11 hours ago

This is very different from the A/B interface testing you're referring to, what LLMs enable is A/B testing the tool's own output — same input, different result.

Your compiler doesn't do that. Your keyboard doesn't do that. The randomness is inside the tool itself, not around it. That's a fundamental reliability problem for any professional context where you need to know that input X produces output X, every time.

freeone3000 · 11 hours ago

Yes! And it was bad then too!!

I want software that does a specific list of things, doesn’t change, and preferentially costs a known amount.

_heimdall · 11 hours ago

LLMs are nondeterministic by design, but that has nothing to do with A/B testing.

croes · 8 hours ago

That’s not a problem of LLMs but of using services provided by others.

How often were features changed or deactivated by cloud services?

NotGMan · 11 hours ago

By that definition humans are not professional since we hallucinate and make mistakes all the time.

krisbolton · 11 hours ago

chrislloyd · 8 hours ago

Hi, this was my test! The plan-mode prompt has been largely unchanged since the 3.x series models and now 4.x get models are able to be successful with far less direction. My hypothesis was that shortening the plan would decrease rate-limit hits while helping people still achieve similar outcomes. I ran a few variants, with the author (and few thousand others) getting the most aggressive, limiting the plan to 40 lines. Early results aren't showing much impact on rate limits so I've ended the experiment.

Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!

nextzck · 2 hours ago

The 40-line cap not moving rate limits makes sense - plan text is cheap. The cost is in Phase 1 exploration.

Plan mode spins up to 3 explore subagents before the planner even starts, and the heuristic is "use multiple when scope is uncertain." It won't choose fewer - it's being asked to plan, so scope is always uncertain. Nothing penalizes claude for over-exploring and nothing rewards restraint.

Plan mode also ignores session state. A cold start gets the same fanout as a warm session where the relevant files are already in context. In a warm session the explore pass is pure waste - it re-reads loaded files and feeds the planner lossy summaries that conflict with what it already knows.

More tokens, worse plan.

If exploration was conditional on what's already in context..skip it for warm sessions, keep it for cold starts - that does more for both rate limits and plan quality than a hard 40-line cap.

Note: plan mode didn’t always have this 3 subagent fan out behavior attached to it, it was introduced around opus 4.6 launch.

BAM-DevCrew · 7 hours ago

As a divergent thinker with extensive hard constraints in claude.mds and on-boarding commands that force claude to internalize my constraints, that you or some other employee of Anthropic could randomly select me for testing is horrifying. Each unexpected behavior and my corresponding reaction to it can wipe me out, my brain out, completely for hours, days, even weeks. I have in the last year spend tens (estimating around 400) of hours establishing and reestablishing a system to protect myself from psychological harm and financial harm. It is twisted that you Anthropic employees do not consider the impact your work has on divergent thinking Claude users, let alone that real work is severly impacted by your work. Totally irresponsible. Offensively so.

shepherdjerred · 5 hours ago

What?

Even without Anthropic's experimentation, anything in the context is completely probabilistic.

You cannot rely on it no matter how/how much you prompt the model

PufPufPuf · 5 hours ago

I can't tell whether something is satire anymore.

bartread · an hour ago

I don't mind you testing stuff out - it's the only sensible way to make the app better - but you need to give people choices to switch to different behaviours if the behaviour you're testing on them isn't working out well for them.

In other news, Claude Code login is down, so if you have time it would be sensible to proiritise fixing that:

Authorization failed Redirect URI http:/localhost:53025/callback is not supported by client.

MacOS Sequoia, VS Code 1.111.0, Firefox 147.0.4 (although also fails on Chrome 145.0.7632.160).

This just started happening as of this evening. I've tried restarting everything, and it doesn't help.

okwhateverdude · 6 hours ago

How can we opt-out of these tests? The behavior foibles I've been experiencing over the past month might be directly attributable to these experiments! It can be extreme frustrating. I don't want to be in the beta channel. Please change this to be opt-in.

ramoz · 8 hours ago

Thanks for the transparency. Sorry for the noise.

I think I'd be okay with a smaller, more narrative-detailed plan - not so much about verbosity, more about me understanding what is about to happen & why. There hadn't been much discourse once planning mode entered (ie QA). It would jump into its own planning and idle until I saw only a set of projected code changes.

oakwhiz · 3 hours ago

Shouldn't you be giving people their tokens back when you used their tokens to test on their environment?

rusakov-field · 11 hours ago

On one side I am frustrated with LLMs because they derail you by throwing grammatically correct bullshit and hallucinations at you, where if you slip and entertain some of it momentarily it might slow you down.

But on the other hand they are so useful with boilerplate and connecting you with verbiage quickly that might guide you to the correct path quicker than conventional means. Like a clueless CEO type just spitballing terms they do not understand but still that nudging something in your thought process.

But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

qazxcvbnmlp · 10 hours ago

One of the main skills of using the llm well is knowing the difference between useful output and ai slop.

Mc_Big_G · 10 hours ago

>Those who think they will take over are clueless.

You're underestimating where it's headed.

rusakov-field · 10 hours ago

Do you think it will reach "understanding of semantics", true cognition, within our lifetimes ? Or performance indistinguishable from that even if not truly that.

Not sure. I am not so optimistic. People got intoxicated with nuclear powered cars , flying cars , bases on the moon ,etc all that technological euphoria from the 50's and 60's that never panned out. This might be like that.

I think we definitely stumbled on something akin to the circuitry in the brain responsible for building language or similar to it. We are still a long way to go until artificial cognition.

EMM_386 · 11 hours ago

> But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

Or - there are enough people who know their stuff that the people who don't will be replaced and they will take over anyway.

risyachka · 11 hours ago

> there are enough people who know their stuff

unless the bar for "know their stuff" is very very low - this is not the case in the nearest future

gnfargbl · 10 hours ago

For anyone else wondering why the article ends in a non-sequitur: it looks like the author wrote about decompiling the Claude Code binaries and (presumably) discovering A/B testing paths in the code.

HN user 'onion2k pointed out that doing this breaks Anthropic's T&Cs: https://news.ycombinator.com/item?id=47375787

vova_hn2 · 9 hours ago

Two thoughts:

1. Open source tools solve the problem of "critical functions of the application changing without notice, or being signed up for disruptive testing without opt-in".

2. This makes me afraid that it is absolutely impossible for open source tools to ever reach the level of proprietary tools like Claude Code precisely because they cannot do A/B tests like this which means that their design decisions are usually informed by intuition and personal experience but not by hard data collected at scale.

dijit · 9 hours ago

Regarding point 1 specifically, there were so many people seriously miffed at the “man after midnight”[0] time-based easter egg that I would be careful with that reasoning.

Open source doesn’t always mean reproducible.

People don’t enjoy the thought of auditing code… someone else will do it; and its made somewhat worse with our penchant to pull in half the universe as dependencies (Rust, Go and Javascript tend to lean in this direction to various extremes). But auditing would be necessary in order for your first point here to be as valid as you present.

[0]: https://gitlab.com/man-db/man-db/-/commit/002a6339b1fe8f83f4...

> People don’t enjoy the thought of auditing code… someone else will do it

I think that with modern LLMs auditing a big project personally, instead of relying on someone else to do it, actually became more realistic.

You can ask an LLM to walk you through the code, highlight parts that seem unusual or suspicious, etc.

On the other hand, LLMs also made producing code cheaper then ever, so you can argue, that big projects will just become even bigger wich will put them out of reach even for a reviewer who is also armed with an LLM.

BiteCode_dev · 5 hours ago

Let's A/B test the linux kernel, for shits and giggles.

alpaca128 · 9 hours ago

A/B test doesn't necessarily imply improvements for the user. It could be testing of future enshittification methods. See YouTube for an example.

bushido · 10 hours ago

I have no issues with A/B tests.

I do have an issue with the plan mode. And nine out of ten times, it is objectively terrible. The only benefit I've seen in the past from using plan mode is it remembers more information between compactions as compared to the vanilla - non-agent team workflow.

Interestingly, though, if you ask it to maintain a running document of what you're discussing in a markdown file and make it create an evergreen task at the top of its todo list which references the markdown file and instructs itself to read it on every compaction, you get much better results.

mikkupikku · 10 hours ago

Huh, very much not my experience with plan mode. I use plan mode before almost anything more than truly trivial task because I've found it to be far more efficient. I want a chance to see and discuss what claude is planning to do before it races off and does the thing, because there are often different approaches and I only sometimes agree with the approach claude would decide on by itself.

Planning is great. It's plan mode that is unpredictable in how it discusses it and what it remembers from the discussion.

I still have discussions with the agents and agent team members. I just force it to save it in a document in the repo itself and refer back to the document. You can still do the nice parts of clearing context, which is available with plan mode, but you get much better control.

At all times, I make the agents work on my workflow, not try and create their own. This comes with a whole lot of trial and error, and real-life experience.

There are times when you need a tiger team made up of seniors. And others when you want to give a overzealous mid-level engineer who's fast a concrete plan to execute an important feature in a short amount of time.

I'm putting it in non-AI terms because what happens in real life pre-AI is very much what we need to replicate with AI to get the best results. Something which I would have given a bigger team to be done over two to eight sprints will get a different workflow with agent teams or agents than something which I would give a smaller tiger team or a single engineer.

They all need a plan. For me plan mode is insufficient 90% of the times.

I can appreciate that many people will not want to mess around with workflows as much as I enjoy doing.

andrewaylett · 10 hours ago

> on every compaction

I've only hit the compaction limit a handful of times, and my experience degraded enough that I work quite hard to not hit it again.

One thing I like about the current implementation of plan mode is that it'll clear context -- so if I complete a plan, I can use that context to write the next plan without growing context without bound.

samdjstephens · 10 hours ago

I really like this too - having the previous plan and implementation in place to create the next plan, but then clearing context once that next plan exists feels like a great way to have exactly the right context at the right time.

I often do follow ups, that would have been short message replies before, as plans, just so I can clear context once it’s ready. I’m hitting the context limit much less often now too.

Agreed. The only time I don't clear context after a plan has been agreed on is when I'm doing a long series of relatively small but very related changes, such as back-and-forth tweaking when I don't yet know what I really want the final result to be until I've tried stuff out. In those cases, it has very rarely been useful to compact the context, but usually I don't get close.

Apparently the blog stripped the decompilation details for ToS reasons, which sucks because those are exactly the hack-y bits that make this interesting for HN.

> It told me it was following specific system instructions to hard-cap plans at 40 lines, forbid context sections, and “delete prose, not file paths.

Yeah, would be nice to be able to view and modify these instructions.