The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much. I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo. That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.
> I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo.
I disagree in the case of LLMs.
AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output".
It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around.
And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.
> That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.
The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review.
People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client.
Anyone who trusts LLMs to do anything has shit coming. You can not trust them. If you do, that's on you. I don't care if you want to trust it to manage hiring, you can't. If you do anyway then the ethical problems are squarely on you.
People keep complaining about LLMs taking jobs, meanwhile others complain that they can't take their jobs and here I am just using them as a useful tool more powerful than a simple search engine and it's great. No chance it'll replace me, but it sure helps me do ny job better and faster.
Would you have a problem with the following scheme?
Every client is free and encouraged to feed back its financial health: profit for that hour/day/month/...
The AB(-X) test run by the LLM provider uses the correlation of a client's profit with its AB(-X) test, so that participating with the testing improves your profit statistically speaking (sometimes up sometimes down, but on average up).
You may say, what about that hiring decision? One thing is certain: when companies make more profit they are more likely to seek and accept more employees.
That sounds like a good way to get extreme short-term optimization.
Say a particular finetune prioritizes profits right now and makes recommendations like "cut down on maintenance, you can make up for it later with your increased profits and their interest". It produces more profits, and wins the AB test. Later the chickens come home to roost.
You can reduce the problem by using long-term indicators, but then each AB test is very slow.
I think you would be hard pushed to find any big tech company which doesn't do some kind of A B testing. It's pretty much required if you want to build a great product.
A big tech company has ~10k experiments running at once. Some engineers will be kicking off a few experiments every day. Some will be minor things like font sizes or wording of buttons, whilst others will be entirely new features or changes in rules.
Focus groups have their place, but cannot collect nearly the same scale of information.
As someone who works in these orgs, only a small fraction are about user experience metrics. 90+% are extracting more short term value with unknown second order effects on usability.
Yeah, that's why we didn't have anything anyone could possibly consider as a "great product" until A/B testing existed as a methodology.
Or, you could, you know, try to understand your users without experimenting on them, like countless of others have managed to do before, and still shipped "great products".
I know this is a salty take, but reliance on A/B testing to design products is indicative of product deciders who don't know what they are doing and don't know what their product should be. It's like a chef saying, I want to make a pancake, but trying 50 different combinations of ingredients until one of them ends up being a pancake. If you have to test whether a product works / is good / is profitable, then you didn't know what you were doing in the first place.
Using A/B tests to safely deploy and test bug fixes and change requests? Totally different story.
Long term effectiveness? LLMs are such a fast moving target. Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that? Next month a new model could be released to everyone else - or by a competitor - that’s a big step difference in performance in tasks you care about. You’d rather be on your own path learning about the state of the world that doesn’t exist anymore? nov-ish 2025 and after, for example, seemed like software engineering changed forever because of improvements in opus.
>Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that?
If you really want to keep non-determinism down, you could try (1) see if you can fix the installed version of the clause code client app (I haven’t looked into the details to prevent auto-updating..because bleeding edge person) and (2) you can pin to a specific model version which you think would have to reduce a/b test exposure to some extent https://support.claude.com/en/articles/11940350-claude-code-...
> Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that?
Yes. I'd like some guarantee that my results are reproducible for some reasonable amount of time. New versions can also introduce regressions. A prompt that works well with today's model might not work with tomorrow's, even if the latter is "better".
> And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.
LLMs are non-deterministic anyway, as you note above with your comment on the 'reproducibility' issue. So; any sort of research into CC's long-term effectiveness would already have taken into account that you can run it 15x in a row and get a different response every time.
These are two very different things. I suspect that in some cases pointing finger at a black box instead of actually explaining your decisions can actually shield you from legal liability...
Hiding something like this in the TOC rather than explicitly asking users to opt in is a dark pattern. You can't gain the moral highground by cackling that someone should have read the fine print.
All TOS essentially boil down to "we owe you nothing and can change the product at anytime to anything we want at our sole discretion"
Obviously it would be unreasonable to accept such terms without further context. The further context in this case being that Anthropic will maintain Claude as an AI agent and seek to improve it's performance. What is at the heart of this issue is whether or not Anthropics recent A/B testing violated that context. Not whether or not they violated the TOS (they didn't, obviously)
I read the article saying they were testing service changes on paying users without knowledge or explicit consent of the user, that the user had to test and determine why they perceived their sevice changed.
That is a dark pattern to inflict on users expecting consistent output.
Evil might be a stretch, but I really hate A/B testing. Some feature or UI component you relied on is now different, with no warning, and you ask a coworker about it, and they have no idea what you're talking about.
Usually, the change is for the worse, but gets implemented anyway. I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.
> I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.
In my experience all manner of analytics data frequently gets misused to support whatever narrative the product manager wants it to support.
With enough massaging you can make “objective” numbers say anything, especially if you do underhanded things like bury a previously popular feature three modals deep or put it behind a flag. “Oh would you look at that, nobody uses this feature any more! Must be safe to remove it.”
> The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much.
No. Users aren't free test guinea pigs. A/B testing cannot be done ethically unless you actively point out to users that they are being A/B tested and offering the users a way to opt out, but that in turn ruins a large part of the promise behind A/B tests.
Meta has had an IRB for well over a decade (following a scandal where they used their users as lab rats) and that didn't stop them from doing any of the BS they did ever since.
Not to start an internet argument -- I don't think it is appropriate in this context. A/B testing the features of a web app is not unexpected or unethical. So invoking the memory of cambridge analytica (etc) is disproportionate. It's far more legitimate to just discuss how much A/B testing should negatively affect a user. I don't have an answer and it's an interesting and relevant question.
> A/B testing the features of a web app is not unexpected or unethical.
It's not "unexpected" but it is still unethical. In ye olde days, you had something like "release notes" with software, and you could inform yourself what changed instead of having to question your memory "didn't there exist a button just yesterday?" all the time. Or you could simply refuse to install the update, or you could run acceptance tests and raise flags with the vendor if your acceptance tests caused issues with your workflow.
Now with everything and their dog turning SaaS for that sweet sweet recurring revenue and people jerking themselves off over "rapid deployment", with the one doing the most deployments a day winning the contest? Dozens if not hundreds of "releases" a day, and in the worst case, you learn the new workflow only for it to be reverted without notice again. Or half your users get the A bucket, the other half gets the B bucket, and a few users get the C bucket, so no one can answer issues that users in the other bucket have. Gaslighting on a million people scale.
It sucks and I wish everyone doing this only debilitating pain in their life. Just a bit of revenge for all the pain you caused to your users in the endless pursuit for 0.0001% more growth.
> It's far more legitimate to just discuss how much A/B testing should negatively affect a user. I don't have an answer and it's an interesting and relevant question.
You don't have an answer on "how much should A/B testing negatively affect a user"? So "a lot" would be on the table?