The framing of A/B testing as a "silent experimentation on users" and invoking M...

SlinkyOnStairs · 2026-03-14T12:35:31 1773491731

> I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo.

I disagree in the case of LLMs.

AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output".

It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around.

And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.

> That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review.

People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client.

sfn42 · 2026-03-14T17:20:42 1773508842

Anyone who trusts LLMs to do anything has shit coming. You can not trust them. If you do, that's on you. I don't care if you want to trust it to manage hiring, you can't. If you do anyway then the ethical problems are squarely on you.

People keep complaining about LLMs taking jobs, meanwhile others complain that they can't take their jobs and here I am just using them as a useful tool more powerful than a simple search engine and it's great. No chance it'll replace me, but it sure helps me do ny job better and faster.

DoctorOetker · 2026-03-14T22:49:11 1773528551

Would you have a problem with the following scheme?

Every client is free and encouraged to feed back its financial health: profit for that hour/day/month/...

The AB(-X) test run by the LLM provider uses the correlation of a client's profit with its AB(-X) test, so that participating with the testing improves your profit statistically speaking (sometimes up sometimes down, but on average up).

You may say, what about that hiring decision? One thing is certain: when companies make more profit they are more likely to seek and accept more employees.

986aignan · 2026-03-15T12:49:13 1773578953

That sounds like a good way to get extreme short-term optimization.

Say a particular finetune prioritizes profits right now and makes recommendations like "cut down on maintenance, you can make up for it later with your increased profits and their interest". It produces more profits, and wins the AB test. Later the chickens come home to roost.

You can reduce the problem by using long-term indicators, but then each AB test is very slow.

londons_explore · 2026-03-14T13:13:37 1773494017

I think you would be hard pushed to find any big tech company which doesn't do some kind of A B testing. It's pretty much required if you want to build a great product.

wavefunction · 2026-03-14T14:36:49 1773499009

A responsible company develops an informed user group they can test new changes with and receive direct feedback they can take action on.

londons_explore · 2026-03-14T16:32:18 1773505938

A big tech company has ~10k experiments running at once. Some engineers will be kicking off a few experiments every day. Some will be minor things like font sizes or wording of buttons, whilst others will be entirely new features or changes in rules.

Focus groups have their place, but cannot collect nearly the same scale of information.

rkomorn · 2026-03-14T16:39:54 1773506394

I think a lot of people (myself included) would just like to not be constantly part of some sort of revenue optimization effort.

I don't care, at all, about the "scale of information" for the company's sake.

londons_explore · 2026-03-14T20:38:32 1773520712

Often the experiments are not for revenue - many of them will be optimizing user experience metrics - ie. Load time or user dropoff rate.

They are clearly good for both user satisfaction and the companies bottom line.

jjj123 · 2026-03-14T21:35:56 1773524156

As someone who works in these orgs, only a small fraction are about user experience metrics. 90+% are extracting more short term value with unknown second order effects on usability.

wavefunction · 2026-03-16T15:14:14 1773674054

Big tech companies are not serving their "users" but advertisers, it's a common mistake.

franktankbank · 2026-03-14T19:48:07 1773517687

If you have 10k experiments running then you are probably p-hacking.

embedding-shape · 2026-03-14T13:33:12 1773495192

Yeah, that's why we didn't have anything anyone could possibly consider as a "great product" until A/B testing existed as a methodology.

Or, you could, you know, try to understand your users without experimenting on them, like countless of others have managed to do before, and still shipped "great products".

ryandrake · 2026-03-14T22:17:35 1773526655

I know this is a salty take, but reliance on A/B testing to design products is indicative of product deciders who don't know what they are doing and don't know what their product should be. It's like a chef saying, I want to make a pancake, but trying 50 different combinations of ingredients until one of them ends up being a pancake. If you have to test whether a product works / is good / is profitable, then you didn't know what you were doing in the first place.

Using A/B tests to safely deploy and test bug fixes and change requests? Totally different story.

coldtea · 2026-03-14T13:49:23 1773496163

A/B testing is the child of profit maximization, engagement farming, and enshittification. Not of "great product building".

steve-atx-7600 · 2026-03-14T13:17:57 1773494277

Long term effectiveness? LLMs are such a fast moving target. Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that? Next month a new model could be released to everyone else - or by a competitor - that’s a big step difference in performance in tasks you care about. You’d rather be on your own path learning about the state of the world that doesn’t exist anymore? nov-ish 2025 and after, for example, seemed like software engineering changed forever because of improvements in opus.

coldtea · 2026-03-14T13:47:29 1773496049

>Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that?

Where can I sign up?

steve-atx-7600 · 2026-03-14T13:25:29 1773494729

If you really want to keep non-determinism down, you could try (1) see if you can fix the installed version of the clause code client app (I haven’t looked into the details to prevent auto-updating..because bleeding edge person) and (2) you can pin to a specific model version which you think would have to reduce a/b test exposure to some extent https://support.claude.com/en/articles/11940350-claude-code-...

Edit: how to disable auto updates of the client app https://code.claude.com/docs/en/setup#disable-auto-updates

maleldil · 2026-03-15T03:26:45 1773545205

> Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that?

Yes. I'd like some guarantee that my results are reproducible for some reasonable amount of time. New versions can also introduce regressions. A prompt that works well with today's model might not work with tomorrow's, even if the latter is "better".

garciasn · 2026-03-14T12:50:23 1773492623

> And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.

LLMs are non-deterministic anyway, as you note above with your comment on the 'reproducibility' issue. So; any sort of research into CC's long-term effectiveness would already have taken into account that you can run it 15x in a row and get a different response every time.

johnisgood · 2026-03-14T12:55:26 1773492926

Then do not use LLMs for hiring, or use a specific LLM, or self-host your own!

airza · 2026-03-14T13:27:29 1773494849

Isn’t the horrendous ethical and legal decision delegating your hiring process to a black box?

vova_hn2 · 2026-03-14T13:37:56 1773495476

> ethical and legal decision

These are two very different things. I suspect that in some cases pointing finger at a black box instead of actually explaining your decisions can actually shield you from legal liability...

paulryanrogers · 2026-03-14T13:41:45 1773495705

For some proponents, AI is liability washing

raw_anon_1111 · 2026-03-14T13:00:03 1773493203

Would you rather they change things for everyone at once without testing?

aeinbu · 2026-03-14T13:40:37 1773495637

That is not the only other alternative.

You can do A/B testing splitting up your audience in groups, having some audience use A, and others use B - all the time.

I think the article’s author is frustrated over sometimes getting A and at other times B, and not knowing when he is on either.

simianwords · 2026-03-14T14:01:25 1773496885

Strange! You benefitted from all the previous a/b experiments to give you a somewhat optimal model now. But now it’s too inconvenient for you?

plussed_reader · 2026-03-14T14:13:33 1773497613

Informed consent for a paying user is inconvenient?

doc_ick · 2026-03-14T14:49:15 1773499755

Did you read the TOC?

mrgoldenbrown · 2026-03-14T19:29:04 1773516544

Hiding something like this in the TOC rather than explicitly asking users to opt in is a dark pattern. You can't gain the moral highground by cackling that someone should have read the fine print.

doc_ick · 2026-03-15T13:30:59 1773581459

This is technology, some politics, capitalism, and math that is trained on curiously gained data, where does non-selfish morality come in?

MadnessASAP · 2026-03-14T20:43:29 1773521009

All TOS essentially boil down to "we owe you nothing and can change the product at anytime to anything we want at our sole discretion"

Obviously it would be unreasonable to accept such terms without further context. The further context in this case being that Anthropic will maintain Claude as an AI agent and seek to improve it's performance. What is at the heart of this issue is whether or not Anthropics recent A/B testing violated that context. Not whether or not they violated the TOS (they didn't, obviously)

doc_ick · 2026-03-15T13:29:05 1773581345

Ultimately that just sounds like within their own TOC, they were just working on getting the best operational results.

If you wanted something more deterministic write it yourself or get it verified, all hosted llms as far as know does neither.

plussed_reader · 2026-03-17T03:13:37 1773717217

I read the article saying they were testing service changes on paying users without knowledge or explicit consent of the user, that the user had to test and determine why they perceived their sevice changed.

That is a dark pattern to inflict on users expecting consistent output.

jdbernard · 2026-03-14T14:56:29 1773500189

Does anyone?

xg15 · 2026-03-14T19:57:26 1773518246

We actually didn't, if much of that A/B testing was to find the optimally "engagement maximizing" i.e. maximally addictive UI design.

ramoz · 2026-03-14T12:18:58 1773490738

I apologize for doing this - and I agree. I will revise

s3p · 2026-03-14T14:38:38 1773499118

I still think you have a point here. Doing this kind of testing on users unwittingly is unethical in my opinion

everdrive · 2026-03-14T13:00:34 1773493234

>I don't believe A/B testing is an inherent evil,

Evil might be a stretch, but I really hate A/B testing. Some feature or UI component you relied on is now different, with no warning, and you ask a coworker about it, and they have no idea what you're talking about.

Usually, the change is for the worse, but gets implemented anyway. I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.

cosmic_cheese · 2026-03-14T13:54:46 1773496486

> I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.

In my experience all manner of analytics data frequently gets misused to support whatever narrative the product manager wants it to support.

With enough massaging you can make “objective” numbers say anything, especially if you do underhanded things like bury a previously popular feature three modals deep or put it behind a flag. “Oh would you look at that, nobody uses this feature any more! Must be safe to remove it.”

hollow-moe · 2026-03-14T13:28:49 1773494929

Tech companies really have issues with "informed and conscious consent" doesn't they

mschuster91 · 2026-03-14T13:05:25 1773493525

> The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much.

No. Users aren't free test guinea pigs. A/B testing cannot be done ethically unless you actively point out to users that they are being A/B tested and offering the users a way to opt out, but that in turn ruins a large part of the promise behind A/B tests.

bcrl · 2026-03-14T17:35:53 1773509753

Please name a computer science program that has an ethics component.

Yes, I wish software developers were more like actual engineers in this regard.

gnabgib · 2026-03-14T17:37:08 1773509828

All Computer Engineering & Systems Engineering programs in Canada require two ethics components (once at graduation, once at P.Eng)

ryandrake · 2026-03-14T22:14:35 1773526475

Sadly, in the USA, I believe most engineering ethics classes are optional electives, and it shows when you look at the graduating student body today.

saltcured · 2026-03-14T17:41:38 1773510098

Yeah, and if you don't already have an IRB, your organization probably isn't ready to be doing such things responsibly...

mschuster91 · 2026-03-15T00:58:23 1773536303

Meta has had an IRB for well over a decade (following a scandal where they used their users as lab rats) and that didn't stop them from doing any of the BS they did ever since.

tomalbrc · 2026-03-14T12:23:37 1773491017

Would love to know why you would consider invoking Meta “a little much”. Sounds more than appropriate.

krisbolton · 2026-03-14T12:59:52 1773493192

Not to start an internet argument -- I don't think it is appropriate in this context. A/B testing the features of a web app is not unexpected or unethical. So invoking the memory of cambridge analytica (etc) is disproportionate. It's far more legitimate to just discuss how much A/B testing should negatively affect a user. I don't have an answer and it's an interesting and relevant question.

mschuster91 · 2026-03-14T13:09:03 1773493743

> A/B testing the features of a web app is not unexpected or unethical.

It's not "unexpected" but it is still unethical. In ye olde days, you had something like "release notes" with software, and you could inform yourself what changed instead of having to question your memory "didn't there exist a button just yesterday?" all the time. Or you could simply refuse to install the update, or you could run acceptance tests and raise flags with the vendor if your acceptance tests caused issues with your workflow.

Now with everything and their dog turning SaaS for that sweet sweet recurring revenue and people jerking themselves off over "rapid deployment", with the one doing the most deployments a day winning the contest? Dozens if not hundreds of "releases" a day, and in the worst case, you learn the new workflow only for it to be reverted without notice again. Or half your users get the A bucket, the other half gets the B bucket, and a few users get the C bucket, so no one can answer issues that users in the other bucket have. Gaslighting on a million people scale.

It sucks and I wish everyone doing this only debilitating pain in their life. Just a bit of revenge for all the pain you caused to your users in the endless pursuit for 0.0001% more growth.

xg15 · 2026-03-14T19:55:59 1773518159

> It's far more legitimate to just discuss how much A/B testing should negatively affect a user. I don't have an answer and it's an interesting and relevant question.

You don't have an answer on "how much should A/B testing negatively affect a user"? So "a lot" would be on the table?

xg15 · 2026-03-14T19:51:11 1773517871

> The framing of A/B testing as a "silent experimentation on users"

Sorry, but how is A/B testing not exactly that? The experiments may be on non-disruptive things like button color, but they're experiments no less.

The users are also rarely informed about the experiment taking place, let alone on the motivation or evaluation criteria.

cyanydeez · 2026-03-14T15:22:36 1773501756

Relying on a paid service for anything significant is basically accepting the Company Store feudal serfdom.

Enshittification is coming for AI.