Most Winning A/B Test Results Are Illusory [pdf]

ted_dunning · 9 years ago

This is yet another article that ignores the fact that there is a MUCH better approach to this problem.

Thompson sampling avoids the problems of multiple testing, power, early stopping and so on by starting with a proper Bayesian approach. The idea is that the question we want to answer is more "Which alternative is nearly as good as the best with pretty high probability?". This is very different from the question being answered by a classical test of significance. Moreover, it would be good if we could answer the question partially by decreasing the number of times we sample options that are clearly worse than the best. What we want to solve is the multi-armed bandit problem, not the retrospective analysis of experimental results problem.

The really good news is that Thompson sampling is both much simpler than hypothesis testing can be done in far more complex situations. It is known to be an asymptotically optimal solution to the multi-armed bandit and often takes only a few lines of very simple code to implement.

See http://tdunning.blogspot.com/2012/02/bayesian-bandits.html for an essay and see https://github.com/tdunning/bandit-ranking for an example applied to ranking.

yummyfajitas · 9 years ago

Thompson sampling is a great tool. I've used it to make reasonably large amounts of money. But it does not solve the same problem as A/B testing.

Thompson Sampling (at least the standard approach) assumes that conversion rates do not change. In reality they vary significantly over a week, and this fundamentally breaks bandit algorithms.

https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...

Furthermore, you do not need to use Thompson Sampling to have a proper Bayesian approach. At VWO we also use a proper Bayesian approach, but we use A/B testing in to avoid the various pitfalls that Thompson Sampling has. Google Optimize uses an approach very similar to ours, (although it may be flawed [1]) and so does A/B Tasty (probably not flawed).

https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic...

Note: I'm the Director of Data Science at VWO. Obviously I'm biased, etc. However my post critiquing bandits was published before I took on this role. It was a followup to a previous post of mine which led people to accidentally misuse bandits: https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs...

[1] The head of data science at A/B Tasty suggests Google Optimize counts sessions rather than visitors, which would break the IID assumption. https://www.abtasty.com/uk/blog/data-scientist-hubert-google...

paulddraper · 9 years ago

No, no, no, no. https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm... needs a rebuttal so very, very badly.

> Depending on what your website is selling, people will have a different propensity to purchase on Saturday than they have on Tuesday.

Affects multi-armed bandit and fixed tests. If you do fixed A/B test on Tuesday, your results will also be wrong. Either way, you have to decide on what kind of seasonality your data has, and don't make any adjustments until the period is complete.

If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.

> Delayed response is a big problem when A/B testing the response to an email campaign.

Affects multi-armed bandit and fixed tests. If you include immature data in your p-test, your results will be wrong. Either way, you have to decide how long it takes to declare an individual success or failure.

> You don't get samples for free by counting visits instead of users

Affects multi-armed bandit and fixed tests. Focusing on relevant data increases the power of your experiment.

---

For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.

ted_dunning · 9 years ago

Thompson sampling does not need to assume stability. You can inject time features into the model if you want to model seasonality (or, more accurately, ignorances of seasonality) and you can also have a hidden random-walk variable.

Yes, if you assume stability and things vary, you will not have good results. That seems like any statistics.

xapata · 9 years ago

VWO, you mean Vanguard FTSE Emerging Markets ETF?

developer2 · 9 years ago

This ignores what I have seen in my experience, which is that marketing teams - composed of the people who dictate what A/B tests the business should run - have little to no background in statistics, let alone any interest whatsoever in actually performing legitimate A/B tests.

It's often the case that the decision maker has already decided to move ahead with option A, but performs a minimal "fake" A/B test to put in their report as a way to justify their choice. I've seen A/B tests deployed at 10am in the morning, and taken down at 1pm with less than a dozen data points collected. The A/B test "owner" is happy to see that option A resulted in 7 conversions, with option B only having 5. Not statistically significant whatsoever, but hey let's waste developers' time and energy for two days implementing an A/B test in order to help someone else try to nab their quarterly marketing bonus.

paulddraper · 9 years ago

Join us, comrade, in the fight against the statistical blight!

Move your decision process to multi-armed bandit and you never have to decide when to end an A/B test -- math does it for you, in a provably optimal way.

jblow · 9 years ago

But it's DATA SCIENCE. You know it's SCIENCE because they called it SCIENCE.

martingoodson · 9 years ago

This is yet another comment claiming that Thompson sampling is the answer to all of our statistical problems!

Naive Thompson sampling (like the code you linked to) will result in problems equally disastrous to those I wrote about in the Qubit whitepaper. Other comments have highlighted a key problem with simple bandit algorithms - reward distributions which change over time will render their results worthless. You can model these dynamics but not in 'a few lines of very simple code'. It is verging on the irresponsible to suggest otherwise.

I personally favour a bayesian state-space model to elegantly take care of these things - but that's outside the remit of the whitepaper and outside the skill set of most non-statisticians. Frequentist testing, when done properly, is simple to implement and has statistical guarantees that are very attractive in practice.

paulddraper · 9 years ago

+1

I wrote a blog post focusing on the #2 problem mentioned: multiple testing. I simulated the typical mitigating approaches, and comparing them against Thompson sampling (code in GitHub). https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of...

I'm not a Bayesian fanatic, but given how perfectly A/B test optimization fits in the Bayesian approach, it's a shame it's not yet the de facto standard.

I think the primary reasons are (1) it's not as intuitive (especially for the uninitiated), (2) it's harder to implement an automated feedback mechanism, (3) FUD. E.g. https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm... lists devastating complications with correct multi-armed bandit tests, and then in fine print admits that traditional tests have all the same complications.

abecedarius · 9 years ago

Is it really less intuitive? I would've said less familiar. Null-hypothesis significance testing is unintuitive: nonexperts seem to explain it wrongly more often than not. (Like "the p-value is the probability the result was random chance".) Probably both approaches are unintuitive to humans unless they're explained really well.

saycheese · 9 years ago

Given most people don't get A/B testing, it's a stretch to me to believe that some that does would know about more complex approaches that require more skill.

Firmly believe that there's way more to be gained by more people understanding how to use A/B testing than more complex solutions.

amasad · 9 years ago

This is a great read. But left me thinking I'm missing in the fundamentals. Can you recommend a book (or other posts) on statistics fundementals for programmers?

mattj · 9 years ago

I agree with you (and love your blog, btw), but I think you're skipping over at least a few benefits you can get out of a mature / well built a/b framework that are hard to build into a bandit approach. The biggest one I've found personally useful is days-in analysis; for example, quantifying the impact of a signup-time experiment on one-week retention. This doesn't really apply to learning ranking functions or other transactional (short-feedback loop) optimization.

That being said, building a "proper" a/b harness is really hard and will be a constant source of bugs / FUD around decision-making (don't believe me? try running an a/a experiment and see how many false positives you get). I've personally built a dead-simple bandit system when starting greenfield and would recommend the same to anyone else.

paulddraper · 9 years ago

Speaking of mature, well-built A/B test frameworks, Google Analytics uses multi-armed bandit.

https://support.google.com/analytics/answer/2844870?hl=en

tedsanders · 9 years ago

Thank you Ted for bringing sanity to this conversation. Terrific point and post.

By the way, I doubt you remember me, but thank you for inviting me on a tour of Veoh ten years ago when I was a young college sophomore. I enjoyed the opportunity as well as our brief chat about Bayesianism.

ted_dunning · 9 years ago

My calendar agrees with you ... we apparently had lunch in December, 2007. Sadly, I don't remember it off-hand.

On the other hand, one reason is that I invited a lot of students to come visit (and a number to intern with us).

tedsanders · 9 years ago

I think the entire approach discussed in this pdf is flawed. (Edit: not saying PDF itself is flawed or wrong, just the hypothesis testing approach to A/B testing.)

The right question to ask is: What is the difference between A and B, and what is our uncertainty on that estimate?

The wrong question to ask is: Is A different/better than B, given some confidence threshold?

The reason this is the wrong question is that it's unnecessarily binary. It is a non-linear transformation of information that undervalues confidence away from the arbitrary threshold and overvalues confidence right at the arbitrary threshold.

A test with only 10 or 100 samples still gives you information. It gives you weak information, sure, but information nonetheless. If you approach the problem from a continuous perspective (asking how big the difference is), you can straightforwardly use the information. But if you approach the problem from a binary hypothesis-testing perspective (asking is there a difference), you'll be throwing away lots of weak information when it could be proving real (yet uncertain) value.

Once you switch away from the binary hypothesis-testing framework, you no longer have to worry about silly issues like stopping too early or false positives or false negatives. You simply have a distribution of probabilities over possible effect sizes.

Obi_Juan_Kenobi · 9 years ago

That's putting the cart before the horse.

Before you can quantify a difference, you have to determine whether one exists in the first place. That is the purpose of binary testing; without it, you're just looking at noise without any means to decide what is real and what is not.

As a corollary, if you can meaningfully quantify the difference between A and B, then you should have no trouble establishing that they are different. Obviously business decisions are not generally going to uphold the rigor of good science, but what is the purpose of quantifying things when you're as likely to be wrong as your are right?

tedsanders · 9 years ago

Asking whether A and B are different is not a useful question. There will be a difference 100% of the time. (Though of course sampling may not reliably detect the difference for feasible samples.)

The superior question is which is probably better, and by how much. If all you know is that A is 75% likely to be better than B, then go with A. It's useful information, even if it doesn't cross a an arbitrarily preset threshold of 95% or whatever you use.

You don't need to wait for certainty to act. In fact, all actions are taken under uncertainty. So it feels incredibly artificial and counterproductive to frame these questions in such a binary, nonlinear way. It's clinging to certainty when certainty does not exist.

martingoodson · 9 years ago

I don't know about 'flawed'. Classical hypothesis testing has its place, when used correctly. I guess the point of the article is that it falls apart dramatically when you do certain things wrong. And that these things turn out to be in very common practice.

Disclaimer: I wrote that article (a long time ago...)

Silhouette · 9 years ago

If you're running an A/B test, by which I mean precisely that you already have two versions available and are simply trying to decide which is most likely to give better results in the future by comparing how successfully they achieve the desired result with all else being equal, then I'm still waiting for someone to explain to me why the correct question is not simply "Which performs better, A or B?"

The statistical rigour of all the hypothesis tests and Bayesian methods and so on is laudable, but fundamentally any result and any conclusion you draw from those tests can only ever be as good as the underlying assumptions from which the result was derived. For example, if you choose to perform any statistical test that designates one of A and B the default case and tests for evidence in favour of the other, you have made a determination that the situation is not symmetric. You can choose a level of power or a significance level you want to achieve, but the 5% convention is again arbitrary. The same goes for any prior you choose in a Bayesian method.

Fundamentally, if you are truly starting from a neutral situation where you just want to know which of A and B is better, and you truly have no other information to go on other than how they have performed since the start of your measurements with all else being equal, then it remains the case that you still have nothing else to go on no matter what statistical calculations you perform.

jgalt212 · 9 years ago

> The wrong question to ask is: Is A better than B, given some confidence threshold?

That's a perfectly fine question so long as there are three acceptable answers: Yes, No and Unclear.

tedsanders · 9 years ago

Suppose A is significantly better than the control. And B is not significantly better than the control. Does that imply A is significantly better than B? No. Statistical differences do not the follow the same logic as normal differences.

It feels so pointless to bin probabilities into yes, no, and unclear. We're throwing away information with value when we compress the problem this way. And I think it leads to more misinterpretations.

korzun · 9 years ago

> A test with only 10 or 100 samples still gives you information. It gives you weak information, sure, but information nonetheless.

If your market consists of 1000 people.

tedsanders · 9 years ago

Could you elaborate? Even if your market consists of infinite people, it gives you real information.

dxhdr · 9 years ago

Maybe I'm misunderstanding but isn't the whole point of A/B testing to answer the question "is A better than B?"

godDLL · 9 years ago

This is very well explained, even if you don't understand statistics. Apparently not many vendors of A/B testing software do.

iaw · 9 years ago

I suspect the people building it do, the people selling it probably do not.

CalRobert · 9 years ago

Having worked between customers and engineering at a place that did this, I can confirm that "our tests are Bayesian!!" was a refrain everyone was taught to repeat, but few if any were taught what it meant.

This video is helpful: https://www.youtube.com/watch?v=Dy_LRK2Pkig

I still don't know a great way to describe it in the 6 or 7 seconds you have before the potential customer's attention starts to flag.

ssharp · 9 years ago

For most businesses, focusing on the math and making incremental improvements to the statistical methodology is a waste of time. There is a "good enough" approach to using tools like Optimizely and VWO.

Instead, they should be focusing on the quality of the tests they run. Quality hypotheses, preferably backed by data-driven insights into behavior that test very clear, if not dramatic, changes to the user flow will be what leads to bottom-line improvements, not increasing the mathematical merit and rigor of poorly-conceived tests. Of course, doing both is ideal but I put more fault on the testing software than the companies using it.

Keep track of historical conversion rate and adjust/account for noise in your historical conversion rate. Conceive a testing program that focuses on the quality testing and you'll likely see an upward trend in that historical conversion rate.

jkuria · 9 years ago

Hmmh, this is interesting. Most A/B software will let you set a level of statistical confidence that needs to be attained before a winner can be declared. For example in Google Analytics two common ones are 95% and 99%. We stop our tests when they reach at least 95% confidence. Is the author saying one must wait for 6000 events even if the difference between A/B is large? The larger the relative difference, the fewer events needed.

squarecog · 9 years ago

Stopping when you hit 95% confidence is a classic failure mode. Yes, if you are doing classic t-test based A/B testing, you have to wait until a pre-determined threshold; otherwise, effectively, by looking at the p-value and stopping when it hits 95% confidence, what you are doing is ignoring all negative results and accepting the first positive result you see -- clearly, that's bad science. I'm simplifying the math here, but this is the general notion.

You can see a demonstration of this in practice: http://www.gigamonkeys.com/interruptus/ (sit back and watch your false positive rate, aka "bogus a/b testing results", grow).

tedsanders · 9 years ago

Is it really a failure mode? I think it should be fine to stop whenever you want during a test. You just have to be smart enough to not use a t test and misinterpret it to mean than A is 95% likelier than B.

contravariant · 9 years ago

The problem is that people often repeatedly check the significance, despite the fact that this test only guarantees a certain false positive rate if you use it once.

If you're planning to stop as soon as you find a positive result you'll need to modify the tests to ensure that the total chance that any test results in a positive is low enough. In general you'll need to keep increasing the significance level as you do more events (if you only plan to test a finite number of times you can keep the significance fixed, but I think this will lead to more false negatives than an 'adaptive' significance).

To illustrate, if you have 6000 events and check for 99% significance after every one, then you'd expect about 60 false positives on average. Of course these false positives aren't distributed uniformly, so it's not like you'll always find 60 false positives, however it's not like you'll only find 0 or 6000 either, meaning that (significantly) more than 1% of the time you'll have at least 1 false positive.

paulddraper · 9 years ago

Yep, this is precisely the multiple testing point covered.

tedsanders · 9 years ago

Checking and stopping is only a problem if you use the inappropriate formula. It's not generally a problem as far as I'm aware. What do you think?

daveguy · 9 years ago

From the article:

"You can perform power calculations by using an online calculator or a statistical package like R. If time is short, a simple rule of thumb is to use 6000 conversion events in each group if you want to detect a 5% uplift. Use 1600 if you only want to detect uplifts of over 10%. These numbers will give you around 80% power to detect a true effect."

These numbers assume approximately 10% of changes give true uplift (aka actual improvement). If the incidence of improvement is rarer you would need more than 6000 events to identify a 5% increase.

(OT: that term sounds like cult language -- you have three chances to achieve true uplift)

paulddraper · 9 years ago

You can take an adaptive approach, but almost universally it is done wrong. https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of...

FYI, the difference with Google Analytics and most other A/B platforms is that GA uses multi-armed bandit. It adjusts the proportions according to the results. This is nice because it effectively ends the test for you, and you don't have to stress about suboptimal success rates just for the purpose of experimentation. https://support.google.com/analytics/answer/2844870

user5994461 · 9 years ago

> Is the author saying one must wait for 6000 events even if the difference between A/B is large?

Yes.

> The larger the relative difference, the fewer events needed.

Wrong. That's the common mistake that is being addressed.

The relative difference could be anything at any point in time. You always need a lots of events to be somewhat confident that there is really a difference.

Short version: If you want X% confidence on a test, there are maths formula to say that you need to monitor N events, at least. The formula doesn't depends on how large is the difference (which you don't know anyway).

tedsanders · 9 years ago

It's absolutely true that if the difference is larger you need fewer samples to reach statistical significance. This why is why small effects require studies with large sample sizes.

contravariant · 9 years ago

Well, in this case the distribution is (assumed to be) binomial, which does allow you to assign a higher significance for cases where there is a strong effect.

Vinnl · 9 years ago

Nice article. One question though:

> Perform a second ‘validation’ test repeating your original test to check that the effect is real

Isn't this just the same as taking a larger sample size?

Obi_Juan_Kenobi · 9 years ago

Almost.

The difference is time, so the populations involved are 'the people that used the site last month' and 'the people that used the site last week'. Usually your assumption is that these are comparable, but that's not necessarily true. Furthermore, the effect is only meaningful if that assumption is true, for most business cases (i.e. you want this effect to hold in the future).

In practice, a lot of effects do disappear when you repeat a test because there was some unaccounted for factor that varied between them. It's a good sanity check.

But you're right, much of the purpose is to discover mean regression. This is something that happens more often than you'd expect because you tend to be focusing large effects, many of which will simply be due to chance.

Vinnl · 9 years ago

So basically, it means waiting a while before performing the test again? Since "larger sample size" mostly just means "keep the test running for longer", so the difference is that in that case, the "second" test is very close in time to the "first" test?

guico · 9 years ago

I don't get this: "But remember: if you limit yourself to detecting uplifts of over 10% you will miss also miss negative effects of less than 10%. Can you afford to reduce your conversion rate by 5% without realizing it?"

You'd potentially lose 5% CR if you ship the variant even when it doesn't show a detectable uplift. Why would you do that?

RA_Fisher · 9 years ago

Any other statisticians want to champion this? https://news.ycombinator.com/item?id=13434410