The persistent mischaracterization of Google and Facebook A/B tests

Ozzie_osman · 5 months ago

So their argument is that the tests are not providing causal inference, because the platforms can target the two A/B test groups differently.

With that in mind, my reading of this is: if you're a researcher trying to say "Advertisement A is more appealing than Y", you need causal inference and these tests won't give them to you. If you're a marketer just trying to determine what ad spend is more efficient, you don't need that causal inference.

In other words, they are debunking using these tools for research studies, and not as marketing tools.

tsimionescu · 5 months ago

I think it's somewhat more important than that. Basically the point is that the A/B tests tell you "which ad is more effective spending on this particular platform with the particular arrangement of users you selected". If you try to expand the user group after the A/B test is over, or if you take the ads to another platform, you shouldn't expect the results to hold.

So if you're an ad exec, you should make sure you run separate A/B tests for each specific ad channel you intend to spend money on. An A/B test on Facebook does not tell you anything about how well the same ads would do on Google, and even less so on TV (even assuming you are reaching roughly the same type of audience). This happens because Facebook's targeting mechanisms are not the same as Google's, and so they may optimize the ads differently than Google within the same population groups you selected and give you different results. And TV ads are not targeted at all, even if the audience you are reaching is the same.

Ozzie_osman · 5 months ago

True, but any ad exec worth their salt already knows this, if not because of different targeting algorithms, then at least different user and intent profiles (eg social users are generally younger and lower intent).

ano-ther · 5 months ago

The issue seems to be that the platforms optimize before showing — presumably because they get paid for click-throughs.

Couldn’t they offer an unbiased randomization option with a different payment model (eg based on showings, not clicks)? Would preserve their revenue and researchers get a good tool.

mathteddybear · 5 months ago

Yeah it's kind of a different standard

There is a section 2.1.3 "Online platform studies versus lift tests" in the article. For the marketing tools purpose, you can use either (or some mixture of both). There are pros and cons to the choice.

Dead Comment

charcircuit · 5 months ago

>i.e., the inability to attribute user responses to ad creatives versus the platform’s targeting algorithms

Why would people expect to measure just the creative? How good the platform is able to target is part of what one would want to include in the measurement.

The goal of these experiments is to see which creative pushes the metrics the most.

wbl · 5 months ago

The reason you need this is that the hypothesis is about the creative not the combination with the platform. You'd want the creative choice to go the same way across platforms.

charcircuit · 5 months ago

>You'd want the creative choice to go the same way across platforms.

While it would be convenient, platforms aren't the same. You can't just assume people will react the same across platforms.

gdc3000 · 5 months ago

Really found this paper interesting and concerning. I don’t work in marketing or run these kinds of studies. I do work at Qualtrics and have experience with A/B testing, in general. For those who work in this space and can relate to this paper, would it be helpful if Qualtrics developed some kind of audit panel in our product to help surface potential platform bias? For example, sample ratio mismatch or metadata balance checks.

robertlagrant · 5 months ago

Perhaps something that highlights the limited scope of a test's predictive power? E.g. for a test run on Facebook "This test is likely to be very useful for a another Facebook ad campaign with the same parameters, and at least somewhat useful for a Google ad campaign with equivalent parameters".

drpossum · 5 months ago

Did the International Journal of Research in Human Manipulation and Subjugation think their methods were too weak to get published there?