October 28, 2022

The Temporal Limitations of AB Testing

With every post I write, I wonder what direction it will take this blog, and what value it provides to analysts. And although I often write about career, I feel like it is lacking more professional content.

Now, there are reasons for that, obviously — It started as a Facebook page, where the platform doesn’t allow putting images into the text, so every point had to be explained verbally, which is sometimes complex, but anyway — I wrote a three-post series on a professional topic, so let’s hope it works out!

Sometime recently (just this week, as of writing the draft for this post) I was in a Monday conference, organized by Eldar Muzikansky, which had two interesting lectures, by Allon Korem and Michael Elbaz.

Both lectures were enriching, touching on plenty of relevant and interesting points, but (as both lecturers would agree) didn’t cover even 0.1% of the interesting problems in the field — giving me some inspiration. So let’s talk for a bit not about how to perform A/B testing, but about their limitations.

This post would cover the issue of temporal limitations for A/B testing

I would to state, right off the bat, that all of the examples I’m about to give here are silly. Obviously because I’m trying to make a point, but I hope that we can learn from that point, identify a pattern, and from that pattern start a discussion with ourselves or team members running the test.

For context, let us imagine we have an operation team whose only role is feeding text into web pages, with the hope that during a google search (what we call in the biz “organic search”) it would attract to the page the most relevant audience for producing a purchase (or any other kind of conversion, doesn’t really matter).

So this team examines their site on August, and comes up with the following hypothesis:

“If towards the end of September we start placing Christmas related terms over the site, these pages would show up higher on searches for expressions related to the product and Christmas, giving us more relevant entries.”

They define the results perfectly and when they get them decide to run a dependent sample t-test for a before and after diagnosis, and after running all of the analytic work we find that the after group is markedly higher and it’s statistically significant and hooray and good work everyone — everybody is happy, they present it to management, get an applause and here’s the question now: what is the conclusion of the experiment?

The conclusion is, of course, the more Christmas related terms we put on our pages, the higher they’ll rank.

The team members are happy, they write a case study and even present it during a meetup with pizza and beer and networking and all that.

During the meetup, a team from Another Organizations hears about the experiment and decides to implement the changes as written — which results in Another Organization’s site getting a Christmasy makeover — a sprinkle of Christmas terms, red and green color scheme, ho-ho-ho and all that — in May.

Now, we’re all clear that the result of this experiment (at least on the immediate surface level) won’t be beneficial to them, right?

Imagine yourself going into some commerce site full of bats and jack-o-lanterns, but what you’re looking for is chocolate eggs (full disclosure — I’m always looking for chocolate, eggs or otherwise).

The problem is that the conclusions from the experiment weren’t limited with regards to the time axis, and although I gave a classic example, there’s no reason to assume that the time axis doesn’t influence all other tests we perform.

We can expand on this point and say we all know the issue — after all, one of the explicit reasons we use a two-three weeks duration is to express weekly seasonality (because behavior on weekdays and weekends can be different)

So if we can agree with the idea that conclusion from A/B testing are limited in terms of time, we can’t continue to argue that “the change group performs better than the control” without considering significant changes in terms of times and population characteristics.

What am I suggesting? That when we present the conclusion of an A/B test, we have to present all information that could have influenced and can qualify the conclusions.

In terms of demographics:

If we created a test and, completely by coincidence, only teenagers took part in it, we can’t apply the conclusions to other age-groups.

If my test, by complete coincidence, had a majority of male participants, applying the conclusion to the entire population is problematic to the point of dangerous (I’m looking at you, clinical trials with female under-representation).

In terms of time axis:

Merry Christmas and ho-ho-ho notwithstanding, the purchase of certain items can be influenced by other significant events, like say a war in Europe, even without a direct connection (but also tread lightly with the idea of “the butterfly effect” and how everything can influence everything — keep your limitations reasonable).

In terms of our system:

If we examined a certain element in the page, but our back-end system changed since then – the results of that test might not be accurate or reliable anymore.

In conclusion, I’m not saying that I’m an AB testing expert, because I’m not, but I do want you to pay close attention to how you use your tools.

More to explore

Well-defined processes and deserts

This post emphasizes the significance of clear definitions in data analysis. It provides examples of poorly defined questions and suggests ways to improve their clarity for accurate analysis. Additionally, it offers an interesting answer to the question of “Where’s the world’s biggest desert?”

How ChatGPT will change home assignments?

At this time, I’m thinking about breaking away from my usual customs, from my profession, or even from my generic pretense. I

Statistics and analysts

Let’s talk about the importance of statistics for analysts Since this is a major topic, and statistics (and math in general) has

Analysis Paralysis

Don't be a stranger!
Contact me

right here!