And here we are: the third (and final?) post of our A/B testing extended universe. After the first post discussed the limits of our conclusions, and the second the greediness of A/B testing, let’s take up a subject which is more cliché, I think, but still a point worthy of shining a spotlight on for a minute.
For the previous articles look here:
Let’s talk about transitivity of A/B testing conclusions
First, let’s talk about transitivity —
What even is that?
A transitive relation is a relation which is relayed (transits) between comparisons. that is, if R is a transitive relation, and it’s true that x stands in relation R to y, and it’s also true the y stands in relation R to z, than it is true that x stands in relation R to z.
Where can we find transitivity in real life?
Well, in all sorts of places.
Transitivity inherently exists on ordered axes (if ten is bigger than five and five is bigger than one, then ten is bigger than one).
On interval scales — if the eighth floor is higher than the sixth floor and the sixth floor is higher than the basement, then the eighth floor is higher than the basement.
Now that we’ve established where we naturally find transitivity, let’s give an example of A/B testing.
To reference (not to say blatantly rip off) post no. 2 — the button in red (I’m sticking to this joke, even though by this point I’m the only one who finds it funny, and even that just because it was a long day).
We have a red button and hypothesize that if we change it to green, we would get better results. We run the test, find out we were right and all is well.
The button is green now.
Now, we have a green button and we hypothesize that if we change its color to blue we’d get better results. We run the test, look at that we were right, all is well again.
Is a blue button better than a red button?
We can’t be certain.
I think the most intuitive example I saw was in the 2022 hayaData conference, where Igal Goldfine described the lack of transitivity of AB testing as a rock, paper, scissors game.
While the game is a closed system where we know precisely what the influences are, a knowledge we don’t have for A/B tests, the fact remains that until we check — we can’t assume transitivity exists.
If I wanted to give an example where an A/B test probably is certainly transitive, I would give the following:
Let’s run a test on a virtual 1980’s convenience store — we want to check how the prices of Hubba Bubba gum (you know, the one that comes in a round plastic container) influence the sales of that same bubblegum. Precise definitions are important here — we’re examining the correlation between price and number of sales (not, say, income).
The original state is that the price for Hubba Bubba bubble gum is $30 per piece. For our variant group, we’ll lower the price to 5 cents per piece — quite naturally, the number of gums sold will go up.
Our new base state is 5 cents per piece, and we conduct a new test where we drop the price to precisely zero. Even more naturally, the number of gums “sold” will increase again.
So I think the options are pretty clear, and so is the rational behind saying that A/B tests are not transitive.
The takeaway is fairly straightforward — when we want to check what improvement (if any) a change has, we don’t have to examine the new variant against every preceding variant, but we can’t assume that the new variant would be better than anything in the past, simply because it’s in the past.
Again, hope that helps with designing you AB tests!