Post number two in our “what are we doing wrong with A/B testing, and how can we fix it” series.
If you’re interested in reading the other posts – here are the links:
Whenever AB testing is discussed, the simplest go-to example is always the button in red (which is also the title of my A/B testing centered noir novel), and the hypothesis that a green button functions better, or vice-versa.
And it makes a lot of sense to use this example, because it illustrates very well the concept of variant vs. control. It’s not so complex it can be confusing, and it even makes sense considering our everyday reality and experience with traffic lights and so on.
But what happens when that same example contains a more complex problem?
So let’s hop out for a minute to our real life. We have our red button, it’s very conspicuous, all’s good and well. After a very successful meetup about it, the product person comes to us and says they read that in our specific scenario (which is well defined by all limitations) — a green button would increase the click-rate and produce a more desirable outcome.
We are lucky enough to work for a really data-driven organization, so we conduct a proper A/B test and wouldn’t you know it, green does turn out to work better for us, results are statistically significant and everybody’s happy.
Then the team leader hears the story over lunch and gets excited: “now do the text in green, too! The same green! Green is good!”
Now, on the one hand you think to yourself — “yeah? Is green text over a green background what’s going to drive percentages up? That doesn’t sound right.”
On the other hand, you think, well, green does have a cultural context, maybe the text is superfluous.
You find the justifications and run the test only to discover that if you don’t tell people what they should do — they don’t know what to do (E&OE).
The team leader gets angry, he throws a tantrum and decides to change the button back to red, because green doesn’t really drive improvement.
What’s the methodological (rather than managerial) problem we see here?
A/B tests, just like humans, are naturally greedy. They therefore require that we understand that the test itself doesn’t remember or conserve knowledge on prior states or the nature of the change. In an unplanned series of tests we can reach the wrong conclusions (such as “green never works”).
We have to remember that one of our risks in running multiple A/B tests is that the user will be exposed to one variant in the first test and to another in the second, and our data would get mixed up and we won’t be able to a point to a direct connection, and suffer noise and mayhem and criminal activity and kids not taking their books back to the library on time, and we don’t want that, do we?
So what am I suggesting?
When we can use the opportunity and instead of one test — think ahead and plan a long range series of experiments, we can create a funnel of A/B tests and use earmarked participants from one test for the next. Example:
I want to conduct a medium range experiment in two phases. In the first experiment I want to measure the effect of giving a certain discount on earnings. I use a variant of 10% discount and a control with 0% discount.
After concluding the test, I raise a new hypothesis — clients that were exposed to the discount percentage would spend more dollars when presented with a discount lower than the original.
The rationale, in this case, is that people will remember the higher discount percent and would assume that it is going down and soon they will not have any discount, so they better buy now (a similar rationalization can be found for a higher discount, too).
When concluding this test and analyzing the results we will be able to say what value did the change have over the groups when presenting a certain discount rate, and at the end of the process, very interesting information about client response in the case of subsequent experiments.
In this configuration, we use the greediness of the tests in order to get a much better idea on the behavior of our samples.
A note about hypotheses — for simplicity and continuity’s sake, I formulated here the alternative hypothesis, but, strictly speaking, when evaluating hypotheses, we should be discussing the null hypothesis, I.e. assume that the change would NOT have a significant effect.
Hope this’ll help you out when designing your AB tests!