Let’s talk about the importance of statistics for analysts
Since this is a major topic, and statistics (and math in general) has very common applications, I thought I’d take the time to write a bit about stats, and mostly – weird stuff we do with it.
How normal is normal?
The first subject is one that is often overlooked – the assumption of normality.
Many of the statistical tests we learned about in different courses have an unbreakable bond with the assumption that the data we have is normally distributed.
But what if it’s not?
Well, I saw a lot of cases that just “assume” normality where there is none. Either by trying to normalize the database or just ignoring it, this can have disastrous consequences for plenty of reasons, but I think the most basic one is poor sampling quality/methodology.
Assumptions is the mother of all f***ups
And it’s no different here at all – if you assume something about your data, be prepared to find that everything is fine in your analysis – but the one thing you assumed was correct (and therefore – nothing is true)
Which leads us to sampling.
Let’s talk about sampling:
Why do we sample?
- Because we don’t want to perform an action on the entire database.
- Because it’ll be too expensive to perform an action on the entire database.
- Because it’ll harm our initial intent if we should perform an action on the entire database (imagine that in order to diagnose disease of a person, we would draw all of their blood, instead of a sample)
- Because it’ll take too much time to perform an action on the entire database, and many more reasons.
What, then, is the most important property for a sample to have?
It should represent the population it’s a sample of as best as possible.
(Choke full of rhetorical questions, aren’t I?)
This image shows four kinds of distributions –
- Left Skewed
- Right Skewed
Each can represent a different phenomenon (Right skewed can stand for the time people spend on your page/site, left skewed can be the distribution of prices for an electronics shop, Normal can be a satisfaction survey etc.)
If we want to sample the populations we can’t take the same sample for the left and right skewed distributions – right?
the uniform is evenly spread on all values, while the gamma distribution is very high on the low end and very very very low for high values (so called long right tail).
Now, if we want to run some tests (namely A/B tests) we’ll have to use a sample for it (if the change is catastrophic, it’s better we don’t lose our pants).
This is exactly where quality sampling, which comes from knowing how the data is distributed, comes into play.
Now, to another burning issue with our understanding of statistics.
The Problem with Percentages
”Lies, damned lies and statistics.”
Most people who use this quote either understand statistics very well or not at all – those in the middle almost never use it (which is a good thing, it’s annoying).
Lately, since I’ve been answering some questions from readers (so yeah, feel free to ask some more here), it looks like I’ve started reaching more and more people – and look at that growth rate! I’m reaching 3000% more people since 3 years ago! Hooray! I’m on the top of the world and all that.
Well, we can pretty easily see the problem with this unyielding enthusiasm, and maybe stop for a moment and say that the absolute figures are somewhat short of spectacular.
That is, one of the things about presenting a ratio is that it presents only that ratio. It doesn’t have any nuance – and I want to encourage you to take this into account any time you see a ratio.
So, fine, great, you can’t use a ratio to show progress, what’s next?
Well, this is where context comes in – you can use a ratio, but you have to be very mindful of what that ratio actually represents. You have to know how to keep the ratio in proportion (excuse the pun).
That is, we can take the conversion rate between purchases and purchase sum – but only the 4 sigmas around the average (assuming normal distribution*). Or we take all products that have at least p purchases, so we know our decisions have some scale.
The main takeaway point here is that data, in itself, is meaningless. It’s numbers in the air (Fugayzi, fugazi and all that jazz). Every piece of information you work with only has meaning if you can create that meaning around it and attach that meaning to your domain.
Just a reminder that I still take questions in PM’s, in public and through carrier pigeons (though they’re somewhat less reliable).
* Someone please remind me to discuss that assumption sometime.