Bayesian or Frequentist in Testing: What's the difference?

This content was originally published in our Product Director, Sandeep Shah's, monthly newsletter. If you would like to receive regular content written by Sandeep, you can email us to subscribe here.

What are Bayesian and Frequentist statistics?

In a nutshell, it’s the difference between a percentage chance vs. a flag. (God, I wish people would condense the hours of articles I’ve read into that).

Bayesian stats operate off a model of Chance to beat control – a percentage based output where 50% is even (no detectable change), 0% is strongly bad and 100% is strongly good.

Frequentist stats operate off thresholds. You set your tolerances as part of the calculation, and it tells you whether you have or haven’t met the threshold. A simple yes or no.

So often, we’re told to only trust in Bayesian stats, and that anything else is counter-intuitive. And that’s what this trail of thought considers.

How does your mind work?

The one thing everyone at Webtrends Optimize stands firm on, is that numbers are just numbers. They don’t hold any inherent magical powers. We as analysts choose to misinterpret them, abuse them, or learn correctly from them.

Bias is an awful thing in experimentation. People so often don’t pre-determine their tolerance for risk or need for certainty. So when they see results, they decide whether to accept or reject them – not before.

A common conversation I take people down is this:

Q. What level of significance do you need to see to accept your test results?

A. 95%

Q. Why that?

A. We stick with industry best practices.

Q. Ok. So would you reject it at 94.9%?

A. Of course not, that’s 95% more or less.

Q. How about 94.8%? How far away is too far away?

This is a difficult question to answer – and honestly, people just make up the rules as they go along.

This is clearly biased reporting, and a terrible way to operate.

Meeting a threshold

The truth is, we’re all trying to meet a threshold. Frequentist stats get a bad rap, but the idea of meeting a number or not is all they really do, and as analysts all you’re really looking for is a yes or no.

Perhaps for one test you really care and need certainty, but for another a bit of direction is fine, but the net result you’re looking for is “it worked” or “it didn’t”. “I should respond to this” or “I shouldn’t”.

If this is the case, why even allow users the opportunity to decide when reporting on whether or not to conclude a test? A degree of freedom takes away from the science of the whole situation, which is infact what everyone wants more of.

We let you decide

The cool thing I see in our reporting is that we let you decide what number to use. Education is important – understand the meaning behind what you’re looking at – but forcing you to adopt one approach or another makes no sense to me.

Different strokes for different folks, as the saying goes.

People don’t always know what they’re talking about

Storytime.

I’ve been sat in the audience of 2 or 3 conferences. Well known, hundreds of people attending, you know the type. And been told that the way to avoid content flickering is to make sure you minify your code, and compress your images.

It’s not.

But when one person feeds someone else lies, and they share that new-found “knowledge” with 2 others and so on, we end up sat in the crowd of conferences being told to do things that just aren’t right. After 5 years of constantly reminding people this is nonsense, it’s only now I feel people are coming round to it – and that too by getting me on a call to spell it out with examples.

The point is – don’t expect “experts” to be right. Doing something for 10 years doesn’t mean you’ve been doing it right.

Instead, the best advice I can give you is to read and learn. Decide what works for you, make sure it’s justified, and go with that.

Herd mentality can be an awful thing.

Bayesian or Frequentist in Testing: What's the difference?

Solutions

Key Tools

Other Stuff