Stories

Are You Doing These Two Mistakes For Your A/B Testing?

Pinterest LinkedIn Tumblr
A few years back, I was working with a startup that builds a platform for travelers to book their hotels and create personalised itinerary. The idea was interesting. It’s enough to tempt any engineer to work with them. As you might guess already, I ended up getting a contract being a consultant to them for their user experience improvements. Aside from that, I had helped them to implement part of their Javascript application with Angular 1.x.

In my usual workflow, I had to go through their existing processes and check their past results. This gives me a good ground to plan and implement my ideas around their business model and their customer pain points.

Digging through their past, I found that they created numerous hypothesis and spent months testing them. After spending a few hours going through their test results and their google analytics data, I noticed they did two major mistakes in their testing, which I will share with you here in detail. These issues are common and perhaps your company may happen to do some of these mistakes as well.

They test cases that already have solutions, with the same context. It seems simple, but this is a common pitfall for early stage UX teams to waste huge amount of time unnecessarily. In fact, there are some smart people out there which have a clean solution for most of these problems. Use them.

Instead of this team spending months at a time to test an idea, only to reach the same conclusion in the end, they can spend their time on other improvements and most probably other features.

Here is a simple scenario. You have an eCommerce that is in the retail sector. You are selling a type of product that heavily depends on images and your customers need to see the product photos to motivate them to navigate through your product page. Let us assume here, that you need to decide whether or not you should use a list view or a grid view for your search result page.

My best bet is that you go and search your case before implementing your tests and do an A/B testing for a few months. Do bare in mind to put aside the time it takes for design and development, as well as the cost of the tool being used.

This process can even be longer for lower traffic websites. As you can notice, this is an expensive mistake and an unnecessary one to add.

Here is some good sources of inspiration for you to find about similar cases.


Validate the test with extremely small sample sizes. Sometimes it is okay to use small sample sizes for cases that are obvious, however, this may not be sufficient data for every case study. This was the second major issue I noticed from their analytics team, where they validated ideas with extremely small sample sizes.

Data needs to be sufficient in order to be valuable and can be referred as a valid source of information. It is crucial to keep the sample size big enough to take decisions that affect the business and it’s users.

I’ve seen many teams get it wrong. They test the hypothesis and with a total of 50 tested sessions they reach to a conclusion. That’s most probably useless.

Here is a good explanation from “Michael Zuschlag”.

“There is no contradiction between being concerned with statistical significance and conducting usability tests with 3 to 5 users. Technically, “statistical significance” means the results you’re seeing cannot be plausibly attributed to chance. In scientific research, where the cost of reporting spurious results is high, “plausible” is generally defined as a 0.05 probability or higher. There are several issues when applying this to a usability test of as few as three people.

First of all, the significance level of your results depends not only on the sample size but on the magnitude of the observed effect (i.e., how different it is from your null hypothesis). You can have significance with small sample sizes if the magnitude is big enough. So in the case of a usability test, what’s the magnitude? What are you comparing your effect to?

If you run the binomial calculations, it turns out that if 3 out of 3 of your users have a serious problem with your product, then at a significance level of 0.05, at least 36% of the population will also have the same serious problem with your product (one-tailed test). I don’t know about you, but 36% is an awfully big proportion of your users to frustrate, and of course, it could easily be much more. It’s clearly a serious usability problem. What Krug apparently fails to realize is that if you have an issue that “will cause everyone to fall,” then the results from a sample of 3 or so people will be statistically significant for a pragmatic null hypothesis.

Or take the usability test testing rule of thumb to have about 5 users per usability test. If a problem affects 30% or more of your users, you have an over 0.83 probability of observing it one or more users with a sample size of 5. On the other hand if a problem affects 2% or fewer of your users, then you have less than 0.096 probability of observing it in 1 or more users. So by testing 5 users and attending to anything seen in one or more users you have an excellent chance of catching the most common problems and little chance of wasting time on problems affecting a tiny minority.

So far from ignoring statistical significance, drawing conclusions from usability tests on 3, 4, or 5 users is actually perfectly consistent with the laws of probability. This is why empirically it has worked so well. Additionally, statistical significance only relates to quantitative results. Usability tests also typically include qualitative results which can boost your confidence in the conclusions. You find out not only how many have a problem but, through your observations and debriefing question, you uncover why. If the apparent reason why is something that is reasonably likely to be relevant to a lot of your users, then you should have more confidence in your results.

That said, there’s a caveat to testing with such small sample sizes that gets back to the issue of the magnitude of the effect: small-sample usability tests are only good for finding big obvious problems –ones that affect a large proportion of users. However, sometimes you have to worry about problems affecting small proportion. To take an extreme case, if the problem only occurs for 2% of your users but it ends up killing those 2%, then obviously you want to know about it, and obviously a sample size of 5 is not going to cut it.

Similarly, when comparing results of two designs or problems, you can’t confidently state that one is better than the other with a small sample size unless one completely blows the other out of the water. When you need to know at a 0.05 level of significance which problem is greater or which design is performing better, larger sample sizes are called for. As a quick and dirty (and conservative) estimate of the sample size you need, take the precision you want, invert it, and square it. For example, if you want to know the percent of users able to complete a task to within 5%, then you need as many as (1/0.05)² = 400 users!

On the other hand, who says you need significance at the 0.05 level? For the business, what are the consequences of choosing one design to build or one problem to solve over the other? In many situations wouldn’t we be satisfied with a 0.10 probability of pursing a spurious results? Or even 0.20? The cost of missing a good design or top priority problem may be much more than erroneous pursuing something when it doesn’t make any difference. For any given sample size, the larger the real difference in magnitude, the smaller the chance you’re wrong, so if you are wrong in choosing one thing over the other at a 0.20 level of significance, you’re unlikely to be terribly wrong –you’re unlikely to have been much better off going with the other option.

Take another extreme case: You test two icons for something on three users. Two users do well with Icon A while only one does well with Icon B. For a null hypothesis of equal performance of the icons, the two-tailed significance level is 1.0 –it can’t get any more insignificant. But which icon do you choose? One icon doesn’t cost any more to use than the other one and you have to choose one. So of course you choose Icon A. Obviously, you should have low confidence in your choice. Obviously it’s reasonably plausible that the icons could perform equally well in the real world. There’s even a reasonable probability that B is actually better than A. But in the absence of any other data, Icon A is obviously your best bet. In the presence other data the level of significance does matter -you want to know how much confidence to place in each piece of information you have. However, the point is you don’t always have to be 95% confident about the information for it to be worth considering.

For a more sophisticated analysis, see Lewis, J. R., (2006). Sample sizes for usability tests: Mostly math, not magic. Interactions, 13(6), p29–33.”

If you have any similar experiences, please feel free to share it!

Thanks for reading!

Reference:

https://ux.stackexchange.com/questions/4993/when-does-statistical-significance-matter