Danial Kalbasi

Notes on engineering leadership, building products, and figuring out what matters.

Are you making these two mistakes in your A/B testing?

A few years ago, I was working with a startup that built a platform for travelers to book hotels and create personalized itineraries. The idea was intriguing. It’s enough to tempt any engineer to work with them. As you might guess, I ended up getting a contract as a consultant with them to design their user experience and improve their current system.

In my usual workflow, I had to review their existing processes and examine their past results. This gives me a good ground to plan and implement my ideas around their business model and their customer pain points.

Digging through their past, I found that they created numerous hypotheses and spent months testing them. After spending a few hours reviewing their test results and Google Analytics data, I noticed two major mistakes in their testing, which I will outline in detail below. These issues are common, and your company may also make some of these mistakes.

Mistake #1: They test cases that already have solutions, with the same context. It seems simple, but this is a common pitfall for early-stage UX teams to waste a considerable amount of time unnecessarily. In fact, there are solutions for many UX issues out there. Study the existing solutions to see if they fit your context, and then use them. Don't waste your time on reinventing the wheel.

Here's a simple example. You run an eCommerce site selling furniture. Your customers need to see product photos before they'll click through to buy anything. You're trying to decide: should your search results show products in a list view or grid view?

The wrong approach: Spend 3 months running A/B tests, plus weeks of design and development time, plus the cost of testing tools.

The right approach: Search "eCommerce list vs grid view" first. You'll find dozens of studies showing that grid views consistently outperform list views for visual products. For furniture, this is a solved problem.

Why waste months testing something that's already been proven? Use that time to work on problems that actually need solving.

Here is some good sources of inspiration for you to find about similar cases.


Mistake #2: Validating tests with small sample sizes

Sometimes, it's okay to use small sample sizes for obvious cases where analytics teams validate ideas. However, this may not be sufficient data for every case study. This was the second major issue I noticed with their analytics team - they were validating ideas using small sample sizes when they shouldn't have been.

Data must be sufficient to be valuable and considered a reliable source of information. It's crucial to maintain a sample size that's large enough to make informed decisions that affect the business and its users.

I've seen many teams get this wrong. They test a hypothesis with a total of 50 tested sessions and then draw conclusions. That's most probably useless.

But here's where it gets interesting - small sample sizes aren't always wrong. For usability testing, there's actually solid reasoning behind using just 3 to 5 users, but only for specific situations.

When you're looking for big, obvious usability problems, small samples work well. If 3 out of 5 users can't figure out how to complete your checkout process, that's 60% of your test group failing. You don't need to test 100 more people to know there's a serious problem. The pattern is clear enough.

The general rule for usability testing is about five users per test. This works because if a problem affects a large portion of your users (say 30% or more), you'll very likely spot it with 5 people. On the flip side, if a problem only affects 2% of users, you'll probably miss it completely with such a small sample.

So small sample usability tests are good for finding big, common problems that affect lots of users. But here's the catch - they're not good for everything else.

When you're comparing two designs or trying to figure out which one performs better, you can't confidently say one beats the other with a small sample unless one completely destroys the other. And if you need precise measurements - like knowing task completion rates within 5% accuracy - you might need hundreds of users.

The bottom line is matching your sample size to what you're actually trying to learn. Use small samples to find obvious problems. Use bigger samples when you need to measure small differences or make critical business decisions.

For a more sophisticated analysis, see Lewis, J. R., (2006). Sample sizes for usability tests: Mostly math, not magic. Interactions, 13(6), p29–33.”

If you have any similar experiences, please feel free to share them!

Thanks for reading!

Reference:

https://ux.stackexchange.com/questions/4993/when-does-statistical-significance-mattera