🌄

A/B Testing

What is A/B Testing?

Previously, learned about different types of requests for analytics. In this lesson, we'll learn about a powerful analytics tool: the A/B Test.

A/B Testing is a type of experiment for de-risking choices between two options, such as changes to a website, addition of new features, or wording of email subjects.

Case: article headlines

Suppose, we want to pick the most exciting title for a blog post using an A/B Test. We randomly divide our audience into two groups. Each group sees a different title. Eventually, we'll pick the better title and make it permanent.

A/B Testing Steps

There are four steps to running an A/B test: picking a metric to track, calculating sample size, running the experiment, and checking for significance.

Let's examine each step for our headline example.

Pick a metric to track

First, we pick a measurable metric to track. In this case we'll examine the percent of people who click on a link with the title of the article.

Next, we'll decide how long to run the experiment. We'll run the experiment until we reach a sample size large enough to be certain that any difference we observe is not due to random chance. The necessary sample size depends on a "baseline metric". In this case, our baseline metric is how often people generally click on a link to one of the articles. If this rate is close to 50%, we'll need a much smaller sample size. If the rate is much larger or much smaller, which is typical for something like clicking a link, then we'll need a larger sample size.

The sample size also depends on how sensitive we need our test to be. A test's sensitivity tells us how small of a change in our metric we are able to detect. Larger sample sizes allow us to detect smaller changes.

You might think that we always want the highest possible sensitivity, but we actually want to optimize for an amount of sensitivity that is meaningful for our business problem. For example, if the first title is clicked on by 5 percent of viewers and the second title is clicked on by 5.01 percent of the viewers, we don't actually care about that difference; it doesn't affect our profits by enough. Generally, we care about a relative increase of between 10 percent and 20 percent of the baseline metric.

Run your experiment

We now run our experiment until we reach the calculated sample size. Stopping the experiment too early or running it for too long could lead to biased results.

Check for significance

Once we've reached the target sample size, we check our metric. We need some difference between the two titles, but how do we know if that difference is meaningful? We check by performing a test of statistical significance. If the results are significant, we can be reasonably sure that the difference observed is not due to random chance, but to an actual difference in preference.

What if the results aren't significant?

If there are any differences in click-through rates between the two titles, they're smaller than the threshold we chose when determining the sensitivity. Running our test longer won't help. It will let us detect a smaller difference, but we already decided that those differences are irrelevant to our business problem!

It's important to remember that there still might be a difference in click-through rates between the two article titles: but that difference is not significant to our business problem.