# Introduction

There is no online marketing campaign without A/B testing. Marketeers use treatment and control groups to tests the suitability of a new design or feature on the conversion rate. A well thought test design is important to yield reliable and meaningful results. As a data scientist, you need to think of a proper test design including the number of test groups, the size each group and other factors such as randomization and factors influencing the test. In this post, I will describe the most important determinants to consider from a statistical perspective.

# Control Groups and their Determinants

Determine the size of a control group sounds like an easy thing to do, but it depends on many factors. Some of them are of statistical nature, others are restrains by marketing managers. They want control groups as small as possible to reduce the risk of a decreased conversion. A one size fits all treatment group is therefore inappropriate. To understand the determinants of the group size, we want to reconsider some fundamentals of statistical testing. Assuming we have two groups of website visitors. One group saw the old website, the second group was redirected to the new website. You observe an amount of visitors buying products. The number of purchases devided by the number of visitors yields our conversion rate. For a test, we set as our Null-Hypothesis, which is the equivalence of conversion rates between our new and old website design. Now we want to consult the data and see, if it can convince us of the opposite. So we test, if

Test decides in favor of | Correct decision. Chances are 1- (1- your significance level) | Error Type 2. Chances are |

Test decides in favor of | Error Type 1. Chances are | Correct decision. Chances are 1- Power |

, the significance level is chosen by you, meaning you need to set a probability which is acceptable for you as a data scientist to be wrong. In science 5% or 1% are common values for . A low value reduces the Type 1 error probability.- To avoid a situation in which we decide against
, we say our new website design helps our conversion, whereas in fact it does not, is called Type 2 error. We want to avoid this error by choosing a high**power**. The**power**in statistical terms defines the chance or probability that is rejected by the data, whereas really is false ( is correct and we decided in favor of it). Alternatively, it can be defined as the probability to aviod a Type 2 error.

These terms are illustrated by Figure 1 below. The power is the area of the blue curve right to the black dotted line. The bigger this area the higher is the statistical power — its determinants are described below.

As seen in the image above, the power is influenced mainly by four factors, which are:

- The difference in effects. If the conversion rate of your treated group differs substantially from the control group, your power increases.
- The standard deviation of your effect. If you want to test for the effect of a new design on the amount of money spend, you can calculate a deviation about some average. The higher this deviation, the lower the power, because deviation, or variance is uncertainty about an average behaviour. And you are interested in this average behaviour of two groups.
- The significance level. This is illustrated by Figure 1 in the lower right part. The black line slightly moved to the left to demonstrate a significance level of 10%. This is chosen by you and is mostly equal to, or below to 5%. It stands for the level of uncertainty you can accept, regarding your conclusion about the Null-Hypothesis. The lower this value the more convincing data you need about your conclusion on
. A lower significance value increases the power of your test. - Finally, the sample size. The more observations you have, the more certain you are about the results. But as noted above, you want to minimize loss from suboptimal website designs or features.

The formula to calculate the control group size depending on the key factors above is (1):

(1)

- Z is 1.96 which embodies an alpha of 5%
- d is the minimum difference we want to be able to detect between our two groups.
- p is the conversion rate we expect, which means the conversion you are normally seeing in your campaigns without any A/B testing.
- By using this formula you guarantee the highest possible certainty (accuracy) by minimizing the sample size needed.

# Testing Your Results

Depending on the metric you try to optimize (conversion rate, average money spend, buy/quit.) you need a suitable test statistic to check for a significant difference in your outcomes. This is why it is important to plan and determine the sample size beforehand. This wikipedia page sums op some common metrics you want to optimize and the applicable tests. See the table below.

Distribution | Example | Standard Test | Alternative Test |

Gaussian | Continues variables like the average revenue per user, the duration stayed on your page, etc. | Welch’s t-test | Student’s t-test |

Binomial | Every metric which has two states (Did the customer bought, clicked etc.). | Fisher’s exact test | Barnard’s test |

Multinomial | Measurements for customer choices (multiple choices which exclude each other.) The number of people using distribution channel A, B or C to get your product, or the delivery method chosen. | Chi-squared test | — |

see https://en.wikipedia.org/wiki/A/B_testing

# Other Determinants

Keep in mind that non-experimental situations are not suitable for A/B testing. Testing over time can introduce serious biases in results, when e.g. different day times are used, or summer vacations. Customer behavior on websites and elsewhere can be affected by these circumstances. Other things to consider are the randomization of the treatment group and stratification. Responses measured over time could be biased. Make sure you take this into consideration when planning your test set-up. Imagine you launch a new design for a news website. Only showing the new page to early birds and the old version to night owls can bias your results, because the two groups are not fully random. Additionally, if you want to control for two changes, you need more than one test group. One for each feature combination.

# Conclusion

A/B testing sounds like an easy thing to do, but it needs planning from a statistical perspective to yield reliable results. Make sure you track everything and plan in advance. When it comes to the control group size, one size fits all is not the way to go, because you may waste conversion. A 50/50 split is therefore too generous. Consider the factors influencing the size and adjust your parameters according to your objective.

**Follow Up Reads**

(1) Cochran, W. G. (1977). Sampling Techniques, 3rd ed., Wiley, New York.