Estimating long-term detection, win, and error rates in A/B testing

This blog post is also available in Quarto notebook form on GitHub.

Estimating long-term detection, win and error rates in A/B testing

Why a team that has good (but not great) ideas 75% of the time may see a win rate of just 25%, and why focusing on MDE/statistical power alone may not always be the best

Classical statistical guarantees in A/B testing (Type I and II error rates) are great at quantifying the reproducibility of a single experiment. As a reminder:

  • Type I error rate, α\alpha, guarantees that when the null hypothesis is true (typically: no difference between variants), you will only observe statistically significant results (i.e., observed data says there’s a difference) α\alpha fraction of the time. It’s the false positive rate for an experiment with no effect.

  • Type II error rate, 1β1-\beta, guarantees that when the null hypothesis is false, you will observe data that will yield statistically significant results β\beta fraction of the time (a.k.a statistical power), conditional on sample size, effect size of interest, and chosen α\alpha rate. It’s the false negative rate for an experiment with an underlying effect.

However, they tell us little about the long-term dynamics of an “experimentation program” ( a series of experiments, each estimating the impact of a different treatment). You can’t immediately answer questions such as “What % of experiments can I expect to detect a statistically significant result in?”.

That, however, may be an interesting question to answer. Suppose you’re Head of Data Something in a company that ran 100 experiments last year, of which only 5% yielded positive statistically significant results (a bit below an oft-quoted figure in the industry). You may be wondering if that’s normal. Is it really so that 95% of product changes had zero impact? Perhaps they weren’t exactly zero, but simply smaller than you could declare as statistically significant with the sample sizes at hand?

If you work in a large company, you may find yourself at the opposite of “statistically but not practically significant.” After all, a conversion rate improvement of 0.5 percentage point may still mean many $$ annually. However, achieving 80% power on such an effect size requires a 3-4x sample size compared to an effect size of 1 percentage point. Are you willing to make that trade-off and make all experiments 3-4x slower?

Understanding the relationship between experiment-level design parameters (minimum detectable effect (MDE), sample size, power) and overall experimentation program win/detection/error rates can be pretty illuminating. So, let’s figure out how to do it.

This blog post is accompanied by an interactive Streamlit app where you can explore how all the different parameters come to life yourself.

From statistical power to detection/win rates

You may wonder why the answer to long-term detection rates isn’t simply the Type II error rate or “We will detect (i.e., declare them as statistically significant) differences when they exist 80% of the time”.

That’s because the Type II error rate is conditional on the MDE. Assuming you use the same MDE all the time, you can claim that when the actual effect is at least as large as the MDE, we will detect it at least 80% of the time.

But what if it’s not? And how exactly does one think about the “at least” part of the statement? What if actual effects are larger/smaller than MDE? Your power will be exactly 80% only if you always choose MDE precisely equal to the actual unobservable difference for that specific experiment. If you can do that, you don’t need to experiment!

Frequentist approaches tend to develop tools that find upper/lower bounds of a measure of interest, and statistical power is an excellent example of that (I saw this clearly articulated in an essay by Jacob Steinhardt titled “Beyond Bayesians and Frequentists” - I highly recommend it).

However, we are after an expectation—we want to know the average expected detection rate. For that, we’ll borrow an idea from Bayesian statistics that tends to be better suited for such purposes. Instead of working with fixed values, we will assume a distribution of likely effect sizes (a.k.a., a prior distribution) and then integrate over that distribution to get expected rates.

To be precise, by integrating over a distribution of hypothesized treatment effect sizes AA (we’ll call the distribution FAF_A), we can calculate:

  • Expected “detection rate” - the % of experiments we would declare statistically significant given they have a non-zero actual treatment effect.

  • Expected “win rate” - the % of experiments we would declare statistically significant, but we count only experiments with a positive treatment effect.

Mathematically:

expected detection rate=+FA(x)Pr(reject H0|x0,α,n,σ2)dx \text{expected detection rate} = \int_{-\infty}^{+\infty} F_A(x) \text{Pr}(\text{reject } H_0 | x \neq 0, \alpha, n, \sigma^2) dx expected win rate=0+FA(x)Pr(reject H0|x0,n,α,σ2)dx \text{expected win rate} = \int_0^{+\infty} F_A(x) \text{Pr}(\text{reject } H_0 | x \neq 0, n, \alpha, \sigma^2) dx

where nn is the sample size, σ2\sigma^2 is the variance of the outcome metric, and α\alpha is the chosen statistical significance level.

One immediate thing to note is that detection/win rates do not depend on MDEMDE or β\beta (statistical power). The only levers you have to increase them are larger samples, variance reduction techniques (e.g., by using covariates, such as CUPED-like methods), better treatments (changes in FAF_A), and different α\alpha levels.

If your immediate reaction is, “OMG, a subjective assumption on effect size distribution (gasp!)”, I would like to remind you that choosing the MDE threshold for power calculations is subjective, too. Ideally, one would pick an MDE upfront and tune the sample size to achieve the desired statistical power. However, in practice, there’s always an incentive to cherry-pick an MDE itself, given a fixed sample size. It looks better to report a power of 80% with an MDE of 0.015 than a power of 50% with an MDE of 0.01.

Having said that, if you use MDE correctly and report statistical power together with the MDE, bad MDE choices are more transparent than flawed effect size distribution assumptions. But if we want to estimate average detection/win rates, that’s the cost we need to pay - there’s no free lunch!

Selecting an effect size distribution

So, how do we select a potential effect size distribution?

  • Historical experiment results could be a good starting point for understanding where most of the distribution density should be. If most experiments see effect sizes of a few percentage points, then the distribution should reflect that, too. Importantly, we should consider observed treatment effects in all experiments, not just the statistically significant ones.

  • Teams have domain expertise and are paid to do their work, so we’d hope they do better than a random flip of a coin - the mean/median of the distribution should be positive.

  • The distribution should not be symmetric. Teams may occasionally experiment with more drastic changes, but safeguards (dogfooding, small-scale pilots, UX research) should prevent ideas with a major negative impact from reaching the experimentation phase.

From a modeling perspective, a (shifted/scaled) Gamma distribution is a good option for modeling such skewed distributions with limited downside and a longer tail of positive effects.

For the sake of this blog post, I will use two distributions representing different hypothetical teams:

  • Mature product team. This team manages a mature product with a well-established user base. Experiments usually represent incremental changes and minor features, and thus, an average experiment achieves just a 0.50.5 percentage point improvement. They extensively rely on dogfooding, UXR, and other means to test ideas before they reach the experimentation phase, thus protecting the downside very well. We’ll model it with a Gamma distribution.

  • Startup product team. This team is still in the product-market fit stage. They tend to make wide-ranging changes and like moving fast, without too much research before the experimentation phase. Because it’s a less polished product, experiments tend to produce higher gains (2.02.0 percentage point improvements, on average), but they also see negative impacts from changes tested more often. We’ll model it with a Normal distribution.

Show code
eff_size_distrs = list(

  list(

    name = 'Startup team',

    distr = 'x ~ N(0.02, 0.03)',

    generator = function() rnorm(1, mean=0.02, sd=0.03),

    density = function(x) dnorm(x, mean = 0.02, sd = 0.03) 

  ),

  list(

    name = 'Mature team',

    distr = 'x ~ Gamma(2,200)) - 0.005',

    generator = function() rgamma(1, shape=2, rate = 200) - 0.005,

    density = function(x) dgamma(x + 0.005, shape=2, rate = 200)

  )

)

Here’s how that looks visually, with some summary statistics below.

Show code
emp = lapply(

  eff_size_distrs,

  function(d) {

    r = list()

    r[[d$name]] = sapply(1:50000, function(i) d$generator())

    r

  }

) |> bind_cols() |> pivot_longer(everything(), names_to = "Assumed distribution")



ggplot(emp) + 

  #geom_density(aes(x=value, color=`Assumed distribution`)) +

  sapply(

  eff_size_distrs, 

  function(d) stat_function(

    aes(color=d$name, fill=d$name), 

    fun = d$density,

    data = tibble(name = d$name, full_name = paste(d$name, '', d$distr)),

    geom="area",

    alpha=0.5

  )

) + 

  xlim(-0.05, 0.1) + labs(

    y='Density',

    color='Assumed effect size distribution',

    x='Effect size',

    title='Some possible effect size assumptions'

  ) + 

  theme_light() + 

  theme(

    panel.grid.major = element_blank(),

    legend.position = "none"

  ) + facet_wrap(~full_name) +

  geom_vline(xintercept = 0, linetype = 'dotted')