P-value is the probability that the treatment effect is larger than zero (under certain conditions)

All code behind this blog post is available on GitHub.

P-value is the probability that treatment effect is larger than zero. Under certain conditions.

Ah, the beautiful concept of p-value. The probability of observing the result as extreme or more extreme than the data at hand, assuming the null hypothesis is true. Not a single person ever simplified it to something else. Like.. I don’t know.. the probability that the null hypothesis is true? Or, in an A/B test, the probability that treatment is greater/smaller than zero? 👀

Doing that is equivalent to committing an original data science sin. You’d be banished to the ungodly world full of local optima, spaghetti SQL, Excel downloads and out-of-date dashboards. Or simply wouldn’t pass an screening round in a hiring process.

BUT.

P-value can be equal to the probability that the treatment effect is greater/smaller than zero. In the context of A/B testing, where we are in a straightforward univariate hypothesis testing setting, it is as simple as employing a Bayesian approach with an uninformative prior.

This post, I admit, is partly for kicks. But it has some practical implications, too:

  • Most articles on using Bayesian A/B testing will tell you that using uninformative priors is not a very good idea. While I agree, the fact that using them yields results equivalent to a most popular frequentist procedure means that they aren’t a terrible choice. There are some vendors out there that scaremonger into shying away from using Bayesian approaches. This post shows that if you choose to go with uninformative priors, you’ll be shooting yourself into the foot as much as you would with a simple t-test.

  • The bad thing, however, is that p-values have such an interpretation under very unrealistic assumptions of prior distributions. In practice, thus, interpreting p-value as a direct measure of evidence or using uninformative priors is a bad idea.

The Gaussian case

Let’s consider the following setup:

  • We’re interested in measuring the impact of a simple A/B test.

  • Our outcome metric happens to be normally distributed: YAN(μA,σA)Y_A \sim N(\mu_A, \sigma_A) and YBN(μB,σB)Y_B \sim N(\mu_B, \sigma_B). For simplicity, we will set μA=0,μB=0.05, and σA,σB=1\mu_A = 0, \mu_B = 0.05, \text{ and } \sigma_A, \sigma_B = 1. In other words, equal variance, and a treatment effect of 0.050.05.

Suppose we ran an experiment and collected 6,000 samples in each experiment arm (that’s 78% power, assuming we set MDE equal to the true unobservable population effect).

A quick t-test refresher

If we were to evaluate our results using a t-test, we would:

  • Compute sample means: YA=mean(YA)\bar{Y_A} = \text{mean}(Y_A), YB=mean(YB)\bar{Y_B} = \text{mean}(Y_B)

  • Compute sample variances: sA2=var(YA)s_A^2\ =\text{var}(Y_A), sB2=var(YB)s_B^2\ =\text{var}(Y_B)

Then, enabled by CLT, we can claim that the difference of the means between the two groups is normally distributed (that distribution is referred to as the “sampling distribution”), with the following parameters:

  • Mean: Z=YAYB\bar{Z} = \bar{Y_A} - \bar{Y_B}

  • Standard error: sZ=sA2nA+sB2nBs_Z = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}

Finally, we calculate the test statistic t=Z/sZt = \bar{Z} / s_Z, and look up the associated CDF of the sampling distribution at that point (a.k.a p-value).

This is how it may look numerically.

Show code
set.seed(42)

n = 6000

effect_size = 0.05



A = rnorm(n, 0, 1)

B = rnorm(n, effect_size, 1)



sampling_distribution = function(X, Y) {

  list(

    mean = mean(X) - mean(Y),

    sd = sqrt(var(X) / n + var(Y) / n)

  )

}



sdist = sampling_distribution(B, A)

t = sdist$mean / sdist$sd



as_tibble(sdist) |> 

  pivot_longer(everything(), names_to = "statistic") |> 

  kbl(caption='Summary statistics of the sampling distribution') |> 

  kable_minimal(full_width = T, position='left')
Summary statistics of the sampling distribution
statistic value
mean 0.0419394
sd 0.0183665
Show code
tibble(

  statistic = c('Test statistic', 'P-value'),

  value = c(t, (1 - pnorm(t)) * 2)

) |> kbl(caption='Hypothesis test outcomes') |> 

  kable_minimal(full_width = T, position='left')
Hypothesis test outcomes
statistic value
Test statistic 2.2834717
P-value 0.0224026

As a visual reminder - the p-value is twice the CDF of the sampling distribution in the range [,0][-\infty, 0] (twice, because we’re testing for a two-sided alternative hypothesis).

Mathematically, we don’t need to calculate the test statistic. We can get the CDF directly using the parameters of our sampling distribution.

Show code
#this is also a p-value!

format(pnorm(0, mean = sdist$mean, sd = sdist$sd, lower.tail=T) * 2, digits=4)
[1] "0.0224"

We can double-check using the built-in R functions.

Show code
t.test(B, A)


    Welch Two Sample t-test



data:  B and A

t = 2.2835, df = 11997, p-value = 0.02242

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

 0.00593807 0.07794067

sample estimates:

   mean of x    mean of y 

 0.033779917 -0.008159451 

Great, we can conclude that, if our hypothesis of treatment having no effect were true, we would be observing a difference between two groups at least as large as we did (we observed a difference of 0.041) roughly 2% of the time.

Sounds rare enough that we could reject our null hypothesis and conclude that there indeed is a difference.

We also conclude that the true effect should be somewhere between 0.005 and 0.078 95% of the time if we were to repeat this experiment an infinite number of times. As for this particular run - well, it’s either inside that range or not. T-tests (or frequentist approaches, more generally) don’t tell you anything about this particular experiment. Just long-term probabilities.

A mouthful. But hey, this isn’t a post about issues with null hypothesis testing. I wrote that one already.

A Bayesian approach

Generally, saying “I’m using a Bayesian approach” is like saying “I’m using linear algebra”. It’s not a statistical method. We need an estimand! We will use the simplest one: the probability that the treatment is larger than 0.

Next, we will need uninformative priors. We can achieve that by choosing normal distribution priors with variance approaching infinity (making any number in the real domain roughly equally likely), i.e.:YA,YBN(0,1/τ)Y_A, Y_B \sim N(0, 1/\tau), where τ\tau is some tiny number, e.g. 0.001.

Note that we did not choose a prior for the treatment effect itself. Instead, we’re effectively saying that both experiment arms have the same mean and can take almost any value with the same likelihood.

However, we can see what kind of treatment effect prior distribution our choice implies by leveraging the fact that a difference between two independent normal distributions is just another normal distribution:

if XN(μx,sx2),YN(μy,sy2) then (XY)N(μxμy,sx2+sy2) \text{if } X \sim N(\mu_x, s_x^2), Y \sim N(\mu_y, s_y^2)\, \text{ then } (X - Y) \sim N(\mu_x - \mu_y, s_x^2 + s_y^2)

Show code
difference_of_two_normals = function(X, Y) {

  list (

    mean = X$mean - Y$mean,

    tau = 1 / ((1 / X$tau) + (1 / Y$tau)),

    sd = sqrt((1 / X$tau) + (1 / Y$tau))

  )

}





difference_of_two_normals(list(mean=0, tau=0.001), list(mean=0, tau=0.001)) |> 

  as_tibble() |> pivot_longer(everything(), names_to = 'statistic') |>

  kbl(caption='Implied treatment effect prior distribution') |> 

  kable_minimal(full_width = T, position='left')
Implied treatment effect prior distribution
statistic value
mean 0.00000
tau 0.00050
sd 44.72136

We can see that choosing a prior for experiment results of zero mean and τ=0.001\tau=0.001 (or, equivalently, the standard deviation of (10.001)=31.6\sqrt(\frac{1}{0.001})=31.6) yields a prior on the treatment effect that also has a zero mean but has even more uncertainty, with a standard deviation of 44.744.7.

Next, we need to get to posterior distributions. Luckily, a normal prior and a normal likelihood form a conjugate pair that yields a normal posterior distribution, so we can do this algebraically.

Specifically, suppose the prior distribution is P0N(μ0,τ01)P_0 \sim N(\mu_0, \tau_0^{-1}), the observed mean is X\bar{X} and observed precision is τ=1sx2\bar{\tau} = \frac{1}{s^2_x}. Then the posterior is PN(τ0μ0+τXnτ0+nτ,(τ0+nτ)1)P \sim N \left(\frac{\tau_0 \mu_0+\bar{\tau} \bar{X} n}{\tau_0+ n \bar{\tau}},\left(\tau_0+n \bar{\tau}\right)^{-1}\right). Let’s calculate the posteriors for our samples.

Show code
calculate_normal_posterior = function(X, prior) {

  n = length(X)

  observed_mu = mean(X)

  observed_tau = 1 / var(X)

  

  list(

    mean = (prior$tau * prior$mean + observed_tau * observed_mu * n) / (prior$tau + n * observed_tau),

    tau = (prior$tau + n * observed_tau),

    sd = sqrt(1 / (prior$tau + n * observed_tau))

  )

}



A_posterior = calculate_normal_posterior(A, list(mean=0, tau=0.001))

B_posterior = calculate_normal_posterior(B, list(mean=0, tau=0.001))



bind_rows(A_posterior, B_posterior, .id = 'ID') |> 

  mutate(ID = str_replace_all(ID, c('1'='A', '2'='B'))) |>

  kbl(caption='Posterior distributions of experiment arms') |> 

  kable_minimal(full_width = T, position='left')
Posterior distributions of experiment arms
ID mean tau sd
A -0.0081594 5873.761 0.0130479
B 0.0337799 5985.180 0.0129259

We have posterior distributions for the data of the two samples, but we are after the treatment effect - i.e., the difference between the two samples. We can use the same approach as we did previously when estimating the implied treatment effect prior previously and calculate it, as the two posteriors can be assumed to be independent, too.

Show code
treatment_posterior = difference_of_two_normals(B_posterior, A_posterior)

treatment_posterior |> as_tibble() |> pivot_longer(everything(), names_to = 'statistic') |>

  kbl(caption='Posterior distribution of the treatment effect') |> 

  kable_minimal(full_width = T, position='left')
Posterior distribution of the treatment effect
statistic value
mean 0.0419394
tau 2964.4734513
sd 0.0183665

Let’s visualize it.

Show code
ggplot() + geom_function(

  fun = dnorm, 

  args = within(treatment_posterior, rm("tau"))

) + xlim(-0.1, 0.1) + 

  geom_vline(xintercept=0, color='red', linetype='dashed') + 

  xlab("") + ggtitle("PDF the treatment effect") + ylab("Density") +

  theme_light() + theme(panel.grid=element_blank())

That looks familiar! Let’s overlay the sampling distribution from the t-test:

Show code
ggplot() + geom_function(

  fun = dnorm, 

  args = within(treatment_posterior, rm("tau")),

  aes(color = 'Posterior of a treatment effect'),

  linetype='dashed'

) + xlim(-0.1, 0.1) + 

  geom_function(

  fun = dnorm, 

  args = sdist,

  aes(color = 'Sampling distribution'),

  linetype='twodash'

) + xlab("") + ggtitle("Treatment effect posterior distribution vs. Sampling distribution") + 

  ylab("Density") + labs(color='') +

  theme_light() + theme(panel.grid=element_blank(), legend.position='bottom')

They are identical! This means that the probability of the treatment effect being smaller than zero is—you guessed it—equal to 1/2 of a two-sided p-value—or, in other words, identical to the one-sided p-value.