All code behind this blog post is available on GitHub.
Ah, the beautiful concept of p-value. The probability of observing the result as extreme or more extreme than the data at hand, assuming the null hypothesis is true. Not a single person ever simplified it to something else. Like.. I don’t know.. the probability that the null hypothesis is true? Or, in an A/B test, the probability that treatment is greater/smaller than zero? 👀
Doing that is equivalent to committing an original data science sin. You’d be banished to the ungodly world full of local optima, spaghetti SQL, Excel downloads and out-of-date dashboards. Or simply wouldn’t pass an screening round in a hiring process.
BUT.
P-value can be equal to the probability that the treatment effect is greater/smaller than zero. In the context of A/B testing, where we are in a straightforward univariate hypothesis testing setting, it is as simple as employing a Bayesian approach with an uninformative prior.
This post, I admit, is partly for kicks. But it has some practical implications, too:
Most articles on using Bayesian A/B testing will tell you that using uninformative priors is not a very good idea. While I agree, the fact that using them yields results equivalent to a most popular frequentist procedure means that they aren’t a terrible choice. There are some vendors out there that scaremonger into shying away from using Bayesian approaches. This post shows that if you choose to go with uninformative priors, you’ll be shooting yourself into the foot as much as you would with a simple t-test.
The bad thing, however, is that p-values have such an interpretation under very unrealistic assumptions of prior distributions. In practice, thus, interpreting p-value as a direct measure of evidence or using uninformative priors is a bad idea.
The Gaussian case
Let’s consider the following setup:
We’re interested in measuring the impact of a simple A/B test.
Our outcome metric happens to be normally distributed: and . For simplicity, we will set . In other words, equal variance, and a treatment effect of .
Suppose we ran an experiment and collected 6,000 samples in each experiment arm (that’s 78% power, assuming we set MDE equal to the true unobservable population effect).
A quick t-test refresher
If we were to evaluate our results using a t-test, we would:
Compute sample means: ,
Compute sample variances: ,
Then, enabled by CLT, we can claim that the difference of the means between the two groups is normally distributed (that distribution is referred to as the “sampling distribution”), with the following parameters:
Mean:
Standard error:
Finally, we calculate the test statistic , and look up the associated CDF of the sampling distribution at that point (a.k.a p-value).
This is how it may look numerically.
Show code
set.seed(42)
= 6000
n = 0.05
effect_size
= rnorm(n, 0, 1)
A = rnorm(n, effect_size, 1)
B
= function(X, Y) {
sampling_distribution list(
mean = mean(X) - mean(Y),
sd = sqrt(var(X) / n + var(Y) / n)
)
}
= sampling_distribution(B, A)
sdist = sdist$mean / sdist$sd
t
as_tibble(sdist) |>
pivot_longer(everything(), names_to = "statistic") |>
kbl(caption='Summary statistics of the sampling distribution') |>
kable_minimal(full_width = T, position='left')
statistic | value |
---|---|
mean | 0.0419394 |
sd | 0.0183665 |
Show code
tibble(
statistic = c('Test statistic', 'P-value'),
value = c(t, (1 - pnorm(t)) * 2)
|> kbl(caption='Hypothesis test outcomes') |>
) kable_minimal(full_width = T, position='left')
statistic | value |
---|---|
Test statistic | 2.2834717 |
P-value | 0.0224026 |
As a visual reminder - the p-value is twice the CDF of the sampling distribution in the range (twice, because we’re testing for a two-sided alternative hypothesis).
Mathematically, we don’t need to calculate the test statistic. We can get the CDF directly using the parameters of our sampling distribution.
Show code
#this is also a p-value!
format(pnorm(0, mean = sdist$mean, sd = sdist$sd, lower.tail=T) * 2, digits=4)
[1] "0.0224"
We can double-check using the built-in R functions.
Show code
t.test(B, A)
Welch Two Sample t-test
data: B and A
t = 2.2835, df = 11997, p-value = 0.02242
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.00593807 0.07794067
sample estimates:
mean of x mean of y
0.033779917 -0.008159451
Great, we can conclude that, if our hypothesis of treatment having no effect were true, we would be observing a difference between two groups at least as large as we did (we observed a difference of 0.041
) roughly 2% of the time.
Sounds rare enough that we could reject our null hypothesis and conclude that there indeed is a difference.
We also conclude that the true effect should be somewhere between 0.005
and 0.078
95% of the time if we were to repeat this experiment an infinite number of times. As for this particular run - well, it’s either inside that range or not. T-tests (or frequentist approaches, more generally) don’t tell you anything about this particular experiment. Just long-term probabilities.
A mouthful. But hey, this isn’t a post about issues with null hypothesis testing. I wrote that one already.
A Bayesian approach
Generally, saying “I’m using a Bayesian approach” is like saying “I’m using linear algebra”. It’s not a statistical method. We need an estimand! We will use the simplest one: the probability that the treatment is larger than 0.
Next, we will need uninformative priors. We can achieve that by choosing normal distribution priors with variance approaching infinity (making any number in the real domain roughly equally likely), i.e.:, where is some tiny number, e.g. 0.001.
Note that we did not choose a prior for the treatment effect itself. Instead, we’re effectively saying that both experiment arms have the same mean and can take almost any value with the same likelihood.
However, we can see what kind of treatment effect prior distribution our choice implies by leveraging the fact that a difference between two independent normal distributions is just another normal distribution:
Show code
= function(X, Y) {
difference_of_two_normals list (
mean = X$mean - Y$mean,
tau = 1 / ((1 / X$tau) + (1 / Y$tau)),
sd = sqrt((1 / X$tau) + (1 / Y$tau))
)
}
difference_of_two_normals(list(mean=0, tau=0.001), list(mean=0, tau=0.001)) |>
as_tibble() |> pivot_longer(everything(), names_to = 'statistic') |>
kbl(caption='Implied treatment effect prior distribution') |>
kable_minimal(full_width = T, position='left')
statistic | value |
---|---|
mean | 0.00000 |
tau | 0.00050 |
sd | 44.72136 |
We can see that choosing a prior for experiment results of zero mean and (or, equivalently, the standard deviation of ) yields a prior on the treatment effect that also has a zero mean but has even more uncertainty, with a standard deviation of .
Next, we need to get to posterior distributions. Luckily, a normal prior and a normal likelihood form a conjugate pair that yields a normal posterior distribution, so we can do this algebraically.
Specifically, suppose the prior distribution is , the observed mean is and observed precision is . Then the posterior is . Let’s calculate the posteriors for our samples.
Show code
= function(X, prior) {
calculate_normal_posterior = length(X)
n = mean(X)
observed_mu = 1 / var(X)
observed_tau
list(
mean = (prior$tau * prior$mean + observed_tau * observed_mu * n) / (prior$tau + n * observed_tau),
tau = (prior$tau + n * observed_tau),
sd = sqrt(1 / (prior$tau + n * observed_tau))
)
}
= calculate_normal_posterior(A, list(mean=0, tau=0.001))
A_posterior = calculate_normal_posterior(B, list(mean=0, tau=0.001))
B_posterior
bind_rows(A_posterior, B_posterior, .id = 'ID') |>
mutate(ID = str_replace_all(ID, c('1'='A', '2'='B'))) |>
kbl(caption='Posterior distributions of experiment arms') |>
kable_minimal(full_width = T, position='left')
ID | mean | tau | sd |
---|---|---|---|
A | -0.0081594 | 5873.761 | 0.0130479 |
B | 0.0337799 | 5985.180 | 0.0129259 |
We have posterior distributions for the data of the two samples, but we are after the treatment effect - i.e., the difference between the two samples. We can use the same approach as we did previously when estimating the implied treatment effect prior previously and calculate it, as the two posteriors can be assumed to be independent, too.
Show code
= difference_of_two_normals(B_posterior, A_posterior)
treatment_posterior |> as_tibble() |> pivot_longer(everything(), names_to = 'statistic') |>
treatment_posterior kbl(caption='Posterior distribution of the treatment effect') |>
kable_minimal(full_width = T, position='left')
statistic | value |
---|---|
mean | 0.0419394 |
tau | 2964.4734513 |
sd | 0.0183665 |
Let’s visualize it.
Show code
ggplot() + geom_function(
fun = dnorm,
args = within(treatment_posterior, rm("tau"))
+ xlim(-0.1, 0.1) +
) geom_vline(xintercept=0, color='red', linetype='dashed') +
xlab("") + ggtitle("PDF the treatment effect") + ylab("Density") +
theme_light() + theme(panel.grid=element_blank())
That looks familiar! Let’s overlay the sampling distribution from the t-test:
Show code
ggplot() + geom_function(
fun = dnorm,
args = within(treatment_posterior, rm("tau")),
aes(color = 'Posterior of a treatment effect'),
linetype='dashed'
+ xlim(-0.1, 0.1) +
) geom_function(
fun = dnorm,
args = sdist,
aes(color = 'Sampling distribution'),
linetype='twodash'
+ xlab("") + ggtitle("Treatment effect posterior distribution vs. Sampling distribution") +
) ylab("Density") + labs(color='') +
theme_light() + theme(panel.grid=element_blank(), legend.position='bottom')
They are identical! This means that the probability of the treatment effect being smaller than zero is—you guessed it—equal to 1/2 of a two-sided p-value—or, in other words, identical to the one-sided p-value.