notebook

Rethinking-inspired¶

For the past few months, I’ve been spending a lot of my time on following the Statistical Rethinking course by Richard McElreath. I’ve heard good things about it before, and I wanted to get deeper into Bayesian workflows and get some exposure to causal inference – and it’s rare to get a course that delivers both together.

It was totally worth it. Not only the course helped me get much more comfortable with the topics covered, but its use goes beyond: the “owl drawing framework” / generative thinking is helpful in so many situations, and I have a bunch of topics I’d like to dive deeper into (state-space models in particular seem particularly exciting).

It so happened that I had a problem at work that was nearly perfect to address using the techniques learned in the course. I figured I’ll put a post together – perhaps some will find the problem itself interesting (it’s a pretty common one in subscription-businesses, I suppose), others may get inspired to learn about Bayesian modelling themselves.

P.S. Mid-way through the course, I started converting lecture materials & homeworks into pyMC (5.X), too. Just in case that’s handy for anyone (planning to clean them up & share more widely one day).

P.P.S. This blog post doesn’t contain any code cells. If you’d rather read the notebook with all the code cells visible, you can find it in my blog post repository.

Problem at hand – estimating drop-off rate over tenure¶

Here’s the problem I’ll be solving in this post. Suppose you are in a business of providing a subscription-based service, where “success” means interacting with your service near daily. It could be pretty much anything – chatGPT, Spotify, Nexflix, a to-do app. Let’s also suppose that the usage of your service exhibits quite strong seasonality. In this example, I’ll be working with day-of-week effects but could also be anything from intra-day (hourly) to intra-year (monthly) effects. Additionally, suppose user signup patterns are also seasonal (i.e. we get many more signups on a some days of the week than on the others). Finally, just like any subscription service, you have a large fraction of users who just don’t stick with your service – they try using it for a few days and drop off (even if they have at least a month of a committment).

Understanding the drop-off rate (you could call it “retention curve” – though it’s not exactly retention) in such a business is very important. It helps inform decisions on “how long you have” to engage users before they start dropping off, it is one of the best ways to compare different signup cohorts, and it is a key input to customer lifetime value estimation.

The simplest approach to getting such a curve is simple descriptive analysis. Suppose you have data on whether a user was active on your service for each of first 28 tenure days, for all users who joined in the last 3 months. You could calculate the average % of users who were active on the service on a daily basis and plot it to get a nice, smooth curve which you include in the presentation to the CEO and move on with your life.

There’s just one potential problem. Your curve may instead look like this:

What is going on? Why are there these bumps?

The fact that they repeat nearly every 7 days may give away something: it’s seasonality effects. The above curve is calculated from a simulated dataset where there are two “seasonal” effects:

An unequal number of users start their tenure at a given day of week (e.g. we get more signups on Monday)
Users are more likely to be active on a given day of the week independent of tenure

In fact, a descriptive analysis may fall short in other ways, too.

Here’s a chart from another simulated dataset where only the number of signups vary by day of week. There is no seasonality associated with visit probabilities, and the tenure effects are strictly monotonic. However, if you were to plot % users active by day of week, you’d find they tend to be more active towards the end of the week, and you may even conclude that there are some differences in how active users are, on average, over the first 28 days of their tenure based on the day of the week they signed up.

But both of these conclusions would be false for this dataset. It’s merely a function of concentrated signups (50% of them take place on a Friday, with the rest distributed roughly equally over the week).

This problem is a good example of something that would typically land on a data analyst’s desk, not a data scientist’s one. However, getting the right answers requires a bit of “data science” (which is why, in my view, a good data analyst needs to know statistics & ML; it’s just that their focus is on different kind of problems).

Could we build a (Bayesian) model to uncover the true data generating parameters?

As we care about interpretability, we’ll have to stick to regressions. This is inherently a logistic regression problem (“will a user visit on day X?”), which can also be expressed as a binomial regression as long as we don’t care about user-specific parameters or group observations into groups that share the same characteristics. While my focus is on using a Bayesian approach (using pymc), I will also have a more frequently used (pun intended) MLE version modelled with statsmodels package.

The data generating process¶

Throughout this post, I will be working with simulated data. That’s one of the very useful things I took away from the Statistical Rethinking course. If you can control the data, it’s easy to know if a model is recovering the true data generating process. That gives some hope the model will work with real data, too.

So, here’s the setup. First, I create 50,000 users who have 3 attributes:

Signup day of week. I can control the proportions of users signing up across days of the week. I visualize the weights assigned to each day below.
Intrinsic motivation, which I can’t ever observe in real life, but represents the baseline chance that a user will visit an app on any given day. I hypothesise that users come to the service with different motivation levels. I can control motivation levels to be different depending on signup day of the week (e.g. “users who sign up on Friday tend to be more motivated”). Also, I can allow for different levels of heterogeneity in baseline motivation among users. I achieve that by drawing a unique baseline motivation level for each user from Beta distribution. For starters, I keep motivation levels the same every day of the week, with Beta distribution parameters α=30, β=10 (so the average intrinsic motivation is ~0.75).
User’s newsletter day of the week. Just to make things a bit more interesting, I imagine each user gets a newsletter (which increases their chance of using the service on the given day), but the day the user may get the newsletter varies. The newsletter “effect” is +10ppt on the chance to visit on that day.

Once I have drawn the users, I simulate their activity data.

Each day, I draw a Bernoulli variable with p probability of success, where p is a function of intrinsic motivation baseline, a tenure effect, a day of week effect, and a newsletter effect (if the day is the newsletter day for the user).
For simplicity purposes, I simply sum up the effects. In the cases when the final probability p is very low (<0.01) or high (>0.99), I set it to 0.01 and 0.99, respectively, so that it’s never 100% certain if a user will use the service or not.
Day of the week effects are just arbitrarily chosen offsets (visualized below).
I model tenure effect in two components. First, I produce a curve using a function $f(x)= \frac{-4}{x ^{0.01}}$, where $x$ is a tenure day, then I rescale it to 0-1 range, and finally I apply an overall tenure effect. In a nutshell, I get a a smooth tenure effect which starts at 0 and reaches the overall effect by the last tenure day. That is visualized below, too.

Here’s how the simulated data looks like in the end:

	signup_day_of_week	newsletter_day_of_week	tenure_day	day_of_week	visit	is_newsletter_day
0	Wed	Sun	1	Wed	1	0
1	Wed	Sun	2	Thu	1	0
2	Wed	Sun	3	Fri	0	0
3	Wed	Sun	4	Sat	0	0
4	Wed	Sun	5	Sun	0	1

For modelling purposes, I will be working with data grouped by tenure day, day of week, whether it’s a newsletter day and user signup day of week. This allows me to reduce the size of the dataset from 1.4M (50,000 users x 28 days) rows to just 252 (all combinations of user-day attributes). Then, I will be fitting binomial regressions models on it (vs. fitting logistic regression models on individual-level observation data).

	day_of_week	tenure_day	visited	total_users	signup_day_of_week	not_visited
0	Mon	1	2097	2453	Mon	356
1	Tue	1	6070	7511	Tue	1441
2	Wed	1	3779	4987	Wed	1208
3	Thu	1	3725	4994	Thu	1269
4	Fri	1	11285	15065	Fri	3780

Model specifications¶

Modelling Tenure effects¶

Most of the model specification here is quite straightforward – we want an intercept, plus indicator variables to capture day-of-week and signup-day effects. There are more complex approaches one could take – for example, if you believe that day-of-week effects are likely more similar between Monday and Tuesday than between Monday and Saturday, you could capture that via a Gaussian Process prior in a Bayesian setting – but I’ll stick to these.

What about tenure effects, however? I’ll explore 3 possibilities:

MLE option #1 – I’ll treat days as a continuous variable, and fit a single parameter. While it’s linear in logit space, it’s not linear in probability space and may be good enough to capture what’s needed.
MLE option #2 – I’ll treat days as a discrete variable, with each day getting a dummy indicator.
Bayesian model – I’ll treat days as a discrete ordinal variable, and force a monotonic effect. This is one of my favourite aspects about Bayesian modelling – if I have some specific domain assumptions I want to make, I can bake them into the model. I’ll enforce that by modelling tenure effects as $T_n = β * \sum_{i=1}^{n}{p_i}$, where β represents the full tenure impact, and p is the proportion that a tenure day n contributes. This can be easily achieved with a Dirichlet prior on p.

MLE model specifications¶

In a nutshell, the MLE models I end up with are:

MLE #1: visits | days ~ α + β*tenure_day + I(day_of_week) + I(signup_day_of_week) + I(is_newsletter_day)
MLE #2: visits | days ~ α + I(tenure_day) + I(day_of_week) + I(signup_day_of_week) + I(is_newsletter_day)

The only thing to keep in mind is that I have to drop one of the levels in each of the dummies (I chose to drop tenure day 1 and Monday when it comes to day-of-week variables), so the intercept will represent “visits on a Monday on 1st tenure day” in case of the first model, and “visits on a Monday” in the second model.

Bayesian model specification¶

When it comes to the Bayesian model, the approach will be similar, but with a few tweaks:

Instead of having “Monday, tenure day 1” as a reference baked into the intercept, I’ll model the day of week effects (both signup day and active day) as hierarchical, sharing a common underlying intercept. Not only this allows me to avoid choosing an arbitrary reference value, but I can capture any correlation between day-of-week parameters.
I will force the tenure effects to be monotonic, as already described above.

These structural assumptions that will make the Bayesian model behave a bit differently than the MLE one.

There’s also priors to specify:

For the underlying intercept α, I will go with a prior N(0.5,2). As α is on a logit space, this corresponds to a base probability of observing a visit of 63% and is wide enough to cover pretty much entire 0-1 space. I want to set the mean to above zero given I believe that the motivation (== likelihood to log in on day 1) is definitely above 50%.
For each of the signup day effect S and day of week effect D, I will use a prior N(0, 0.5). I don’t have strong beliefs that these effects are positive or negative (thus mean of zero), but I do think that the effects, if any, are unlikely to be dramatic. ~95% of N(0, 0.5) observations fall into range [-1, 1], which, for an average α, means I expect that these effects can individually move the base probability from 50% to as low as 26% (invlogit(-1)) or up to 73%. That sounds quite reasonable to me.
For newsletter day effect γ, I pick a prior N(1, 0.5). This is basically a belief that a result should be positive, but I am not sure by how much, as this prior largely explores the [0; 2] logit space.
I believe that the tenure effect β is negative, and quite strongly so – it represents the decrease in base probability of a visit by tenure day 28. I’ll set a prior to Normal(-2, 1), which translates to an expectation of 11% visit probability by day 28. As its standard deviation is 1, it will largely explore space between [0, -4], which represents my belief that such an effect should be negative. An alternative could be to force it to be negative using an -Exp() prior.
Reasoning about priors for p ~ Dirichlet() is not straightforward, as we also need to take α and β parameters into account, and this is all in the logit space. The easiest is to draw from priors and see what happens. It turns out that p ~ Dirichlet(1, 1, 1.... 1) is a very reasonable prior, so I will stick with that, though as it’s quite clear in the illustrations below, other choices may be more appropriate depending on your beliefs.
Because the hierarchical effects are multivariate, there’s also a prior on the correlation structure between them to specify. I am using the pymc.LKJCholeskyCov for a non-centered parameterization, which requires an η parameter to specify beliefs about correlation strenghts. After doing a similar simulation, I chose η=2, as it represents a prior concentrated in [-0.75; 0.75] space. For comparison, η=1 is a flat prior, while η=4 is concentrated in the [-0.5, 0.5] space, and η=10 in the [-0.25; 0.25] space.

How much do these precise choices matter? That’s something I’ll explore in this post. Besides using the “hand-picked” priors listed above, I will also see what happens if I just set N(0,5) priors for most parameters.

This is the full specification of the model: $$ α \sim N(0.5, 2) \\ β \sim N(-2, 1) \\ \gamma \sim N(1, 0.5) \\ D_d, S_s \sim MvN(0, 0.5) \quad \forall \quad d=1..7, s=1..7 \\ T_i \sim \text{Dirichlet}(1,…1) \quad \forall \quad i=2..28 \\ \eta = \alpha + \beta * \sum_{i=2}^{i=t}T_i + \gamma * N_t + D_d + S_s \\ \text{total visits} \sim \text{Binomial}(p=\text{invlogit}(\eta), N=\text{total users}) $$

where $N_t$ is an indicator variable if it’s a newsletter day, $t$ is the current tenure day, $d$ is the current day of the week and $s$ is user signup day of the week;

Or, if you prefer a more visual view:

Using prior predictive simulations to understand priors¶

Now, I have to admit I did not immediately come up with the handpicked priors above. I kept on fiddling with them while writing this post. One helpful tool in that process was prior predictive simulations. The idea is that you sample from the priors and see what kind of predictions your model arrives before it even tries to learn any parameters. It’s similar to what I did when exploring how to set Dirichlet prior earlier, but with focus on predictions. PyMC provides an easy way to obtain prior predictive samples with pm.sample_prior_predictive().

A distribution of predictive values themselves are not that interesting. What helped me the most was to visualize them along a data (conditional distributions). Specifically, I chose to do that along tenure – after all, that’s the most interesting aspect in this case study. Below, you can see the distribution of success probability obtained at each tenure day by sampling from the two set of priors.

Suddenly, it’s crystal clear why wide priors may make inference difficult. They effectively imply that the visit probability is highly bimodal, and I can see why it may make the sampler work overtime. It also shows that my current prior choice could be further improved. While it seems to work well in early and late tenure days, I expected to see a bell-like curve in mid-tenure days; instead, the prior choices yield a largely uniform probability.

Turns out, to achieve such a shape, all I need is to reduce the variance of the intrinsic motivation prior from N(0.5, 2) to N(0.5, 0.7). I found prior predictive simulations in the prior setting process very powerful!

Running the models¶

Let’s go ahead and run the models. The MLE models run in sub-second, you don’t even need to think about it. When it comes to pyMC model, even using the blackjax sampler (which is much faster than the pyMC built-in one), it still takes ~25 seconds. Not too bad, but not exactly instantenous.

Additionally, as Bayesian models require a bit more care, let’s have a look at the r-hat statistics for key variables and their trace plots to make sure that the model converged.

There are some divergences, but nothing too worrisome – let’s proceed to comparing what the models learned.

Comparing the model performance¶

As it can be seen from the plots below, both the discrete MLE approach and the Bayesian one yield very similar results. Bayesian model is more certain about signup effects being zero, and is a bit less off when estimating the newsletter impact. But overall, both models perform great.

The continuous MLE model, on the other hand, struggles with capturing the tenure effects. The linear specification just doesn’t cut it here. Perhaps a more creative approach (log- or polynomial specification) may work better. The upside is that it is most precise in capturing newsletter and day of week effects.

So is the effort in setting up the Bayesian model worth it? Well, it definitely did better than the continuous MLE model; but the performance compared to the discrete MLE model is nearly identical. It looks like we did not benefit that much from the structures built into it.

The only real advantage is that we were able to estimate the baseline motivation, and the model also learned that there is a lot of variation in it. If fact, the model is still overly confident about it: the 95% credible interval spans roughly 0.7-0.8 range, while in reality the data generation process implies a broader range (you can look back at the user motivation histogram above for a visual illustration).

Beware of correlations¶

Talking about uncertainty: this turns out a great example of why correlation in Bayesian models matter. If you revisit the trace plots, it may strike you as surprising that the Bayesian model has low uncertainty about day of week and signup day effects in the marginal effects plots. Specifically, if let’s look at the uncertainty of the offset parameters that represent the difference from intercept:

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Z[Mon]	1.173	0.514	0.231	2.152	0.019	0.013	739.0	1143.0	1.00
Z[Tue]	0.753	0.438	-0.081	1.563	0.017	0.012	689.0	965.0	1.00
Z[Wed]	0.324	0.388	-0.446	1.016	0.015	0.011	662.0	951.0	1.00
Z[Thu]	-0.032	0.372	-0.677	0.724	0.014	0.010	680.0	1141.0	1.00
Z[Fri]	0.004	0.372	-0.707	0.703	0.014	0.010	675.0	1081.0	1.00
Z[Sat]	-0.785	0.430	-1.599	0.022	0.015	0.010	875.0	1590.0	1.01
Z[Sun]	-1.547	0.573	-2.617	-0.463	0.017	0.012	1082.0	1641.0	1.00

Standard deviations are at least 0.4. So how come that is not visible in the marginal effects plot? Well, it turns out that the parameters are highly correlated. Whenever in the trace the Monday effect is high, so is the Friday one, and thus the marginal effect that reflects the difference between Monday and Friday is small. We can clearly see that in the pair-wise correlation plots.

The takeaway here is twofold:

The model may benefit from a specification similar to the one in MLE (setting a reference day). We would lose the ability to measure the underlying motivation, but it would reduce correlations among variables, which may help with sampling.
It’s important to compute the effects we care about. Below I visualize seemingly the same thing: the “effect of a day of a week”, but in one chart I show it against baseline motivation, while in the other I use Monday as the reference. While point estimates are comparable, the uncertainty looks very different! That’s because it represents a different kind of uncertainty.

Do priors matter?¶

I promised to try out a wider set of priors, where instead of carefully thinking about all effects, I set N(0,5) priors for all applicable parameters. What happens then? Turns out that:

The model still converges (r-hat values < 1.03);
However, there are many more divergences (the sampler is rejecting unlikely values)
Inference results are largely identical.

Modelling a more difficult scenario¶

The data used so far had no signup-day effects. Would the models be able to recover them if there were any? Also, the data had a lot of signal. What if we reduce the size of the tenure effect to make it harder to figure out what is what?

I fit two MLE models as before, as well as two Bayesian ones – using my hand-picked priors as well as the wider ones.

Solving convergence issues¶

However, the Bayesian models do not converge.

A peek into the trace shows that 3 out of 4 chains actually converged nicely. However, the last one got completely stuck.

I spent a LOT of time understanding what happens, and I may write it up as a separate blog post. As MCMC is stochastic, debugging was quite a bit of pain. Models would seemingly converge until they would not. One set of priors would work on a given day and not on the other day. In the end, the main learnings I took away were (in order of importance):

Chains getting completely stuck was the main issue. Increasing target_accept parameter (decreasing “step size” in the HMC algorithm) helped solve this issue. I suppose this issue was conceptually similar to a gradient descent getting stuck in a local optima when the step size / learning rate is too high.
Using wider priors would increase the risk of problematic divergences. This behavior was similar to what was already visible in the initial scenario, but the scale of the problem went up.
My initial prior set wasn’t the best given the new dataset as it encoded an assumption of strong tenure effects (β ~ N(-2, 1)). While it did not have a major effect, using something like β ~ N(-1, 0.5) yielded the “nicest” traces.

Let’s refit the model using target_accept=0.95 and a set of handpicked priors (where β ~ N(-1, 0.5)).

Result time¶

How do the models perform? Again, the performance is very similar between the MLE discrete and the Bayesian model (and MLE continuous approach just doesn’t work..)

I think it’s worth zooming into tenure effects. This is a good illustration where the extra structure in the Bayesian model finally starts to pay off. While the Bayesian model does not recover a perfectly smooth curve, it, by design, is monotonic. On the other hand, the MLE model is clearly affected by noise.

Modelling varying tenure effects¶

The last scenario I want to explore is where user motivation has a stronger impact on tenure effects. So far, the data generation process assumed that:

A user has an intrinsic motivation level α
A fixed tenure impact β, which is spread over the tenure period.

In real world, however, it’s probably not how motivation works. If you are motivated, then you don’t only have a higher initial motivation level; you likely also lose less motivation over time (β is smaller in absolute size). Let’s generate the last dataset where β varies by signup day of the week, similar to how α is modelled.

How could this be captured in the models? in case of MLE, the most natural representation is an interaction effect between day of signup and the tenure effect. That’s a bit problematic in case of discrete MLE model, however, as that means adding 27*6=162 additional parameters to the model.

In case of the Bayesian model, it’s a bit simpler – we just need to draw a 7-dimensional β (so that’s only 6 additional parameters). Not only that, we can draw them from the same multivariate Normal prior as the other day-of-week effects, which should allow capturing the correlation between the intercept and the tenure effect associated with the same signup day of week.

Because this post is already very long, I will just do the Bayesian version. MLE may or may not work, but it’s too cumbersome to work through at this point.

The model takes a bit longer to run (~40 seconds), but converges without major issues. Trace plots look good, too.

What about the effects? Here’s how they look. Beautiful, isn’t it?

It’s a wrap¶

This post took muuuuuch longer than expected. My personal takeaway is that:

Bayesian models are great, but only when you need the complexity they can support (last scenario is the best example) or you know you’ll need flexible uncertainty estimates. I haven’t done much of the later in this post, but answering questions such as “How much more likely is it that a user who signed up on Friday will make at least 10 visits in the first 28 days than one who signed up on Monday?” is a breeze in Bayesian setting. In other settings, the ROI may just not be there.
Priors matter, but not so much. I could have used N(0,5) priors for most of analysis in this blog post, and fiddling with step size instead could have been sufficient. I would have had to deal with more divergences, but depending on robustness I am looking for, it may have been OK.
Reasoning about priors is not that hard, but prior predictive simulations are your friend. Without them, I would have never realized what less informative priors really imply in these models. They may have been been uninformative individually, but due to logit link, they were very informative in the probability space.
Stochastic nature of MCMC can drive you crazy while debugging. There’s a reason I end up using random seeds in the code behind the post everywhere -_-.

That’s the objective conclusions.

Subjectively, I find it very satisfying about putting together a custom model that represents a structure I believe in. It’s like playing with LEGO – you start small, but your imagination (and computational power) is the limit. In this case, one could add other seasonality effects, vary tenure curve shapes, capture individual-level effects. And it wouldn’t be just throwing an ever increasing number of indicator variables together.

To me, that’s the beauty of Bayesian models – they allow modelling complexity without losing interpretability. I think they have their place in one’s modelling toolkit for situations when simple GLMs don’t cut it, but ML methods are not appropriate, either.

	signup_day_of_week	newsletter_day_of_week	tenure_day	day_of_week	visit	is_newsletter_day
0	Wed	Sun	1	Wed	1	0
1	Wed	Sun	2	Thu	1	0
2	Wed	Sun	3	Fri	0	0
3	Wed	Sun	4	Sat	0	0
4	Wed	Sun	5	Sun	0	1

	signup_day_of_week	newsletter_day_of_week	tenure_day	day_of_week	visit	is_newsletter_day
0	Wed	Sun	1	Wed	1	0
1	Wed	Sun	2	Thu	1	0
2	Wed	Sun	3	Fri	0	0
3	Wed	Sun	4	Sat	0	0
4	Wed	Sun	5	Sun	0	1

Modeling tenure effects the Bayesian way