Rethinking-inspired¶
For the past few months, I’ve been spending a lot of my time on following the Statistical Rethinking course by Richard McElreath. I’ve heard good things about it before, and I wanted to get deeper into Bayesian workflows and get some exposure to causal inference – and it’s rare to get a course that delivers both together.
It was totally worth it. Not only the course helped me get much more comfortable with the topics covered, but its use goes beyond: the “owl drawing framework” / generative thinking is helpful in so many situations, and I have a bunch of topics I’d like to dive deeper into (state-space models in particular seem particularly exciting).
It so happened that I had a problem at work that was nearly perfect to address using the techniques learned in the course. I figured I’ll put a post together – perhaps some will find the problem itself interesting (it’s a pretty common one in subscription-businesses, I suppose), others may get inspired to learn about Bayesian modelling themselves.
P.S. Mid-way through the course, I started converting lecture materials & homeworks into pyMC (5.X), too. Just in case that’s handy for anyone (planning to clean them up & share more widely one day).
P.P.S. This blog post doesn’t contain any code cells. If you’d rather read the notebook with all the code cells visible, you can find it in my blog post repository.
Problem at hand – estimating drop-off rate over tenure¶
Here’s the problem I’ll be solving in this post. Suppose you are in a business of providing a subscription-based service, where “success” means interacting with your service near daily. It could be pretty much anything – chatGPT, Spotify, Nexflix, a to-do app. Let’s also suppose that the usage of your service exhibits quite strong seasonality. In this example, I’ll be working with day-of-week effects but could also be anything from intra-day (hourly) to intra-year (monthly) effects. Additionally, suppose user signup patterns are also seasonal (i.e. we get many more signups on a some days of the week than on the others). Finally, just like any subscription service, you have a large fraction of users who just don’t stick with your service – they try using it for a few days and drop off (even if they have at least a month of a committment).
Understanding the drop-off rate (you could call it “retention curve” – though it’s not exactly retention) in such a business is very important. It helps inform decisions on “how long you have” to engage users before they start dropping off, it is one of the best ways to compare different signup cohorts, and it is a key input to customer lifetime value estimation.
The simplest approach to getting such a curve is simple descriptive analysis. Suppose you have data on whether a user was active on your service for each of first 28 tenure days, for all users who joined in the last 3 months. You could calculate the average % of users who were active on the service on a daily basis and plot it to get a nice, smooth curve which you include in the presentation to the CEO and move on with your life.
There’s just one potential problem. Your curve may instead look like this:
What is going on? Why are there these bumps?
The fact that they repeat nearly every 7 days may give away something: it’s seasonality effects. The above curve is calculated from a simulated dataset where there are two “seasonal” effects:
- An unequal number of users start their tenure at a given day of week (e.g. we get more signups on Monday)
- Users are more likely to be active on a given day of the week independent of tenure
In fact, a descriptive analysis may fall short in other ways, too.
Here’s a chart from another simulated dataset where only the number of signups vary by day of week. There is no seasonality associated with visit probabilities, and the tenure effects are strictly monotonic. However, if you were to plot % users active by day of week, you’d find they tend to be more active towards the end of the week, and you may even conclude that there are some differences in how active users are, on average, over the first 28 days of their tenure based on the day of the week they signed up.
But both of these conclusions would be false for this dataset. It’s merely a function of concentrated signups (50% of them take place on a Friday, with the rest distributed roughly equally over the week).
This problem is a good example of something that would typically land on a data analyst’s desk, not a data scientist’s one. However, getting the right answers requires a bit of “data science” (which is why, in my view, a good data analyst needs to know statistics & ML; it’s just that their focus is on different kind of problems).
Could we build a (Bayesian) model to uncover the true data generating parameters?
As we care about interpretability, we’ll have to stick to regressions. This is inherently a logistic regression problem (“will a user visit on day X?”), which can also be expressed as a binomial regression as long as we don’t care about user-specific parameters or group observations into groups that share the same characteristics. While my focus is on using a Bayesian approach (using pymc
), I will also have a more frequently used (pun intended) MLE version modelled with statsmodels
package.
The data generating process¶
Throughout this post, I will be working with simulated data. That’s one of the very useful things I took away from the Statistical Rethinking course. If you can control the data, it’s easy to know if a model is recovering the true data generating process. That gives some hope the model will work with real data, too.
So, here’s the setup. First, I create 50,000 users who have 3 attributes:
- Signup day of week. I can control the proportions of users signing up across days of the week. I visualize the weights assigned to each day below.
- Intrinsic motivation, which I can’t ever observe in real life, but represents the baseline chance that a user will visit an app on any given day. I hypothesise that users come to the service with different motivation levels. I can control motivation levels to be different depending on signup day of the week (e.g. “users who sign up on Friday tend to be more motivated”). Also, I can allow for different levels of heterogeneity in baseline motivation among users. I achieve that by drawing a unique baseline motivation level for each user from Beta distribution. For starters, I keep motivation levels the same every day of the week, with Beta distribution parameters α=30, β=10 (so the average intrinsic motivation is ~0.75).
- User’s newsletter day of the week. Just to make things a bit more interesting, I imagine each user gets a newsletter (which increases their chance of using the service on the given day), but the day the user may get the newsletter varies. The newsletter “effect” is +10ppt on the chance to visit on that day.
Once I have drawn the users, I simulate their activity data.
- Each day, I draw a Bernoulli variable with
p
probability of success, wherep
is a function of intrinsic motivation baseline, a tenure effect, a day of week effect, and a newsletter effect (if the day is the newsletter day for the user). - For simplicity purposes, I simply sum up the effects. In the cases when the final probability
p
is very low (<0.01) or high (>0.99), I set it to 0.01 and 0.99, respectively, so that it’s never 100% certain if a user will use the service or not. - Day of the week effects are just arbitrarily chosen offsets (visualized below).
- I model tenure effect in two components. First, I produce a curve using a function $f(x)= \frac{-4}{x ^{0.01}}$, where $x$ is a tenure day, then I rescale it to 0-1 range, and finally I apply an overall tenure effect. In a nutshell, I get a a smooth tenure effect which starts at 0 and reaches the overall effect by the last tenure day. That is visualized below, too.
Here’s how the simulated data looks like in the end:
user_id | signup_day_of_week | newsletter_day_of_week | tenure_day | day_of_week | visit | is_newsletter_day | |
---|---|---|---|---|---|---|---|
0 | 0 | Wed | Sun | 1 | Wed | 1 | 0 |
1 | 0 | Wed | Sun | 2 | Thu | 1 | 0 |
2 | 0 | Wed | Sun | 3 | Fri | 0 | 0 |
3 | 0 | Wed | Sun | 4 | Sat | 0 | 0 |
4 | 0 | Wed | Sun | 5 | Sun | 0 | 1 |
For modelling purposes, I will be working with data grouped by tenure day, day of week, whether it’s a newsletter day and user signup day of week. This allows me to reduce the size of the dataset from 1.4M (50,000 users x 28 days) rows to just 252 (all combinations of user-day attributes). Then, I will be fitting binomial regressions models on it (vs. fitting logistic regression models on individual-level observation data).
day_of_week | tenure_day | visited | total_users | is_newsletter_day | signup_day_of_week | not_visited | |
---|---|---|---|---|---|---|---|
0 | Mon | 1 | 2097 | 2453 | 0 | Mon | 356 |
1 | Tue | 1 | 6070 | 7511 | 0 | Tue | 1441 |
2 | Wed | 1 | 3779 | 4987 | 0 | Wed | 1208 |
3 | Thu | 1 | 3725 | 4994 | 0 | Thu | 1269 |
4 | Fri | 1 | 11285 | 15065 | 0 | Fri | 3780 |
Model specifications¶
Modelling Tenure effects¶
Most of the model specification here is quite straightforward – we want an intercept, plus indicator variables to capture day-of-week and signup-day effects. There are more complex approaches one could take – for example, if you believe that day-of-week effects are likely more similar between Monday and Tuesday than between Monday and Saturday, you could capture that via a Gaussian Process prior in a Bayesian setting – but I’ll stick to these.
What about tenure effects, however? I’ll explore 3 possibilities:
- MLE option #1 – I’ll treat days as a continuous variable, and fit a single parameter. While it’s linear in logit space, it’s not linear in probability space and may be good enough to capture what’s needed.
- MLE option #2 – I’ll treat days as a discrete variable, with each day getting a dummy indicator.
- Bayesian model – I’ll treat days as a discrete ordinal variable, and force a monotonic effect. This is one of my favourite aspects about Bayesian modelling – if I have some specific domain assumptions I want to make, I can bake them into the model. I’ll enforce that by modelling tenure effects as $T_n = β * \sum_{i=1}^{n}{p_i}$, where
β
represents the full tenure impact, andp
is the proportion that a tenure dayn
contributes. This can be easily achieved with a Dirichlet prior onp
.
MLE model specifications¶
In a nutshell, the MLE models I end up with are:
- MLE #1:
visits | days ~ α + β*tenure_day + I(day_of_week) + I(signup_day_of_week) + I(is_newsletter_day)
- MLE #2:
visits | days ~ α + I(tenure_day) + I(day_of_week) + I(signup_day_of_week) + I(is_newsletter_day)
The only thing to keep in mind is that I have to drop one of the levels in each of the dummies (I chose to drop tenure day 1 and Monday when it comes to day-of-week variables), so the intercept will represent “visits on a Monday on 1st tenure day” in case of the first model, and “visits on a Monday” in the second model.
Bayesian model specification¶
When it comes to the Bayesian model, the approach will be similar, but with a few tweaks:
Instead of having “Monday, tenure day 1” as a reference baked into the intercept, I’ll model the day of week effects (both signup day and active day) as hierarchical, sharing a common underlying intercept. Not only this allows me to avoid choosing an arbitrary reference value, but I can capture any correlation between day-of-week parameters.
I will force the tenure effects to be monotonic, as already described above.
These structural assumptions that will make the Bayesian model behave a bit differently than the MLE one.
There’s also priors to specify:
For the underlying intercept
α
, I will go with a priorN(0.5,2)
. As α is on a logit space, this corresponds to a base probability of observing a visit of 63% and is wide enough to cover pretty much entire 0-1 space. I want to set the mean to above zero given I believe that the motivation (== likelihood to log in on day 1) is definitely above 50%.For each of the signup day effect
S
and day of week effectD
, I will use a priorN(0, 0.5)
. I don’t have strong beliefs that these effects are positive or negative (thus mean of zero), but I do think that the effects, if any, are unlikely to be dramatic. ~95% ofN(0, 0.5)
observations fall into range[-1, 1]
, which, for an averageα
, means I expect that these effects can individually move the base probability from 50% to as low as 26% (invlogit(-1)
) or up to 73%. That sounds quite reasonable to me.For newsletter day effect
γ
, I pick a priorN(1, 0.5)
. This is basically a belief that a result should be positive, but I am not sure by how much, as this prior largely explores the[0; 2]
logit space.I believe that the tenure effect
β
is negative, and quite strongly so – it represents the decrease in base probability of a visit by tenure day 28. I’ll set a prior toNormal(-2, 1)
, which translates to an expectation of 11% visit probability by day 28. As its standard deviation is 1, it will largely explore space between [0, -4], which represents my belief that such an effect should be negative. An alternative could be to force it to be negative using an-Exp()
prior.Reasoning about priors for
p ~ Dirichlet()
is not straightforward, as we also need to takeα
andβ
parameters into account, and this is all in the logit space. The easiest is to draw from priors and see what happens. It turns out thatp ~ Dirichlet(1, 1, 1.... 1)
is a very reasonable prior, so I will stick with that, though as it’s quite clear in the illustrations below, other choices may be more appropriate depending on your beliefs.Because the hierarchical effects are multivariate, there’s also a prior on the correlation structure between them to specify. I am using the
pymc.LKJCholeskyCov
for a non-centered parameterization, which requires anη
parameter to specify beliefs about correlation strenghts. After doing a similar simulation, I choseη=2
, as it represents a prior concentrated in [-0.75; 0.75] space. For comparison,η=1
is a flat prior, whileη=4
is concentrated in the [-0.5, 0.5] space, andη=10
in the [-0.25; 0.25] space.