If linear regression is statistics/econometrics 101, then log transformations of dependent and independent variables and the associated interpretations must be statistics/econometrics 102.
Typically, you are told that:
- If you log-transform an independent variable, then the regression coefficient \( b \) associated with that variable can be interpreted as “for every 1% increase in the independent variable, the dependent variable increases by \( \frac{b}{100} \) units”.
- If you log-transform a dependent variable, then the regression coeffcient \( b \) tells you that for every unit increase in the independent variable, the dependent variable increases by \( b\% \).
- If you log-transform both the dependent and independent variables, then the interpretation is that for every 1% increase in the independent variable, the dependent variable increases by \( b\% \). This is especially useful for economists studying price elasticities, for example.
If you are lucky, you may be told that you need to be careful with the interpretation of coefficients in case your dependent variable is log-transformed and the coefficients are not small. And maybe even given the precise formula to obtain the percentage effect: \( 100 \left ( e^{b} – 1 \right ) \). That’s exactly what happened in my class, and I am happy it nudged me to investigate this further. I think a lot of people may overestimate how accurate the percentage-based interpretations really are.
To begin with, where do all these percentages come from? The typical explanation relies on the fact that the first-order derivative of the natural logarithm is \( \frac{1}{x} \). But derivatives are all about “very small” changes in value and arguably 1 unit increase/decrease is not that small.
Here’s what the maths looks like for the simplest model \(y = a + bx\):
Log-transform of an independent variable (linear-log model):
$$ y_0 = a + b \log \left( x \right) \text{ (our “base case”) } $$
$$ y_1 = a + b \log \left( 1.01x \right) \text{ (x increases by 1%) } $$
$$ y_1 – y_0 = b \log \left( 1.01x \right) – b \log \left( x \right) = b \log \left( \frac{1.01x}{x} \right) = b \log \left(1.01 \right) $$
In other words, when \(x\) increases by 1%, the change in \(y\) is not \(b\), but rather \( b \log \left(1.01 \right) \). What is \( \log \left(1.01 \right) \)? It’s approximately \( 0.00995 \) or \(0.001\) and thus the rule all of learned. But it is \(5\% \) off!
Consider the following scenario. You are currently selling goods at a price of $100 and your last year sales were $10m. You are interested in raising the prices by 15% and you want to know what is a reasonable sales budget for the next year. A data scientist runs some analysis and finds that the regression coefficient is -90,000 and thus reports that for every 1% increase in price, you can expect to sell 900 units less. For a 15% price increase, a naive answer would be that 13,500 (90,000 * 15 / 100) less units will be sold. Your sales budget could thus be 115 * (100,000 – 13,500) ~ 9.95m. You begrudgingly prepare yourself for explaining to the higher-ups why a price increase will not result in higher sales.
A precise, answer, however, would be that you should expect \( -90000 \log \left( 1.1 \right) = 12,579 \) less units sold. That would result in a sales budget of $1.005m. A contrived example? Perhaps, but there surely are situations where even such a small difference is important enough.
Things get even more interesting in the log-linear model:
$$ \log \left( y_0 \right) = a + b x \text{ (our “base case”) } $$
$$ \log \left( y_1 \right) = a + b \left( x + 1 \right) \text{ (let’s add 1 to x) } $$
$$ \frac{y_1 – y_0}{y_0} = \frac { e^{a + b \left( x + 1 \right)} – e^{ a + b x } } { e^{ a + b x } } = \frac { e^{a + b x} \left( e^{b} – 1 \right) } { e^{ a + b x } } = e^{b} – 1 $$
Assuming \( b \) is small, \( e^{b} – 1 \approx b \). But what if it is not? Here’s how it looks graphically. If, say, your coefficient is equal to 1, then what it really means is not that “1 unit increase in the independent variable will result in a 1% increase in the response”. It’s actually 1.7%. That’s quite a difference.
As you may guess, in a log-log model everything gets compounded, resulting in the following equation. If you find that your coefficient is, say, -2, then its precise effect on response is not 2% but rather 1.97%.
$$ \frac{y_1 – y_0}{y_0} = e^{b * log 1.01} – 1 $$
Does it matter all the time? Definitely not. But, at least based on how I was taught these topics, I don’t think a lot of people are aware of the approximation errors using the simple percentage-based interpretation. There may be cases where the approximation error is important enough to be aware of.
Subscribe for infrequent new posts: