Interpretation of log transformations in linear models: just how accurate is it?

If linear regression is statistics/econometrics 101, then log transformations of dependent and independent variables and the associated interpretations must be statistics/econometrics 102.

Typically, you are told that:

  • If you log-transform an independent variable, then the regression coefficient \( b \) associated with that variable can be interpreted as “for every 1% increase in the independent variable, the dependent variable increases by \( \frac{b}{100} \) units”.
  • If you log-transform a dependent variable, then the regression coeffcient \( b \) tells you that for every unit increase in the independent variable, the dependent variable increases by \( b\% \).
  • If you log-transform both the dependent and independent variables, then the interpretation is that for every 1% increase in the independent variable, the dependent variable increases by \( b\% \). This is especially useful for economists studying price elasticities, for example.
Lecture notes from the Econometrics class in my undergraduate studies

If you are lucky, you may be told that you need to be careful with the interpretation of coefficients in case your dependent variable is log-transformed and the coefficients are not small. And maybe even given the precise formula to obtain the percentage effect: \( 100 \left ( e^{b} – 1 \right ) \). That’s exactly what happened in my class, and I am happy it nudged me to investigate this further. I think a lot of people may overestimate how accurate the percentage-based interpretations really are.

To begin with, where do all these percentages come from? The typical explanation relies on the fact that the first-order derivative of the natural logarithm is \( \frac{1}{x} \). But derivatives are all about “very small” changes in value and arguably 1 unit increase/decrease is not that small.

Here’s what the maths looks like for the simplest model \(y = a + bx\):

Log-transform of an independent variable (linear-log model):

$$ y_0 = a + b \log \left( x \right) \text{ (our “base case”) } $$

$$ y_1 = a + b \log \left( 1.01x \right) \text{ (x increases by 1%) } $$

$$ y_1 – y_0 = b \log \left( 1.01x \right) – b \log \left( x \right) = b \log \left( \frac{1.01x}{x} \right) = b \log \left(1.01 \right) $$

In other words, when \(x\) increases by 1%, the change in \(y\) is not \(b\), but rather \( b \log \left(1.01 \right) \). What is \( \log \left(1.01 \right) \)? It’s approximately \( 0.00995 \) or \(0.001\) and thus the rule all of learned. But it is \(5\% \) off!

Consider the following scenario. You are currently selling goods at a price of $100 and your last year sales were $10m. You are interested in raising the prices by 15% and you want to know what is a reasonable sales budget for the next year. A data scientist runs some analysis and finds that the regression coefficient is -90,000 and thus reports that for every 1% increase in price, you can expect to sell 900 units less. For a 15% price increase, a naive answer would be that 13,500 (90,000 * 15 / 100) less units will be sold. Your sales budget could thus be 115 * (100,000 – 13,500) ~ 9.95m. You begrudgingly prepare yourself for explaining to the higher-ups why a price increase will not result in higher sales.

A precise, answer, however, would be that you should expect \( -90000 \log \left( 1.1 \right) = 12,579 \) less units sold. That would result in a sales budget of $1.005m. A contrived example? Perhaps, but there surely are situations where even such a small difference is important enough.

Things get even more interesting in the log-linear model:

$$ \log \left( y_0 \right) = a + b x \text{ (our “base case”) } $$

$$ \log \left( y_1 \right) = a + b \left( x + 1 \right) \text{ (let’s add 1 to x) } $$

$$ \frac{y_1 – y_0}{y_0} = \frac { e^{a + b \left( x + 1 \right)} – e^{ a + b x } } { e^{ a + b x } } = \frac { e^{a + b x} \left( e^{b} – 1 \right) } { e^{ a + b x } } = e^{b} – 1 $$

Assuming \( b \) is small, \( e^{b} – 1 \approx b \). But what if it is not? Here’s how it looks graphically. If, say, your coefficient is equal to 1, then what it really means is not that “1 unit increase in the independent variable will result in a 1% increase in the response”. It’s actually 1.7%. That’s quite a difference.

Log-linear model coefficient interpretation

As you may guess, in a log-log model everything gets compounded, resulting in the following equation. If you find that your coefficient is, say, -2, then its precise effect on response is not 2% but rather 1.97%.

$$ \frac{y_1 – y_0}{y_0} = e^{b * log 1.01} – 1 $$

Does it matter all the time? Definitely not. But, at least based on how I was taught these topics, I don’t think a lot of people are aware of the approximation errors using the simple percentage-based interpretation. There may be cases where the approximation error is important enough to be aware of.

Subscribe for infrequent new posts:


Posted

in

Hi! 👋 I am Aurimas Račas. I love all things data, and know a few things about building analytics teams, using data to drive business and product decisions, and have a rare but powerful mix of of strategy, finance and accounting expertise, combined with deep technical BI and data science skills. My code lives on Github, opinions on Twitter / Mastodon, and professional background on LinkedIn.