Interpretation of log transformations in linear models: just how accurate is it?

If linear regression is statistics/econometrics 101, then log transformations of dependent and independent variables and the associated interpretations must be statistics/econometrics 102.

Typically, you are told that:

  • If you log-transform an independent variable, then the regression coefficient \( b \) associated with that variable can be interpreted as “for every 1% increase in the independent variable, the dependent variable increases by \( \frac{b}{100} \) units”.
  • If you log-transform a dependent variable, then the regression coeffcient \( b \) tells you that for every unit increase in the independent variable, the dependent variable increases by \( b\% \).
  • If you log-transform both the dependent and independent variables, then the interpretation is that for every 1% increase in the independent variable, the dependent variable increases by \( b\% \). This is especially useful for economists studying price elasticities, for example.
Lecture notes from the Econometrics class in my undergraduate studies

If you are lucky, you may be told that you need to be careful with the interpretation of coefficients in case your dependent variable is log-transformed and the coefficients are not small. And maybe even given the precise formula to obtain the percentage effect: \( 100 \left ( e^{b} – 1 \right ) \). That’s exactly what happened in my class, and I am happy it nudged me to investigate this further. I think a lot of people may overestimate how accurate the percentage-based interpretations really are.

To begin with, where do all these percentages come from? The typical explanation relies on the fact that the first-order derivative of the natural logarithm is \( \frac{1}{x} \). But derivatives are all about “very small” changes in value and arguably 1 unit increase/decrease is not that small.

Here’s what the maths looks like for the simplest model \(y = a + bx\):

Log-transform of an independent variable (linear-log model):

$$ y_0 = a + b \log \left( x \right) \text{ (our “base case”) } $$

$$ y_1 = a + b \log \left( 1.01x \right) \text{ (x increases by 1%) } $$

$$ y_1 – y_0 = b \log \left( 1.01x \right) – b \log \left( x \right) = b \log \left( \frac{1.01x}{x} \right) = b \log \left(1.01 \right) $$

In other words, when \(x\) increases by 1%, the change in \(y\) is not \(b\), but rather \( b \log \left(1.01 \right) \). What is \( \log \left(1.01 \right) \)? It’s approximately \( 0.00995 \) or \(0.001\) and thus the rule all of learned. But it is \(5\% \) off!

Consider the following scenario. You are currently selling goods at a price of $100 and your last year sales were $10m. You are interested in raising the prices by 15% and you want to know what is a reasonable sales budget for the next year. A data scientist runs some analysis and finds that the regression coefficient is -90,000 and thus reports that for every 1% increase in price, you can expect to sell 900 units less. For a 15% price increase, a naive answer would be that 13,500 (90,000 * 15 / 100) less units will be sold. Your sales budget could thus be 115 * (100,000 – 13,500) ~ 9.95m. You begrudgingly prepare yourself for explaining to the higher-ups why a price increase will not result in higher sales.

A precise, answer, however, would be that you should expect \( -90000 \log \left( 1.1 \right) = 12,579 \) less units sold. That would result in a sales budget of $1.005m. A contrived example? Perhaps, but there surely are situations where even such a small difference is important enough.

Things get even more interesting in the log-linear model:

$$ \log \left( y_0 \right) = a + b x \text{ (our “base case”) } $$

$$ \log \left( y_1 \right) = a + b \left( x + 1 \right) \text{ (let’s add 1 to x) } $$

$$ \frac{y_1 – y_0}{y_0} = \frac { e^{a + b \left( x + 1 \right)} – e^{ a + b x } } { e^{ a + b x } } = \frac { e^{a + b x} \left( e^{b} – 1 \right) } { e^{ a + b x } } = e^{b} – 1 $$

Assuming \( b \) is small, \( e^{b} – 1 \approx b \). But what if it is not? Here’s how it looks graphically. If, say, your coefficient is equal to 1, then what it really means is not that “1 unit increase in the independent variable will result in a 1% increase in the response”. It’s actually 1.7%. That’s quite a difference.

Log-linear model coefficient interpretation

As you may guess, in a log-log model everything gets compounded, resulting in the following equation. If you find that your coefficient is, say, -2, then its precise effect on response is not 2% but rather 1.97%.

$$ \frac{y_1 – y_0}{y_0} = e^{b * log 1.01} – 1 $$

Does it matter all the time? Definitely not. But, at least based on how I was taught these topics, I don’t think a lot of people are aware of the approximation errors using the simple percentage-based interpretation. There may be cases where the approximation error is important enough to be aware of.

From an Excel schedule to Outlook Calendar with PowerShell

(it was actually quite pleasant!)

I am starting studies at Georgia Tech (yay!) and while the semester didn’t even start yet, it already feels a bit overwhelming. Tons of emails, activities, and lots of calculus. More than I was expecting, actually!

Now, one of the ways I deal with such stress is by getting organized. Sometimes I wonder if that’s just a form of procrastination (organizing stuff vs. actually dealing with it), but it helps. When my email inbox is under control, my calendar is up to date, I just feel better. period.

So it probably won’t be surprising that when an email arrived announcing some additional “TA sessions”, with 3 sessions every week until the end of the semester, I had an immediate urge to *organize it*. Except that the schedule came in a form of an Excel spreadsheet…

Not exactly “Outlook-ready”…

Now, I’ve spent the past few days refreshing Python skills, and setting up a new laptop (WSL2 is great, it looks like I won’t be dual-booting Windows/Ubuntu anymore.. feels slightly morally wrong, though). In all that process, I had a chance to play around with PowerShell, and I was wondering how hard would it be to create .ics files/events in Outlook calendar for all these sessions. For someone who wasn’t really familiar with PowerShell control structures, equality operators (“-ne”, that’s basically Lithuanian..), the answer is ~1 hour hard. That was surprisingly fast.

Most of the credit goes to the author of the New-IcsEvent PowerShell function which generates .ics files based on supplied parameters. In my case, I did not bother with reading the Excel file either – turned out that working with lists/strings in PowerShell is very intuitive.

$weeks = "08-23-21
08-30-21
09-06-21
09-13-21
09-20-21
09-27-21
10-04-21
10-11-21
10-18-21
10-25-21
11-01-21
11-08-21
11-15-21
11-22-21
11-29-21
12-06-21".split("`n") 
#weeks is now an array - just like in Python 
#(except, of course, unlike the rest of the world, we don't use \n for line breaks..)

After some fiddling (and extracting locations and session titles to similar arrays as above), all my code had to do was pretty much this:

for ($i = 0; $i -lt $weeks.Count; $i++) {
    $start_week_date = [datetime]::parseexact($weeks[$i], 'MM-dd-yy', $null)
    if ($tuesdays[$i] -ne "") {
        $day = $start_week_date.AddDays(1).ToString("yyyy-MM-dd") + " 09:15"
        $topic = "MSA TA: " + $tuesdays[$i]
        Write-Output $day
        New-IcsEvent -Path C:\users\aurim\Desktop -Subject $topic -Location 'location' -StartDate $day -ShowAs Free -Duration '01:00:00'
    }
    if ($thursdays[$i] -ne "") {
        $day = $start_week_date.AddDays(3).ToString("yyyy-MM-dd") + " 09:15"
        $topic = "MSA TA: " + $thursdays[$i]
        Write-Output $day
        New-IcsEvent -Path C:\users\aurim\Desktop -Subject $topic -Location 'location' -StartDate $day -ShowAs Free -Duration '01:00:00'
    }
    if ($fridays[$i] -ne "") {
        $day = $start_week_date.AddDays(4).ToString("yyyy-MM-dd") + " 13:00"
        $topic = "MSA TA: " + $fridays[$i]
        Write-Output $day
        New-IcsEvent -Path C:\users\aurim\Desktop -Subject $topic -Location 'location' -StartDate $day -ShowAs Free -Duration '01:00:00'
    }    
}

Now, this created a bunch of .ics files on my desktop. 44 to be precise. Turns out that .ics can contain multiple events, however. So all I needed to do was to concatenate all the files together, skipping the first 4 and the last line. That would be quite simple on Unix: ls *.ics " | tail -n +2 | head -n -1 | cat > my_master_schedule.ics. I have a feeling that with some fiddling with awk/sed, it’s very likely possible to even ensure that the first 4 lines of the first file and the last line of the very last file are not skipped, giving us a fully valid .ics file in one go.

BEGIN:VCALENDAR
VERSION:2.0
METHOD:PUBLISH
PRODID:-//JHP//We love PowerShell!//EN
BEGIN:VEVENT
UID: ab4d0283-885b-433f-b269-7ce6b841df46
CREATED: 20210820T005313Z
DTSTAMP: 20210820T005313Z
LAST-MODIFIED: 20210820T005313Z
CLASS:Public
CATEGORIES:
SEQUENCE:0
DTSTART: 20210826T131500Z
DTEND: 20210826T141500Z
DESCRIPTION: 
SUMMARY: MSA TA: R Intro part I
LOCATION: IC215
TRANSP:TRANSPARENT
END:VEVENT
END:VCALENDAR

Nevertheless, I was curious how difficult would it be to do the same in PowerShell. It took some searching, but it turned out it is pretty straightforward, too. By the way, that’s one thing I did not like about PowerShell – there are tons of answers on the web, but quite a few of them are clearly low quality, even on StackOverflow. Reminds me a bit of looking for good VBA guidance. Curse of the masses, I suppose?

ls *.ics | get-content | where {$_.readcount -gt 4 -and $_ -ne "END:VCALENDAR"} > MSA_TA_SESSIONS.icsx                   

As you can see, I cheated a bit as I think get-content reads line-by-line, and so I could not refer to total line count (there definitely are ways to do that, though, it just seemed to require more steps). Either way, I had all I needed.

Now it was time to finish up the exercise by opening Notepad and adding those missing first 4 and the very last line manually. No point in writing code for that.

“Football field” (variable-width bar) charts in R / Power BI

An interesting way to present long-term contracts and an opportunity to try our R visuals in Power BI

On a recent project, our team was asked if we can visualize long term real estate contracts on a “football field” chart. Now, I’ve heard of snowball charts, waterfall charts, but a football field chart was something new. A quick search, however, showed that I was missing out.

Yup, it’s a thing.

What exactly is a football field chart? It effectively is a column/bar chart where the individual bars are used to indicate range (and thus start/end in arbitrary places). Most examples on the internet relate to valuation ranges. In our case, the idea was to display contracts over time as discreet bars, varying the bar width with contract value changes.

Something along these lines

We were planning to deliver our insights via Power BI reports, so my immediate reaction was to look whether we could create these sort of charts using Power BI’s R visuals and ggplot2. Turns out, ggplot2 includes exactly what we needed in this scenario. Its geom_tile() / geom_rect() allow drawing arbitrary rectangles. If you prefer Python, the ggplot2-inspired plotnine package supports the same geoms, and, admittedly has way more impressive examples.

This is an example of what you can achieve with the ggplot2 / plotnine packages. Technically, this is a bar chart!
Credit: Plotnine geom_tile() documentation

Getting to the right visual in R

For the purpose of this blog post, I generated some synthetic data representing a real estate portfolio in Germany. In case you are interested in how it was created, check out the python script with details (it was a lot of fun to play around with various probability distributions, I may write up a blog post about that separately one day).

We effectively had: 1) a list of contracts 2) a list of terms associated with each contract and 3) a monthly rental portfolio data set derived from the contract/term information.

Getting to the desired visualization was surprisingly easy. First, let’s select top 10 contracts by their future value (you don’t really want to visualize all contracts at once unless you intend to print it on A0 size paper..)

library(tidyverse)
library(data.table)
library(RColorBrewer)

#read in the table that contains individual contract terms with associated start/end dates
terms = fread("~/coding/real-estate/terms.csv")

#read in the table that contains monthly information for each term (SQM / rent income in a month)
#add proper year/month columns to the dataframe
PL = fread("~/coding/real-estate/monthly_data.csv") %>% 
  mutate(date = as.Date(strptime(paste0(month,"-01"), format="%Y-%m-%d"))) %>%
  mutate(year = year(date), month = month(date))

#calculate future value per contract (assuming we're at the end of 2020)
# filter it down to top10 contracts only
top_10_contracts = PL %>% filter(date > "2020-12-31") %>% 
  inner_join(terms, by='term_id') %>%
  inner_join(contracts, by="contract_id") %>% 
  group_by(contract_id) %>% summarize(future_value = sum(rent), .groups='drop_last') %>% 
  arrange(desc(future_value)) %>% head(10)

Then, we can proceed to get the first iteration of the chart. As geom_tile() requires “central” positions and width/height of the rectangles, let’s calculate the mid-period of each term to be used as the x-axis position, width will be equal to number of days, and height to monthly rent. Let’s try with one contract.

chart_df = 
  #filter to 1 contract
  top_10_contracts[1,] %>% 
  #get terms of contract
  inner_join(terms, by="contract_id") %>% 
  #calculate # of days of each term
  mutate(width = as.numeric(difftime(end, start, units="days"))) %>% 
  #calculate the mid-point of each term
  mutate(mid_point = start + width/2) %>% 
  #extensions are numeric, make sure they are treated as discrete variables
  mutate(extension = as.factor(extension)) 

ggplot(chart_df) + 
  geom_tile(aes(
    x=mid_point, 
    y=contract_id, 
    width=width, 
    height=monthly_rent, 
    fill=extension
  ))
Not bad for an initial attempt.

Seems like we’re on the right track, even if not beautiful. We can fix that later. Let’s try with 10 contracts.

Accidental art?

Right. That’s not that great. While initially I thought that R, for some reason, places each series on top of each other (i.e. ignores Y-axis parameter), I later realized the real issue was bar heights. They needed to be re-scaled. Here’s a fixed version where each bar height is scaled to maximum of 1 (and some formatting tweaks).

ggplot(chart_df) + 
  geom_tile(aes(
    x=mid_point, 
    y=contract_id, 
    width=width, 
    height=monthly_rent/max(monthly_rent), #that's the fix
    fill=extension)
  ) +
  ylab("") + xlab("Year") + theme_bw() + 
  theme(panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank())
Much better.

Almost perfect. Just need to add labels that indicate total value and sort the contracts from the highest value to the lowest. Here’s the final code that achieves all that.

chart_df = 
  top_10_contracts %>%
  #get terms of top 10 contracts
  inner_join(terms, by="contract_id") %>% 
  #calculate # of days of each term
  mutate(width = as.numeric(difftime(end, start, units="days"))) %>% 
  #calculate the mid-point of each term
  mutate(mid_point = start + width/2) %>% 
  #extensions are numeric, make sure they are treated as discrete variables
  mutate(extension = as.factor(extension)) %>% 
  #reorder contracts
  mutate(contract_id = reorder(as.character(contract_id), future_value, mean)) 

labels = 
  #group by contract
  chart_df %>% group_by(contract_id) %>% 
  #calculate total value and get end date
  summarize(end_date = max(end), value = max(future_value), .groups="drop_last") %>% 
  #format total value
  mutate(label = paste0(round(value / 1000000, 1), "M")) 

ggplot(chart_df) + 
  #create rectangle geoms
  geom_tile(aes(
    x=mid_point, 
    y=contract_id, 
    width=width, 
    height=monthly_rent/max(monthly_rent), 
    fill=extension
  )) + 
  #deal with axis labels
  ylab("") + xlab("Year") + theme_bw() + 
  #remove gridlines on Y-axis
  theme(panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank()) +
  #add labels at the end of each contract bar
  geom_label(data = labels, mapping=aes(end_date + 400, contract_id, label=label)) +
  #apply a better color palette
  scale_fill_brewer(palette = "RdYlGn")
Here we are.

Edit: I received a comment (thanks Q!) that the chart may look more intuitive if there is a shadow behind each bar indicating the maximum monthly value on the chart for reference. Here is how that could look like (code-wise, that is just another geom_tile() layer).

Putting it all together in Power BI

Power BI’s R visual integration is incredibly simple. Effectively, all you need to do is to use their built-in R visual, drop in the fields that you want to have available as part of the dataframe in R (Power BI does all the aggregation for you, it’s effectively passing the same dataset as if it was a matrix visual) and then adjust R code accordingly (some of the data transforms were no longer required).

While not applicable in this case, a couple of caveats about R Visuals in Power BI:

  • R visuals can be cross-filtered but are not interactive themselves and do not cross-filter other visuals in the report;
  • When published to Power BI Service, they will work as long as you use (a long list) of supported packages;
  • There are certain limitations to number of data points, processing time and etc.

Combine it with a dynamic TopN selection, and voilĂ . Click to see it in action.

A quick Power BI mock-up with dynamic interactions

All code used in this blog post is available at https://github.com/kamicollo/football-field-charts.