## Introduction¶

One of the most basic problems in data science goes something like this:

I want to fit a curve to data that is flexible enough to capture non-linear relationships (i.e. go through known data points without any error) but I also want confidence intervals for predictions at unknown points.

Most methods give you one or the other. You can go with linear/polynomial regresssions and have confidence intervals, but they won’t perfectly fit your data. Or you can go with splines, LOESS and other non-parametric methods, but then you won’t get confidence intervals out of the box (you still can estimate them with bootstrapping but it’s not always straightforward).

It was in this context that I learned about Gaussian Processes (GPs) first. GPs give you both a perfect interpolating function over known data points while also providing variance estimates at unknown data points. Here’s a very basic (and recognizable) illustration of a GP in action.

Once I learned about Gaussian processes, I started seeing references to them everywhere. Want to optimize hyperparameters? GPs can help. Reading an academic paper doing some cool stuff? Oh look, GPs are involved, too. That’s how I decided to spend some more time to understand them (a bit better), and they did not disappoint.

Cool stuff is meant to be shared, and that is how this post/notebook was born. In it, you will find:

- Brief intro into how GPs work
- GP regression as a way to interpolate in multi-dimensional space
- Importance of kernels in GPs
- GPs for time-series forecasting
- GP regressions with multiple output variables 🤯
- GP regressions as surrogate functions and Bayesian optimization

Be forewarned: it’s a loooong one!

## Data used in this notebook¶

One of the earliest applications of Gaussian Process is known as kriging, named after Danie G. Krige, a mining enginner from South Africa. Staying close to the roots, this notebook uses weather forecasts in Lithuania to illustrate GPs (and it makes for pretty charts, too!). An extract of the dataset is displayed below. This notebook uses GPy library. If you are reading it in post format, some of the code cells are excluded for brevity. You can find the entire notebook and associated data on GitHub.

```
#read in weather forecasts and Lithuania's shape file
lt_forecasts = gpd.read_file("lt_forecasts.geojson")
lithuania = gpd.read_file('lt-shapefile/lt.shp')
lt_forecasts.head(n=5)
```

forecastTimeUtc | airTemperature | windSpeed | cloudCover | totalPrecipitation | location | geometry | |
---|---|---|---|---|---|---|---|

0 | 2022-02-01T06:00:00 | -2.8 | 4.0 | 18.0 | 0.0 | Senoji Radiškė | POINT (23.25380 54.29488) |

1 | 2022-02-04T21:00:00 | 1.7 | 5.0 | 100.0 | 1.2 | Senoji Radiškė | POINT (23.25380 54.29488) |

2 | 2022-02-06T18:00:00 | 2.4 | 7.0 | 97.0 | 3.5 | Senoji Radiškė | POINT (23.25380 54.29488) |

3 | 2022-02-06T12:00:00 | 2.1 | 5.0 | 100.0 | 1.0 | Senoji Radiškė | POINT (23.25380 54.29488) |

4 | 2022-02-06T06:00:00 | 0.3 | 7.0 | 100.0 | 0.9 | Senoji Radiškė | POINT (23.25380 54.29488) |

## How do GPs work?¶

Parametric algorithms typically fit a fixed number of parameters to data (think linear regression and its coefficients). Non-parametric methods, on the other hand, typically use data points themselves as “parameters” (think K-nearest neighbours where classification depends solely on existing data points).

Gaussian Processes fall somewhere in between. Instead of fitting a fixed number of parameters, they fit a distribution of *functions* to data, where these functions are drawn from a multivariate normal distribution (which is what enables having confidence intervals out of the box), while the functions themselves are additionally parameterized by existing data points. That’s not an easy to digest sentence, I know… One of the best definitions I found is from Julia’s GP package:

Gaussian processes are a family of stochastic processes which provide a flexible nonparametric tool for modelling data. A Gaussian Process places a prior over functions, and can be described as an infinite dimensional generalisation of a multivariate Normal distribution. Moreover, the joint distribution of any finite collection of points is a multivariate Normal. This process can be fully characterised by its mean and covariance functions, where the mean of any point in the process is described by the mean function and the covariance between any two observations is specified by the kernel. Given a set of observed real-valued points over a space, the Gaussian Process is used to make inference on the values at the remaining points in the space.

In some ways, GPs thus incorporate both parametric and non-parametric worlds – they have fixed parameters (the mean and the variance of the prior), but they also use data as parameters (as the posterior of the functions is effectively drawn from n-dimensional Gaussian distribution, where n is the number of sample points).

If you are interested to understand GPs deeper (and I definitely suggest that), I highly recommend a tutorial on YouTube by Richard Turner and a visual exploration of Gaussian processes by three researchers from University of Konstanz.

Finally, I cannot leave out two wonderful memes by Richard McElreath…

help pic.twitter.com/wbDo3CqpTK

— Richard McElreath 🦔 (@rlmcelreath) February 21, 2022

If you are wondering how one can fit a “infinite-dimensional” normal distribution, the answer lies in the fact that a marginal distribution of a normal distribution is normal, just like is a joint and a conditional distribution of two normal distributions. Richard Turner does a really great job at illustrating these dynamics in his tutorial. If you don’t watch it now, make sure at least to put it in your bookmarks/open tabs list – it’s a really good one!

For all their coolness, GPs Achilles heel is computational complexity. They scale with amount of data as any other non-parametric methods, with most implementations having $O(n^3)$. Improving that is an active line of research, whereas for every day data scientists concerned with speed can choose between GPU-enabled GPyTorch and efficiency focused celerite libraries. Just keep in mind that GPs are not really an option if you have 1,000,000s of observations.

## 1. Gaussian Process Regression – interpolations¶

Let’s start with the classical application of Gaussian Process in geostatistics. Imagine you are building an application that requires knowing air temperature at any given location in Lithuania. What you have, though, is just the measurements at weather stations. Here’s how it looks visually.

Let’s use Gaussian Process regression to interpolate. We will fit a GP using coordinates as features and air temperature as the dependent variable.

```
#organize data into predictors and independent variables
coordinates = np.zeros((len(morning_forecast),2))
coordinates[:,0] = morning_forecast.geometry.x.values
coordinates[:,1] = morning_forecast.geometry.y.values
actual_temperatures = morning_forecast['airTemperature'].values.reshape(-1, 1)
#fit the simplest GP model with default parameters
interpolation_model = GPy.models.GPRegression(X=coordinates, Y=actual_temperatures)
interpolation_model.optimize_restarts(num_restarts = 1, verbose=False)
#define the grid covering Lithuania
x_space = np.linspace(coordinates[:,0].min(), coordinates[:,0].max(), 100)
y_space = np.linspace(coordinates[:,1].min(), coordinates[:,1].max(), 100)
xx, yy = np.meshgrid(x_space, y_space)
grid = np.concatenate((xx.reshape(-1,1), yy.reshape(-1, 1)), axis=1)
#obtain predictions over the grid - GPY returns both mean and variance
p_means, p_vars = interpolation_model.predict(grid)
```

And, here it is – now we have predictions over the entire country. Looks pretty, doesn’t it?