This tutorial is to show you how to use mgcv for non-linear time series data. With time series, the values you observe at a current time, \(y(t)\), are assumed to depend on the previous values of the time series: y(t) = f(y(t-1), y(t-2),…). Compare this to standard statistical methods, which assume each data point is independent of the others once you account for any covariates, or spatial statistics, which assume that the value at a given point depends on all its neighbours, not just the ones in one direction from it.
A good example of time series data would be a population of animals fluctuating over time. If the population is well-above carrying capacity currently, it’s likely going to be above its carrying capacity the next time we survey it as well. To be able to effectively model this population, we need to understand how it changes over time in response to its own density. Generalized additive models, using mgcv, are very useful for this as they allow you to model current values as a non-linear function of past values.
This tutorial has two major sections. In the first one (part 3), we talk about how to use GAMs to model trends, with autocorrelated background noise. The second section (part 4) focuses on using GAMs to model complex nonlinear dependencies between time points, for cases where you really care about how current observations depend on previous observations (for instance, when trying to estimate population dynamics). You don’t have to do both sections; each stands alone, so feel free to work through whichever of the two is more useful for you.
Here’s a few key ideas and R functions you should familiarize yourself with if you haven’t already encountered them before. For the R functions (which will be) highlighted like this, use ?function_name to look them up. If you’re familiar with time series stats, skip to the next part.
This exercise will use a type of smoother we didn’t cover in the intro: the cyclical smooth. Standard smooth terms assume that the left and right end of the range of your x variable have nothing to do with each other; one can be much higher or lower than the other. However, there’s several types of data where the predictor starts and ends at the same point. Think of time of day: 11:59 and 00:00 are almost the same time, even though they occur on different days. In this tutorial, we’ll work with seasonal (monthly) data, and similarly, month 1 and month 12 are very close to each other. As such, the start and end of such a smooth should also be very close; if our model predicts that it should be 10 degrees C at midnight, it should also predict the same temperature at 11:59:59.
mgcv allows us to easily model this type of data, using a cyclic smooth basis. The standard smooth terms we’ve used so far are built so that their 2nd derivates are zero at the ends of the data. The cyclic smooth instead assumes that the value and 1st and 2nd derivatives are equal at the ends of the data.
To use a cyclic smooth, you need to code the smooth term like this: s(x, bs="cc"). By default, mgcv assumes that the largest and smallest values of x are the end points. If that isn’t true, you can specify the end points of the range as shown below and in the first example using the knots argument.
This code shows how the smooth basis functions vary between the default thin plate (“tp”), cyclic (“cc”) and cyclic smooth with end points outside the range of the data. Note that no matter how you vary the cyclic smooths, the ends will always be equal. For more info on cyclic smooths, refer to ?smooth.construct.cc.smooth.spec
library(mgcv)
dat = data.frame(x = seq(0,1, length = 100))
tp_smooth = smooth.construct2(s(x,bs="tp", k=4),data = dat,knots = NULL)
cc_smooth = smooth.construct2(s(x,bs="cc", k=4),data = dat,knots = NULL)
cc_smooth2 = smooth.construct2(s(x,bs="cc", k=4),data = dat,
                               knots = list(x=c(-0.5,1.5)))
layout(matrix(1:3, nrow=1))
matplot(dat$x,tp_smooth$X, type="l",main="Thin plate spline" )
matplot(dat$x,cc_smooth$X, type="l",main="Cyclic spline" )
matplot(dat$x,cc_smooth2$X, type="l",main="Cyclic spline with wider knots")layout(1)Trend: The long-term value around which a time series is fluctuating. For instance, if the habitat for a population is slowly deteriorating, it will show a decreasing trend.
Seasonal effect: Many time series show a strong seasonal pattern, so the average value of the series depends on time of year. For instance, our hypothetical population may reproduce in spring, so population numbers would be higher then and decline throughout the year. We can use cyclic smoothers to model these effects.
Stationary time series: This is a time series without any trend or seasonal components, so if you looked at multiple realizations of the time series, the mean and variance (and other statistical properties) averaged across realizations would not change with time.
Autocorrelation function (acf): This is a function that describes how correlated a time series is with itself at different time lags; that is, the acf of a time series x is: acf(x,i) = cor(x(t),x(t-i)), where i is a given lag. In R, use the function acf(x) to view the acf function for series x.
gamm: This function fits a generalized additive mixed effects model, using the nlme package. We haven’t discussed it too much in the rest of the course, but it’s a tool to include more complicated mixed effects models than is allowed for by the bs=“re” method we discussed earlier. More importantly for this tutorial, it also allows you to add correlations between errors into your model. This lets you specify how the errors are related to one another, using a range of models. Look up ?corStruct if you’re interested in the range of possible error correlations allowed. The gamm function creates a list object, with two items in it: a gam model, containing the smooth terms, which functions like the gam models you’ve been working with so far, and an lme object, that contains the random effects, and in our case correlation functions.
In many cases, what we’re interested in is estimating how things are changing over time, and the fact that the time series is autocorrelated is a nuisance; it makes it harder to get reliable estimates of the trends we’re actually interested in. In this section, we’ll show you how to estimate non-linear trends, while accounting for autocorrelation in residuals 1.
This example assumes that a time series can be broken down into two main components: a mean effect, that changes over time, and an error term with autocorrelated errors, so the model for the dependent variable looks like:
\[y(t) = f(t) + \epsilon(t) \] \[\epsilon(t) = \sum_{i=1}^{n} \alpha_i \cdot \epsilon(t-i) + rnorm(0, \sigma)\]
Where \(f(t)\) is a smooth function of time, \(\alpha_i\) are the auto-regressive coefficients, and \(\sigma\) is the standard deviation of model errors.
The first time series we’ll look at is a long-term temperature and precipitation data set from the convention center here at Fort Lauderdale, consisting of monthly mean temperature (in degrees c), the number of days it rained that month, and the amount of precipitation that month (in mm). The data range from 1950 to 2015, with a column for year and month (with Jan. = 1). The data is from a great web-app called FetchClim2, and the link for the data I used is here.
Here we’ll load it and plot the air temperature data. Note that air temperature shows a strong seasonal trend, and that air-temperature is strongly auto-correlated.
library(dplyr)
library(mgcv)
library(ggplot2)
library(tidyr)
florida = read.csv("data/time_series/Florida_climate.csv",
                           stringsAsFactors = F)
head(florida)##      lat     lon air_temp    precip year month
## 1 26.099 -80.123 21.56915  18.55344 1950     1
## 2 26.099 -80.123 21.11481  26.17154 1950     2
## 3 26.099 -80.123 21.68798  78.39297 1950     3
## 4 26.099 -80.123 21.38804  49.63116 1950     4
## 5 26.099 -80.123 24.73590 156.15713 1950     5
## 6 26.099 -80.123 26.81385  85.51056 1950     6ggplot(aes(x=month, y=air_temp,color=year,group=year), 
      data= florida)+
  geom_point()+
  geom_line()acf(florida$air_temp,lag.max = 36)Looking at the time series, it does look like later years are warmer, but we should model it to see how much warmer.
The first model we’ll fit will include both a yearly and seasonal trend. Note that the seasonal trend uses a cyclic smoother, as January and December values should be close to one another. Also note that we’ve specified a knots command for the month term, with knots at 0.5 and 12.5 months. This is because if we just left it as is, mgcv would assume that month 1 and month 12 were the end points, and should therefore have the same value. Since January and December will have different mean temperatures, this creates a bias. By specifying knots like this, we are stating that the point of equality should occur equidistant between the middle of January and the middle of December.
model_temp_basic = gamm(air_temp~s(year)+s(month,bs="cc",k=12),data=florida,
                        knots = list(month = c(0.5, 12.5)))
plot(model_temp_basic$gam,page=1,scale=0)summary(model_temp_basic$gam)## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## air_temp ~ s(year) + s(month, bs = "cc", k = 12)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.93321    0.03105   770.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df      F p-value    
## s(year)  6.687  6.687  23.78  <2e-16 ***
## s(month) 8.160 10.000 768.02  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.908   
##   Scale est. = 0.76285   n = 792It appears that there’s about a 8 degree C difference between the warmest and coolest months, and that temperatures haven been rising non-linearly, with a much faster rate of increase since 2010. The model seems to fit well; however, if we look at the acf, there’s still substantial unexplained autocorrelation:
acf(residuals(model_temp_basic$gam),lag.max = 36)To incorporate the auto-correlated error structure, we need to use the slightly more complex gamm function. This uses the nlme package to fit the model, and it allows us to add correlation functions (in this case, an auto-regressive model) to our GAM model. The model we’ll add (an auto-regressive order-p model) assumes that the residuals at time t \(\epsilon_t = \sum_{i=1}^{p} \alpha_p\epsilon_{t-i} + rnorm(0,\sigma)\). The more values you add to the auto-regressive model, the more complex time dependencies you can model. Also note we specify the form of the correlation as form= ~1|year. This means that temperatures will only be assumed to be correlated within a given year, so there’s no assumed correlation between the temperature in December of one year and January of the next. This is just to speed up computations for the workshop; in an actual analysis, we would not recommend ignoring these correlations2.
model_temp_autoreg1 = gamm(air_temp~s(year)+s(month,bs="cc",k=12),
                            correlation = corARMA(form=~1|year,p = 1),
                            data=florida, knots = list(month = c(0.5, 12.5)))
model_temp_autoreg2 = gamm(air_temp~s(year)+s(month,bs="cc",k=12),
                            correlation = corARMA(form=~1|year,p = 2),
                            data=florida, knots = list(month = c(0.5, 12.5)))
layout(matrix(1:6, nrow = 2,byrow = F))
plot(model_temp_basic$gam,scale=0,main="basic model")
plot(model_temp_autoreg1$gam,scale=0,main="lag-1 model")
plot(model_temp_autoreg2$gam,scale=0,main="lag-2 model")layout(1)
layout(matrix(1:3, nrow = 1))
acf(residuals(model_temp_basic$lme,type="normalized"),main="basic model",lag.max = 36)
acf(residuals(model_temp_autoreg1$lme,type="normalized"),main="lag-1 model",lag.max = 36)
acf(residuals(model_temp_autoreg2$lme,type="normalized"),main="lag-2 model",lag.max = 36)layout(1)Note that including the auto-regressive model resulted in a substantially smoother estimate of the long-term trend. Warming appears to still be speeding up, although not at the same rapid rate as in the naive model. Adding the first lag also seems to remove most of the auto-correlation in the data, although there appears to be at least some long-period auto-correlation at the 12-month scale. We can use an ANOVA to test whether adding the first or second autoregressive terms actually result in a better fitting model than the naive one.
anova(model_temp_basic$lme, model_temp_autoreg1$lme, model_temp_autoreg2$lme)##                         Model df      AIC      BIC     logLik   Test
## model_temp_basic$lme        1  5 2098.127 2121.500 -1044.0637       
## model_temp_autoreg1$lme     2  6 1979.365 2007.412  -983.6825 1 vs 2
## model_temp_autoreg2$lme     3  7 1980.249 2012.970  -983.1243 2 vs 3
##                           L.Ratio p-value
## model_temp_basic$lme                     
## model_temp_autoreg1$lme 120.76232  <.0001
## model_temp_autoreg2$lme   1.11647  0.2907It appears that adding the first autoregressive term substantially improved the model, but the second lag term may not be necessary.
Modify the model of temperature and season to allow the seasonal effect to change over time. Is there evidence for changing seasonal patterns? How does this change once temperature autocorrelation is accounted for? Hint: this will require use tensor smooths.
Construct a model to estimate seasonal and long-term trends in precipitation for the convention center. What lags are necessary to include in the model? Remember that precipitation is a positive variable (i.e. cannot go below zero), so you may want to transform the variable to better suit the model’s assumptions.
So far we’ve focused on estimating trends where the presence of auto-correlation is just a nuisance effect we need to account for to get accurate estimates. However, in population and community ecology, we often find the time dependence itself to be interesting, as it can give us information on how the population is changing in response to its own density; that is, it helps us estimate what factors are affecting population growth rates. We can use this information to estimate what the equilibrium (carrying capacity) population density is, whether this equilibrium is stable, and how quickly we expect the population to return to equilibrium after perturbation. We can also use these methods to forecast the population into the future, which is very useful for managers3.
We also know, though, that density dependent effects in ecology are often strongly non-linear, so that observed densities at a given time may be a complex function of lagged density values: \(x(t) = f(x(t-1), x(t-2)...)\). Therefore, the linear correlation equations we were just working with may not suffice to estimate the real population dynamics. This is where mgcv comes in, letting us analyze complex time-dependent relationships.
The second data set we’ll work on is the famous “lynx” data. The data set consists of the annual numbers of lynx caught by Hudson’s Bay trappers from 1821 to 1934. This is one of the most well-known data sets in time series analysis (and in population ecology), as it shows clearly cyclic behaviour, with a period of 9-10 years. It is also one of the most well-studied data sets in time series statistics, as standard (linear) time series methods have a hard time capturing the strong cyclic behaviours.
Here we’ll load it, convert it into a data frame that mgcv is able to handle, and plot the data. Note the strong cyclicity, obvious from the acf function.
library(dplyr)
library(mgcv)
library(ggplot2)
library(tidyr)
data(lynx)
lynx_full = data.frame(year=1821:1934, population = as.numeric(lynx))
head(lynx_full)##   year population
## 1 1821        269
## 2 1822        321
## 3 1823        585
## 4 1824        871
## 5 1825       1475
## 6 1826       2821ggplot(aes(x=year, y=population), data= lynx_full)+
  geom_point()+
  geom_line()acf(lynx_full$population)Next we’ll add some extra columns to the data, representing lagged population values. Here we’ll use the lag function from the dplyr package. Note that we’ve log- transformed the population data for the lag values. This is because our model will assume we’re dealing with log-responses, so it will make interpreting plots easier. Also note that we specified the default argument in the lag function as the mean log population. This will replace all the NAs generated at the start of the time series (where there’s no lagged values possible) with the mean value, so we don’t end up throwing out data, and so we can compare models with different lags (as they’ll have the same number of data points).
We’ll also split the data set into 2: a training data set to fit models, and a 40 year testing data set to test how well our models fit out of sample. This is often a crucial step in fitting this sort of data; it is very easy to overfit time series data, and holding data in reserve gives us something to test for overfitting issues.
mean_pop_l = mean(log(lynx_full$population))
lynx_full = mutate(lynx_full, 
                   popl = log(population),
                   lag1 = lag(popl,1, default = mean_pop_l),
                   lag2 = lag(popl,2, default = mean_pop_l),
                   lag3 = lag(popl,3, default = mean_pop_l),
                   lag4 = lag(popl,4, default = mean_pop_l),
                   lag5 = lag(popl,5, default = mean_pop_l),
                   lag6 = lag(popl,6, default = mean_pop_l))
lynx_train = filter(lynx_full, year<1895)
lynx_test = filter(lynx_full, year>=1895)We’ll start by fitting a series of linear auto-regressive models of increasing order, and seeing how well each forecasts the test data, using one-step-ahead forecasting, where we try to forecast the next step for each time point, given the observed data series:
lynx_lm1 = gam(population~lag1, 
                         data=lynx_train, family = "poisson", method="REML")
lynx_lm2 = update(lynx_lm1,formula = population~lag1+lag2)
lynx_lm3 = update(lynx_lm1,formula = population~lag1+lag2+lag3)
lynx_lm4 = update(lynx_lm1,formula = population~lag1+lag2+lag3+lag4)
round(AIC(lynx_lm1,lynx_lm2,
      lynx_lm3,lynx_lm4))##          df   AIC
## lynx_lm1  2 39574
## lynx_lm2  3 21971
## lynx_lm3  4 17873
## lynx_lm4  5 17736lynx_predict_lm = mutate(
  lynx_full,
  lag1_model = as.vector(exp(predict(lynx_lm1,lynx_full))),
  lag2_model = as.vector(exp(predict(lynx_lm2,lynx_full))),
  lag3_model = as.vector(exp(predict(lynx_lm3,lynx_full))),
  lag4_model = as.vector(exp(predict(lynx_lm4,lynx_full))),
  data_type = factor(ifelse(year<1895, "training","testing"),
                     levels= c("training","testing"))
  )
#Gathers all of the predictions into two columns: the model name and the predicted value
lynx_predict_lm = gather(lynx_predict_lm,key= model, value= pop_est,
                                  lag1_model:lag4_model )%>%
  mutate(pop_est = as.numeric(pop_est))
forecast_accuracy_lm = lynx_predict_lm %>% 
  group_by(model)%>%
  filter(data_type=="testing")%>%
  summarize(out_of_sample_r2 = round(cor(log(pop_est),log(population))^2,2))
print(forecast_accuracy_lm)## Source: local data frame [4 x 2]
## 
##        model out_of_sample_r2
##       (fctr)            (dbl)
## 1 lag1_model             0.59
## 2 lag2_model             0.82
## 3 lag3_model             0.77
## 4 lag4_model             0.77ggplot(aes(x=year, y=population), data= lynx_predict_lm)+
  geom_point()+
  geom_line()+
  geom_line(aes(y=pop_est,color=model))+
  scale_color_brewer(palette="Set1")+
  facet_grid(.~data_type,scales = "free_x")+
  theme_bw()Note that the anova table shows that adding each lag substantially reduced model AIC, up to lag-4. Some work on this time series shows that you need up to 11 (!!) linear lag terms to effectively fit the model. Even the long-lagged models under-estimate population densities when the cycle is at a peak, and seem to miss-estimate the period of the cycle; the predicted peaks in the testing data occur 1-2 years after they do in the observed data. Further, looking at the out-of-sample estimates of \(R^2\), \(R^2\) is highest with a two-lag model.
Now let’s look at how adding non-linear lag terms improves the model. We’ve constrained the degrees of freedom of each smooth term as there’s complex correlations between the lags, and allowing too much freedom for the smooths can allow for complex and difficult to interpret smooths, as well as poor out-of-sample performance.
lynx_gam1 = gam(population~s(lag1,k=5), 
                         data=lynx_train, family = "poisson", method="REML")
lynx_gam2 = update(lynx_gam1,population~s(lag1,k=5)+s(lag2,k=5))
lynx_gam3 = update(lynx_gam1,population~s(lag1,k=5)+s(lag2,k=5)+s(lag3,k=5))
lynx_gam4 = update(lynx_gam1,population~s(lag1,k=5)+s(lag2,k=5)+s(lag3,k=5)+
                              s(lag4,k=5))
round(AIC(lynx_gam1,lynx_gam2,lynx_gam3,lynx_gam4))##           df   AIC
## lynx_gam1  5 39096
## lynx_gam2  9 18676
## lynx_gam3 13 13440
## lynx_gam4 17 13081lynx_predict_gam = mutate(
  lynx_full,
  lag1_model = as.vector(exp(predict(lynx_gam1,lynx_full))),
  lag2_model = as.vector(exp(predict(lynx_gam2,lynx_full))),
  lag3_model = as.vector(exp(predict(lynx_gam3,lynx_full))),
  lag4_model = as.vector(exp(predict(lynx_gam4,lynx_full))),
  data_type = factor(ifelse(year<1895, "training","testing"),
                     levels= c("training","testing"))
  )
#Gathers all of the predictions into two columns: the model name and the predicted value
lynx_predict_gam = gather(lynx_predict_gam,key= model, value= pop_est,
                                  lag1_model:lag4_model)%>%
  mutate(pop_est = as.numeric(pop_est))
forecast_accuracy_gam = lynx_predict_gam %>% 
  group_by(model)%>%
  filter(data_type=="testing")%>%
  summarize(out_of_sample_r2 = round(cor(log(pop_est),log(population))^2,2))
print(forecast_accuracy_gam)## Source: local data frame [4 x 2]
## 
##        model out_of_sample_r2
##       (fctr)            (dbl)
## 1 lag1_model             0.59
## 2 lag2_model             0.85
## 3 lag3_model             0.80
## 4 lag4_model             0.80ggplot(aes(x=year, y=population), data= lynx_predict_gam)+
  geom_point()+
  geom_line()+
  geom_line(aes(y=pop_est,color=model))+
  scale_color_brewer(palette="Set1")+
  facet_grid(.~data_type, scales="free_x")+
  theme_bw()The nonlinear models are more able to effectively capture the timing and magnitude of peaks in the cycle than any of the linear models, and explain more of the variance out-of-sample, accounting for 86% of OOS variance for the 2-lag model.
plot(lynx_gam2, page=1,scale=0)Our best fitting model, the 2-lag nonlinear model, indicates that log-lynx densities increase roughly linearly with the prior year’s log-density, but decrease nonlinearly when population densities two year’s previous are high, decreasing at a faster rate when (log) densities are very high two years previously. You’ll see this sort of lagged response in predator-prey cycles.
Focusing on the two-lag model: is there any evidence of an interaction between the population at lag 1 and the population at lag 2? Make sure to choose appropriate interaction terms, keeping in mind what you know about the data.
We ignored confidence intervals on our forecasts. For the 2-lag linear and nonlinear models, determine the appropriate confidence intervals and determine if the observed out-of-sample population densities fall within them.
We assumed that errors were Poisson-distributed for this exercise. However, we know many natural populations are over-dispersed, that is,they show more variation than we’d expect from a Poisson distribution. Choose a distribution to model this extra variation. How does it affect your model estimates, and the estimated degree of non-linearity present?
CHALLENGING: We focused, for simplicity, on one-step-ahead forecasting. However, we often want to forecast over longer time horizons. If you have the time, try figuring out how you can use this model to forecast whole trajectories of the out-of-sample data. Do the forecast trajectories for the linear or non-linear models have the same dynamic properties as the observed data? Hint: This will likely require a for-loop, and you’ll have to calculate new lagged terms for each step.
This is an expanded example based on one on Gavin’s blog, available here. His example also has some great code on how to model the derivatives of the trend.↩
Also note that if your data is not organized so that data are sequential (so that gamm can assume that the prior data point is the one that occurs immediately before the current one), you’d have to give corARMA a unique term specifying the time each observation occurs at. You could do that for this data using: florida= mutate(florida, time = as.Date(paste(year, month,"15", sep = "-"))) corARMA(form = time|year, p=2)↩
This example is based off an example by the statistician Cosma Shalizi, at Carnegie Mellon University. The original example starts at page 517 of his book Advanced Data Analysis from an Elementary Point of view (the example starts at page 517 of the pdf draft available here). This book is a great learning tool for stats in general, and for more on using GAMs.↩