Introduction

In this example, we look at a spatial data set of forest health data from a stand in Germany. We can treat the individual trees as discrete spatial units and fit a spatial term using a special type of spline basis, a Markov random field.

The data come from a project n the forest of Rothenbuch, Germany, which has been ongoing since 1982. The project studies five species but here we just consider the beech data. Each year the condition of each of 83 trees is categorized into one of nine ordinal classes in terms of the degree (percentage) of defoliation. A defoliation value of 0% indicates a healthy beech tree, whilst a value of 100% indicates a tree is dead. For the purposes of this example, several of the nine ordinal classes are merged to form a three class response variable of interest

  1. 0%, healthy trees that I’ll refer to as the "low" defoliation group
  2. 12.5% – 37.5% defoliation, which I refer to as the "med" group, and
  3. 50% – 100% defoliation, which is the "high" group.

The original classes were 0%, 12.5%, 37.5%, 50%, 62.5%, 75%, 87.5%, 100% and are in variable defol.

Alongside the response variable, a number of continuous and categorical covariates are recorded

  • id — location ID number
  • year — year of recording
  • x and y — the x- and y-coordinates of each location
  • age — average age of trees at a location
  • canopyd — canopy density at the location, in %
  • gradient — slope, in %
  • alt — altitude above sea level in m
  • depth — soil depth in cm
  • ph — soil pH at 0–2cm depth
  • watermoisture — soil moisture in three categories
    1. 1 moderately dry
    2. 2 moderately moist
    3. 3 moist or temporarily wet
  • alkali — fraction of alkali ions in four categories
    1. 1 very low
    2. 2 low
    3. 3 moderate
    4. 4 high
  • humus thickness of the soil humus layer in five categories
    1. 0 0cm
    2. 1 1cm
    3. 2 2cm
    4. 3 3cm
    5. 4 $>$3cm
  • type — type of forest
    • 0 deciduous forest
    • 1 mixed forest
  • fert — fertilization
    • 0 not fertilized
    • 1 fertilized

The aim of the example is to investigate the effect of the measured covariates on the degree of defoliation and to quantify any temporal trend and spatial effect of geographic location in the data set, while adjusting for the effects of the other covariates.

The data are extensively analysed in Fahrmeir, Kneib, Lang, and Marx (2013) Regression: Models, Methods and Applications. Springer.

Getting started

Begin by loading the packages we’ll use in this example

## Load packages
library("mgcv")
## Loading required package: nlme
## This is mgcv 1.8-13. For overview type 'help("mgcv-package")'.
library("ggplot2")

Next, load the forest health data set

## Read in the forest health data
forest <- read.table("./data/forest-health/beech.raw", header = TRUE, na.strings = ".")
head(forest)
##   id year defol   x y age canopyd gradient alt depth   ph watermoisture
## 1  5 1983     0 1.5 5  43     100        2 320    10 4.61             1
## 2  5 1984     0 1.5 5  44     100        2 320    10 4.34             1
## 3  5 1985     0 1.5 5  45     100        2 320    10 4.75             1
## 4  5 1986     0 1.5 5  46     100        2 320    10 3.99             1
## 5  5 1987     0 1.5 5  47     100        2 320    10 5.30             1
## 6  5 1988     0 1.5 5  48     100        2 320    10 3.83             1
##   alkali humus type fert
## 1      3     2    1    1
## 2      3     4    1    1
## 3      3     1    1    1
## 4      3     3    1    1
## 5      3     1    1    1
## 6      3     0    1    1

The data require a little processing to ensure they are correctly coded. The chunk below does

  1. Makes a nicer stand label so that we can sort stands numerically even though the id is character,
  2. Aggregates the defoliation data into an ordered categorical variable for low, medium, or high levels of defoliation. This is converted to a numeric variable for modelling,
  3. Convert the categorical variables to factors. humus is converted to a 4-level factor,
  4. Remove some NAs, and
  5. Summarises the response variable in a table.
forest <- transform(forest, id = factor(formatC(id, width = 2, flag = "0")))

## Aggregate defoliation & convert categorical vars to factors
levs <- c("low","med","high")
forest <- transform(forest,
                    aggDefol = as.numeric(cut(defol, breaks = c(-1,10,45,101),
                                              labels = levs)),
                    watermoisture = factor(watermoisture),
                    alkali = factor(alkali),
                    humus = cut(humus, breaks = c(-0.5, 0.5, 1.5, 2.5, 3.5),
                                labels = 1:4),
                    type = factor(type),
                    fert = factor(fert))
forest <- droplevels(na.omit(forest))
head(forest)
##   id year defol   x y age canopyd gradient alt depth   ph watermoisture
## 1 05 1983     0 1.5 5  43     100        2 320    10 4.61             1
## 3 05 1985     0 1.5 5  45     100        2 320    10 4.75             1
## 4 05 1986     0 1.5 5  46     100        2 320    10 3.99             1
## 5 05 1987     0 1.5 5  47     100        2 320    10 5.30             1
## 6 05 1988     0 1.5 5  48     100        2 320    10 3.83             1
## 7 05 1989     0 1.5 5  49     100        2 320    10 4.05             1
##   alkali humus type fert aggDefol
## 1      3     3    1    1        1
## 3      3     2    1    1        1
## 4      3     4    1    1        1
## 5      3     2    1    1        1
## 6      3     1    1    1        1
## 7      3     1    1    1        1
with(forest, table(aggDefol))
## aggDefol
##    1    2    3 
## 1002  576   48

To view the spatial arrangement of the forest stand, plot the unique x and y coordinate pairs

## Plot data
ggplot(unique(forest[, c("x","y")]), aes(x = x, y = y)) +
    geom_point()

Non-spatial ordered categorical GAM

Our first model will ignore the spatial component of the data. All continuous variables except ph and depth are modelled using splines, and all categorical variables are included in the model. Note that we can speed up fitting by turning on some multi-threaded parallel processing in parts of the fitting algorithm via the nthreads control argument. The ocat family is used and we specify that there are three classes. Smoothness selection is via REML.

## Model
ctrl <- gam.control(nthreads = 3)
forest.m1 <- gam(aggDefol ~ ph + depth + watermoisture + alkali + humus + type + fert +
                    s(age) + s(gradient, k = 20) + s(canopyd) + s(year) + s(alt),
                data = forest, family = ocat(R = 3), method = "REML",
                control = ctrl)

The model summary

summary(forest.m1)
## 
## Family: Ordered Categorical(-1,4.27) 
## Link function: identity 
## 
## Formula:
## aggDefol ~ ph + depth + watermoisture + alkali + humus + type + 
##     fert + s(age) + s(gradient, k = 20) + s(canopyd) + s(year) + 
##     s(alt)
## 
## Parametric coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.86977    1.46473   1.959  0.05008 .  
## ph             -0.66048    0.31818  -2.076  0.03791 *  
## depth          -0.05061    0.01053  -4.806 1.54e-06 ***
## watermoisture2  1.20087    0.37392   3.212  0.00132 ** 
## watermoisture3  1.43426    0.36591   3.920 8.86e-05 ***
## alkali2        -1.66310    0.26841  -6.196 5.79e-10 ***
## alkali3        -2.62266    0.43275  -6.060 1.36e-09 ***
## alkali4        -1.20589    0.51273  -2.352  0.01868 *  
## humus2          0.41113    0.19317   2.128  0.03331 *  
## humus3          0.89086    0.21644   4.116 3.86e-05 ***
## humus4          0.49641    0.25131   1.975  0.04824 *  
## type1          -1.40312    0.21748  -6.452 1.11e-10 ***
## fert1          -0.81955    0.39343  -2.083  0.03724 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df Chi.sq  p-value    
## s(age)       7.428  8.320 105.75  < 2e-16 ***
## s(gradient) 15.750 17.379 157.66  < 2e-16 ***
## s(canopyd)   4.694  5.695  94.64  < 2e-16 ***
## s(year)      7.391  8.371  74.93 1.10e-12 ***
## s(alt)       4.995  6.031  39.99 5.17e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Deviance explained = 47.1%
## -REML = 822.86  Scale est. = 1         n = 1626

indicates all model terms are significant at the 95% level, especially the smooth terms. Don’t pay too much attention to this though as we have not yet accounted for spatial structure in the data and several of the variables are likely to be spatially autocorrelated and hence the identified effects may be spurious.

This is somewhat confirmed by the form of the fitted functions; rather than smooth, monotonic functions man terms are highly non-linear and difficult to interpret.

plot(forest.m1, pages = 1, shade = 2, scale = 0)

We might expect that older trees are more susceptible to damage yet confusingly there is a decreasing effect for the very oldest trees. The smooth of gradient is uninterpretable. Trees in low and high altitude areas are less damaged than those in areas of intermediate elevation, which is also counter intuitive.

Running gam.check()`

gam.check(forest.m1)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-3.614328e-09,6.956565e-08]
## (score 822.8606 & scale 1).
## Hessian positive definite, eigenvalue range [0.9857476,407.0144].
## Model rank =  68 / 68 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                 k'    edf k-index p-value
## s(age)       9.000  7.428   0.988    0.38
## s(gradient) 19.000 15.750   0.493    0.00
## s(canopyd)   9.000  4.694   0.550    0.00
## s(year)      9.000  7.391   1.009    0.70
## s(alt)       9.000  4.995   0.489    0.00

suggests no major problems, but the residual plot is difficult to interpret owing to the categorical nature of the response variable. The printed output suggests some smooth terms may need their basis dimension increasing. Before we do this however, we should add a spatial effect to the model.

Spatial GAM via a MRF smooth

For these data, it would be more natural to fit a spatial effect via a 2-d smoother as we’ve considered in other examples. However, we can consider the trees as being discrete spatial units and fit a spatial effect via a Markov random field (MRF) smooth. To fit the MRF smoother, we need information on the neighbours of each tree. In this example, any tree within 1.8km of a focal tree was considered a neighbour of that focal tree. This neighbourhood information is stored in a BayesX graph file, which we can convert into the format needed for mgcv’s MRF basis function. To facilitate reading the graph file, load the utility function gra2mgcv()

## souce graph reading function
source("./code_snippets/gra2mgcv.R")

Next we load the .gra file and do some manipulations to match the forest environmental data file

## Read in graph file and output list required by mgcv
nb <- gra2mgcv("./data/forest-health/foresthealth.gra")
nb <- nb[order(as.numeric(names(nb)))]
names(nb) <- formatC(as.numeric(names(nb)), width = 2, flag = "0")

Look at the structure of nb:

head(nb)
## $`01`
## [1] 2 3 4 6 7
## 
## $`02`
## [1] 1 3 6 7 9
## 
## $`03`
## [1]  2  1  4  6  7  8 10
## 
## $`04`
## [1]  3  1  6  7  8 10 11
## 
## $`05`
## [1]  8 11
## 
## $`06`
## [1]  4  3  2  1  7  9 10 13

In mgcv the MRF basis can be specified by one or more of

  1. polys; coordinates of vertices defining spatial polygons for each discrete spatial unit. Any two spatial units that share one or more vertices are considered neighbours,
  2. nb; a list with one component per spatial unit, where each component contains indices of the neighbouring components of the current spatial unit, and/or
  3. penalty; the actual penalty matrix of the MRF basis. This is an \(N\) by \(N\) matrix with the number of neighbours of each unit on the diagonal, and the \(j\)th column of the \(i\)th row set to -1 if the \(j\)th spatial unit is a neighbour of the \(i\)th unit. Elsewhere the penalty matrix is all 0s.

mgcv will create the penalty matrix for you if you supply polys or nb. As we don’t have polygons here, we’ll use the information from the .gra file converted to the format needed by gam() to indicate the neighbourhood.

The nb list is passed along using the xt argument of the s() function. For the MRF basis, xt is a named list with one or more of the components listed above. In the model call below we use xt = list(nb = nb) to pass on the neighbourhood list. To indicate that an MRF smooth is needed, we use bs = "mrf" and the covariate of the smooth is the factor indicating to which spatial unit each observation belongs. In the call below we use the id variable. Note that this doesn’t need to be a factor, just something that can be coerced to one.

All other aspects of the model fit remain the same hence we use update() to add the MRF smooth without repeating everything from the original call.

## Fit model with MRF
## forest.m2 <- gam(aggDefol ~ ph + depth + watermoisture + alkali + humus + type + fert +
##                      s(age) + s(gradient, k = 20) + s(canopyd) + s(year) + s(alt) +
##                      s(id, bs = "mrf", xt = list(nb = nb)),
##                 data = forest, family = ocat(R = 3), method = "REML",
##                 control = ctrl)
forest.m2 <- update(forest.m1, . ~ . + s(id, bs = "mrf", xt = list(nb = nb)))

Look at the model summary:

summary(forest.m2)
## 
## Family: Ordered Categorical(-1,6.27) 
## Link function: identity 
## 
## Formula:
## aggDefol ~ ph + depth + watermoisture + alkali + humus + type + 
##     fert + s(age) + s(gradient, k = 20) + s(canopyd) + s(year) + 
##     s(alt) + s(id, bs = "mrf", xt = list(nb = nb))
## 
## Parametric coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -1.74165    2.74005  -0.636  0.52502   
## ph             -0.12912    0.39361  -0.328  0.74287   
## depth          -0.02665    0.04083  -0.653  0.51387   
## watermoisture2  1.82028    1.63539   1.113  0.26568   
## watermoisture3  3.20482    1.63619   1.959  0.05015 . 
## alkali2        -0.57099    1.00411  -0.569  0.56959   
## alkali3        -1.58087    1.58047  -1.000  0.31719   
## alkali4        -0.18128    1.83951  -0.099  0.92150   
## humus2          0.18027    0.22674   0.795  0.42657   
## humus3          0.80192    0.26132   3.069  0.00215 **
## humus4          0.73747    0.31548   2.338  0.01941 * 
## type1          -1.47082    0.80333  -1.831  0.06712 . 
## fert1          -1.85722    1.49302  -1.244  0.21352   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df  Chi.sq  p-value    
## s(age)       7.478  8.369  97.018  < 2e-16 ***
## s(gradient)  1.831  1.895   0.512    0.720    
## s(canopyd)   3.665  4.532  27.318 3.86e-05 ***
## s(year)      7.693  8.568  97.004  < 2e-16 ***
## s(alt)       1.001  1.001   0.034    0.853    
## s(id)       56.790 73.000 539.138  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Deviance explained = 64.4%
## -REML =  703.5  Scale est. = 1         n = 1626

What differences do you see between the model with the MRF spatial effect and the first model that we fitted?

Quickly create plots of the fitted smooth functions

plot(forest.m2, pages = 1, shade = 2, scale = 0, seWithMean = TRUE)

Notice that here we use seWithMean = TRUE as one of the terms has been shrunk back to a linear function and would otherwise have a bow tie confidence interval.

Compare these fitted functions with those from the original model. Are these more in keeping with expectations?

Model diagnostics are difficult with models for discrete responses such as these. Instead we can interrogate the model to derive quantities of interest that illustrate or summarise the model fit. First we start by generating posterior probabilities for the three ordinal defoliation/damage categories using predict(). This is similar to the code Eric showed this morning for the ordered categorical family.

fit <- predict(forest.m2, type = "response")
colnames(fit) <- levs
head(fit)
##         low          med         high
## 1 0.9999270 0.0000729851 5.080673e-08
## 3 0.9995600 0.0004396671 3.061755e-07
## 4 0.9976596 0.0023387352 1.631751e-06
## 5 0.9980228 0.0019758656 1.378073e-06
## 6 0.9981903 0.0018084697 1.261111e-06
## 7 0.9990970 0.0009023911 6.286986e-07

The predict() method returns a 3-column matrix, one column per category. The entries in the matrix are the posterior probabilities of class membership, with rows summing to 1. As we’ll see later, we can take a majority wins approach and assign each observation a fitted class membership and compare these to the known classes.

For easier visualisation it would be nicer to have these fitted probabilities in a tidy form, which we now do via a few manipulations

fit <- setNames(stack(as.data.frame(fit)), c("Prob", "Defoliation"))
fit <- transform(fit, Defoliation = factor(Defoliation, labels = levs))
fit <- with(forest, cbind(fit, x, y, year))
head(fit)
##        Prob Defoliation   x y year
## 1 0.9999270         med 1.5 5 1983
## 2 0.9995600         med 1.5 5 1985
## 3 0.9976596         med 1.5 5 1986
## 4 0.9980228         med 1.5 5 1987
## 5 0.9981903         med 1.5 5 1988
## 6 0.9990970         med 1.5 5 1989

The fitted values are now ready for plotting. Here we plot just the years 2000–2004, showing the spatial arrangement of trees, faceted by defoliation/damage category:

ggplot(subset(fit, year >= 2000), aes(x = x, y = y, col = Prob)) +
    geom_point(size = 1.2) +
    facet_grid(Defoliation ~ year) +
    coord_fixed() +
    theme(legend.position = "top")

A more complex use of the model involves predicting change in class posterior probability over time whilst holding all other variables at the observed mean or category mode

N <- 200
pdat <- with(forest, expand.grid(year = seq(min(year), max(year), length = N),
                                 age = mean(age, na.rm = TRUE),
                                 gradient = mean(gradient, na.rm = TRUE),
                                 canopyd = mean(canopyd, na.rm = TRUE),
                                 alt = mean(alt, na.rm = TRUE),
                                 depth = mean(depth, na.rm = TRUE),
                                 ph = mean(ph, na.rm = TRUE),
                                 humus = factor(2, levels = levels(humus)),
                                 watermoisture = factor(2, levels = levels(watermoisture)),
                                 alkali = factor(2, levels = levels(alkali)),
                                 type = factor(0, levels = c(0,1)),
                                 fert = factor(0, levels = c(0,1)),
                                 id = levels(id)))
pred <- predict(forest.m2, newdata = pdat, type = "response", exclude = "s(id)")
colnames(pred) <- levs
pred <- setNames(stack(as.data.frame(pred)), c("Prob","Defoliation"))
pred <- cbind(pred, pdat)
pred <- transform(pred, Defoliation = factor(Defoliation, levels = levs))

A plot of the predictions is produced using

ggplot(pred, aes(x = year, y = Prob, colour = Defoliation)) +
    geom_line() +
    theme(legend.position = "top")

Using similar code, produce a plot for the effect of age holding other values at the mean or mode.

The final summary of the model that we’ll produce is a confusion matrix of observed and predicted class membership. Technically, this is over-optimistic as we are using the same data to both fit and test the model, but it is an illustration of how well the model does are fitting the observed classes.

fit <- predict(forest.m2, type = "response")
fitClass <- factor(levs[apply(fit, 1, which.max)], levels = levs)
obsClass <- with(forest, factor(aggDefol, labels = levs))
sum(fitClass == obsClass) / length(fitClass)
## [1] 0.8659287
table(fitClass, obsClass)
##         obsClass
## fitClass low med high
##     low  920 105    0
##     med   82 461   21
##     high   0  10   27

Just don’t take it as any indication of how well the model will do at predicting the defoliation class for a new year or tree.

How better does the model with the MRF spatial effect fit the data compared to the original model that was fitted? We can use AIC as one means of answering that question

AIC(forest.m1, forest.m2)
##                  df      AIC
## forest.m1  58.19453 1555.807
## forest.m2 101.54578 1262.811

Which model does AIC indicate as fitting the data best?

Exercise

As an additional exercise, replace the MRF smooth with a 2-d spline of the x and y coordinates and produce maps of the class probabilities over the spatial domain. You’ll need to use ideas and techniques from some of the other spatial examples in order to complete this task.