In this example, we look at a spatial data set of forest health data from a stand in Germany. We can treat the individual trees as discrete spatial units and fit a spatial term using a special type of spline basis, a Markov random field.
The data come from a project n the forest of Rothenbuch, Germany, which has been ongoing since 1982. The project studies five species but here we just consider the beech data. Each year the condition of each of 83 trees is categorized into one of nine ordinal classes in terms of the degree (percentage) of defoliation. A defoliation value of 0% indicates a healthy beech tree, whilst a value of 100% indicates a tree is dead. For the purposes of this example, several of the nine ordinal classes are merged to form a three class response variable of interest
"low"
defoliation group"med"
group, and"high"
group.The original classes were 0%, 12.5%, 37.5%, 50%, 62.5%, 75%, 87.5%, 100% and are in variable defol
.
Alongside the response variable, a number of continuous and categorical covariates are recorded
id
— location ID numberyear
— year of recordingx
and y
— the x- and y-coordinates of each locationage
— average age of trees at a locationcanopyd
— canopy density at the location, in %gradient
— slope, in %alt
— altitude above sea level in mdepth
— soil depth in cmph
— soil pH at 0–2cm depthwatermoisture
— soil moisture in three categories
1
moderately dry2
moderately moist3
moist or temporarily wetalkali
— fraction of alkali ions in four categories
1
very low2
low3
moderate4
highhumus
thickness of the soil humus layer in five categories
0
0cm1
1cm2
2cm3
3cm4
$>$3cmtype
— type of forest
0
deciduous forest1
mixed forestfert
— fertilization
0
not fertilized1
fertilizedThe aim of the example is to investigate the effect of the measured covariates on the degree of defoliation and to quantify any temporal trend and spatial effect of geographic location in the data set, while adjusting for the effects of the other covariates.
The data are extensively analysed in Fahrmeir, Kneib, Lang, and Marx (2013) Regression: Models, Methods and Applications. Springer.
Begin by loading the packages we’ll use in this example
## Load packages
library("mgcv")
## Loading required package: nlme
## This is mgcv 1.8-13. For overview type 'help("mgcv-package")'.
library("ggplot2")
Next, load the forest health data set
## Read in the forest health data
forest <- read.table("./data/forest-health/beech.raw", header = TRUE, na.strings = ".")
head(forest)
## id year defol x y age canopyd gradient alt depth ph watermoisture
## 1 5 1983 0 1.5 5 43 100 2 320 10 4.61 1
## 2 5 1984 0 1.5 5 44 100 2 320 10 4.34 1
## 3 5 1985 0 1.5 5 45 100 2 320 10 4.75 1
## 4 5 1986 0 1.5 5 46 100 2 320 10 3.99 1
## 5 5 1987 0 1.5 5 47 100 2 320 10 5.30 1
## 6 5 1988 0 1.5 5 48 100 2 320 10 3.83 1
## alkali humus type fert
## 1 3 2 1 1
## 2 3 4 1 1
## 3 3 1 1 1
## 4 3 3 1 1
## 5 3 1 1 1
## 6 3 0 1 1
The data require a little processing to ensure they are correctly coded. The chunk below does
humus
is converted to a 4-level factor,NA
s, andforest <- transform(forest, id = factor(formatC(id, width = 2, flag = "0")))
## Aggregate defoliation & convert categorical vars to factors
levs <- c("low","med","high")
forest <- transform(forest,
aggDefol = as.numeric(cut(defol, breaks = c(-1,10,45,101),
labels = levs)),
watermoisture = factor(watermoisture),
alkali = factor(alkali),
humus = cut(humus, breaks = c(-0.5, 0.5, 1.5, 2.5, 3.5),
labels = 1:4),
type = factor(type),
fert = factor(fert))
forest <- droplevels(na.omit(forest))
head(forest)
## id year defol x y age canopyd gradient alt depth ph watermoisture
## 1 05 1983 0 1.5 5 43 100 2 320 10 4.61 1
## 3 05 1985 0 1.5 5 45 100 2 320 10 4.75 1
## 4 05 1986 0 1.5 5 46 100 2 320 10 3.99 1
## 5 05 1987 0 1.5 5 47 100 2 320 10 5.30 1
## 6 05 1988 0 1.5 5 48 100 2 320 10 3.83 1
## 7 05 1989 0 1.5 5 49 100 2 320 10 4.05 1
## alkali humus type fert aggDefol
## 1 3 3 1 1 1
## 3 3 2 1 1 1
## 4 3 4 1 1 1
## 5 3 2 1 1 1
## 6 3 1 1 1 1
## 7 3 1 1 1 1
with(forest, table(aggDefol))
## aggDefol
## 1 2 3
## 1002 576 48
To view the spatial arrangement of the forest stand, plot the unique x
and y
coordinate pairs
## Plot data
ggplot(unique(forest[, c("x","y")]), aes(x = x, y = y)) +
geom_point()
Our first model will ignore the spatial component of the data. All continuous variables except ph
and depth
are modelled using splines, and all categorical variables are included in the model. Note that we can speed up fitting by turning on some multi-threaded parallel processing in parts of the fitting algorithm via the nthreads
control argument. The ocat
family is used and we specify that there are three classes. Smoothness selection is via REML.
## Model
ctrl <- gam.control(nthreads = 3)
forest.m1 <- gam(aggDefol ~ ph + depth + watermoisture + alkali + humus + type + fert +
s(age) + s(gradient, k = 20) + s(canopyd) + s(year) + s(alt),
data = forest, family = ocat(R = 3), method = "REML",
control = ctrl)
The model summary
summary(forest.m1)
##
## Family: Ordered Categorical(-1,4.27)
## Link function: identity
##
## Formula:
## aggDefol ~ ph + depth + watermoisture + alkali + humus + type +
## fert + s(age) + s(gradient, k = 20) + s(canopyd) + s(year) +
## s(alt)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.86977 1.46473 1.959 0.05008 .
## ph -0.66048 0.31818 -2.076 0.03791 *
## depth -0.05061 0.01053 -4.806 1.54e-06 ***
## watermoisture2 1.20087 0.37392 3.212 0.00132 **
## watermoisture3 1.43426 0.36591 3.920 8.86e-05 ***
## alkali2 -1.66310 0.26841 -6.196 5.79e-10 ***
## alkali3 -2.62266 0.43275 -6.060 1.36e-09 ***
## alkali4 -1.20589 0.51273 -2.352 0.01868 *
## humus2 0.41113 0.19317 2.128 0.03331 *
## humus3 0.89086 0.21644 4.116 3.86e-05 ***
## humus4 0.49641 0.25131 1.975 0.04824 *
## type1 -1.40312 0.21748 -6.452 1.11e-10 ***
## fert1 -0.81955 0.39343 -2.083 0.03724 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 7.428 8.320 105.75 < 2e-16 ***
## s(gradient) 15.750 17.379 157.66 < 2e-16 ***
## s(canopyd) 4.694 5.695 94.64 < 2e-16 ***
## s(year) 7.391 8.371 74.93 1.10e-12 ***
## s(alt) 4.995 6.031 39.99 5.17e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Deviance explained = 47.1%
## -REML = 822.86 Scale est. = 1 n = 1626
indicates all model terms are significant at the 95% level, especially the smooth terms. Don’t pay too much attention to this though as we have not yet accounted for spatial structure in the data and several of the variables are likely to be spatially autocorrelated and hence the identified effects may be spurious.
This is somewhat confirmed by the form of the fitted functions; rather than smooth, monotonic functions man terms are highly non-linear and difficult to interpret.
plot(forest.m1, pages = 1, shade = 2, scale = 0)
We might expect that older trees are more susceptible to damage yet confusingly there is a decreasing effect for the very oldest trees. The smooth of gradient
is uninterpretable. Trees in low and high altitude areas are less damaged than those in areas of intermediate elevation, which is also counter intuitive.
Running gam.check()`
gam.check(forest.m1)
##
## Method: REML Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-3.614328e-09,6.956565e-08]
## (score 822.8606 & scale 1).
## Hessian positive definite, eigenvalue range [0.9857476,407.0144].
## Model rank = 68 / 68
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(age) 9.000 7.428 0.988 0.38
## s(gradient) 19.000 15.750 0.493 0.00
## s(canopyd) 9.000 4.694 0.550 0.00
## s(year) 9.000 7.391 1.009 0.70
## s(alt) 9.000 4.995 0.489 0.00
suggests no major problems, but the residual plot is difficult to interpret owing to the categorical nature of the response variable. The printed output suggests some smooth terms may need their basis dimension increasing. Before we do this however, we should add a spatial effect to the model.
For these data, it would be more natural to fit a spatial effect via a 2-d smoother as we’ve considered in other examples. However, we can consider the trees as being discrete spatial units and fit a spatial effect via a Markov random field (MRF) smooth. To fit the MRF smoother, we need information on the neighbours of each tree. In this example, any tree within 1.8km of a focal tree was considered a neighbour of that focal tree. This neighbourhood information is stored in a BayesX graph file, which we can convert into the format needed for mgcv’s MRF basis function. To facilitate reading the graph file, load the utility function gra2mgcv()
## souce graph reading function
source("./code_snippets/gra2mgcv.R")
Next we load the .gra
file and do some manipulations to match the forest
environmental data file
## Read in graph file and output list required by mgcv
nb <- gra2mgcv("./data/forest-health/foresthealth.gra")
nb <- nb[order(as.numeric(names(nb)))]
names(nb) <- formatC(as.numeric(names(nb)), width = 2, flag = "0")
Look at the structure of nb
:
head(nb)
## $`01`
## [1] 2 3 4 6 7
##
## $`02`
## [1] 1 3 6 7 9
##
## $`03`
## [1] 2 1 4 6 7 8 10
##
## $`04`
## [1] 3 1 6 7 8 10 11
##
## $`05`
## [1] 8 11
##
## $`06`
## [1] 4 3 2 1 7 9 10 13
In mgcv the MRF basis can be specified by one or more of
polys
; coordinates of vertices defining spatial polygons for each discrete spatial unit. Any two spatial units that share one or more vertices are considered neighbours,nb
; a list with one component per spatial unit, where each component contains indices of the neighbouring components of the current spatial unit, and/orpenalty
; the actual penalty matrix of the MRF basis. This is an \(N\) by \(N\) matrix with the number of neighbours of each unit on the diagonal, and the \(j\)th column of the \(i\)th row set to -1
if the \(j\)th spatial unit is a neighbour of the \(i\)th unit. Elsewhere the penalty matrix is all 0
s.mgcv will create the penalty matrix for you if you supply polys
or nb
. As we don’t have polygons here, we’ll use the information from the .gra
file converted to the format needed by gam()
to indicate the neighbourhood.
The nb
list is passed along using the xt
argument of the s()
function. For the MRF basis, xt
is a named list with one or more of the components listed above. In the model call below we use xt = list(nb = nb)
to pass on the neighbourhood list. To indicate that an MRF smooth is needed, we use bs = "mrf"
and the covariate of the smooth is the factor indicating to which spatial unit each observation belongs. In the call below we use the id
variable. Note that this doesn’t need to be a factor, just something that can be coerced to one.
All other aspects of the model fit remain the same hence we use update()
to add the MRF smooth without repeating everything from the original call.
## Fit model with MRF
## forest.m2 <- gam(aggDefol ~ ph + depth + watermoisture + alkali + humus + type + fert +
## s(age) + s(gradient, k = 20) + s(canopyd) + s(year) + s(alt) +
## s(id, bs = "mrf", xt = list(nb = nb)),
## data = forest, family = ocat(R = 3), method = "REML",
## control = ctrl)
forest.m2 <- update(forest.m1, . ~ . + s(id, bs = "mrf", xt = list(nb = nb)))
Look at the model summary:
summary(forest.m2)
##
## Family: Ordered Categorical(-1,6.27)
## Link function: identity
##
## Formula:
## aggDefol ~ ph + depth + watermoisture + alkali + humus + type +
## fert + s(age) + s(gradient, k = 20) + s(canopyd) + s(year) +
## s(alt) + s(id, bs = "mrf", xt = list(nb = nb))
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.74165 2.74005 -0.636 0.52502
## ph -0.12912 0.39361 -0.328 0.74287
## depth -0.02665 0.04083 -0.653 0.51387
## watermoisture2 1.82028 1.63539 1.113 0.26568
## watermoisture3 3.20482 1.63619 1.959 0.05015 .
## alkali2 -0.57099 1.00411 -0.569 0.56959
## alkali3 -1.58087 1.58047 -1.000 0.31719
## alkali4 -0.18128 1.83951 -0.099 0.92150
## humus2 0.18027 0.22674 0.795 0.42657
## humus3 0.80192 0.26132 3.069 0.00215 **
## humus4 0.73747 0.31548 2.338 0.01941 *
## type1 -1.47082 0.80333 -1.831 0.06712 .
## fert1 -1.85722 1.49302 -1.244 0.21352
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 7.478 8.369 97.018 < 2e-16 ***
## s(gradient) 1.831 1.895 0.512 0.720
## s(canopyd) 3.665 4.532 27.318 3.86e-05 ***
## s(year) 7.693 8.568 97.004 < 2e-16 ***
## s(alt) 1.001 1.001 0.034 0.853
## s(id) 56.790 73.000 539.138 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Deviance explained = 64.4%
## -REML = 703.5 Scale est. = 1 n = 1626
What differences do you see between the model with the MRF spatial effect and the first model that we fitted?
Quickly create plots of the fitted smooth functions
plot(forest.m2, pages = 1, shade = 2, scale = 0, seWithMean = TRUE)
Notice that here we use seWithMean = TRUE
as one of the terms has been shrunk back to a linear function and would otherwise have a bow tie confidence interval.
Compare these fitted functions with those from the original model. Are these more in keeping with expectations?
Model diagnostics are difficult with models for discrete responses such as these. Instead we can interrogate the model to derive quantities of interest that illustrate or summarise the model fit. First we start by generating posterior probabilities for the three ordinal defoliation/damage categories using predict()
. This is similar to the code Eric showed this morning for the ordered categorical family.
fit <- predict(forest.m2, type = "response")
colnames(fit) <- levs
head(fit)
## low med high
## 1 0.9999270 0.0000729851 5.080673e-08
## 3 0.9995600 0.0004396671 3.061755e-07
## 4 0.9976596 0.0023387352 1.631751e-06
## 5 0.9980228 0.0019758656 1.378073e-06
## 6 0.9981903 0.0018084697 1.261111e-06
## 7 0.9990970 0.0009023911 6.286986e-07
The predict()
method returns a 3-column matrix, one column per category. The entries in the matrix are the posterior probabilities of class membership, with rows summing to 1. As we’ll see later, we can take a majority wins approach and assign each observation a fitted class membership and compare these to the known classes.
For easier visualisation it would be nicer to have these fitted probabilities in a tidy form, which we now do via a few manipulations
fit <- setNames(stack(as.data.frame(fit)), c("Prob", "Defoliation"))
fit <- transform(fit, Defoliation = factor(Defoliation, labels = levs))
fit <- with(forest, cbind(fit, x, y, year))
head(fit)
## Prob Defoliation x y year
## 1 0.9999270 med 1.5 5 1983
## 2 0.9995600 med 1.5 5 1985
## 3 0.9976596 med 1.5 5 1986
## 4 0.9980228 med 1.5 5 1987
## 5 0.9981903 med 1.5 5 1988
## 6 0.9990970 med 1.5 5 1989
The fitted values are now ready for plotting. Here we plot just the years 2000–2004, showing the spatial arrangement of trees, faceted by defoliation/damage category:
ggplot(subset(fit, year >= 2000), aes(x = x, y = y, col = Prob)) +
geom_point(size = 1.2) +
facet_grid(Defoliation ~ year) +
coord_fixed() +
theme(legend.position = "top")
A more complex use of the model involves predicting change in class posterior probability over time whilst holding all other variables at the observed mean or category mode
N <- 200
pdat <- with(forest, expand.grid(year = seq(min(year), max(year), length = N),
age = mean(age, na.rm = TRUE),
gradient = mean(gradient, na.rm = TRUE),
canopyd = mean(canopyd, na.rm = TRUE),
alt = mean(alt, na.rm = TRUE),
depth = mean(depth, na.rm = TRUE),
ph = mean(ph, na.rm = TRUE),
humus = factor(2, levels = levels(humus)),
watermoisture = factor(2, levels = levels(watermoisture)),
alkali = factor(2, levels = levels(alkali)),
type = factor(0, levels = c(0,1)),
fert = factor(0, levels = c(0,1)),
id = levels(id)))
pred <- predict(forest.m2, newdata = pdat, type = "response", exclude = "s(id)")
colnames(pred) <- levs
pred <- setNames(stack(as.data.frame(pred)), c("Prob","Defoliation"))
pred <- cbind(pred, pdat)
pred <- transform(pred, Defoliation = factor(Defoliation, levels = levs))
A plot of the predictions is produced using
ggplot(pred, aes(x = year, y = Prob, colour = Defoliation)) +
geom_line() +
theme(legend.position = "top")
Using similar code, produce a plot for the effect of age
holding other values at the mean or mode.
The final summary of the model that we’ll produce is a confusion matrix of observed and predicted class membership. Technically, this is over-optimistic as we are using the same data to both fit and test the model, but it is an illustration of how well the model does are fitting the observed classes.
fit <- predict(forest.m2, type = "response")
fitClass <- factor(levs[apply(fit, 1, which.max)], levels = levs)
obsClass <- with(forest, factor(aggDefol, labels = levs))
sum(fitClass == obsClass) / length(fitClass)
## [1] 0.8659287
table(fitClass, obsClass)
## obsClass
## fitClass low med high
## low 920 105 0
## med 82 461 21
## high 0 10 27
Just don’t take it as any indication of how well the model will do at predicting the defoliation class for a new year or tree.
How better does the model with the MRF spatial effect fit the data compared to the original model that was fitted? We can use AIC as one means of answering that question
AIC(forest.m1, forest.m2)
## df AIC
## forest.m1 58.19453 1555.807
## forest.m2 101.54578 1262.811
Which model does AIC indicate as fitting the data best?
As an additional exercise, replace the MRF smooth with a 2-d spline of the x
and y
coordinates and produce maps of the class probabilities over the spatial domain. You’ll need to use ideas and techniques from some of the other spatial examples in order to complete this task.