This exercise is based on the Appendix of Miller et al 2013. In this example we’re ignoring all kinds of important things like detectability and availability. This should not be treated as a serious analysis of these data! For a more complete treatment of detection-corrected abundance estimation via distance sampling and generalized additive models, see Miller et al. (2013).
From that appendix:
The analysis is based on a dataset of observations of pantropical dolphins in the Gulf of Mexico (shipped with Distance 6.0 and later). For convenience the data are bundled in an R
-friendly format, although all of the code necessary for creating the data from the Distance project files is available on github. The OBIS-SEAMAP page for the data may be found at the SEFSC GoMex Oceanic 1996 survey page.
Probably the easiest way to do these exercises is to open this document in RStudio and go through the code blocks one by one (hitting the “play” button in the editor window), filling in the code where necessary and executing the commands one-by-one. You can then compile the document once you’re done to check that everything works.
The data are provided in the data/mexdolphins
folder as the file mexdolphins.RData
. Loading this we can see what is provided:
load("data/mexdolphins/mexdolphins.RData")
ls()
## [1] "mexdolphins" "pred_latlong" "preddata"
mexdolphins
the data.frame
containing the observations and covariates, used to fit the model.pred_latlong
an sp
object that has the shapefile for the prediction grid, used for fancy graphspreddata
prediction grid without any fancy spatial stuffLooking further into the mexdolphins
frame we see:
str(mexdolphins)
## 'data.frame': 387 obs. of 11 variables:
## $ Sample.Label : chr "19960417-1" "19960417-2" "19960417-3" "19960417-4" ...
## $ longitude : num -86.9 -86.8 -86.7 -86.7 -86.6 ...
## $ latitude : num 29.9 29.8 29.8 29.7 29.6 ...
## $ x : num 836106 846013 855023 864610 873598 ...
## $ y : num -1011416 -1021407 -1029785 -1039168 -1048266 ...
## $ Effort : num 13800 14000 14000 13900 13800 13800 14000 14000 13900 14800 ...
## $ Transect.Label: chr "19960417" "19960417" "19960417" "19960417" ...
## $ depth : num 135 148 152 164 180 ...
## $ count : num 0 0 0 0 0 0 0 0 0 0 ...
## $ off.set : num 18.7 18.7 18.7 18.7 18.7 ...
## $ segment.area : num 2.17e+08 2.20e+08 2.20e+08 2.18e+08 2.17e+08 ...
A brief explanation of each entry:
Sample.Label
identifier for the effort “segment” (approximately square sampling area)Transect.Label
identifier for the transect that this segment belongs tolongitude
, latitude
location in lat/long of this segmentx
, y
location in projected coordinates (projected using the North American Lambert Conformal Conic projection)Effort
the length of the current segmentdepth
the bathymetry at the segment’s positioncount
number of dolphins observed in this segmentsegment.area
the area of the segment (Effort
multiplied by the width of the segmentoff.set
the logarithm of the segment.area
multiplied by a correction for detectability (see link to appendix above for more information on this)Our objective here is to build a spatially explicit model of abundance of the dolphins. In some sense this is a kind of species distribution model.
Our possible covariates to model abundance are location and depth. These are fairly good predictors of the abundance (SPOILER ALERT), though we could probably improve the model further by including things like sea surface temperature and chlorophyll a.
We can begin with the model we showed in the first lecture:
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-13. For overview type 'help("mgcv-package")'.
dolphins_depth <- gam(count ~ s(depth) + offset(off.set),
data = mexdolphins,
family = quasipoisson(),
method = "REML")
That is we fit the counts as a function of depth, using the offset to take into account effort. We use a quasi-Poisson response (i.e., modelling just the mean-variance relationship such that the variance is proportional to the mean) and use REML for smoothness selection.
We can check the assumptions of this model by using gam.check
:
gam.check(dolphins_depth)
##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-1.608994e-07,8.599046e-10]
## (score 948.2765 & scale 145.3355).
## Hessian positive definite, eigenvalue range [1.146275,192.5329].
## Model rank = 10 / 10
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(depth) 9.00 6.59 1.02 0.98
As is usual for count data, these plots are a bit tricky to interpret. For example the residuals vs linear predictor plot in the top right has that nasty line through it that makes looking for pattern tricky. We can see easily that the line equates to the zero count observations:
# code from the insides of mgcv::gam.check
resid <- residuals(dolphins_depth, type="deviance")
linpred <- napredict(dolphins_depth$na.action, dolphins_depth$linear.predictors)
plot(linpred, resid, main = "Resids vs. linear pred.",
xlab = "linear predictor", ylab = "residuals")
# now add red dots corresponding to the zero counts
points(linpred[mexdolphins$count==0],resid[mexdolphins$count==0],
pch=19, col="red", cex=0.5)
We can use randomised quantile residuals instead of deviance residuals to get around this in some cases (though not quasi-Poisson, as we don’t have a proper likelihood!).
Ignoring the plots for now (as we’ll address them in the next section), let’s look at the text output. It seems that the k
value we set (or rather the default of 10) seems to have been adequate.
We could increase the value of k
by replacing the s(...)
with, for example, s(depth, k=25)
(for a possibly very wiggly function) or s(depth, k=3)
(for a much less wiggly function). Making k
big will create a bigger design matrix and penalty matrix.
Exercise
Look at the differences in the size of the design and penalty matrices by using dim(odel.matrix(...))
and dim(model$smooth[[1]]$S[[1]])
, replacing ...
and model
appropriately for models with k=3
and k=30
.
dolphins_depth_bigk <- gam(count ~ s(depth, k=30) + offset(off.set),
data = mexdolphins,
family = quasipoisson(),
method = "REML")
dim(model.matrix(dolphins_depth_bigk))
## [1] 387 30
dim(dolphins_depth_bigk$smooth[[1]]$S[[1]])
## [1] 29 29
dolphins_depth_smallk <- gam(count ~ s(depth, k=3) + offset(off.set),
data = mexdolphins,
family = quasipoisson(),
method = "REML")
dim(model.matrix(dolphins_depth_smallk))
## [1] 387 3
dim(dolphins_depth_smallk$smooth[[1]]$S[[1]])
## [1] 2 2
(Don’t worry about the many square brackets etc to get the penalty matrix!)
We can plot the smooth we fitted using plot
.
Exercise
Compare the first model we fitted with the two using different k
values above. Use par(mfrow=c(1,3))
to put them all in one graphic. Look at ?plot.gam
and plot the confidence intervals as a filled “confidence band”. Title the plots appropriately so you can check which is which.
par(mfrow=c(1,3))
plot(dolphins_depth, shade=TRUE, main="Default k")
plot(dolphins_depth_bigk, shade=TRUE, main="big k")
plot(dolphins_depth_smallk, shade=TRUE, main="small k")
In general quasi-Poisson doesn’t seem to do too great a job at modelling data with many zeros. Luckily we have a few tricks up our sleeves…
Adding a smooth of x
and y
to our model with s(x,y)
, we can then switch the family=
argument to use tw()
for a Tweedie distribution.
dolphins_xy_tw <- gam(count ~ s(x,y) + s(depth) + offset(off.set),
data = mexdolphins,
family = tw(),
method = "REML")
More information on Tweedie distributions can be found in Foster & Bravington (2012) and Shono (2008).
Exercise
Now do the same using the negative binomial distribution (nb()
).
dolphins_xy_nb <- gam(count ~ s(x,y) + s(depth) + offset(off.set),
data = mexdolphins,
family = nb(),
method = "REML")
Looking at the quantile-quantile plots only in the gam.check
output for these two models, which do you prefer? Why?
Looks like Tweedie is better here as the points are closer to the x=y line in the Q-Q plot. Also the histogram of residuals looks more (though not very) normal.
Look at the results of calling summary
on both models and note that there are differences in the resulting models, due to the differing mean-variance relationships.
Now let’s move onto using different bases for the smoothers. We have a couple of different options here.
By default we use the "tp"
basis. This is just plain thin plate regression splines (as defined in Wood, 2003). We can also use the "ts"
basis, which is the same but with extra shrinkage on the usually unpenalised parts of model. In the univariate case this is the linear slope term of the smooth.
Exercise
Compare the results from one of the models above with a version using the thin plate with shrinkage using the bs="ts"
argument to s()
for both terms.
dolphins_xy_tw_ts <- gam(count ~ s(x,y, bs="ts") + s(depth, bs="ts") +
offset(off.set),
data = mexdolphins,
family = tw(),
method = "REML")
What are the differences (use summary
)?
EDFs are different for both terms
What are the visual differences (use plot
)?
Not really much difference here!
We can use a soap film smoother (Wood, 2008) to take into account a complex boundary, such as a coastline or islands.
Here I’ve built a simple coastline of the US states bordering the Gulf of Mexico (see the soap_pred.R
file for how this was constructed). We can load up this boundary and the prediction grid from the following RData
file:
load("data/mexdolphins/soapy.RData")
Now we need to build knots for the soap film, for this we simply create a grid, then find the grid points inside the boundary. We don’t need too many of them.
soap_knots <- expand.grid(x=seq(min(xy_bnd$x), max(xy_bnd$x), length.out=10),
y=seq(min(xy_bnd$y), max(xy_bnd$y), length.out=10))
x <- soap_knots$x; y <- soap_knots$y
ind <- inSide(xy_bnd, x, y)
rm(x,y)
soap_knots <- soap_knots[ind, ]
## inSide doesn't work perfectly, so if you get an error like:
## Error in crunch.knots(ret$G, knots, x0, y0, dx, dy) :
## knot 54 is on or outside boundary
## just remove that knot as follows:
soap_knots <- soap_knots[-8, ]
soap_knots <- soap_knots[-54, ]
We can now fit our model. Note that we specify a basis via bs=
and the boundary via xt
(for ext
ra information) in the s()
term. We also include the knots as a knots=
argument to gam
.
dolphins_soap <- gam(count ~ s(x,y, bs="so", xt=list(bnd=list(xy_bnd))) +
offset(off.set),
data = mexdolphins,
family = tw(),
knots = soap_knots,
method = "REML")
Exercise
Look at the summary
output for this model and compare it to the other models.
summary(dolphins_soap)
##
## Family: Tweedie(p=1.352)
## Link function: log
##
## Formula:
## count ~ s(x, y, bs = "so", xt = list(bnd = list(xy_bnd))) + offset(off.set)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.0898 0.2173 -78.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(x,y) 7.444 63 0.401 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0519 Deviance explained = 22.8%
## -REML = 350.97 Scale est. = 61.544 n = 387
The plotting function for soap film smooths looks much nicer by default than for other 2D smooths – try it out.
plot(dolphins_soap)
As we saw in the intro to GAMs slides, predict
is your friend when it comes to making predictions for the GAM.
We can do this very simply, calling predict as one would with a glm
. For example:
pred_qp <- predict(dolphins_depth, preddata, type="response")
Now, this just gives a long vector of numbers for the predicted number of animals per cell. We can find the total abundance using sum
.
Exercise
How many dolphins are there in the total area? What is the maximum in a given cell? What is the minimum?
Note that this section requires quite a few additional packages to run the examples, so may not run the first time. You can use install.packages
to grab the packages you need.
Plotting predictions in projected coordinate systems is tricky. I’ll show to methods here but not go into too much detail, as that’s not the aim of this workshop.
For the non-soap models, we’ll use the below helper function to put the predictions into a bunch of squares and then return an appropriate ggplot2
object for us to plot:
library(plyr)
# fill must be in the same order as the polygon data
grid_plot_obj <- function(fill, name, sp){
# what was the data supplied?
names(fill) <- NULL
row.names(fill) <- NULL
data <- data.frame(fill)
names(data) <- name
spdf <- SpatialPolygonsDataFrame(sp, data)
spdf@data$id <- rownames(spdf@data)
spdf.points <- fortify(spdf, region="id")
spdf.df <- join(spdf.points, spdf@data, by="id")
# seems to store the x/y even when projected as labelled as
# "long" and "lat"
spdf.df$x <- spdf.df$long
spdf.df$y <- spdf.df$lat
geom_polygon(aes_string(x="x",y="y",fill=name, group="group"), data=spdf.df)
}
Then we can plot the predicted abundance:
library(ggplot2)
library(viridis)
library(sp)
library(rgeos)
## rgeos version: 0.3-19, (SVN revision 524)
## GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921
## Linking to sp version: 1.2-3
## Polygon checking: TRUE
library(rgdal)
## rgdal: version: 1.1-10, (SVN revision 622)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.11.4, released 2016/01/25
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.3/Resources/library/rgdal/gdal
## Loaded PROJ.4 runtime: Rel. 4.9.1, 04 March 2015, [PJ_VERSION: 491]
## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.3/Resources/library/rgdal/proj
## Linking to sp version: 1.2-3
library(maptools)
## Checking rgeos availability: TRUE
# projection string
lcc_proj4 <- CRS("+proj=lcc +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs ")
# need sp to transform the data
pred.polys <- spTransform(pred_latlong, CRSobj=lcc_proj4)
p <- ggplot() +
# abundance
grid_plot_obj(pred_qp, "N", pred.polys) +
# survey lines
geom_line(aes(x, y, group=Transect.Label), data=mexdolphins) +
# observations
geom_point(aes(x, y, size=count),
data=mexdolphins[mexdolphins$count>0,],
colour="red", alpha=I(0.7)) +
# make the coordinate system fixed
coord_fixed(ratio=1, ylim = range(mexdolphins$y),
xlim = range(mexdolphins$x)) +
# use the viridis colourscheme, which is more colourblind-friendly
scale_fill_viridis() +
# labels
labs(fill="Predicted\ndensity",x="x",y="y",size="Count") +
# keep things simple
theme_minimal()
print(p)
If you have the maps
package installed you can also try the following plot the coastline too:
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
##
## ozone
map_dat <- map_data("usa")
map_sp <- SpatialPoints(map_dat[,c("long","lat")])
# give the sp object a projection
proj4string(map_sp) <-CRS("+proj=longlat +datum=WGS84")
# re-project
map_sp.t <- spTransform(map_sp, CRSobj=lcc_proj4)
map_dat$x <- map_sp.t$long
map_dat$y <- map_sp.t$lat
p <- p + geom_polygon(aes(x=x, y=y, group = group), fill = "#1A9850", data=map_dat)
print(p)
The soap_preddata
frame has a much larger prediction grid that covers a much wider area.
Predicting using the soap film smoother is exactly the same as for the other smoothers:
soap_pred_N <- predict(dolphins_soap, soap_preddata, type="response")
soap_pred_N <- cbind.data.frame(soap_preddata, N=soap_pred_N, model="soap")
Use the above as a template (along with ggplot2::facet_wrap()
) to build a comparison plot with the predictions of the soap film and another model.
dolphins_xy_q <- gam(count ~ s(x,y, bs="ts") + offset(off.set),
data = mexdolphins,
family = tw(),
method = "REML")
xy_pred_N <- predict(dolphins_xy_q, soap_preddata, type="response")
xy_pred_N <- cbind.data.frame(soap_preddata, N=xy_pred_N, model="xy")
all_pred_N <- rbind(soap_pred_N, xy_pred_N)
p <- ggplot() +
# abundance
geom_tile(aes(x=x, y=y, width=sqrt(area), height=sqrt(area), fill=N), data=all_pred_N) +
# facet it!
facet_wrap(~model) +
# make the coordinate system fixed
coord_fixed(ratio=1, ylim = range(mexdolphins$y),
xlim = range(mexdolphins$x)) +
# use the viridis colourscheme, which is more colourblind-friendly
scale_fill_viridis() +
# labels
labs(fill="Predicted\ndensity", x="x", y="y") +
# keep things simple
theme_minimal() +
geom_polygon(aes(x=x, y=y, group = group), fill = "#1A9850", data=map_dat)
print(p)
Exercise
We can use the se.fit=TRUE
argument to get the per-cell standard errors, then divide these through by the abundances to get a coefficient of variation (CV) per cell. We can then plot that using the same technique as above. Try this out below.
Note that the resulting object from setting se.fit=TRUE
is a list with two elements.
pred_se_qp <- predict(dolphins_depth, preddata, type="response", se.fit=TRUE)
cv <- pred_se_qp$se.fit/pred_se_qp$fit
p <- ggplot() +
# abundance
grid_plot_obj(cv, "CV", pred.polys) +
# survey lines
geom_line(aes(x, y, group=Transect.Label), data=mexdolphins) +
# make the coordinate system fixed
coord_fixed(ratio=1, ylim = range(mexdolphins$y),
xlim = range(mexdolphins$x)) +
# use the viridis colourscheme, which is more colourblind-friendly
scale_fill_viridis() +
# labels
labs(fill="CV",x="x",y="y",size="Count") +
# keep things simple
theme_minimal()
print(p)
Compare the uncertainties of the different models you’ve fitted so far.
lpmatrix
magicNow for some lpmatrix
magic. We said before that the lpmatrix
maps the parameters onto the predictions. Let’s show that’s true:
# make the Lp matrix
lp <- predict(dolphins_depth, preddata, type="lpmatrix")
# get the linear predictor
lin_pred <- lp %*% coef(dolphins_depth)
# apply the link and multiply by the offset as lpmatrix ignores this
pred <- preddata$area * exp(lin_pred)
# all the same?
all.equal(pred[,1], as.numeric(pred_qp), check.attributes=FALSE)
## [1] TRUE
What else can we do? We can also grab the uncertainty for the sum of the predictions:
# extract the variance-covariance matrix
vc <- vcov(dolphins_depth)
# reproject the var-covar matrix to be on the linear predictor scale
lin_pred_var <- tcrossprod(lp %*% vc, lp)
# pre and post multiply by the derivatives of the link, evaluated at the
# predictions -- since the link is exp, we just use the predictions
pred_var <- matrix(pred,nrow=1) %*% lin_pred_var %*% matrix(pred,ncol=1)
# we can then calculate a total CV
sqrt(pred_var)/sum(pred)
## [,1]
## [1,] 0.1841984
As you can see, the lpmatrix
can be very useful!
vis.gam
(watch out for the view=
option) and plot the 2D smooths you’ve fitted. Check out the too.far=
agument.