This vignette illustrates the use of mitml
for the treatment of missing data at Level 2. Specifically, the vignette addresses the following topics:
Further information can be found in the other vignettes and the package documentation.
For purposes of this vignette, we make use of the leadership
data set, which contains simulated data from 750 employees in 50 groups including ratings on job satisfaction, leadership style, and work load (Level 1) as well as cohesion (Level 2).
The package and the data set can be loaded as follows.
library(mitml)
data(leadership)
In the summary
of the data, it becomes visible that all variables are affected by missing data.
summary(leadership)
# GRPID JOBSAT COHES NEGLEAD WLOAD
# Min. : 1.0 Min. :-7.32934 Min. :-3.4072 Min. :-3.13213 low :416
# 1st Qu.:13.0 1st Qu.:-1.61932 1st Qu.:-0.4004 1st Qu.:-0.70299 high:248
# Median :25.5 Median :-0.02637 Median : 0.2117 Median : 0.08027 NA's: 86
# Mean :25.5 Mean :-0.03168 Mean : 0.1722 Mean : 0.04024
# 3rd Qu.:38.0 3rd Qu.: 1.64571 3rd Qu.: 1.1497 3rd Qu.: 0.79111
# Max. :50.0 Max. :10.19227 Max. : 2.5794 Max. : 3.16116
# NA's :69 NA's :30 NA's :92
The following data segment illustrates this fact, including cases with missing data at Level 1 (e.g., job satisfaction) and 2 (e.g., cohesion).
# GRPID JOBSAT COHES NEGLEAD WLOAD
# 73 5 -1.72143400 0.9023198 0.83025589 high
# 74 5 NA 0.9023198 0.15335056 high
# 75 5 -0.09541178 0.9023198 0.21886272 low
# 76 6 0.68626611 NA -0.38190591 high
# 77 6 NA NA NA low
# 78 6 -1.86298201 NA -0.05351001 high
In the following, we will employ a two-level model to address missing data at both levels simultaneously.
The specification of the two-level model, involves two components, one pertaining to the variables at each level of the sample (Goldstein, Carpenter, Kenward, & Levin, 2009; for further discussion, see also Enders, Mister, & Keller, 2016; Grund, Lüdtke, & Robitzsch, in press).
Specifically, the imputation model is specified as a list with two components, where the first component denotes the model for the variables at Level 1, and the second component denotes the model for the variables at Level 2.
For example, using the formula
interface, an imputation model targeting all variables in the data set can be written as follows.
fml <- list( JOBSAT + NEGLEAD + WLOAD ~ 1 + (1|GRPID) , # Level 1
COHES ~ 1 ) # Level 2
The first component of this list includes the three target variables at Level 1 and a fixed (1
) as well as a random intercept (1|GRPID
). The second component includes the target variable at Level 2 with a fixed intercept (1
).
From a statistical point of view, this specification corresponds to the following model \[ \begin{aligned} \mathbf{y}_{1ij} &= \boldsymbol\mu_{1} + \mathbf{u}_{1j} + \mathbf{e}_{ij} \\ \mathbf{y}_{2j} &= \boldsymbol\mu_{2} + \mathbf{u}_{1j} \; , \end{aligned} \] where \(\mathbf{y}_{1ij}\) denotes the target variables at Level 1, \(\mathbf{y}_{2j}\) the target variables at Level 2, and the right-hand side of the model contains the fixed effects, random effects, and residual terms as mentioned above.
Note that, even though the two components of the model appear to be separated, they define a single (joint) model for all target variables at both Level 1 and 2. Specifically, this model employs a two-level covariance structure, which allows for relations between variables at both Level 1 (i.e., correlated residuals at Level 1) and 2 (i.e., correlated random effects residuals at Level 2).
Because the data contain missing values at both levels, imputations will be generated with jomoImpute
(and not panImpute
). Except for the specification of the two-level model, the syntax is the same as in applications with missing data only at Level 1.
Here, we will run 5,000 burn-in iterations and generate 20 imputations, each 250 iterations apart.
imp <- jomoImpute(leadership, formula=fml, n.burn=5000, n.iter=250, m=20)
By looking at the summary
, we can then review the imputation procedure and verify that the imputation model converged.
summary(imp)
#
# Call:
#
# jomoImpute(data = leadership, formula = fml, n.burn = 5000, n.iter = 250,
# m = 20)
#
# Level 1:
#
# Cluster variable: GRPID
# Target variables: JOBSAT NEGLEAD WLOAD
# Fixed effect predictors: (Intercept)
# Random effect predictors: (Intercept)
#
# Level 2:
#
# Target variables: COHES
# Fixed effect predictors: (Intercept)
#
# Performed 5000 burn-in iterations, and generated 20 imputed data sets,
# each 250 iterations apart.
#
# Potential scale reduction (Rhat, imputation phase):
#
# Min 25% Mean Median 75% Max
# Beta: 1.001 1.001 1.001 1.001 1.001 1.001
# Beta2: 1.001 1.001 1.001 1.001 1.001 1.001
# Psi: 1.000 1.001 1.003 1.001 1.003 1.009
# Sigma: 1.000 1.003 1.004 1.004 1.006 1.009
#
# Largest potential scale reduction:
# Beta: [1,3], Beta2: [1,1], Psi: [4,3], Sigma: [3,1]
#
# Missing data per variable:
# GRPID JOBSAT NEGLEAD WLOAD COHES
# MD% 0 9.2 12.3 11.5 4.0
Due to the greater complexity of the two-level model, the output includes more information than in applications with missing data only at Level 1. For example, the output features the model specification for variables at both Level 1 and 2. Furthermore, it provides convergence statistics for the additional regression coefficients for the target variables at Level 2 (i.e., Beta2
).
Finally, it also becomes visible that the two-level model indeed allows for relations between target variables at Level 1 and 2. This can be seen from the fact that the potential scale reduction factor (\(\hat{R}\)) for the covariance matrix at Level 2 (Psi
) was largest for Psi[4,3]
, which is the covariance between cohesion and the random intercept of work load.
The completed data sets can then be extracted with mitmlComplete
.
implist <- mitmlComplete(imp, "all")
When inspecting the completed data, it is easy to verify that the imputations for variables at Level 2 are constant within groups as intended, thus preserving the two-level structure of the data.
# GRPID JOBSAT NEGLEAD WLOAD COHES
# 73 5 -1.72143400 0.83025589 high 0.9023198
# 74 5 -2.80749991 0.15335056 high 0.9023198
# 75 5 -0.09541178 0.21886272 low 0.9023198
# 76 6 0.68626611 -0.38190591 high -1.0275552
# 77 6 1.52825873 -1.11035850 low -1.0275552
# 78 6 -1.86298201 -0.05351001 high -1.0275552
Enders, C. K., Mistler, S. A., & Keller, B. T. (2016). Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation. Psychological Methods, 21, 222–240. doi: 10.1037/met0000063 (Link)
Goldstein, H., Carpenter, J. R., Kenward, M. G., & Levin, K. A. (2009). Multilevel models with multivariate mixed response types. Statistical Modelling, 9, 173–197. doi: 10.1177/1471082X0800900301 (Link)
Grund, S., Lüdtke, O., & Robitzsch, A. (in press). Multiple imputation of missing data for multilevel models: Simulations and recommendations. Organizational Research Methods. doi: 10.1177/1094428117703686 (Link)
# Author: Simon Grund (grund@ipn.uni-kiel.de)
# Date: 2017-10-19