GLMMselect: Bayesian model selection for generalized linear mixed models

library(GLMMselect)

Introduction

The GLMMselect package provides a novel Bayesian model selection approach for the analysis of Poisson and binary data. GLMMselect contains functions to simultaneously select fixed effects and random effects for non-Gaussian data. In the GLMMselect package, we use pseudo likelihood method to solve the problem that the random effects cannot be analytically integrated out of GLMMs. In addition, we develop a fractional Bayes factor (FBF) approach to obtain posterior probabilities of the competing models. Full details on the methods implemented in GLMMselect can be found in the manuscript (Xu et al. 202X).

This vignette contains a estimated example for count data and a case study presented in the manuscript (Xu et al. 202X) to illustrate how GLMMselect works. For the simulated example, the count data are simulated from Poisson GLMM with four candidate fixed effects and two types of random effects (spatial random effects and overdispersion random effects).

Function

The function GLMMselect() implemented in the GLMMselect package is described below:

  • Function GLMMselect() performs the ARM and HCM (Xu et al. 202X) which performs model selection for generalized linear mixed models, given a numeric vector of binary or count data Y, a matrix of covariates X, a list of covariance matrices for random effects Sigma, a list of design matrices for random effects Z, the description of the error distribution family, and the prior for variance component of random effects prior. Arguments pip_fixed and pip_random are the thresholds that if the posterior inclusion probability of fixed effects or random effects is larger than pip_fixed or pip_random, the corresponding fixed effects and random effects will be included in the best model. In addition, the argument NumofModel can adjust the number of models with the largest posterior probabilities which are printed out in the output. The function GLMMselect() returns a list of the indices of fixed effects and random effects that are identified in the best model, a table of models’ posterior probabilities (MPP) which are sorted in decreasing order, a table for fixed effects’ inference, and a table for random effects’ inference which includes point estimates and standard deviations for coefficients, and posterior inclusion probabilities (PIP) for each effect.

Model

The generalized linear mixed model used in the GLMMselect package is $$ g(\pmb{Y})=X\pmb{\beta}+\sum_{q=1}^Q Z_q \pmb{\alpha}_q,$$ where

  • g() is the link function.
  • Y is an n × 1 vector of observations, either binary data or count data.
  • X is the matrix of covariates.
  • β is the p × 1 vector of regression coefficients.
  • Zq is an incidence matrix relating the vector of random effects αq to the observations.
  • αq is a vector of random effects with covariance matrix τqΣq. Σq is a known semi-positive definite matrix, and τq is an unknown scalar. The common types of αq can be spatial random effects, longitudinal random effects and overdispersion random effects.

Currently, the GLMMselect package can analyze binary responses (family = "bernoulli") and Poisson responses (family = "poisson"). The manuscript that develops the methods in GLMMselect (Xu et al. 202X) provides details on the priors for β and τq, as well as details about the FBF approach used by GLMMselect.

Examples

The GLMMselect() function requires a vector of observations (either Bernoulli or assumed Poisson distributed), a matrix of covariates, a list of design matrices for random effects, and a list of covariance matrices for random effects.

The vector of observations must be a numeric n × 1 vector. In the GLMMselect package, there is an example simulated from the Poisson generalized linear mixed model with four candidate covariates, a vector of spatial random effects, and a vector of overdispersion random effects. Here are the first five elements of the Poisson simulated vector of observations:

data("Y")
Y[1:5]
#> [1]  33  47 187  26  35

The covariate matrix passed to the function contains all candidate covariates. Each column corresponds to a candidate covariate. Each row corresponds to an observation. In this example, the covariate matrix contains four candidate covariates. The covariates in the first two columns are in the true model. Here are the first five rows of the covariate matrix:

data("X")
X[1:5,]
#>            [,1]        [,2]       [,3]        [,4]
#> [1,]  0.3586051 -1.39887035 -1.2441304 -0.97847040
#> [2,] -0.1766957  0.09031447  0.5870933 -1.19242832
#> [3,]  0.7548008  0.54639426 -0.2518881  0.02433473
#> [4,]  0.0854506 -0.97412286  0.1203769  0.04743402
#> [5,]  0.5478787 -1.19069820 -1.3385666 -0.28752443

The design matrices for candidate random effects are passed to the GLMMselect function as a list. In this example, this is a list of two design matrices for two types of random effects. The first component is for spatial random effects. The second component is for overdispersion random effects. Here are the first five rows and the first five columns for each of these design matrices:

data("Z")
Z[[1]][1:5,1:5]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1
Z[[2]][1:5,1:5]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1

The covariance matrices for the candidate random effects are also passed to the GLMMselect function as a list. In this example, this is a list of two covariance matrices for two types of random effects. The first component is for spatial random effects. The second componet is for overdispersion random effects. Here are the first five rows and the first five columns for each of these covariance matrices:

data("Sigma")
Sigma[[1]][1:5,1:5]
#>           [,1]      [,2]      [,3]      [,4]      [,5]
#> [1,] 1.2861009 0.7911009 0.4988302 0.3047474 0.1670184
#> [2,] 0.7911009 0.9938302 0.5970181 0.3611012 0.2041574
#> [3,] 0.4988302 0.5970181 0.8561012 0.4964282 0.2877719
#> [4,] 0.3047474 0.3611012 0.4964282 0.7827719 0.4448714
#> [5,] 0.1670184 0.2041574 0.2877719 0.4448714 0.7498331
Sigma[[2]][1:5,1:5]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1

Data Y,X, Sigma, and Z above are attached in the GLMMselect package.

Example: Analysis of the simulated dataset

In this example, we use GLMMselect to analyze Poisson count data with an approximate reference (AR) prior for the variance components of random effects.

Model_selection_output <- GLMMselect(Y=Y, X=X, Sigma=Sigma, Z=Z, 
                                     family="poisson", prior="AR", offset=NULL)
Model_selection_output$BestModel
#> $covariate_inclusion
#> [1] 1 2
#> 
#> $random_effect_inclusion
#> [1] 1
Model_selection_output$PosteriorProb
#>    x1 x2 x3 x4 r1 r2   MPP
#> 1   1  1  0  0  1  0 0.624
#> 2   1  1  1  0  1  0 0.137
#> 3   1  1  0  1  1  0 0.114
#> 4   1  1  1  1  1  0 0.055
#> 5   1  1  0  0  0  1 0.032
#> 6   1  1  0  0  1  1 0.022
#> 7   1  1  1  0  0  1 0.006
#> 8   1  1  0  1  0  1 0.004
#> 9   1  1  1  0  1  1 0.002
#> 10  1  1  0  1  1  1 0.002
Model_selection_output$FixedEffect
#>       Est    SD   PIP
#> x1  1.028 0.028 1.000
#> x2  0.997 0.027 1.000
#> x3 -0.012 0.020 0.202
#> x4  0.000 0.023 0.177
Model_selection_output$RandomEffect
#>      Est    SD   PIP
#> r1 0.042 0.016 0.956
#> r2 0.012 0.006 0.070

GLMMselect() outputs the indices of the covariates and the indices of the random effects which are in the best model. The true model contains the first two covariates and spatial random effects. GLMMselect correctly identifies these two covariates and the spatial random effects.

Example: Analysis of Scotland lip cancer data

Here is a case study to help illustrate how to use GLMMselect. This dataset provides the number of male lip cancer cases in the 56 counties of Scotland during the period 1975-1980, as well as the percentage of the work force employed in agriculture, fishing, or forestry (AFF) as a covariate. The model we consider is $$ y_i|\mu_i \stackrel{ind}{\sim} \mbox{Poisson}(\mu_i), \ \ \ i=1\dots 56, \\ \log(\mu_i) = \log(n_i)+\pmb{x}_i^T\pmb{\beta}+\alpha_{1i}+\alpha_{2i}, \\ \pmb{\alpha}_1 \sim N(\pmb{0},\tau_1 \Sigma_1), \\ \pmb{\alpha}_2 \sim N(\pmb{0}, \tau_2 \Sigma_2). $$

ni is the expected number of lip cancer cases in the ith county, which are assumed to be known constants. α1 is a vector of spatial random effects. α2 is a vector of overdispersion random effects.

In this example, we use GLMMselect to analyze lip cancer data with a half Cauchy (HC) prior for variance components of random effects. Data lipcancer_Y,lipcancer_X, lipcancer_Sigma, and lipcancer_Z are attached in the GLMMselect package.

lip_cancer_output <- GLMMselect(Y=lipcancer_Y, X=lipcancer_X, Sigma=lipcancer_Sigma, Z=lipcancer_Z,
                                family="poisson", prior="HC", offset=lipcancer_offset)
lip_cancer_output$BestModel
#> $covariate_inclusion
#> [1] 1
#> 
#> $random_effect_inclusion
#> [1] 1
lip_cancer_output$PosteriorProb
#>   x1 r1 r2   MPP
#> 1  1  1  0 0.905
#> 2  0  1  0 0.084
#> 3  1  1  1 0.009
#> 4  0  1  1 0.002
#> 5  1  0  1 0.000
#> 6  0  0  1 0.000
#> 7  1  0  0 0.000
#> 8  0  0  0 0.000
lip_cancer_output$FixedEffect
#>      Est    SD   PIP
#> x1 0.428 0.128 0.914
lip_cancer_output$RandomEffect
#>      Est    SD   PIP
#> r1 0.486 0.157 1.000
#> r2 0.000 0.052 0.011

See also

ref.ICAR provides an objective Bayesian approach for modeling spatially correlated areal data using an intrinsic conditional autoregressive prior on a vector of spatial random effects.

Reference

Xu, Shuangshuang, Ferreira, M. A. R., Porter, E. M., and Franck, C. T. (202X). Bayesian Model Selection for Generalized Linear Mixed Models, Biometrics, under review.