Controlling for Variables

Introduction

This vignette describes how to control for variables. This is a new feature to BGGM (version 2.0.0).

Example 1: Multivariate Regression

When controlling for variables, a multivariate regression is fitted in BGGM. In fact, a GGM can be understood as a multivariate regression with intercepts only models for the predictors.

Notes about Implementation

BGGM does not use the typical approach for multivariate regression in R. This avoids having to write out each outcome variable, of which there are typically many in a GGM. In BGGM, it is assumed that the data matrix includes only the variables to be included in the GGM and the control variables.

Correct

Suppose that we want to control for education level, with five variables included in the GGM.

# data
Y <- bfi[,c(1:5, 27)]

# head
head(Y)

#>       A1 A2 A3 A4 A5 education
#> 61617  2  4  3  4  4        NA
#> 61618  2  4  5  2  5        NA
#> 61620  5  4  5  4  4        NA
#> 61621  4  4  6  5  5        NA
#> 61622  2  3  3  4  5        NA
#> 61623  6  6  5  6  5         3

Notice that Y includes only the five variables and education.

Fit Model

This model can then be fitted with

fit <- explore(Y, formula = ~ as.factor(education))

To show this is indeed a multivariate regression, here are the summarized regression coefficients for the first outcome.

summ_coef <- regression_summary(fit)

# outcome one
summ_coef$reg_summary[[1]]

#>                       Post.mean Post.sd Cred.lb Cred.ub
#> (Intercept)               0.256   0.095   0.072   0.442
#> as.factor(education)2     0.073   0.128  -0.177   0.323
#> as.factor(education)3    -0.202   0.104  -0.405  -0.001
#> as.factor(education)4    -0.462   0.119  -0.691  -0.233
#> as.factor(education)5    -0.578   0.117  -0.815  -0.346

And here are the coefficients from lm (a univariate regression for A1)

round(
  cbind(
    # summary: coef and se
    summary( lm(scale(A1, scale = F) ~ as.factor(education), data = Y))$coefficients[,1:2],
    # confidence interval
    confint( lm(scale(A1, scale = F) ~ as.factor(education), data = Y))
), 3)


#>                       Estimate Std. Error  2.5 % 97.5 %
#> (Intercept)              0.256      0.093  0.073  0.438
#> as.factor(education)2    0.072      0.125 -0.172  0.316
#> as.factor(education)3   -0.203      0.101 -0.401 -0.004
#> as.factor(education)4   -0.461      0.116 -0.690 -0.233
#> as.factor(education)5   -0.578      0.115 -0.804 -0.351

The estimate are very (very) similar.

Summary

Note that all the other functions work just the same. For example, the relations controlling for education are summarized with

summary(fit)

#> BGGM: Bayesian Gaussian Graphical Models 
#> --- 
#> Type: continuous 
#> Analytic: FALSE 
#> Formula: ~ as.factor(education) 
#> Posterior Samples: 5000 
#> Observations (n):
#> Nodes (p): 5 
#> Relations: 10 
#> --- 
#> Call: 
#> estimate(Y = Y, formula = ~as.factor(education))
#> --- 
#> Estimates:
#>  Relation Post.mean Post.sd Cred.lb Cred.ub
#>    A1--A2    -0.239   0.020  -0.278  -0.200
#>    A1--A3    -0.109   0.020  -0.150  -0.070
#>    A2--A3     0.276   0.019   0.239   0.312
#>    A1--A4    -0.013   0.021  -0.055   0.026
#>    A2--A4     0.156   0.020   0.117   0.196
#>    A3--A4     0.173   0.020   0.134   0.214
#>    A1--A5    -0.010   0.020  -0.050   0.029
#>    A2--A5     0.150   0.020   0.111   0.189
#>    A3--A5     0.358   0.018   0.322   0.392
#>    A4--A5     0.121   0.020   0.082   0.159
#> --- 

Incorrect

Now if we wanted to control for education, but also had gender in Y, this would be incorrect

Y <- bfi[,c(1:5, 26:27)]

head(Y)

#>       A1 A2 A3 A4 A5 gender education
#> 61617  2  4  3  4  4      1        NA
#> 61618  2  4  5  2  5      2        NA
#> 61620  5  4  5  4  4      2        NA
#> 61621  4  4  6  5  5      2        NA
#> 61622  2  3  3  4  5      1        NA
#> 61623  6  6  5  6  5      2         3

In this case, with estimate(Y, formula = as.factor(education)), the GGM would also include gender (six variables instead of the desired 5). This is because all variables not included in formula are included in the GGM. This was adopted in BGGM to save the user from having to write out each outcome.

This differs from lm, where each outcome needs to be written out, for example cbind(A1, A2, A3, A4, A4) ~ as.factor(education). This is quite cumbersome for a model that includes many nodes.

Example 2: Multivariate Probit

The above data is ordinal. In this case, it is possible to fit a multivariate probit model. This is also the approach for binary data in BGGM. This is implemented with

fit <- estimate(Y, formula = ~ as.factor(education), 
                type = "ordinal", iter = 1000)

Note that the multivariate probit models can also be summarized with regression_summary.

Example 3: Gaussian Copula Graphical Model

This final example fits a Gaussian copula graphical model that can be used for mixed data. In this case, formula is not used and instead all of the variables are included in the GGM.

Fit Model

This model is estimated with

# data
Y <- na.omit(bfi[,c(1:5, 27)])

# fit type = "mixed"
fit <- estimate(Y, type = "mixed", iter = 1000)

# summary
summary(fit)

#> BGGM: Bayesian Gaussian Graphical Models 
#> --- 
#> Type: mixed 
#> Analytic: FALSE 
#> Formula:  
#> Posterior Samples: 1000 
#> Observations (n):
#> Nodes (p): 6 
#> Relations: 15 
#> --- 
#> Call: 
#> estimate(Y = Y, type = "mixed", iter = 1000)
#> --- 
#> Estimates:
#>       Relation Post.mean Post.sd Cred.lb Cred.ub
#>         A1--A2    -0.217   0.048  -0.294  -0.114
#>         A1--A3    -0.063   0.027  -0.113  -0.011
#>         A2--A3     0.364   0.023   0.317   0.410
#>         A1--A4     0.116   0.038   0.048   0.192
#>         A2--A4     0.241   0.031   0.182   0.303
#>         A3--A4     0.228   0.026   0.174   0.275
#>         A1--A5     0.057   0.031   0.003   0.120
#>         A2--A5     0.186   0.027   0.135   0.241
#>         A3--A5     0.438   0.019   0.399   0.474
#>         A4--A5     0.151   0.025   0.103   0.199
#>  A1--education    -0.016   0.069  -0.125   0.119
#>  A2--education     0.063   0.049  -0.016   0.162
#>  A3--education     0.049   0.025   0.002   0.099
#>  A4--education     0.053   0.026   0.005   0.105
#>  A5--education     0.072   0.024   0.024   0.120
#> --- 

Here it is clear that education is included in the model, as the relations with the other nodes are included in the output.

Select Graph

The graph is selected with

select(fit)

Note

It is possible to control for variable with all methods in BGGM, including when comparing groups, Bayesian hypothesis testing, etc.