--- title: "Controlling for Variables" author: "Donny Williams" date: "5/25/2020" bibliography: ../inst/REFERENCES.bib output: rmarkdown::html_vignette: toc: yes vignette: > %\VignetteIndexEntry{Controlling for Variables} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # Introduction This vignette describes how to control for variables. This is a new feature to **BGGM** (version `2.0.0`). # Example 1: Multivariate Regression When controlling for variables, a multivariate regression is fitted in **BGGM**. In fact, a GGM can be understood as a multivariate regression with intercepts only models for the predictors. ## Notes about Implementation **BGGM** does not use the typical approach for multivariate regression in `R`. This avoids having to write out each outcome variable, of which there are typically many in a GGM. In **BGGM**, it is assumed that the data matrix includes only the variables to be included in the GGM and the control variables. ### Correct Suppose that we want to control for education level, with five variables included in the GGM. ```r # data Y <- bfi[,c(1:5, 27)] # head head(Y) #> A1 A2 A3 A4 A5 education #> 61617 2 4 3 4 4 NA #> 61618 2 4 5 2 5 NA #> 61620 5 4 5 4 4 NA #> 61621 4 4 6 5 5 NA #> 61622 2 3 3 4 5 NA #> 61623 6 6 5 6 5 3 ``` Notice that `Y` includes **only** the five variables and `education`. ## Fit Model This model can then be fitted with ``` fit <- explore(Y, formula = ~ as.factor(education)) ``` To show this is indeed a multivariate regression, here are the summarized regression coefficients for the first outcome. ``` summ_coef <- regression_summary(fit) # outcome one summ_coef$reg_summary[[1]] #> Post.mean Post.sd Cred.lb Cred.ub #> (Intercept) 0.256 0.095 0.072 0.442 #> as.factor(education)2 0.073 0.128 -0.177 0.323 #> as.factor(education)3 -0.202 0.104 -0.405 -0.001 #> as.factor(education)4 -0.462 0.119 -0.691 -0.233 #> as.factor(education)5 -0.578 0.117 -0.815 -0.346 ``` And here are the coefficients from `lm` (a univariate regression for `A1`) ``` round( cbind( # summary: coef and se summary( lm(scale(A1, scale = F) ~ as.factor(education), data = Y))$coefficients[,1:2], # confidence interval confint( lm(scale(A1, scale = F) ~ as.factor(education), data = Y)) ), 3) #> Estimate Std. Error 2.5 % 97.5 % #> (Intercept) 0.256 0.093 0.073 0.438 #> as.factor(education)2 0.072 0.125 -0.172 0.316 #> as.factor(education)3 -0.203 0.101 -0.401 -0.004 #> as.factor(education)4 -0.461 0.116 -0.690 -0.233 #> as.factor(education)5 -0.578 0.115 -0.804 -0.351 ``` The estimate are very (very) similar. ## Summary Note that all the other functions work just the same. For example, the relations controlling for education are summarized with ``` summary(fit) #> BGGM: Bayesian Gaussian Graphical Models #> --- #> Type: continuous #> Analytic: FALSE #> Formula: ~ as.factor(education) #> Posterior Samples: 5000 #> Observations (n): #> Nodes (p): 5 #> Relations: 10 #> --- #> Call: #> estimate(Y = Y, formula = ~as.factor(education)) #> --- #> Estimates: #> Relation Post.mean Post.sd Cred.lb Cred.ub #> A1--A2 -0.239 0.020 -0.278 -0.200 #> A1--A3 -0.109 0.020 -0.150 -0.070 #> A2--A3 0.276 0.019 0.239 0.312 #> A1--A4 -0.013 0.021 -0.055 0.026 #> A2--A4 0.156 0.020 0.117 0.196 #> A3--A4 0.173 0.020 0.134 0.214 #> A1--A5 -0.010 0.020 -0.050 0.029 #> A2--A5 0.150 0.020 0.111 0.189 #> A3--A5 0.358 0.018 0.322 0.392 #> A4--A5 0.121 0.020 0.082 0.159 #> --- ``` ### Incorrect Now if we wanted to control for education, but also had gender in `Y`, this would be incorrect ``` Y <- bfi[,c(1:5, 26:27)] head(Y) #> A1 A2 A3 A4 A5 gender education #> 61617 2 4 3 4 4 1 NA #> 61618 2 4 5 2 5 2 NA #> 61620 5 4 5 4 4 2 NA #> 61621 4 4 6 5 5 2 NA #> 61622 2 3 3 4 5 1 NA #> 61623 6 6 5 6 5 2 3 ``` In this case, with `estimate(Y, formula = as.factor(education))`, the GGM would also include `gender` (six variables instead of the desired 5). This is because all variables not included in `formula` are included in the GGM. This was adopted in **BGGM** to save the user from having to write out each outcome. This differs from `lm`, where each outcome needs to be written out, for example `cbind(A1, A2, A3, A4, A4) ~ as.factor(education)`. This is quite cumbersome for a model that includes many nodes. # Example 2: Multivariate Probit The above data is ordinal. In this case, it is possible to fit a multivariate probit model. This is also the approach for binary data in **BGGM**. This is implemented with ``` fit <- estimate(Y, formula = ~ as.factor(education), type = "ordinal", iter = 1000) ``` Note that the multivariate probit models can also be summarized with `regression_summary`. # Example 3: Gaussian Copula Graphical Model This final example fits a Gaussian copula graphical model that can be used for mixed data. In this case, `formula` is not used and instead all of the variables are included in the GGM. ## Fit Model This model is estimated with ``` # data Y <- na.omit(bfi[,c(1:5, 27)]) # fit type = "mixed" fit <- estimate(Y, type = "mixed", iter = 1000) # summary summary(fit) #> BGGM: Bayesian Gaussian Graphical Models #> --- #> Type: mixed #> Analytic: FALSE #> Formula: #> Posterior Samples: 1000 #> Observations (n): #> Nodes (p): 6 #> Relations: 15 #> --- #> Call: #> estimate(Y = Y, type = "mixed", iter = 1000) #> --- #> Estimates: #> Relation Post.mean Post.sd Cred.lb Cred.ub #> A1--A2 -0.217 0.048 -0.294 -0.114 #> A1--A3 -0.063 0.027 -0.113 -0.011 #> A2--A3 0.364 0.023 0.317 0.410 #> A1--A4 0.116 0.038 0.048 0.192 #> A2--A4 0.241 0.031 0.182 0.303 #> A3--A4 0.228 0.026 0.174 0.275 #> A1--A5 0.057 0.031 0.003 0.120 #> A2--A5 0.186 0.027 0.135 0.241 #> A3--A5 0.438 0.019 0.399 0.474 #> A4--A5 0.151 0.025 0.103 0.199 #> A1--education -0.016 0.069 -0.125 0.119 #> A2--education 0.063 0.049 -0.016 0.162 #> A3--education 0.049 0.025 0.002 0.099 #> A4--education 0.053 0.026 0.005 0.105 #> A5--education 0.072 0.024 0.024 0.120 #> --- ``` Here it is clear that education is included in the model, as the relations with the other nodes are included in the output. ## Select Graph The graph is selected with ``` select(fit) ``` # Note It is possible to control for variable with all methods in **BGGM**, including when comparing groups, Bayesian hypothesis testing, etc.