1 A bit of math housekeeping
2 Overview of covariance, correlation, and regression
3 Covariance <-> correlation
- 3.1 Factors affecting correlations
4 Single-predictor regression
5 Multiple regression
6 Assumptions of regression (from Kline)
7 Suppression

1 A bit of math housekeeping

The use of a ‘hat’ over a variable denotes an estimator of a quantity. So, \(\hat{Y}\) indicates the predicted values of \(Y\) from a statistical model.
Sample means are denoted \(\bar{x}\).
Sample variances are \(s^2_x\) and standard deviations are \(s_x\).
Matrices are denoted by boldface capital letters, \(\mathbf{X}\).
Vectors are denoted by boldface lowercase letters, \(\mathbf{x}\).
The transpose of matrix flips a matrix along its diagonal (transposing rows and columns). It is denoted with the prime operator, \(\mathbf{X}'\) and the t() function in R.
Matrices are composed of rows and columns (we won’t deal with arrays of 3+ dimensions). Rows are the first dimension, columns are the second. So, a matrix composed of \(n\) observations and \(k\) variables is denoted:

\[\underset{n \times k}{\mathbf{X}}\]

In R, selecting row i in column j from matrix X is: X[i,j].

A row vector is a collection of numbers having one row and \(m\) columns (i.e., 1 x \(m\)).

\[ \begin{equation} \mathbf{x} = \begin{bmatrix} x_1 & x_2 & \cdots & x_m \end{bmatrix} \end{equation} \]

A column vector is a collection of numbers having many rows and one column (i.e., \(m\) x 1).

\[ \begin{equation} \mathbf{x} =\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix} \end{equation} \]

Thus, the transpose of a column vector is a row vector, and vice versa:

\[ \begin{equation} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix}' = \begin{bmatrix} x_1 & x_2 & \cdots & x_m \end{bmatrix} \end{equation} \]

Random variables (i.e., variables whose values are depend on uncertain processes) are denoted by capital letters (e.g., \(X\)). Most psychometric data are random variables (e.g., height, weight, IQ, depression severity). Realizations of random variables (i.e., draws from the processes that control the variable) are denoted by lower case letters (e.g., \(x\)).

I will probably screw these up along the way, but I wanted to mention them up front to be on the same page!

2 Overview of covariance, correlation, and regression

We’ll use housing price data from Boston’s 1970 census to review important concepts in correlation and regression. This is a nice dataset for regression because there are many interdependent variables: crime, pollutants, age of properties, etc.

#example dataset from mlbench package with home prices in Boston by census tract
data(BostonHousing2)
BostonSmall <- BostonHousing2 %>% dplyr::select(
  cmedv, #median value of home in 1000s
  crim, #per capita crime by town
  nox, #nitric oxide concentration
  lstat #proportion of lower status
  )

n <- nrow(BostonSmall) #number of observations
k <- ncol(BostonSmall) #number of variables

#scatterplot matrix of variables
splom(BostonSmall)

Our dataset, BostonSmall, is a data matrix with 506 observations on the rows and 4 variables on the columns. As noted above, we could notate this in matrix form as:

\[\underset{n \times k}{\mathbf{X}}\]

In a data matrix with \(k\) variables, the covariance matrix, \(\mathbf{S_{XX}}\), will be \(k \times k\) in size. That is, we are computing the relationships among all variables. The sample size (number of rows) influences the precision of the variance-covariance estimates.

We can use the cov() function in R to get a sample variance-covariance matrix:

from_r <- cov(BostonSmall)
from_r

##         cmedv   crim     nox   lstat
## cmedv  84.312 -30.77 -0.4568 -48.577
## crim  -30.770  73.99  0.4196  27.986
## nox    -0.457   0.42  0.0134   0.489
## lstat -48.577  27.99  0.4889  50.995

And we can look under the hood to verify that we understand the concept and formula:

\[ s_{XY} = \frac{\sum{(X-\bar{X})(Y-\bar{Y})}}{N-1} \]

Here is the covariance matrix computed manually using the formula above:

mycov <- function(x, y) {
  stopifnot(length(x)==length(y))
  sum((x - mean(x))*(y - mean(y)))/(length(x) - 1)
}

cmat <- matrix(NA, nrow=k, ncol=k)
indices <- which(is.na(cmat), arr.ind=TRUE)
cmat[indices] <- apply(indices, 1, function(v) { 
  mycov(BostonSmall[, v[1] ], BostonSmall[, v[2] ])
})

cmat

##         [,1]   [,2]    [,3]    [,4]
## [1,]  84.312 -30.77 -0.4568 -48.577
## [2,] -30.770  73.99  0.4196  27.986
## [3,]  -0.457   0.42  0.0134   0.489
## [4,] -48.577  27.99  0.4889  50.995

Or we can do the matrix math by hand:

\[ \mathbf{S_{XX}} = \frac{1}{(N-1) \mathbf{X}^{*\prime} \mathbf{X}^{*}} \]

if \(\mathbf{X^{*}}\) is a matrix that is already in deviation form (i.e., each variable is mean-centered).

#for the mathy among you...
bmat <- as.matrix(BostonSmall) #convert to numeric matrix
n <- nrow(bmat) #number of observations

#1) compute the means for each variable and populate a matrix of the same size
M <- matrix(data=1, nrow=n, ncol=1) %*% colMeans(bmat)

#2) compute the deviation of each score from its variable mean
D <- bmat - M

#3) the covariance matrix is D'D/n-1
C <- t(D) %*% D / (n-1)

C

##         cmedv   crim     nox   lstat
## cmedv  84.312 -30.77 -0.4568 -48.577
## crim  -30.770  73.99  0.4196  27.986
## nox    -0.457   0.42  0.0134   0.489
## lstat -48.577  27.99  0.4889  50.995

3 Covariance <-> correlation

The correlation of \(x\) and \(y\) is a covariance that has been standardized by the standard deviations of \(x\) and \(y\). This yields a scale-insensitive measure of the linear association of \(x\) and \(y\).

\[r_{XY}= \frac{s_{XY}}{s_{X} s_{Y}}\]

rmat <- array(NA, dim=dim(cmat)) #initialize an empty correlation matrix
variances <- diag(cmat) #variances of each variable are on the diagonal of the covariance matrix
rmat[indices] <- apply(indices, 1, function(v) { cmat[v[1],v[2]] / sqrt(variances[ v[1] ] * variances[ v[2] ]) })
rmat

##        [,1]   [,2]   [,3]   [,4]
## [1,]  1.000 -0.390 -0.429 -0.741
## [2,] -0.390  1.000  0.421  0.456
## [3,] -0.429  0.421  1.000  0.591
## [4,] -0.741  0.456  0.591  1.000

And it’s easy to generate a correlation matrix from R directly using the cor() function:

cor(BostonSmall, method="pearson")

##        cmedv   crim    nox  lstat
## cmedv  1.000 -0.390 -0.429 -0.741
## crim  -0.390  1.000  0.421  0.456
## nox   -0.429  0.421  1.000  0.591
## lstat -0.741  0.456  0.591  1.000

There is also a handy cov2cor() function for converting a covariance matrix to a correlation matrix:

cov2cor(cmat)

##        [,1]   [,2]   [,3]   [,4]
## [1,]  1.000 -0.390 -0.429 -0.741
## [2,] -0.390  1.000  0.421  0.456
## [3,] -0.429  0.421  1.000  0.591
## [4,] -0.741  0.456  0.591  1.000

Rearranging the equation above, we can convert correlations to covariances by: \[s_{XY}= r_{XY} s_X s_Y\]

And lavaan has a cor2cov() function that takes a covariance matrix and vector of standard deviations.

cormat <- cor(BostonSmall)
sds <- apply(BostonSmall, 2, sd)
lavaan::cor2cov(cormat, sds)

##         cmedv   crim     nox   lstat
## cmedv  84.312 -30.77 -0.4568 -48.577
## crim  -30.770  73.99  0.4196  27.986
## nox    -0.457   0.42  0.0134   0.489
## lstat -48.577  27.99  0.4889  50.995

3.1 Factors affecting correlations

Linearity: less linear, lower \(r\)
Variability: smaller \(s\), smaller \(r\)
Distribution of scores: e.g., opposing skews, smaller \(r\)
Reliability: lower measurement precision in either variable, lower \(r\)

4 Single-predictor regression

Single-predictor regression (sometimes called ‘bivariate regression’, though this is confusing) is simply a re-expression of the covariance between two variables. Importantly, regression places variables into the asymmetric roles of criterion and predictor.

That is, the goal of regression is to predict values of the criterion variable based on one or more predictors. Whereas correlation is inherently non-directional, regression moves us one step closer to positing a belief that one can estimate the values of the criterion if one knows values of the predictors. Although this does not imply strong causality, conceptual clarity about the roles of variables in a regression model is essential. And having a clear head about predictor and criterion variable becomes even more essential in SEM!

Here’s the basic equation: \[ \hat{Y} = B_{0} + B_{1}X + \varepsilon \]

The expected change in \(Y\) for a one-unit change in \(X\) is encoded by \(B_1\). Note that by adding \(B_0\), we are representing a belief about the mean of \(\hat{Y}\). We can re-express \(B_1\) in terms of correlation or covariance, as above.

First, let’s set the roles of criterion and predictor. Can we estimate the median home price (criterion) on the basis of crime (predictor)? Let’s fit this equation in R:

\[\hat{\textrm{Price}} = B_{0} + B_{1}\textrm{Crime} + \varepsilon\]

m <- lm(cmedv ~ crim, BostonSmall)
summary(m)

## 
## Call:
## lm(formula = cmedv ~ crim, data = BostonSmall)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16.95  -5.47  -2.00   2.53  29.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.0316     0.4082    58.9   <2e-16 ***
## crim         -0.4159     0.0438    -9.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.47 on 504 degrees of freedom
## Multiple R-squared:  0.152,  Adjusted R-squared:  0.15 
## F-statistic: 90.2 on 1 and 504 DF,  p-value: <2e-16

We see that \(B_1\) is -.42, indicating a negative relationship between price and crime.

We can obtain the same estimate by knowing the correlation between price and crime, as well as the standard deviations of each:

\[B_1 = r_{XY}\frac{s_Y}{s_X}\]

unname(cormat["cmedv","crim"]*(sds["cmedv"]/sds["crim"]))

## [1] -0.416

We can also visualize the association between these variables as a scatterplot, adding the predicted values \(\hat{\textrm{Price}}\) at every value of \(\textrm{Crime}\) (i.e., the fitted regression line).

BostonSmall <- BostonSmall %>% add_predictions(m)
ggplot(BostonSmall, aes(x=crim, y=cmedv)) + 
  geom_point() + geom_line(aes(y=pred)) + theme_bw()

4.1 Regression diagnostics

This is clearly not a straightforward relationship! The distribution of crime has a long positive tail. We can visualize the residuals of this model, \(\textrm{Price} - \hat{\textrm{Price}}\) to get a sense of things:

ggplot(BostonSmall, aes(x=crim, y=cmedv)) + 
  geom_point() + 
  geom_line(aes(y=pred)) + 
  geom_segment(aes(xend=crim, yend=pred), color="blue", alpha=0.2) +
  theme_bw()

Or the association between the predicted data and residuals:

BostonSmall <- BostonSmall %>% add_residuals(m)
ggplot(BostonSmall, aes(x=pred, y=resid)) + 
  geom_point() + stat_smooth(method="loess")

## `geom_smooth()` using formula 'y ~ x'

Or the distribution of residuals (which should be normal):

ggplot(BostonSmall, aes(x=resid)) + geom_histogram(bins=15)

The residuals do not look white and there is evidence of heteroscedasticity. The distribution of crime is very difficult to work with, and we won’t solve it perfectly now.

ggplot(BostonSmall, aes(x=crim)) + geom_histogram(bins=15)

But a log2 transformation of crime moves things in the right direction in terms of normalizing the variance while retaining interpretability (one unit change = 2x crime). Remember that regression doesn’t assume normality of predictors, but instead of the residuals.

ggplot(BostonSmall, aes(x=log2(crim))) + geom_histogram(bins=15)

Refit the model using the log2 transformation. This is moving in the right direction and will have to do for now.

BostonSmall$log_crim <- log2(BostonSmall$crim)
m2 <- lm(cmedv ~ log_crim, BostonSmall)
BostonSmall <- BostonSmall %>% add_predictions(m2)
BostonSmall <- BostonSmall %>% add_residuals(m2)
ggplot(BostonSmall, aes(x=log_crim, y=cmedv)) + 
  geom_point() + 
  geom_line(aes(y=pred)) + 
  geom_segment(aes(xend=log_crim, yend=pred), color="blue", alpha=0.2) + 
  theme_bw()

4.2 Orthogonality of residuals with predictor(s)

Let’s return to the relationships among correlation, covariance, and regression. As Kline notes (p. 27), the residuals of a regression should be orthogonal to the predictor. Indeed, the least squares criterion requires this: \(\underset{\mathbf{X} \in \mathbb{R}}{\textrm{min}} \sum(Y - \hat{Y})^2\).

cor.test(residuals(m2), BostonSmall$log_crim)

## 
##  Pearson's product-moment correlation
## 
## data:  residuals(m2) and BostonSmall$log_crim
## t = 1e-15, df = 504, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0872  0.0872
## sample estimates:
##      cor 
## 4.34e-17

4.3 Swapping criterion and predictor

Unlike correlation or covariance, which are symmetric and undirected, swapping the criterion and predictor in single-predictor regression changes the slope estimate of their association. If we treat crime as the criterion and housing price as the predictor, the slope now represents change in housing prices as a function of change in crime (under log2, a one unit change indicates a 2x increase in crime rate). To wit:

\[ \textrm{log}_2(\hat{\textrm{Crime)}} = B_0 + B_1 \textrm{Price} \]

m3 <- lm(log_crim ~ cmedv, BostonSmall)
summary(m3)

## 
## Call:
## lm(formula = log_crim ~ cmedv, data = BostonSmall)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.951 -2.186 -0.655  2.521  8.600 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3738     0.3273    7.25  1.6e-12 ***
## cmedv        -0.1553     0.0135  -11.54  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.78 on 504 degrees of freedom
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.208 
## F-statistic:  133 on 1 and 504 DF,  p-value: <2e-16

We can, however, still recover the correlation based on the unstandardized regression coefficient and the standard deviations of the variables:

\[r_{\textrm{Price,log2(Crime)}} = B_1 \frac{s_{\textrm{log2(Crime)}}}{s_{\textrm{Price}}}\]

(coef(m3)["cmedv"] * sd(BostonSmall$cmedv)) / sd(BostonSmall$log_crim)

##  cmedv 
## -0.457

cor.test(~log_crim + cmedv, BostonSmall)

## 
##  Pearson's product-moment correlation
## 
## data:  log_crim and cmedv
## t = -12, df = 504, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.524 -0.386
## sample estimates:
##    cor 
## -0.457

4.4 Standardized parameter estimates

This brings us to standardization in regression, which attempts to re-express the parameter estimates in units that are not specific to the scale of each variable. More specifically, standardized parameter estimates represent the association in standard deviation units. In the case of crime and housing prices, we could ask how many standard deviations lower are housing prices given a one SD increase in crime.

Standardization is often achieved by z-scoring all variables (or at least dividing each by its own SD) prior to calculating the regression:

BostonSmall_std <- BostonSmall %>% select(-pred, -resid) %>% mutate_all(scale)
m4 <- lm(cmedv ~ log_crim, BostonSmall_std)
summary(m4)

## 
## Call:
## lm(formula = cmedv ~ log_crim, data = BostonSmall_std)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.885 -0.563 -0.270  0.290  3.627 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.58e-18   3.96e-02     0.0        1    
## log_crim    -4.57e-01   3.96e-02   -11.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.89 on 504 degrees of freedom
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.208 
## F-statistic:  133 on 1 and 504 DF,  p-value: <2e-16

Thus, in the context of single-predictor regression, the standardized regression coefficient is the correlation.

\[r_{xy} := \beta_1 := B_1 \frac{s_y}{s_x} \]

coef(m2)["log_crim"] * sd(BostonSmall$log_crim) / sd(BostonSmall$cmedv)

## log_crim 
##   -0.457

Standardization can also be achieved after model estimation by dividing by the relevant standard deviations. I’ll skip the details for now, but check out the lm.beta package.

lm.beta(m2)

## 
## Call:
## lm(formula = cmedv ~ log_crim, data = BostonSmall)
## 
## Standardized Coefficients::
## (Intercept)    log_crim 
##       0.000      -0.457

4.5 A few other regression equivalencies

As Kline notes, there are a few other handy properties here:

The unsigned correlation of predictor and criterion is equal to the unsigned correlation of the criterion with the fitted/predicted values

\[ |r_{XY}| = r_{Y\hat{Y}} \]

More generally, as in multiple regression, this should be expressed:

\[ R_{Y \cdot X_{1...p}} = r_{Y\hat{Y}} \]

cor.test(fitted(m2), BostonSmall$cmedv)

## 
##  Pearson's product-moment correlation
## 
## data:  fitted(m2) and BostonSmall$cmedv
## t = 12, df = 504, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.386 0.524
## sample estimates:
##   cor 
## 0.457

Total variance in \(Y\) can be decomposed into the predicted part and residual part

\[ s_{Y}^2 = s_{\hat{Y}}^2 + s_{Y - \hat{Y}}^2 \]

var(BostonSmall$cmedv)

## [1] 84.3

var(fitted(m2)) + var(resid(m2))

## [1] 84.3

The proportion of shared variance between \(X\) and \(Y\) is the proportion of \(Y\) variance explained relative to its total variance

\[ r_{XY}^2 = \frac{s_{\hat{Y}}^2}{s_{Y}^2} \]

rsquare(m2, BostonSmall)

## [1] 0.209

var(fitted(m2)) / var(BostonSmall$cmedv)

## [1] 0.209

5 Multiple regression

In single-predictor regression, there are straightforward mappings back to covariance and correlation. As we consider the joint effects of multiple predictors, however, the math gets a bit harder. Conceptually, the goal of multiple (predictor) regression is to quantify the effects of several predictors on a criterion. In multiple regression, each parameter estimate quantifies the partial relationship of a given predictor \(X_1\) with the criterion \(Y\), controlling for all other predictors \(X_{2...p}\). Thus, how much of a change in \(Y\) is expected from a unit change in \(X_1\) at a given level of \(X_{2...p}\) (aka ‘holding constant’ those predictors).

\[ \hat{Y} = B_0 + B_1 X_1 + B_2 X_2 + B_3 X_3 + ... B_p X_p + \varepsilon \]

Let’s consider the case in which housing prices are predicted by both crime and nitric oxide concentration:

\[\hat{\textrm{Price}} = B_{0} + B_{1}\textrm{log}_2\textrm{(Crime)} + B_2 \textrm{Nox} + \varepsilon\]

m5 <- lm(cmedv ~ log_crim + nox, BostonSmall)
summary(m5)

## 
## Call:
## lm(formula = cmedv ~ log_crim + nox, data = BostonSmall)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.97  -4.92  -2.27   2.66  32.96 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   29.470      3.004    9.81  < 2e-16 ***
## log_crim      -0.925      0.188   -4.91  1.2e-06 ***
## nox          -14.391      5.070   -2.84   0.0047 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.12 on 503 degrees of freedom
## Multiple R-squared:  0.222,  Adjusted R-squared:  0.219 
## F-statistic: 71.6 on 2 and 503 DF,  p-value: <2e-16

We can see that there are negative associations of both crime and pollutants with home prices. As mentioned above, the multiple correlation of all predictors with the criterion is given by:

\[ R_{Y \cdot X_{1...p}} = r_{Y\hat{Y}} \]

cor.test(fitted(m5), BostonSmall$cmedv)

## 
##  Pearson's product-moment correlation
## 
## data:  fitted(m5) and BostonSmall$cmedv
## t = 12, df = 504, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.400 0.536
## sample estimates:
##   cor 
## 0.471

And the variance explained:

cor(fitted(m5), BostonSmall$cmedv)^2

## [1] 0.222

5.1 Standardized parameters in multiple regression

To extend standardized parameter estimates to the multiple regression context, we have to account for the shared variance among predictors.

\[ \beta_\textrm{Nox} = \frac{r_\textrm{Nox,Price} - r_\textrm{Crime,Price} r_\textrm{Nox,Crime}}{1 - r_\textrm{Nox,Crime}^2} \]

So, we adjust the association Nox with Price by the product of the correlations of Crime with Price (other predictor with criterion) and the correlation of Nox and Crime (predictor correlation). The magnitude of the standardized partial coefficient is reduced by adjusting for the correlation among predictors in the denominator. In the case of orthogonal (uncorrelated) predictors, note that the standardized beta simplifies to the bivariate correlation, as in single-predictor regression:

\[ \begin{aligned} r_\textrm{Nox,Crime} :&= 0 \\ \beta_\textrm{Nox} & = r_\textrm{Nox,Price} \end{aligned} \]

Unstandardized betas can be computed from the standardized betas as above:

\[ B_\textrm{Nox} = \beta_\textrm{Nox}\left(\frac{s_\textrm{Price}}{s_\textrm{Nox}}\right) \]

5.2 Partial and part correlations

Also note that we can consider partial relationships in multiple regression in correlational (standardized) terms. But we have to decide how we wish to think about shared variance among predictors. There are two options:

5.2.1 Partial correlation

The association of \(X_1\) and \(Y\), controlling for shared variance with \(X_2\). Note that the goal here is to remove the influence of \(X_2\) from both variables before considering their association.

\[ r_{Y,X_1 \cdot X_2} = \frac{r_{Y,X_1} - r_{X_1,X_2}r_{Y,X_2}}{\sqrt{(1-r_{X_1,X_2}^2)(1-r_{Y,X_2}^2)}} \]

5.2.2 Part (semipartial)

The association of \(X_1\) with \(Y\), controlling for shared variance between \(X_1\) and \(X_2\). Importantly, the association of \(X_2\) with \(Y\) is not removed.

\[ r_{Y(X_1 \cdot X_2)} = \frac{r_{Y,X_1} - r_{Y,X_2}r_{X_1,X_2}}{\sqrt{1-r_{X_1,X_2}^2}} \]

5.2.3 Graphical expression

Venn diagram of shared variance

\(a\) is the part correlation of \(X_1\) with \(Y\), adjusting for \(X_1-X_2\) association.
\(\frac{a}{a+d}\) is the partial correlation of \(X_1\) with \(Y\), controlling for \(X_2\).

5.3 Decomposing variance explained in regression (standardized coefficients, part, or partial correlations)

There is not a single best measure of association between a predictor and criterion in multiple regression. Rather, the challenge is that regression coefficients, part correlations, and partial correlations all represent the association of a predictor with a criterion conditional on the other predictors in the model.

The good news is that standardized regression coefficients are interpretable because they are on the same scale and have a known interpretation (relationship between \(Y\) and \(X_1\) at a given level of \(X_2 ... X_p\)). The bad news is that they do not sufficiently weight the direct effect of a predictor on the criterion, which is often of substantive interest (Johson and Lebreton, 2004).

There are additional measures of the relative importance of predictors in a regression model that go beyond what we will cover in class. I would suggest reading this paper and using the relaimpo package to examine relative importance of predictors in regression.

One intuitive, but flawed measure, of partial association is the product of the standardized coefficient with the correlation of the predictor and the criterion (Hoffman, 1960):

\[ r^2_{Y,X_1\cdot (X_2...X_p)} = \beta_{X_1}r_{Y,X_1} \]

By using the marginal correlation (i.e., the bivariate relationship, averaging over all other predictors), the sum of these estimates across all predictors equals the model \(R^2\).

If you’re interested in the relative importance of a predictor in a multiple regression model, I would suggest using the Lindeman, Merenda and Gold (‘lmg’) method of computing relative importance over standardized coefficients, Hoffman’s method, part, or partial correlation. It’s described in the paper above. Here’s a quick example:

m_relimport <- lm(cmedv ~ log_crim + nox, BostonSmall)
stdbetas <- lm.beta(m_relimport)$standardized.coefficients #standardized coefficients

stdbetas["log_crim"] * cor(BostonSmall$cmedv, BostonSmall$log_crim)

## log_crim 
##    0.144

stdbetas["nox"] * cor(BostonSmall$cmedv, BostonSmall$nox)

##   nox 
## 0.078

#here is the broader set of relatience importance measures to consider (esp. 'lmg')
calc.relimp(m_relimport, type = c("lmg", "first", "last", "betasq", "pratt", "genizi", "car"))

## Warning in rev(variances[[p]]) - variances[[p + 1]]: Recycling array of length 1 in vector-array arithmetic is deprecated.
##   Use c() or as.vector() instead.

## Response variable: cmedv 
## Total response variance: 84.3 
## Analysis based on 506 observations 
## 
## 2 Regressors: 
## log_crim nox 
## Proportion of variance explained by model: 22.2%
## Metrics are not normalized (rela=FALSE). 
## 
## Relative importance metrics: 
## 
##             lmg   last first betasq pratt genizi    car
## log_crim 0.1232 0.0373 0.209 0.0986 0.144 0.1232 0.1310
## nox      0.0984 0.0125 0.184 0.0330 0.078 0.0984 0.0906
## 
## Average coefficients for different model sizes: 
## 
##              1X     2Xs
## log_crim  -1.35  -0.925
## nox      -34.02 -14.391

As a reminder, the complexity of partial relationships in multiple regression largely comes from the conceptual (and statistical) difficulty of handling the relationship among predictors. If the predictors are uncorrelated with each other, then everything you know from single-predictor regression holds true – for example, the equivalence of bivariate correlation and standardized coefficients.

A useful way to think about this is that if the predictors are truly orthogonal to each other, then if I regress \(\mathbf{y}\) on \(\mathbf{x}_1\) alone, none of the variance accounted for by \(\mathbf{x}_2\) has been removed. Indeed, one could take the residuals of the \(Y ~ X_1\) regression into a new model:

\[ \begin{align} r_{X_1,X_2} :&= 0 \\ \hat{Y} &= B_0 + B_1 X_1 + \varepsilon \\ R_{X_1} &= Y - \hat{Y} \\ \hat{R}_{X_1} &= B_0 + B_2 X_2 + \varepsilon \end{align} \]

The coefficients for \(B_1\) and \(B_2\) will be identical when estimated jointly:

\[ \hat{Y} = B_0 + B_1 X_1 + B_2 X_2 + \varepsilon \]

5.3.1 Equivalence of coefficients to single-predictor regression when predictors are orthogonal (uncorrelated)

Here’s a quick demo. First a structure in which \(X_1\) and \(X_2\) are associated with, \(Y\), but not each other. The underlying correlation structure is:

mu <- rep(0,3)
Sigma <- tribble(
  ~y, ~x1, ~x2,
  1, 0.4, 0.4,
  0.4, 1, 0,
  0.4, 0, 1
)

print(Sigma)

## # A tibble: 3 x 3
##       y    x1    x2
##   <dbl> <dbl> <dbl>
## 1   1     0.4   0.4
## 2   0.4   1     0  
## 3   0.4   0     1

sim_nocorr <- data.frame(MASS::mvrnorm(n=5000, mu=mu, Sigma=Sigma, empirical=TRUE))
names(sim_nocorr) <- c("y", "x1", "x2")
cor(sim_nocorr) #verify that x1 and x2 correlate with y, but not each other

##      y       x1       x2
## y  1.0 4.00e-01 4.00e-01
## x1 0.4 1.00e+00 2.73e-16
## x2 0.4 2.73e-16 1.00e+00

First, here are standardized betas under joint versus single-predictor models:

lm.beta(lm(y ~ x1 + x2, sim_nocorr))

## 
## Call:
## lm(formula = y ~ x1 + x2, data = sim_nocorr)
## 
## Standardized Coefficients::
## (Intercept)          x1          x2 
##         0.0         0.4         0.4

lm.beta(lm(y ~ x1, sim_nocorr))

## 
## Call:
## lm(formula = y ~ x1, data = sim_nocorr)
## 
## Standardized Coefficients::
## (Intercept)          x1 
##         0.0         0.4

lm.beta(lm(y ~ x2, sim_nocorr))

## 
## Call:
## lm(formula = y ~ x2, data = sim_nocorr)
## 
## Standardized Coefficients::
## (Intercept)          x2 
##         0.0         0.4

Semipartial correlation of \(Y\) with \(X_1\) net \(X_2\):

with(sim_nocorr, spcor.test(y, x1, x2))

estimate	p.value	statistic	n	gp	Method
0.4	0	30.9	5000	1	pearson

Semipartial correlation of \(Y\) with \(X_2\) net \(X_1\):

with(sim_nocorr, spcor.test(y, x2, x1))

estimate	p.value	statistic	n	gp	Method
0.4	0	30.9	5000	1	pearson

Bivariate correlation of \(Y\) with \(X_1\):

with(sim_nocorr, cor(y, x1))

## [1] 0.4

Bivariate correlation of \(Y\) with \(X_2\):

with(sim_nocorr, cor(y, x2))

## [1] 0.4

Now, the equivalence of coefficients estimated sequentially (one predictor at a time) versus jointly under orthogonal predictors:

#demonstrate equivalence of sequential versus join estimation under orthogonal predictors
m1 <- lm(y ~ x1, sim_nocorr)
summary(m1)

## 
## Call:
## lm(formula = y ~ x1, data = sim_nocorr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.517 -0.617 -0.013  0.621  3.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.45e-17   1.30e-02     0.0        1    
## x1          4.00e-01   1.30e-02    30.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.917 on 4998 degrees of freedom
## Multiple R-squared:  0.16,   Adjusted R-squared:  0.16 
## F-statistic:  952 on 1 and 4998 DF,  p-value: <2e-16

sim_nocorr$resid_x1 <- resid(m1)
m2 <- lm(resid_x1 ~ x2 - 1, sim_nocorr) #remove intercept since that was pulled out in m1
summary(m2)

## 
## Call:
## lm(formula = resid_x1 ~ x2 - 1, data = sim_nocorr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0699 -0.5379  0.0099  0.5450  2.7408 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x2   0.4000     0.0117    34.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.825 on 4999 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.19 
## F-statistic: 1.18e+03 on 1 and 4999 DF,  p-value: <2e-16

m3 <- lm(y ~ x1 + x2, sim_nocorr)
summary(m3)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = sim_nocorr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0699 -0.5379  0.0099  0.5450  2.7408 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.98e-17   1.17e-02     0.0        1    
## x1          4.00e-01   1.17e-02    34.3   <2e-16 ***
## x2          4.00e-01   1.17e-02    34.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.825 on 4997 degrees of freedom
## Multiple R-squared:  0.32,   Adjusted R-squared:  0.32 
## F-statistic: 1.18e+03 on 2 and 4997 DF,  p-value: <2e-16

5.3.2 Non-equivalence of coefficients when predictors are correlated

Now add a correlation of 0.6 between the predictors and repeat the various tests:

Sigma <- tribble(
  ~y, ~x1, ~x2,
  1, 0.4, 0.4,
  0.4, 1, 0.6,
  0.4, 0.6, 1
)
print(Sigma)

## # A tibble: 3 x 3
##       y    x1    x2
##   <dbl> <dbl> <dbl>
## 1   1     0.4   0.4
## 2   0.4   1     0.6
## 3   0.4   0.6   1

sim_withcorr <- data.frame(MASS::mvrnorm(n=5000, mu=mu, Sigma=Sigma, empirical=TRUE))
names(sim_withcorr) <- c("y", "x1", "x2")

#now, semipartial and standardized beta are not equivalent
#likewise, bivariate correlation does not match 
lm.beta(lm(y ~ x1 + x2, sim_withcorr))

## 
## Call:
## lm(formula = y ~ x1 + x2, data = sim_withcorr)
## 
## Standardized Coefficients::
## (Intercept)          x1          x2 
##        0.00        0.25        0.25

lm.beta(lm(y ~ x1, sim_withcorr))

## 
## Call:
## lm(formula = y ~ x1, data = sim_withcorr)
## 
## Standardized Coefficients::
## (Intercept)          x1 
##         0.0         0.4

lm.beta(lm(y ~ x2, sim_withcorr))

## 
## Call:
## lm(formula = y ~ x2, data = sim_withcorr)
## 
## Standardized Coefficients::
## (Intercept)          x2 
##         0.0         0.4

with(sim_withcorr, spcor.test(y, x1, x2))

estimate	p.value	statistic	n	gp	Method
0.2	0	14.4	5000	1	pearson

with(sim_withcorr, spcor.test(y, x2, x1))

estimate	p.value	statistic	n	gp	Method
0.2	0	14.4	5000	1	pearson

with(sim_withcorr, cor(y, x1))

## [1] 0.4

with(sim_withcorr, cor(y, x2))

## [1] 0.4

#demonstrate *non*-equivalence of sequential versus join estimation under correlated predictors
m1 <- lm(y ~ x1, sim_withcorr)
summary(m1)

## 
## Call:
## lm(formula = y ~ x1, data = sim_withcorr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.946 -0.615 -0.006  0.612  3.393 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.98e-17   1.30e-02     0.0        1    
## x1           4.00e-01   1.30e-02    30.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.917 on 4998 degrees of freedom
## Multiple R-squared:  0.16,   Adjusted R-squared:  0.16 
## F-statistic:  952 on 1 and 4998 DF,  p-value: <2e-16

sim_withcorr$resid_x1 <- resid(m1)
m2 <- lm(resid_x1 ~ x2 - 1, sim_withcorr) #remove intercept since that was pulled out in m1
summary(m2)

## 
## Call:
## lm(formula = resid_x1 ~ x2 - 1, data = sim_withcorr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.058 -0.608 -0.001  0.587  3.313 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x2   0.1600     0.0128    12.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.902 on 4999 degrees of freedom
## Multiple R-squared:  0.0305, Adjusted R-squared:  0.0303 
## F-statistic:  157 on 1 and 4999 DF,  p-value: <2e-16

m3 <- lm(y ~ x1 + x2, sim_withcorr)
summary(m3)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = sim_withcorr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.175 -0.598 -0.003  0.590  3.360 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.27e-17   1.27e-02     0.0        1    
## x1           2.50e-01   1.58e-02    15.8   <2e-16 ***
## x2           2.50e-01   1.58e-02    15.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.895 on 4997 degrees of freedom
## Multiple R-squared:   0.2,   Adjusted R-squared:   0.2 
## F-statistic:  625 on 2 and 4997 DF,  p-value: <2e-16

6 Assumptions of regression (from Kline)

Regression coefficients reflect unconditional linear relations only. Stated differently, the covariance of a predictor \(X\) and the criterion \(Y\) must be constant over \(X\), other predictors \(\mathbf{P}\), and unmeasured variables \(\mathbf{U}\). Thus, unmodeled interactions among predictors, measured or unmeasured, violates this assumption. And nonlinear associations between \(X\) and \(Y\) do as well.
Predictors are measured perfectly. Although we know that psychological variables suffer from measurement error (i.e., we are uncertain about the true value of a predictor \(X\)), we cannot manage this problem in conventional regression. Rather, predictors are assumed to have 0 measurement error. Violating this can lead to bias in parameter estimates, both for the unreliable predictor and other parameters.
Residuals are normally distributed. An important assumption in regression is that the errors (i.e., residuals) are normally distributed conditional on \(\mathbf{X}\) (i.e., the matrix of all predictors). In addition, knowledge about one residual should not provide information about another residual. This is the assumption of iid residual – independent and identically distributed. The iid assumption can be violated, for example, by clustered data in which the cluster has certain unmodeled properties (e.g., longitudinal data where one person is higher on average than others).
Residuals are homoscedastic. The variance of the residuals should not depend on the value of the predictor. So, if more extreme values tend to be more poorly fit (larger variance of residuals), this is heteroscedastic.
There are no causal effects among the predictors. This assumption highlights that in univariate regression (i.e., one dependent variable), there is no way to model indirect causal effects among predictors. So if one believes that \(X\) causes \(Y\) via its influence on \(M\) (i.e., \(X \rightarrow M \rightarrow Y\)), this requires two regression equations (\(X \rightarrow M\) and \(M \rightarrow Y\)), which cannot be accommodated in standard regression. Indeed, such multivariate regression problems are a major motivation for SEM!
There is no specification error. Many, if not most, statistical models have this assumption, which basically states that the form of the model reflects an accurate representation of the data-generating processes in the population. There are many ways to violate this assumption.
1. Misspecifying the form of the relationship between a predictor and the criterion (e.g., nonlinear effects).
2. Using a biased or imprecise statistical estimator
3. Including predictors that are irrelevant in the population, but could absorb variance by chance.
4. Omitting variables that have a relationship with the criterion.
5. Omitting variables that are associated with predictors in the model.

7 Suppression

This is a complex problem that is addressed in more detail by Cohen, Cohen, West, and Aiken (2003). I will mention a couple of cases:

7.1 Classical suppression

Including a variable that is uncorrelated with the criterion, but correlated with a useful predictor, can inflate the parameter estimate of the useful predictor. For example:

\[ \begin{align} r_{YX} &= 0.5 \\ r_{YZ} &= 0 \\ r_{XZ} &= 0.4 \\ \hat{Y} &= B_0 + B_1 X + B_2 Z + \varepsilon \end{align} \]

In this cirumstance, as we discussed above, the partial regression coefficient adjusts for the associations among predictors.

\[ \beta_\textrm{X} = \frac{r_\textrm{XY} - r_\textrm{YZ} r_\textrm{XZ}}{1 - r_\textrm{XZ}^2} \]

Thus, when there is no association between the criterion \(Y\) and the predictor \(Z\), we get

\[ \beta_\textrm{X} = \frac{r_\textrm{XY}}{1 - r_\textrm{XZ}^2} \] Relative to the bivariate correlation \(r_{XY}\), the denominator will be smaller under classical suppression, leading \(\beta_X\) to be larger.

7.2 Net suppression

Definition: We observe positive bivariate associations among 2+ predictors and a criterion, but one of the predictors has a negative regression coefficient.

\[ \begin{align} r_{YX} &= 0.3 \\ r_{YZ} &= 0.5 \\ r_{XZ} &= 0.4 \\ \hat{Y} &= B_0 + B_1 X + B_2 Z + \varepsilon \end{align} \]

In this case, \(X\) is the predictor having a weaker association with \(Y\) and will tend to take have a negative parameter estimate in the multiple regression, largely reducing error variance in \(Z\), but contributing little to predicting \(Y\). This form of suppression is often identified when the standardized beta for a predictor is opposite in sign to the bivariate correlation of that predictor with the criterion.

7.3 Cooperative suppression

Definition: two predictors correlate negatively with each other, but positively with the criterion. This will lead to a high multiple \(R\) because the predictors will account for more of the criterion when they are considered together than in bivariate correlation. Thus, this form of suppression is often identified when the standardized beta for a predictor exceed its bivariate correlation with the criterion.

\[ \begin{align} r_{YX} &= 0.5 \\ r_{YZ} &= 0.5 \\ r_{XZ} &= -0.4 \\ \hat{Y} &= B_0 + B_1 X + B_2 Z + \varepsilon \end{align} \]

More details about suppression here

Review of GLM

MLM Lab Workshop

Michael Hallquist

21 Jan 2021