In multiple regression, we are used to seeing a single regression equation such as:
\[ Y_i = \beta_0 + \beta_1 V_i + \beta_2 W_i + \beta_3 X_i + \varepsilon_i \] where \(i\) denotes the ith individual/observation. The goal of this tutorial is to provide some background that will review how such a regression equation can be represented by a set of matrices and vectors:
\[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} \beta_0 + \beta_1 V_1 + \beta_2 W_1 + \beta_3 X_1 \\ \beta_0 + \beta_1 V_2 + \beta_2 W_2 + \beta_3 X_2 \\ \vdots \\ \beta_0 + \beta_1 V_n + \beta_2 W_n + \beta_3 X_n \\ \end{bmatrix} \enspace + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} \]
If we factor out the parameters (\(\mathbf{\Beta}\)) from the data, we can see the difference between the model matrix and the parameter estimates:
\[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & V_1 & W_1 & X_1 \\ 1 & V_2 & W_2 & X_2 \\ \vdots \\ 1 & V_n & W_n & X_n \\ \end{bmatrix} \enspace \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} \]
Remember that matrices are defined by rows (the first dimension) and columns (the second dimension):
\[ \underset{m \times n}{\mathbf{A}} = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ a_{41} & a_{42} & a_{43} \end{bmatrix} \]
And a position in the matrix is specific by subscripting according to the row and column: \(a_{11}\).
A square matrix has the same number of rows and columns. Covariance matrices are always square.
\[ \underset{n \times n}{\mathbf{A}} = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix} \]
A symmetric matrix a square matrix that is identical when transposed. That is, flipping the rows and columns has no effect. Another way to think of it is that the off-diagonal structure (upper triangle and lower triangle) is identical.
\[ \begin{align} \underset{n \times n}{\mathbf{A}} &= \begin{bmatrix} a & ab & ac & ad \\ ab & b & bc & bd \\ ac & bc & c & cd \\ ad & bd & cd & d \end{bmatrix} \\ \cr \mathbf{A} &= \mathbf{A}' \end{align} \]
This is pretty close to the structure we’ll see in much of the class – with \(ab\) representing some function of both \(a\) and \(b\) (e.g., covariance).
A diagonal matrix is a special case of a square symmetric matrix in which there are values along the diagonal, but zeros elsewhere:
\[ \begin{align} \underset{n \times n}{\mathbf{A}} &= \begin{bmatrix} a & 0 & 0 & 0 \\ 0 & b & 0 & 0 \\ 0 & 0 & c & 0 \\ 0 & 0 & 0 & d \end{bmatrix} \\ \cr \mathbf{A} &= \mathbf{A}' \end{align} \]
The trace of a square matrix is the sum of elements along the diagonal:
\[ tr(\mathbf{A}) = a + b + c + d \]
Or more generally, if the matrix is \(n \times n\):
\[ tr(\mathbf{A}) = \sum_{i=1}^{n}{a_{ii}} = a_{11} + a_{22} + ... + a_{nn} \]
An identity matrix is a special case of a diagonal matrix in which the elements of the diagonal are all 1:
\[ \underset{n \times n}{\mathbf{I}} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \]
Why would this be useful? Mostly it helps make matrix multiplication work, but for now, just remember that any matrix multiplied by an identity matrix is unchanged. Just like multiplying a number by 1:
Here’s a square matrix
A <- matrix(rnorm(25), nrow=5, ncol=5)
print(A)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.6208 -0.6728 0.655 1.458 0.3162
## [2,] -0.0968 -0.0921 1.111 -1.590 0.0229
## [3,] 0.8798 0.8030 0.898 -0.529 -1.0620
## [4,] -1.1098 -1.2248 -0.805 -0.277 0.9781
## [5,] -0.9037 1.0291 -1.073 0.337 -1.0197
And now multiplied by \(\mathbf{I}\):
A %*% diag(5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.6208 -0.6728 0.655 1.458 0.3162
## [2,] -0.0968 -0.0921 1.111 -1.590 0.0229
## [3,] 0.8798 0.8030 0.898 -0.529 -1.0620
## [4,] -1.1098 -1.2248 -0.805 -0.277 0.9781
## [5,] -0.9037 1.0291 -1.073 0.337 -1.0197
Matrix addition and subtraction are straightforward. These operations are applied elementwise:
\[ \mathbf{A} = \begin{bmatrix} 10 & 5 \\ 9 & 1 \end{bmatrix} , \enspace \mathbf{B} = \begin{bmatrix} 2 & 1 \\ 20 & 0 \end{bmatrix}, \enspace \textrm{then } \mathbf{A}-\mathbf{B}= \begin{bmatrix} 8 & 4 \\ -11 & 1 \end{bmatrix} \]
Note that matrices must be of the same dimension (i.e., number of rows and columns) to be subtracted or added.
Multiplication is more complex.
To multiply a matrix \(\mathbf{X}\) by a (scalar) constant \(a\), one simply multiplies all elements of \(\mathbf{X}\) by \(a\):
\[ \mathbf{A} = \begin{bmatrix} 10 & 5 \\ 9 & 1 \end{bmatrix}, \enspace k=2, \enspace k\mathbf{A} = \begin{bmatrix} 20 & 10 \\ 18 & 2 \end{bmatrix} \]
Multiplication is a more complex operation when both objects are matrices. First, the order matters, such that \(\mathbf{AB}\) is not (usually) the same as \(\mathbf{BA}\). This gives rise to the terms ‘pre-multiplication’ and ‘post-multiplication’, though we don’t need those much in SEM. Second, if we are computing \(C = AB\), then the number of columns in A must match the number of rows in B:
\[ \underset{n \times k}{\mathbf{C}} = \underset{n \times p}{\mathbf{A}} \cdot \underset{p \times k}{\mathbf{B}} \] Thus, the resulting matrix \(\mathbf{C}\) has the number of rows of \(\mathbf{A}\) and the number of columns of \(\mathbf{B}\). Matrices that can be multiplied are called ‘compatible’ or ‘comformable.’ Matrices in which the inner dimensions (i.e., columns of \(\mathbf{A}\), rows of \(\mathbf{B}\)) do not match are called ‘incompatible’ or ‘non-conformable.’ These cannot be multiplied.
How does matrix multiplication work? One multiplies the elements of the ith row of \(\mathbf{A}\) by the elements of the jth column of \(\mathbf{B}\), then sums up these values into the ith row and jth column of \(\mathbf{C}\). Like so:
\[c_{ij} = \sum_{k=1}^{p} a_{ik} b_{kj}\]
Matrix division is not commonly discussed, but it is conceptually related to the idea of an inverse. If you think of single numbers:
\[c = \frac{x}{y} \implies xy^{-1}\] Likewise, matrix division can be thought of as:
\[\mathbf{AB}^{-1}\] As mentioned above, matrix multiplication is not (usually) commutative, so the concept of matrix division is tricky and not especially useful or prevalent.
If you’d like to learn more about matrix multiplication, I would suggest reading this.
Briefly, figuring out the parameter estimates in models containing more than one outcome depends on solving multiple regression equations. As in high-school algebra, to solve uniquely a set of \(n\) unknown variables, one needs \(n\) equations. One can think of this as a system with zero degrees of freedom.
From Wikipedia: \[ \begin{align} 2x + 3y & = 6 \\ 4x + 9y & = 15 \end{align} \]
This can be re-expressed in matrix form: \[ \begin{bmatrix} 2 & 3 \\ 4 & 9 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 6 \\ 15 \end{bmatrix} \]
More generally, any system of equations can be thought of as a weighted (linear) combination of variables in which a weight matrix \(\mathbf{A}\) multiplies a variable column vector, \(w\), with the constants (on the right-hand side) in a column vector \(\mathbf{b}\).
\[ x \begin{bmatrix} 2 \\ 4 \end{bmatrix} + y \begin{bmatrix} 3 \\ 9 \end{bmatrix} = \begin{bmatrix} 6 \\ 15 \end{bmatrix} \]
\[ \mathbf{A} = \begin{bmatrix} 2 & 3 \\ 4 & 9 \end{bmatrix} , \mathbf{w} = \begin{bmatrix} x \\ y \end{bmatrix}, \mathbf{b} = \begin{bmatrix} 6 \\ 15 \end{bmatrix} \] And in compact matrix notation:
\[ \mathbf{A}\mathbf{w} = \mathbf{b} \]
R
And in R
code, we can setup such as a system as:
A=tribble(
~x, ~y,
2, 3,
4, 9
)
print(A)
## # A tibble: 2 x 2
## x y
## <dbl> <dbl>
## 1 2 3
## 2 4 9
b=matrix(c(6,15), ncol=1)
print(b)
## [,1]
## [1,] 6
## [2,] 15
And we can solve for \(x\) and \(y\) using the solve()
function in R
:
solve(A,b)
## [,1]
## x 1.5
## y 1.0
Handy!
How did it do that? A bit of detail here. In this case, if \(\mathbf{A}\) is square (equal number of rows and columns) and all of the rows are independent (i.e., no two equations are the same or multiples of each other), we can solve this system by using the inverse of the \(\mathbf{A}\) matrix, \(\mathbf{A}^{-1}\):
solve(A)
## [,1] [,2]
## x 1.500 -0.500
## y -0.667 0.333
More specifically,
\[ \mathbf{w} = \mathbf{A^{-1}b} \]
solve(A) %*% b
## [,1]
## x 1.5
## y 1.0
We will largely avoid the details of inverting matrices in this class. However, the important conceptual point is that the inverse of a square matrix is another matrix of the same size that when multiplied with the original matrix yields an identity matrix (i.e., 1s on the diagonal, 0s elsewhere).
\[\mathbf{A}^{-1}\mathbf{A}=\mathbf{I}\]
For example, if \(\mathbf{A}\) is \(n \times n\) and \(n = 5\), then multiplying \(A\) by its inverse \(A^{-1}\) yields
\[\mathbf{I} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ \end{bmatrix} \]
Notice how this is akin to the idea that any number multiplied by its inverse equals one:
\[x \cdot \frac{1}{x} = 1\]
The details are beyond the scope of the class, but the OLS estimates of a multiple regression equation can be derived analytically:
\[\mathbf{\hat{b}}=(\mathbf{X'X})^{−1}\mathbf{X'y}\] where \(\mathbf{\hat{b}}\) is a vector of parameters one for each predictors (i.e., column in the design matrix), \(\mathbf{X}\) is the design matrix consisting of a rows of observations of predictor variables, augmented with a column of 1s (the intercept; mean structure), and \(\mathbf{y}\) is a column vector of values for the criterion variable:
\[ \mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} \\ 1 & x_{21} & x_{22} \\ 1 & x_{31} & x_{32} \end{bmatrix}, \enspace \mathbf{y}=\begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \end{bmatrix}, \enspace \mathbf{b} = \begin{bmatrix} b_{0} \\ b_{1} \\ b_{2} \end{bmatrix} \]
The important point is that we can estimate a set of unknown parameters, \(\mathbf{b}\) according to a formal understanding of how values of \(\mathbf{X}\) are optimally combined with the estimated parameters \(\mathbf{b}\) to predict the criterion \(\mathbf{Y}\). For additional details, see here: https://onlinecourses.science.psu.edu/stat501/node/382
The same idea, albeit in larger form, underlies structural equation modeling, where we have several matrices and free parameters in different parts of the model (e.g., structural versus measurement). The goal is to solve for all of these different parameters at once, often using maximum likelihood estimation.
Remember this analysis from last week? We were interested in the association of median home value with log_2(crime) and nitric oxide concentration
data(BostonHousing2)
BostonSmall <- BostonHousing2 %>% dplyr::select(
cmedv, #median value of home in 1000s
crim, #per capita crime by town
nox, #nitric oxide concentration
lstat #proportion of lower status
)
BostonSmall$log_crim <- log2(BostonSmall$crim)
m <- lm(cmedv ~ log_crim + nox, BostonSmall)
summary(m)
##
## Call:
## lm(formula = cmedv ~ log_crim + nox, data = BostonSmall)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.97 -4.92 -2.27 2.66 32.96
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.470 3.004 9.81 < 2e-16 ***
## log_crim -0.925 0.188 -4.91 1.2e-06 ***
## nox -14.391 5.070 -2.84 0.0047 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.12 on 503 degrees of freedom
## Multiple R-squared: 0.222, Adjusted R-squared: 0.219
## F-statistic: 71.6 on 2 and 503 DF, p-value: <2e-16
Let’s see how this is solved using matrix algebra.
#add a column of 1s as the first 'predictor' to model mean structure (i.e., the intercept)
X = cbind(intercept=1, as.matrix(select(BostonSmall, log_crim, nox)))
y = as.matrix(select(BostonSmall, cmedv))
#(X'X)-1 * X' y
bvals <- solve(t(X) %*% X) %*% t(X) %*% y
print(bvals)
## cmedv
## intercept 29.470
## log_crim -0.925
## nox -14.391
In short, we’ve identified optimal (OLS) parameters by solving the multiple regression equation for log_2(crime) and nox.
This is heavier math, but in inferential statistics, we are often (if not always) interested both in making an inference about a quantity or parameter, while also taking into account uncertainty about the parameter. For example, if I say that looking over data from the past two years, the probability of dying while cliff diving is approximately 5%, but that it could be as low as 1% or as high as 90%, what would you do? I would certainly want a more precise estimate!! If I knew that the margin of error in the estimate was only 1%, I might be more likely to try it.
The analogous concept in regression is the standard error, which represents the probability distribution for the value of a parameter if we were to estimate it repeatedly in independent samples of the same size drawn from the same population. Simply put, the standard error represents the standard deviation of the uncertainty in our parameter estimate. Our parameter is an estimate of the unknown population parameter – thus, we have sampling error in the estimate, which our standard error quantifies. If we are estimating a mean of a set of scores based on a sample of size \(N\), the standard error of our mean estimate is:
\[ SE_{M} = \frac{s}{\sqrt{N}} \]
In the context of regression, standard errors of the parameter estimates, \(\mathbf{B}\), are based on the covariance matrix of the parameters (not the data). The covariance matrix of the parameters quantifies how much uncertainty there is about each parameter (i.e., the variance along the diagonal), as well as the joint uncertainty about bivariate combinations of parameters (i.e., covariances off the diagonal). This matrix is defined as:
\[ \mathbf{S_{\hat{b}}} = \sigma^{2}(\mathbf{X'X})^{-1} \] But what is \(\sigma^{2}\)? It is the variance of the residuals. Returning to the regression equation:
\[ \hat{Y} = B_0 + B_1 X_1 + B_2 X_2 + B_3 X_3 + \varepsilon \]
the second part I haven’t mentioned it that the residuals \(\varepsilon\) are assumed to be independent and identically distributed (i.i.d.) according to a normal distribution:
\[\varepsilon \sim \mathcal{N}(0, \sigma^{2}\mathbf{I})\]
That is, the residuals should have mean of 0 and variance \(\sigma^{2}\). The \(\mathbf{I}\) is not usually mentioned, but is simply the formal reminder that the vector of responses \(\mathbf{y}\) is a column vector. Thus, errors for each realization of \(\mathbf{y} = \{y_1, y_2, y_3\}\) fall along rows of a diagonal matrix like so:
\[ \Sigma = \begin{bmatrix} \sigma^2 & 0 & 0 \\ 0 & \sigma^2 & 0 \\ 0 & 0 & \sigma^2 \end{bmatrix} \]
But how do we estimate \(\sigma^2\)? It’s based on the magnitude of the residuals:
\[ \hat{\sigma}^2 = \frac{ \sum_{i=1}^{N}(y_i - \hat{y}_i)^2 } {N - p - 1} \]
Note that \(\sigma^2\) is also known as mean squared error (MSE), which relates back to the idea of partitioning the total sum of squares into the parts accounted for by the model versus residual, unaccounted for variance:
\[ \begin{align} SS_\textrm{Total} &= \sum_{i=1}^{N}({y_i - \bar{y}})^2 \\ SS_\textrm{Model} &= \sum_{i=1}^{N}({\hat{y}_i - \bar{y}})^2 \\ SS_\textrm{Error} &= \sum_{i=1}^{N}({y_i - \hat{y}})^2 = SS_\textrm{Total} - SS_\textrm{Model} \end{align} \]
\[ \begin{align} df_\textrm{Error} &= N - p - 1 \\ MS_\textrm{Error} &= SS_\textrm{Error}/df_\textrm{Error} \end{align} \]
Where the predicted values \(\hat{\mathbf{y}}\) are given by the product of the design matrix, \(\mathbf{X}\) with the parameter estimates \(\mathbf{\hat{b}}\):
\[\mathbf{\hat{y}} = \mathbf{X\hat{b}}\]
And if we want to estimate this in R
:
yhat <- X %*% bvals
N = nrow(X) #number of observations
p = ncol(X) #number of predictors in the design matrix (includes intercept term)
dfError <- N - p #note that because intercept is already a column in the design matrix, the extra - 1 is not required
sigmaSq <- sum((y - yhat)^2)/dfError # estimate of sigma-squared/MSerror
#standard errors are defined by the standard deviation of the parameter estimates
#note that S_bhat already adjusts for the df_error since this is in the denominator of sigmaSq
S_bhat <- sigmaSq * solve(t(X) %*% X)
se_bhat <- sqrt(diag(S_bhat))
parmat <- cbind(bvals, se_bhat, bvals/se_bhat)
colnames(parmat) <- c("b", "se", "t")
print(parmat) #matches summary(lm)
## b se t
## intercept 29.470 3.004 9.81
## log_crim -0.925 0.188 -4.91
## nox -14.391 5.070 -2.84
Additional details on standard error computation here
And for comparison, the output of summary(lm(m))
:
summary(m)
##
## Call:
## lm(formula = cmedv ~ log_crim + nox, data = BostonSmall)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.97 -4.92 -2.27 2.66 32.96
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.470 3.004 9.81 < 2e-16 ***
## log_crim -0.925 0.188 -4.91 1.2e-06 ***
## nox -14.391 5.070 -2.84 0.0047 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.12 on 503 degrees of freedom
## Multiple R-squared: 0.222, Adjusted R-squared: 0.219
## F-statistic: 71.6 on 2 and 503 DF, p-value: <2e-16
In SEM, we typically need the covariance matrix to be positive definite. Although the math behind this is more complex than we’ll cover in class, there are several important properties of positive definite matrices that we can verify in our data prior (or as part of) SEM estimation.
You may have heard about eigenvalues in the context of factor analysis (and you will again in this class), or in matrix algebra, or not at all. Eigendecomposition is a hefty topic that underpins principal components analysis, factor analysis, and other techniques that seek to describe the latent structure of relationships among variables (including SEM). Conceptually, the idea is that any matrix \(\underset{n \times p}{\mathbf{X}}\) can be decomposed into a set of \(p\) unrelated (orthogonal) latent directions that best span the multivariate space of the data.
More simply consider our housing data:
ggplot(BostonSmall) + aes(x=log_crim, y=nox) + geom_point() + theme_bw(base_size=20) + stat_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
There is clearly some linear relationship between log_2(crime) and nitric oxide concentration, though it is not perfect. The goal of eigen decomposition in this circumstance would be to identify a new variable that is a linear combination of log_2(crime) and nox that explains their shared variance:
\[d_i= w_1 \textrm{log}_2(\textrm{crime})_i + w_2 \textrm{nox}_i\] Thus, the idea is that the new, estimated variable \(d_i\) is a weighted combination of the observed variables. This is our first encounter with a latent variable in that \(d_i\) is a new ‘direction’ through the data that maximizes the shared variance between the two observed variables.
One can think of eigendecomposition proceeding iteratively (though it is simultaneous in the maths) in order to explain all observed covariance among the variables in a data matrix \(X\). This is conceptually close to Type I sum of squares in regression/ANOVA if that rings a bell. More specifically, imagine that we extract the dominant direction through the data that describes the relationship between log_2(crime) and nitric oxide concentration. In the two-variable case, the latent direction \(\mathbf{b}\) is equal to the best-fitting regression line. This holds true because the OLS criterion requires that the parameter estimate summarizing the bivariate relationship explains the maximal variance in \(\mathbf{y}\) attributable to a predictor \(\mathbf{x}_1\).
pcademo <- select(BostonSmall, nox, log_crim)
#compute a principal component analysis (eigendecomposition on the scaled data)
mm <- prcomp(pcademo, scale. = FALSE)
#add first principal component/eigenvector (principal direction through bivariate data)
augment <- pcademo %>% mutate(ev1=mm$x[,1])
ggplot(augment) + aes(x=log_crim, y=nox, color=ev1) + geom_point() + theme_bw(base_size=20)
Related: note the 1.0 correlation between the latent direction \(\mathbf{b}\) and the fitted values of the regression model describing the bivariate relationship:
bvmodel <- lm(nox ~ log_crim, BostonSmall)
cor(fitted(bvmodel), augment$ev1)
## [1] 1
Now consider the leftover variance (the residual \(\mathbf{y} - \mathbf{\hat{y}}\)). The goal of eigendecomposition is to identify any residual structure in those data (i.e., what is the latent direction that explains the remaining covariance?). Thus, the second dominant direction through these bivariate data looks for a combination of the observed variables log_2(crime) and nox that explains the most remaining variance in the relationship.
Note that the second latent direction is synonymous with the magnitude of the residual (color gradient from left to right)
augment <- augment %>% mutate(ev2=mm$x[,2]) %>% add_predictions(bvmodel) %>% add_residuals(bvmodel)
ggplot(augment) + aes(x=resid, y=pred, color=ev2) + geom_point()
with(augment,cor(resid, ev2))
## [1] -1
Don’t worry about the sign of the correlation – ‘directions’ in this latent space have a start and end (i.e., an arrow pointing one direction), but it is the absolute value that matters.
Importantly, these latent directions are orthogonal to each other. Think of this as taking the residuals of one regression model as the input to the next model. Each one tries to describe leftover variation.
Let’s anchor our bivariate regression insights onto the nomenclature of eigendecomposition. A latent direction through the data matrix \(\mathbf{X}\) (\(n \times p\)) is called an eigenvector, which is defined as a weighted sum of the observed data:
\[\nu_{i} = w_{i,1} x_1 + w_{i,2}x_2 + ... w_{i,p}\]
The similarity to a regression equation is notable in that we’ve generated a new criterion of sorts \(\nu_i\) that is a combination of the variables in a matrix \(\mathbf{X}\) that maximizes shared variance. Importantly, though, the order of the eigenvectors matters. The first gets to ‘chew on’ all observed covariation among variables in \(\mathbf{X}\), the second gets its leftovers, the third gets the leftovers of the second, and so on. Thus, the weights for the ith eigenvector are estimated to optimally explain whatever variation remains in the matrix at the ith iteration.
Each eigenvector has a corresponding eigenvalue that summarizes how strongly the data coalesce around that direction. Said differently, larger eigenvalues indicate that more shared variability is explained by that eigenvector. Importantly, we need to consider the total variability in the covariance matrix – that is, how much ‘stuff’ is there to explain? This is given by the sum of the variances, which falls along the diagonal of the covariance matrix. Recall: this is trace of the matrix.
Here’s the covariance matrix of four-variable housing data:
cmat <- cov(BostonSmall)
print(cmat)
## cmedv crim nox lstat log_crim
## cmedv 84.312 -30.77 -0.4568 -48.577 -13.098
## crim -30.770 73.99 0.4196 27.986 17.882
## nox -0.457 0.42 0.0134 0.489 0.285
## lstat -48.577 27.99 0.4889 50.995 13.957
## log_crim -13.098 17.88 0.2850 13.957 9.729
And thus, the total variation of the covariance matrix is the sum of the variances along the diagonal:
sum(diag(cmat))
## [1] 219
Indeed, there is a mathematical relationship between variance explained and eigenvalues. In an \(n \times n\) matrix, there will always be exactly \(n\) eigenvalues and \(n\) eigenvectors. The eigenvectors jointly explain all variation in the matrix, by definition. You can think of this as a zero degree of freedom solution to the data matrix – it ‘explains’ things perfectly. Thus, the total variance of the matrix is equal to the sum of the eigenvalues \(\mathbf{\lambda}\).
\[ \sum_{i=1}^{n} \lambda_i = tr(\mathbf{A}) \]
Let’s verify this in our covariance matrix. The sum of the eigenvalues is:
eigendecomp <- eigen(cmat) #eigen computes the eigendecomposition of a matrix, here the covariance matrix
sum(eigendecomp$values)
## [1] 219
Therefore, the proportion of variance explained by the ith eigenvalue is simply its value normalized by the sum of all eigenvalues:
\[ r^2_{\lambda_i} = \frac{\lambda_i}{\sum_{k=1}^{n} \lambda_k} \]
round(eigendecomp$values/sum(eigendecomp$values), 3)
## [1] 0.677 0.232 0.074 0.017 0.000
Or the cumulative variance explained:
round(cumsum(eigendecomp$values)/sum(eigendecomp$values), 3)
## [1] 0.677 0.909 0.983 1.000 1.000
One small gotcha: note that the prcomp()
function in R
, which runs PCA (eigendecomposition) on a data matrix, returns the square roots of the eigenvalues, so you have to square them to get the variance explained right:
prcompdemo <- prcomp(BostonSmall)
round(prcompdemo$sdev^2/sum(prcompdemo$sdev^2), 3)
## [1] 0.677 0.232 0.074 0.017 0.000
Above, we stated that a positive definite matrix must have eigenvalues that are all positive. Why is that? Let’s start by noticing that in our eigendecomposition of the Boston housing data, we considered five variables: cmedv, crim, nox, lstat, log_crim. But notice how the fifth eigenvector explained essentially no variance:
round(eigendecomp$values/sum(eigendecomp$values), 3)
## [1] 0.677 0.232 0.074 0.017 0.000
Technically, the fifth eigenvalue is positive, but it is quite small relative to the total variance:
eigendecomp$values[5]
## [1] 0.00457
What does this mean? We can explain virtually all of the covariation in the data using a smaller number of latent directions in the data. This relates back to our discussion of minimizing the difference between the model-predicted covariance matrix \(\mathbf{\Sigma}\) and the observed covariance \(\mathbf{S}\). If we think of the covariance matrix in a form akin to multiple regression, one can think that we are trying to explain each variable in the matrix as a function of itself (i.e., variance) and other variables (i.e., covariances). In a way, then, if we have 10 variables in a covariance matrix, then we have 10 equations and 10 unknowns (a bit of hand-waving here to avoid technical distractions).
However, what if we can explain all of the data (100% variance) with only 9 equations? This means that we have a rank-deficient matrix that is not positive definite. Said differently, this means there is not enough data to solve for each variable uniquely (i.e., the model is not identified, as we will discuss later). Thinking in regression terms, it could be a situation in which there are an infinite number of solutions for \(B_1\) and \(B_2\) that give the same fit to the criterion. In algebra, this is also called linear dependency and is the extreme end of multicollinearity.
In a covariance matrix, if one variable is a perfect linear combination of other variables (one or more), then it is redundant – one could reproduce the entire covariance matrix knowing nothing about the redundant variable. The simplest case is when one variable is a linear combination of the others, such as: \(Y = 2X\). Consider how this system of equations cannot be solved (more unknowns than equations):
\[ \begin{align} a &= 2x + y + 3 \\ b &= 9x + 5y - 1 \end{align} \]
whereas this system has an infinite number of solutions: \[ \begin{align} x + y + z &= 1 \\ x + y + 2z &= 3 \end{align} \]
Because \(x = -1 -y\), then \(x\) and \(y\) cannot be uniquely solved. This system is called underdetermined, meaning that there are fewer unique equations than unknowns. More here.
Here, we have something close in that the crim and log_crim are both in the dataset – because it’s a log transformation, it’s not linear, so not completely redundant.
But let’s mess things up just for fun by adding a redundant variable:
BostonSmall$blowup <- 2*BostonSmall$crim
eigen(cov(BostonSmall))
## eigen() decomposition
## $values
## [1] 4.05e+02 8.92e+01 1.65e+01 3.89e+00 4.57e-03 -2.47e-14
##
## $vectors
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.23743 0.79497 -0.55586 -0.0517 0.000216 0.00e+00
## [2,] -0.42191 0.14216 0.02618 -0.0331 0.000544 -8.94e-01
## [3,] -0.00277 -0.00374 -0.00882 0.0288 0.999536 6.31e-12
## [4,] -0.20346 -0.51304 -0.79819 -0.2414 -0.002575 -1.86e-14
## [5,] -0.11037 -0.06123 -0.22453 0.9658 -0.030317 -1.91e-13
## [6,] -0.84382 0.28433 0.05236 -0.0662 0.001089 4.47e-01
Bless R
, it tried hard. But you’ll see that the 6th eigenvalue is negative. The covariance matrix is now not positive definite (i.e., it is singular) – this is bad and will typically blow up any SEM program. Plus, such a matrix is not interesting in principle.
Take home: Always examine your covariance matrix. And in general, also look at the eigenvalues to ensure that your matrix is positive definite.
There’s also a handy function, is.positive.definite()
in the matrixcalc
package for checking the positive definiteness of a matrix:
is.positive.definite(cov(BostonSmall))
## [1] FALSE
We said above that a covariance matrix must be invertible for SEM, which relates to having positive eigenvalues and a determinant. The determinant of a square matrix is the serial product of the \(n\) eigenvalues:
\[ \prod_{i=1}^{n} \lambda_i = \lambda_1 \cdot \lambda_2 \cdot \ldots \lambda_n \]
Thus, eigenvalues of zero lead to a zero determinant, negative eigenvalues may lead to a negative determinant, and so on. Sometimes, an eigenvalue is very close to zero (e.g., \(10^{-8}\)), leading to difficulties inverting the matrix, sometimes called a ‘computationally singular’ matrix (i.e., the computer can’t deal with such a tiny eigenvalue).
The eigenvalue summarizes the strength of a latent direction in the data. Each eigenvalue has a corresponding eigenvector, which quantifies the direction – that is, the weighted linear combination of observed variables. Thus, larger eigenvalues indicate latent components (eigenvectors) that explain more of the total variability (covariation).
Note that if you compute an eigendecomposition on a correlation matrix, all variances are 1.0 (the diagonal). Thus, eigenvalues are expressed in terms of number of variables explained. Thus, an eigenvalue of 3 in a correlation matrix of 10 variables indicates that the corresponding eigenvector accounts for 3 variables’ worth of covariation. This is where Kaiser’s rule in factor analysis comes from (as we’ll talk about in the coming weeks).
If all variables are standardized (z-scored, unit-normalized, whatever you want to call it) and are perfectly uncorrelated (orthogonal), then each eigenvalue will be 1.0. Thus, if there are 10 uncorrelated variables, there are 10 completely unique directions in the multivariate space.
Near zero eigenvalues indicate that one or more variables in the covariance matrix can be explained by other variables.
Eigenvectors specify the combination of observed variables that form the direction. Thus, they provide information about what variables are associated to form, potentially, a meaningful latent variable.
Additional details on PCA and eigendecomposition here.