# 1 A bit of math housekeeping

• The use of a ‘hat’ over a variable denotes an estimator of a quantity. So, $$\hat{Y}$$ indicates the predicted values of $$Y$$ from a statistical model.

• Sample means are denoted $$\bar{x}$$.

• Sample variances are $$s^2_x$$ and standard deviations are $$s_x$$.

• Matrices are denoted by boldface capital letters, $$\mathbf{X}$$.

• Vectors are denoted by boldface lowercase letters, $$\mathbf{x}$$.

• The transpose of matrix flips a matrix along its diagonal (transposing rows and columns). It is denoted with the prime operator, $$\mathbf{X}'$$ and the t() function in R.

• Matrices are composed of rows and columns (we won’t deal with arrays of 3+ dimensions). Rows are the first dimension, columns are the second. So, a matrix composed of $$n$$ observations and $$k$$ variables is denoted:

$\underset{n \times k}{\mathbf{X}}$

In R, selecting row i in column j from matrix X is: X[i,j].

• A row vector is a collection of numbers having one row and $$m$$ columns (i.e., 1 x $$m$$).

$\begin{equation} \mathbf{x} = \begin{bmatrix} x_1 & x_2 & \cdots & x_m \end{bmatrix} \end{equation}$

• A column vector is a collection of numbers having many rows and one column (i.e., $$m$$ x 1).

$\begin{equation} \mathbf{x} =\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix} \end{equation}$

• Thus, the transpose of a column vector is a row vector, and vice versa:

$\begin{equation} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix}' = \begin{bmatrix} x_1 & x_2 & \cdots & x_m \end{bmatrix} \end{equation}$

• Random variables (i.e., variables whose values are depend on uncertain processes) are denoted by capital letters (e.g., $$X$$). Most psychometric data are random variables (e.g., height, weight, IQ, depression severity). Realizations of random variables (i.e., draws from the processes that control the variable) are denoted by lower case letters (e.g., $$x$$).

I will probably screw these up along the way, but I wanted to mention them up front to be on the same page!

# 2 Overview of covariance, correlation, and regression

We’ll use housing price data from Boston’s 1970 census to review important concepts in correlation and regression. This is a nice dataset for regression because there are many interdependent variables: crime, pollutants, age of properties, etc.

#example dataset from mlbench package with home prices in Boston by census tract
data(BostonHousing2)
BostonSmall <- BostonHousing2 %>% dplyr::select(
cmedv, #median value of home in 1000s
crim, #per capita crime by town
nox, #nitric oxide concentration
lstat #proportion of lower status
)

n <- nrow(BostonSmall) #number of observations
k <- ncol(BostonSmall) #number of variables

#scatterplot matrix of variables
splom(BostonSmall)