Adapted from materials developed by Daniel Albohn, Kayla Brown, & Yiming Qian (https://psu-psychology.github.io/r-bootcamp-2019/modules.html#basic_data_analyses)

Load and preprocess data

First load the required packages for this walk through.

pkg_list <- c("tidyverse", "psych", "rcompanion", "knitr", "car", "afex", "ez",
              "ggfortify", "Hmisc", "emmeans", "jtools", "apaTables", "dplyr", "skimr")
# purrr::walk(pkg_list, require, quietly = TRUE, character.only = TRUE)
pacman::p_load(pkg_list, character.only = TRUE)

The data set being used for this walk through is automatically loaded when psych is loaded. You can examine what each of the columns represent by looking at the data set’s help page with ?sat.act.

Next, just to get a snapshot of the data, we can use head and str. Notice that sex and education are integer columns despite being categorical variables.

data("sat.act", package="psych")
sat.act <- sat.act %>% dplyr::rename(sex=gender) %>%
mutate(sex = factor(sex, levels = c(1,2), labels = c("male", "female")))

# Alternatively...
# source("R/load_sat_act.R")
head(sat.act)

##          sex education age ACT SATV SATQ
## 29442 female         3  19  24  500  500
## 29457 female         3  23  35  600  500
## 29498 female         3  20  21  480  470
## 29503   male         4  27  26  550  520
## 29504   male         2  33  31  600  550
## 29518   male         5  26  28  640  640

str(sat.act)

## 'data.frame':    700 obs. of  6 variables:
##  $ sex      : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
##  $ education: int  3 3 3 4 2 5 5 3 4 5 ...
##  $ age      : int  19 23 20 27 33 26 30 19 23 40 ...
##  $ ACT      : int  24 35 21 26 31 28 36 22 22 35 ...
##  $ SATV     : int  500 600 480 550 600 640 610 520 400 730 ...
##  $ SATQ     : int  500 500 470 520 550 640 500 560 600 800 ...

Preprocessing integers into factors

Since some of the analyses we will be running require categorical variables, we first need to preprocess some of the integer columns into categories/factors.

labels <- c("none", "some_hs", "high_school",
            "some_college", "college", "graduate")

sat.act <- sat.act %>%
  mutate(.,
         education = factor(education, levels = (0:5), labels = labels)
         # education = case_when(education == 0 ~ "none",
         #                       education == 1 ~ "some_hs",
         #                       education == 2 ~ "high_school",
         #                       education == 3 ~ "some_college",
         #                       education == 4 ~ "college",
         #                       education == 5 ~ "graduate")
  )

Describe data

Before analysis, we can obtain descriptive statistics quickly by utilizing the describe() function in the psych package.

psych::describe(sat.act) %>% as.data.frame() %>%
  dplyr::select(-vars, -trimmed, -mad, -range, -se)

##              n   mean      sd median min max   skew kurtosis
## sex*       700   1.65   0.478      2   1   2 -0.615  -1.6247
## education* 700   4.16   1.425      4   1   6 -0.681  -0.0749
## age        700  25.59   9.499     22  13  65  1.643   2.4243
## ACT        700  28.55   4.824     29   3  36 -0.656   0.5350
## SATV       700 612.23 112.903    620 200 800 -0.644   0.3252
## SATQ       687 610.22 115.639    620 200 800 -0.593  -0.0178

The describe() function also allows us to get descriptive statistics by a grouping variable using the partner function describeBy(). If we wanted to get descriptive statistics by sex we simply pass that column to an argument.

describeBy(sat.act, group = c("sex"), mat = TRUE) %>%
  dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)

##             group1   n   mean     sd median min max   skew kurtosis
## sex*1         male 247   1.00   0.00      1   1   1    NaN      NaN
## sex*2       female 453   2.00   0.00      2   2   2    NaN      NaN
## education*1   male 247   4.00   1.54      4   1   6 -0.542   -0.598
## education*2 female 453   4.26   1.35      4   1   6 -0.737    0.272
## age1          male 247  25.86   9.74     22  14  58  1.428    1.435
## age2        female 453  25.45   9.37     22  13  65  1.766    3.029
## ACT1          male 247  28.79   5.06     30   3  36 -1.057    1.888
## ACT2        female 453  28.42   4.69     29  15  36 -0.392   -0.416
## SATV1         male 247 615.11 114.16    630 200 800 -0.630    0.130
## SATV2       female 453 610.66 112.31    620 200 800 -0.651    0.423
## SATQ1         male 245 635.87 116.02    660 300 800 -0.722   -0.122
## SATQ2       female 442 596.00 113.07    600 200 800 -0.577    0.129

While describe is useful for quick glimpses at the data, it does not always play nicely with the tidyverse. If you wanted to stick with a more “pure” tidyverse approach to descriptive statistics, you can use skimr.

sat.act %>% 
  skimr::skim(.)

Data summary
Name	Piped data
Number of rows	700
Number of columns	6
_______________________
Column type frequency:
factor	2
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
sex	0	1	FALSE	2	fem: 453, mal: 247
education	0	1	FALSE	6	som: 275, gra: 141, col: 138, non: 57

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	0	1.00	25.6	9.50	13	19	22	29	65	▇▃▂▁▁
ACT	0	1.00	28.6	4.82	3	25	29	32	36	▁▁▂▇▇
SATV	0	1.00	612.2	112.90	200	550	620	700	800	▁▁▅▇▆
SATQ	13	0.98	610.2	115.64	200	530	620	700	800	▁▁▅▇▆

While skim() doesn’t provide as much descriptive detail as describe, it does provide the basics, and a useful visual representation of the data.

Similarly, for obtaining descriptives by grouping variables you can utilize the group_by() function.

Here we provide descriptives separately by sex:

sat.act %>% 
  group_by(., sex) %>% 
  skimr::skim(.)

Data summary
Name	Piped data
Number of rows	700
Number of columns	6
_______________________
Column type frequency:
factor	1
numeric	4
________________________
Group variables	sex

Variable type: factor

skim_variable	sex	n_missing	complete_rate	ordered	n_unique	top_counts
education	male	0	1	FALSE	6	som: 80, col: 51, gra: 46, non: 27
education	female	0	1	FALSE	6	som: 195, gra: 95, col: 87, non: 30

Variable type: numeric

skim_variable	sex	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	male	0	1.00	25.9	9.74	14	19	22	30.0	58	▇▅▂▁▁
age	female	0	1.00	25.4	9.37	13	19	22	28.0	65	▇▃▂▁▁
ACT	male	0	1.00	28.8	5.06	3	25	30	32.5	36	▁▁▂▅▇
ACT	female	0	1.00	28.4	4.69	15	25	29	32.0	36	▁▃▆▇▇
SATV	male	0	1.00	615.1	114.16	200	540	630	700.0	800	▁▁▆▇▇
SATV	female	0	1.00	610.7	112.31	200	550	620	700.0	800	▁▁▃▇▅
SATQ	male	2	0.99	635.9	116.02	300	570	660	720.0	800	▂▃▅▇▇
SATQ	female	11	0.98	596.0	113.07	200	500	600	683.0	800	▁▁▆▇▅

Correlations

Moving beyond categorical variables, we next test the relationship between numeric values using simple correlation. See the supplemental materials for more in-depth explnations.

Correlation is done using the cor() function. Suppose we want to see if the ACT scores increase with age.

# Covariance
# cov(sat.act$age, sat.act$ACT)

# Correlation
cor(sat.act$age, sat.act$ACT)

## [1] 0.111

A small correlation, but no test of significance. To obtain significance values, you must pass your data to cor.test(). Note that this can be done by passing x and y or using the formula method (which will be used for linear models).

# Default method
# cor.test(sat.act$age, sat.act$ACT)

# Formula method
cor.test(~ age + ACT, data = sat.act)

## 
##  Pearson's product-moment correlation
## 
## data:  age and ACT
## t = 3, df = 698, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0367 0.1831
## sample estimates:
##   cor 
## 0.111

Visualizing Correlations

To visualize this relationship, we can pass the raw data to ggplot and get a simple regression line using stat_smooth() with the lm method. See supplemental materials for an example.

Additional Correlation Functions

You can also pass a dataframe of values to the cor function to get a simple correlation matrix (a la SPSS).

cor(sat.act[c("age","ACT","SATV","SATQ")], use = "pairwise.complete.obs")

##          age   ACT    SATV    SATQ
## age   1.0000 0.111 -0.0424 -0.0339
## ACT   0.1105 1.000  0.5611  0.5871
## SATV -0.0424 0.561  1.0000  0.6443
## SATQ -0.0339 0.587  0.6443  1.0000

# Or, using tidyverse verbs...

# sat.act %>% 
#   select(., age, ACT, SATV, SATQ) %>% 
#   drop_na(., SATQ) %>% 
#   cor()

Or, optionally for easier-on-the-eyes output we can use a number of specialized functions.

Hmisc::rcorr(sat.act[c("age","ACT","SATV","SATQ")] %>% as.matrix(.))

##        age  ACT  SATV  SATQ
## age   1.00 0.11 -0.04 -0.03
## ACT   0.11 1.00  0.56  0.59
## SATV -0.04 0.56  1.00  0.64
## SATQ -0.03 0.59  0.64  1.00
## 
## n
##      age ACT SATV SATQ
## age  700 700  700  687
## ACT  700 700  700  687
## SATV 700 700  700  687
## SATQ 687 687  687  687
## 
## P
##      age    ACT    SATV   SATQ  
## age         0.0034 0.2631 0.3744
## ACT  0.0034        0.0000 0.0000
## SATV 0.2631 0.0000        0.0000
## SATQ 0.3744 0.0000 0.0000

Or directly to APA-acceptable tables. You can pass a filename argument to apa.cor.table to save it directly to file.

apaTables::apa.cor.table(sat.act[c("age","ACT","SATV","SATQ")])

## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable M      SD     1           2          3         
##   1. age   25.59  9.50                                    
##                                                           
##   2. ACT   28.55  4.82   .11**                            
##                          [.04, .18]                       
##                                                           
##   3. SATV  612.23 112.90 -.04        .56**                
##                          [-.12, .03] [.51, .61]           
##                                                           
##   4. SATQ  610.22 115.64 -.03        .59**      .64**     
##                          [-.11, .04] [.54, .63] [.60, .69]
##                                                           
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
##  * indicates p < .05. ** indicates p < .01.
##

Linear models

Introduction

The overall goal is to give you an introduction to conducting regression analyses or linear modeling in R.

Single-predictor (simple) regression

Let’s turn to ‘simple’ linear regression (one predictor, one outcome), then scale to multiple regression (many predictors, one outcome). The standard linear regression model is implemented by the lm function in R. The lm function uses ordinary least squares (OLS) which estimates the parameter by minimizing the squared residuals.

Let’s take a look at the simple case of the association between SAT math and verbal scores.

ggplot(sat.act, aes(x=SATV, y=SATQ)) + 
  geom_point(color='darkblue', size = 3) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='black', size=1.2) +
  labs(x="Math Score", y="Verbal Score")

## `geom_smooth()` using formula 'y ~ x'

This relationship looks quite linear. The higher the math score the higher the verbal score.

In R regression models, we use the ~ operator to denote ‘regressed on’. It’s not especially intuitive, but we say the criterion is regressed on the predictor. Here, if we think verbal is predictive of math, we’d say ‘math regressed on verbal’ or ‘verbal predicts math.’

In formula terms, this is SATV ~ SATQ, which we pass as the first argument to lm().

lm_SAT <- lm(SATQ ~ SATV, data=sat.act)
summary(lm_SAT)

## 
## Call:
## lm(formula = SATQ ~ SATV, data = sat.act)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -302.1  -46.5    2.4   51.3  282.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 207.5253    18.5725    11.2   <2e-16 ***
## SATV          0.6576     0.0298    22.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.5 on 685 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.415,  Adjusted R-squared:  0.414 
## F-statistic:  486 on 1 and 685 DF,  p-value: <2e-16

The output contains individual parameter estimates of the model (here, just the intercept and slope), their standard errors, significance tests, and p-values (one degree of freedom). We also get global information such as the sum of squared errors and the coefficient of determination (\(R^2\)).

Regression diagnostics

The ggfortify package also provides an autoplot function that gives similar diagnostics within a handy ggplot-based graph.

autoplot(lm_SAT)

Bootstrap estimates and confidence intervals

Using functionality from the car and boot packges, we can easily get estimates of the regression coefficients and standard errors using nonparametric bootstrapping, which relaxes the normal theory assumption on the standard errors and, therefore, the significance tests. Likewise, the model does not assume normally distributed error.

sat.act.na <- na.omit(sat.act)
lm_SAT.na <- lm(SATQ ~ SATV, data=sat.act.na)
system.time(lm_SAT.boot <- Boot(lm_SAT.na, R=2000))

##    user  system elapsed 
##   2.711   0.036   2.785

summary(lm_SAT.boot, high.moments=TRUE)

## 
## Number of bootstrap replications R = 2000 
##             original  bootBias  bootSE bootMed bootSkew bootKurtosis
## (Intercept)  207.525  0.071309 21.4796 207.396   0.0916       -0.164
## SATV           0.658 -0.000261  0.0336   0.659  -0.0698       -0.188

We can use the object to obtain 95% bootstrapped confidence intervals using the ‘bias corrected, accelerated’ method (aka bca).

confint(lm_SAT.boot, level=.95, type="bca")

## Bootstrap bca confidence intervals
## 
##               2.5 % 97.5 %
## (Intercept) 167.486 251.89
## SATV          0.587   0.72

And we can easily compare the bootstrapped and standard OLS models:

hist(lm_SAT.boot, legend="separate")

Multiple regression

We can easily extend to larger regression models by adding terms to the right side of the formula. For example, in the sat.act dataset, we could examine the extent to which the math scores (SATQ) is a function of both verbal scores (SATV), age (age), and sex (gender, 1 = male, 2 = female).

asv_model <- lm(SATQ ~ age + sex + SATV, sat.act)
summary(asv_model)

## 
## Call:
## lm(formula = SATQ ~ age + sex + SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -290.34  -50.32    5.81   51.51  295.57 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 236.6596    21.2572   11.13  < 2e-16 ***
## age          -0.1267     0.3498   -0.36     0.72    
## sexfemale   -36.8580     6.9202   -5.33  1.4e-07 ***
## SATV          0.6541     0.0293   22.33  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.8 on 683 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.438,  Adjusted R-squared:  0.436 
## F-statistic:  178 on 3 and 683 DF,  p-value: <2e-16

Getting results into a tidy, useful format

Note that the broom package is very useful for extracting global and specific statistics from many models in R, including regression models. The introductory vignette provides a number of useful examples: https://cran.r-project.org/web/packages/broom/vignettes/broom.html. Here, what if we want to save the global statistics and parameter estimates into data.frame objects?

We can use the glance function to get the global model statistics.

broom::glance(asv_model)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual  nobs
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1     0.438         0.436  86.8      178. 3.52e-85     3 -4040. 8089. 8112. 5150990.         683   687

And the tidy function yields the parameter table

broom::tidy(asv_model)

## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  237.      21.3       11.1   1.43e-26
## 2 age           -0.127    0.350     -0.362 7.17e- 1
## 3 sexfemale    -36.9      6.92      -5.33  1.36e- 7
## 4 SATV           0.654    0.0293    22.3   2.51e-83

Modeling interactions

We can use the * operator in R to ask that both the constituent variables and their interaction(s) are entered into the model. For example:

int_model <- lm(SATQ ~ sex*age*SATV, sat.act)
summary(int_model)

## 
## Call:
## lm(formula = SATQ ~ sex * age * SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.41  -50.48    4.05   52.30  291.29 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -23.01000   86.97955   -0.26    0.791    
## sexfemale          216.92703  108.13088    2.01    0.045 *  
## age                  9.54204    3.39075    2.81    0.005 ** 
## SATV                 1.02899    0.13875    7.42  3.6e-13 ***
## sexfemale:age       -8.86490    4.15128   -2.14    0.033 *  
## sexfemale:SATV      -0.33756    0.17252   -1.96    0.051 .  
## age:SATV            -0.01395    0.00546   -2.55    0.011 *  
## sexfemale:age:SATV   0.01155    0.00668    1.73    0.084 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.2 on 679 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.45,   Adjusted R-squared:  0.444 
## F-statistic: 79.4 on 7 and 679 DF,  p-value: <2e-16

This model includes individual effects of sex (gender) and age (age), as well as their interation (gender:age). This highlights that the asterisk operator * will compute all possible interations among the specified predictors. For example, a*b*c*d will generate all effets up through and including the a x b x c x d interation. By contrast, if you wish to specify a given interaction manually/directly, use the colon operator (e.g., a:b). The downside of the colon operator is that it doesn’t guarantee that the corresponding lower-level effects are included, which is usually a sane default position. As a reminder, you should essentially never include an interation without including the lower level effects, because this can misassign the variance.

What do we see? Males tend to have higher math scores and maintain these scores with age. Women have lower maths scores and show a decreasing trend as they get older.

Contrasts in regression

(Some of the code and text here has been adapted from Russell Lenth’s excellent emmeans documentation: https://cran.r-project.org/web/packages/emmeans/)

One of the handiest packages in the R regression universe is emmeans, which can provide the ‘expected marginal means’ (em means), as well as a host of other contrasts and comparisons. In particular, it is very easy to test simple slopes and pairwise differences. Furthermore, the package works with multcomp to handle correction for multiple comparisons. See the longer documentation here.

Let’s look how age and verbal scores may interact to predict math scores.

# creating age groups for pairwise comparisons
sat.act$agef[sat.act$age < 25] <- 1
sat.act$agef[sat.act$age >= 25 & sat.act$age <= 50] <- 2
sat.act$agef[sat.act$age > 50] <- 3

# setting as factors
sat.act$agef <- as.factor(sat.act$agef)
sat.act$gender <- as.factor(sat.act$sex)

# running the model
sat.lm <- lm(SATQ ~ agef + SATV, data = sat.act)
summary(sat.lm)

## 
## Call:
## lm(formula = SATQ ~ agef + SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -301.62  -45.92    2.07   51.81  283.47 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 206.9031    18.8894   10.95   <2e-16 ***
## agef2        -0.0946     7.1350   -0.01     0.99    
## agef3        15.5437    18.9726    0.82     0.41    
## SATV          0.6579     0.0299   22.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.6 on 683 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.416,  Adjusted R-squared:  0.413 
## F-statistic:  162 on 3 and 683 DF,  p-value: <2e-16

This output is hard to look at because there are many dummy codes and we have to infer the reference condition for each factor (usually alphabetical). Also, we do not have an intuitive sense of the expected means in each condition because they depend on the sum of the intercept and the specific dummy code for the condition interest, averaging over the other factor.

We can obtain the expected means for each condition.

Expected means for age group

sat.emm.s <- emmeans(sat.lm, "agef")
print(sat.emm.s)

##  agef emmean    SE  df lower.CL upper.CL
##  1       610  4.32 683      601      618
##  2       610  5.67 683      598      621
##  3       625 18.47 683      589      662
## 
## Confidence level used: 0.95

Expected means for verbal scores

sat.emm.p <- emmeans(sat.lm, "SATV")
print(sat.emm.p)

##  SATV emmean   SE  df lower.CL upper.CL
##   612    615 4.32 683      606      623
## 
## Results are averaged over the levels of: agef 
## Confidence level used: 0.95

Means in each cell of the factorial design

print(emmeans(sat.lm, ~agef*SATV))

##  agef SATV emmean    SE  df lower.CL upper.CL
##  1     612    610  4.32 683      601      618
##  2     612    610  5.67 683      598      621
##  3     612    625 18.47 683      589      662
## 
## Confidence level used: 0.95

Pairwise comparisons

If we wanted to compare the pairwise differences in the effect of age group on math scores while controlling for verbal scores (and potentially other variables we add to the model), we could use the pairs function:

sat_pairs <- pairs(sat.emm.s)
print(sat_pairs)

##  contrast estimate    SE  df t.ratio p.value
##  1 - 2        0.09  7.13 683  0.013  1.0000 
##  1 - 3      -15.54 18.97 683 -0.819  0.6910 
##  2 - 3      -15.64 19.32 683 -0.809  0.6970 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Note that you can get a sense of the contrasts being tested by emmeans by examining the @linfct slot of the object. I’ve learned a lot by examining these contrast matrices and thinking about how to setup a (focal) contrast of interest. Also note that you get p-value adjustment for free (here, Tukey’s HSD method).

Contrasts for the predicted mean level of math scores contrast for each age group, controlling for verbal score.

sat.emm.s@linfct

##      (Intercept) agef2 agef3 SATV
## [1,]           1     0     0  612
## [2,]           1     1     0  612
## [3,]           1     0     1  612

What are the pairwise contrasts for verbal scores?

sat_pairs@linfct

##      (Intercept) agef2 agef3 SATV
## [1,]           0    -1     0    0
## [2,]           0     0    -1    0
## [3,]           0     1    -1    0

Pairwise differences and simple slopes in regression

Consider a model in which we examine the association between SATQ and SATV across age. Here, we regress SATV on SATQ, age (three levels), and their interaction.

fit_sat <- lm(SATQ ~ SATV*agef, data = sat.act)
summary(fit_sat)

## 
## Call:
## lm(formula = SATQ ~ SATV * agef, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -302.51  -50.21    2.09   54.61  256.86 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 178.8432    22.1853    8.06  3.4e-15 ***
## SATV          0.7034     0.0354   19.90  < 2e-16 ***
## agef2       109.8710    42.6641    2.58   0.0102 *  
## agef3        17.2861    95.7898    0.18   0.8568    
## SATV:agef2   -0.1804     0.0690   -2.61   0.0091 ** 
## SATV:agef3   -0.0022     0.1547   -0.01   0.9887    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.3 on 681 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.422,  Adjusted R-squared:  0.417 
## F-statistic: 99.3 on 5 and 681 DF,  p-value: <2e-16

car::Anova(fit_sat, type="III") #overall effects of predictors in the model

## Anova Table (Type III tests)
## 
## Response: SATQ
##              Sum Sq  Df F value  Pr(>F)    
## (Intercept)  506336   1   64.99 3.4e-15 ***
## SATV        3084321   1  395.85 < 2e-16 ***
## agef          51806   2    3.32   0.037 *  
## SATV:agef     53928   2    3.46   0.032 *  
## Residuals   5306050 681                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that this yields a categorical (age group) x continuous (verbal scores) interaction. The output from car::Anova indicates that the interaction is significant, but we need more detailed guidance on how the slope for verbal scores is moderated by age group.

In a simple slopes test, we might wish to know whether the slope for SATV is non-zero in each age group individually. Let’s start by getting the estimated marginal means for each age group.

emmeans(fit_sat, ~agef)

## NOTE: Results may be misleading due to involvement in interactions

##  agef emmean    SE  df lower.CL upper.CL
##  1       610  4.31 681      601      618
##  2       609  5.66 681      598      620
##  3       626 18.43 681      589      662
## 
## Confidence level used: 0.95

And pairwise differences between age groups:

pairs(emmeans(fit_sat, ~agef))

## NOTE: Results may be misleading due to involvement in interactions

##  contrast estimate    SE  df t.ratio p.value
##  1 - 2        0.62  7.11 681  0.087  0.9960 
##  1 - 3      -15.94 18.92 681 -0.842  0.6770 
##  2 - 3      -16.56 19.28 681 -0.859  0.6660 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Transitioning to SATV, because we are interested its linear effect (slope), we use the emtrends function to estimate the slope in each species individually. In terms of simple slopes, we test whether the SATV slope is non-zero in each age group. The infer argument in the summary of emtrends requests t-tests and p-values for the slopes.

summary(emtrends(object = fit_sat, ~agef, var="SATV"), infer=TRUE)

##  agef SATV.trend     SE  df lower.CL upper.CL t.ratio p.value
##  1         0.703 0.0354 681    0.634    0.773 19.900  <.0001 
##  2         0.523 0.0593 681    0.407    0.639  8.820  <.0001 
##  3         0.701 0.1506 681    0.406    0.997  4.660  <.0001 
## 
## Confidence level used: 0.95

Finally, we could examine pairwise differences between slopes among age groups.

pairs(emtrends(object = fit_sat, ~agef, var="SATV"))

##  contrast estimate    SE  df t.ratio p.value
##  1 - 2      0.1804 0.069 681  2.614  0.0250 
##  1 - 3      0.0022 0.155 681  0.014  1.0000 
##  2 - 3     -0.1782 0.162 681 -1.101  0.5140 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

There is a lot more on probing interactions here: https://cran.r-project.org/web/packages/emmeans/vignettes/interactions.html.

ANOVA

Further, sometimes we want to test the relationship between categorical variables and numeric variables. One way to do this is ANOVA. Analysis of variance (ANOVA) provides a statistical test of whether two or more population means are equal. ANOVA is done using the aov_ez function.

Suppose we want to see if the ACT scores differ with education levels and gender.

Between-subjects factorial ANOVA

A two-way between-subjects ANOVA is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable. First, let’s see the summary of each condition.

str(sat.act)

## 'data.frame':    700 obs. of  8 variables:
##  $ sex      : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
##  $ education: Factor w/ 6 levels "none","some_hs",..: 4 4 4 5 3 6 6 4 5 6 ...
##  $ age      : int  19 23 20 27 33 26 30 19 23 40 ...
##  $ ACT      : int  24 35 21 26 31 28 36 22 22 35 ...
##  $ SATV     : int  500 600 480 550 600 640 610 520 400 730 ...
##  $ SATQ     : int  500 500 470 520 550 640 500 560 600 800 ...
##  $ agef     : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##  $ gender   : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...

describeBy(sat.act$ACT,list(sat.act$education,sat.act$gender), mat=TRUE,digits=2) %>% 
  dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)

##            group1 group2   n mean   sd median min max  skew kurtosis
## X11          none   male  27 29.0 5.00     29  20  36 -0.30    -1.13
## X12       some_hs   male  20 26.7 7.11     28  15  35 -0.30    -1.51
## X13   high_school   male  23 26.6 6.39     28   3  32 -2.14     5.39
## X14  some_college   male  80 28.6 5.03     30  17  36 -0.45    -0.92
## X15       college   male  51 28.9 4.42     29  16  36 -0.74     0.12
## X16      graduate   male  46 30.8 3.11     32  25  36 -0.38    -0.81
## X17          none female  30 26.1 5.06     26  15  36  0.08    -0.56
## X18       some_hs female  25 28.1 5.13     27  18  36 -0.21    -0.78
## X19   high_school female  21 27.3 5.23     28  15  36 -0.32    -0.34
## X110 some_college female 195 28.2 4.78     29  16  36 -0.46    -0.47
## X111      college female  87 29.4 4.32     30  19  36 -0.27    -0.67
## X112     graduate female  95 29.0 4.19     29  18  36 -0.31    -0.73

sat.act$id<-1:nrow(sat.act)
sat.act$education<-factor(sat.act$education)
# Compute the analysis of variance
res.aov <- aov_ez(id="id", dv="ACT", data = sat.act, between=c("education","gender"))

## Contrasts set to contr.sum for the following variables: education, gender

# Summary of the analysis
print(res.aov)

## Anova Table (Type 3 tests)
## 
## Response: ACT
##             Effect     df   MSE        F  ges p.value
## 1        education 5, 688 22.56 4.53 *** .032   <.001
## 2           gender 1, 688 22.56     0.87 .001    .351
## 3 education:gender 5, 688 22.56   2.04 + .015    .071
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

We can conclude that education is statistically significant. ACT score is positively related with education. Also, there is no significant interaction effect between education and gender.

Pairwise comparison among education levels

ACT scores generally increase with education. If we want to compare each level in education factor with the other, we can compute those with pairwise comparisons using emmeans.

sat.emm.aov <- emmeans(res.aov, "education")

## NOTE: Results may be misleading due to involvement in interactions

sat_pairs_aov <- pairs(sat.emm.aov)
print(sat_pairs_aov)

##  contrast                   estimate    SE  df t.ratio p.value
##  none - some_hs                0.142 0.951 688  0.150  1.0000 
##  none - high_school            0.559 0.954 688  0.590  0.9920 
##  none - some_college          -0.822 0.705 688 -1.170  0.8530 
##  none - college               -1.643 0.757 688 -2.170  0.2520 
##  none - graduate              -2.366 0.761 688 -3.110  0.0240 
##  some_hs - high_school         0.417 1.011 688  0.410  0.9980 
##  some_hs - some_college       -0.964 0.779 688 -1.240  0.8190 
##  some_hs - college            -1.785 0.826 688 -2.160  0.2580 
##  some_hs - graduate           -2.508 0.830 688 -3.020  0.0310 
##  high_school - some_college   -1.381 0.783 688 -1.760  0.4900 
##  high_school - college        -2.202 0.830 688 -2.650  0.0860 
##  high_school - graduate       -2.926 0.834 688 -3.510  0.0060 
##  some_college - college       -0.821 0.524 688 -1.570  0.6210 
##  some_college - graduate      -1.545 0.530 688 -2.910  0.0430 
##  college - graduate           -0.724 0.598 688 -1.210  0.8320 
## 
## Results are averaged over the levels of: gender 
## P value adjustment: tukey method for comparing a family of 6 estimates

Repeated measures ANOVA

Repeated measures ANOVA is commonly used in data analysis. For example, participants take more than one tasks or conditions. In longitudinal study, data is collected from each participant more than once. Assume there are different time points in this dataset. After introduce the time point variable into the data, let’s check whether ACT scores vary by time point and education.

set.seed(1999) # Needed for getting same results

sat.act <- sat.act %>% 
  mutate(education = factor(rep(sample(1:5, 140, replace = TRUE), each=5)),
         id = rep(1:140, each = 5),
         time = rep(1:5, 140)
         )

res.aov2 <- aov_ez(
  id = "id",
  dv = "ACT",
  data = sat.act,
  between = c("education"),
  within = c("time")
)

## Contrasts set to contr.sum for the following variables: education

summary(res.aov2)

## 
## Univariate Type III Repeated-Measures ANOVA Assuming Sphericity
## 
##                Sum Sq num Df Error SS den Df  F value Pr(>F)    
## (Intercept)    531504      1     3619    135 19829.43 <2e-16 ***
## education         154      4     3619    135     1.44  0.225    
## time              194      4    11905    540     2.20  0.067 .  
## education:time    387     16    11905    540     1.10  0.355    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Mauchly Tests for Sphericity
## 
##                Test statistic p-value
## time                    0.874  0.0362
## education:time          0.874  0.0362
## 
## 
## Greenhouse-Geisser and Huynh-Feldt Corrections
##  for Departure from Sphericity
## 
##                GG eps Pr(>F[GG])  
## time            0.936      0.072 .
## education:time  0.936      0.356  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                HF eps Pr(>F[HF])
## time            0.966     0.0698
## education:time  0.966     0.3556

Visualizing ANOVA

Aside from simply printing a table, we can visualize the mean and sd of each conditions with a line plot. See the supplemental materials for an example.

Basic Regression and ANOVA in R

Michael Hallquist

21 Jan 2021