Adapted from materials developed by Daniel Albohn, Kayla Brown, & Yiming Qian (https://psu-psychology.github.io/r-bootcamp-2019/modules.html#basic_data_analyses)
First load the required packages for this walk through.
pkg_list <- c("tidyverse", "psych", "rcompanion", "knitr", "car", "afex", "ez",
"ggfortify", "Hmisc", "emmeans", "jtools", "apaTables", "dplyr", "skimr")
# purrr::walk(pkg_list, require, quietly = TRUE, character.only = TRUE)
pacman::p_load(pkg_list, character.only = TRUE)
The data set being used for this walk through is automatically loaded when psych
is loaded. You can examine what each of the columns represent by looking at the data set’s help page with ?sat.act
.
Next, just to get a snapshot of the data, we can use head
and str
. Notice that sex
and education
are integer columns despite being categorical variables.
data("sat.act", package="psych")
sat.act <- sat.act %>% dplyr::rename(sex=gender) %>%
mutate(sex = factor(sex, levels = c(1,2), labels = c("male", "female")))
# Alternatively...
# source("R/load_sat_act.R")
head(sat.act)
## sex education age ACT SATV SATQ
## 29442 female 3 19 24 500 500
## 29457 female 3 23 35 600 500
## 29498 female 3 20 21 480 470
## 29503 male 4 27 26 550 520
## 29504 male 2 33 31 600 550
## 29518 male 5 26 28 640 640
str(sat.act)
## 'data.frame': 700 obs. of 6 variables:
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
## $ education: int 3 3 3 4 2 5 5 3 4 5 ...
## $ age : int 19 23 20 27 33 26 30 19 23 40 ...
## $ ACT : int 24 35 21 26 31 28 36 22 22 35 ...
## $ SATV : int 500 600 480 550 600 640 610 520 400 730 ...
## $ SATQ : int 500 500 470 520 550 640 500 560 600 800 ...
Since some of the analyses we will be running require categorical variables, we first need to preprocess some of the integer columns into categories/factors.
labels <- c("none", "some_hs", "high_school",
"some_college", "college", "graduate")
sat.act <- sat.act %>%
mutate(.,
education = factor(education, levels = (0:5), labels = labels)
# education = case_when(education == 0 ~ "none",
# education == 1 ~ "some_hs",
# education == 2 ~ "high_school",
# education == 3 ~ "some_college",
# education == 4 ~ "college",
# education == 5 ~ "graduate")
)
Before analysis, we can obtain descriptive statistics quickly by utilizing the describe()
function in the psych
package.
psych::describe(sat.act) %>% as.data.frame() %>%
dplyr::select(-vars, -trimmed, -mad, -range, -se)
## n mean sd median min max skew kurtosis
## sex* 700 1.65 0.478 2 1 2 -0.615 -1.6247
## education* 700 4.16 1.425 4 1 6 -0.681 -0.0749
## age 700 25.59 9.499 22 13 65 1.643 2.4243
## ACT 700 28.55 4.824 29 3 36 -0.656 0.5350
## SATV 700 612.23 112.903 620 200 800 -0.644 0.3252
## SATQ 687 610.22 115.639 620 200 800 -0.593 -0.0178
The describe()
function also allows us to get descriptive statistics by a grouping variable using the partner function describeBy()
. If we wanted to get descriptive statistics by sex
we simply pass that column to an argument.
describeBy(sat.act, group = c("sex"), mat = TRUE) %>%
dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)
## group1 n mean sd median min max skew kurtosis
## sex*1 male 247 1.00 0.00 1 1 1 NaN NaN
## sex*2 female 453 2.00 0.00 2 2 2 NaN NaN
## education*1 male 247 4.00 1.54 4 1 6 -0.542 -0.598
## education*2 female 453 4.26 1.35 4 1 6 -0.737 0.272
## age1 male 247 25.86 9.74 22 14 58 1.428 1.435
## age2 female 453 25.45 9.37 22 13 65 1.766 3.029
## ACT1 male 247 28.79 5.06 30 3 36 -1.057 1.888
## ACT2 female 453 28.42 4.69 29 15 36 -0.392 -0.416
## SATV1 male 247 615.11 114.16 630 200 800 -0.630 0.130
## SATV2 female 453 610.66 112.31 620 200 800 -0.651 0.423
## SATQ1 male 245 635.87 116.02 660 300 800 -0.722 -0.122
## SATQ2 female 442 596.00 113.07 600 200 800 -0.577 0.129
While describe
is useful for quick glimpses at the data, it does not always play nicely with the tidyverse
. If you wanted to stick with a more “pure” tidyverse
approach to descriptive statistics, you can use skimr
.
sat.act %>%
skimr::skim(.)
Name | Piped data |
Number of rows | 700 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
sex | 0 | 1 | FALSE | 2 | fem: 453, mal: 247 |
education | 0 | 1 | FALSE | 6 | som: 275, gra: 141, col: 138, non: 57 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1.00 | 25.6 | 9.50 | 13 | 19 | 22 | 29 | 65 | ▇▃▂▁▁ |
ACT | 0 | 1.00 | 28.6 | 4.82 | 3 | 25 | 29 | 32 | 36 | ▁▁▂▇▇ |
SATV | 0 | 1.00 | 612.2 | 112.90 | 200 | 550 | 620 | 700 | 800 | ▁▁▅▇▆ |
SATQ | 13 | 0.98 | 610.2 | 115.64 | 200 | 530 | 620 | 700 | 800 | ▁▁▅▇▆ |
While skim()
doesn’t provide as much descriptive detail as describe
, it does provide the basics, and a useful visual representation of the data.
Similarly, for obtaining descriptives by grouping variables you can utilize the group_by()
function.
Here we provide descriptives separately by sex:
sat.act %>%
group_by(., sex) %>%
skimr::skim(.)
Name | Piped data |
Number of rows | 700 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | sex |
Variable type: factor
skim_variable | sex | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|
education | male | 0 | 1 | FALSE | 6 | som: 80, col: 51, gra: 46, non: 27 |
education | female | 0 | 1 | FALSE | 6 | som: 195, gra: 95, col: 87, non: 30 |
Variable type: numeric
skim_variable | sex | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
age | male | 0 | 1.00 | 25.9 | 9.74 | 14 | 19 | 22 | 30.0 | 58 | ▇▅▂▁▁ |
age | female | 0 | 1.00 | 25.4 | 9.37 | 13 | 19 | 22 | 28.0 | 65 | ▇▃▂▁▁ |
ACT | male | 0 | 1.00 | 28.8 | 5.06 | 3 | 25 | 30 | 32.5 | 36 | ▁▁▂▅▇ |
ACT | female | 0 | 1.00 | 28.4 | 4.69 | 15 | 25 | 29 | 32.0 | 36 | ▁▃▆▇▇ |
SATV | male | 0 | 1.00 | 615.1 | 114.16 | 200 | 540 | 630 | 700.0 | 800 | ▁▁▆▇▇ |
SATV | female | 0 | 1.00 | 610.7 | 112.31 | 200 | 550 | 620 | 700.0 | 800 | ▁▁▃▇▅ |
SATQ | male | 2 | 0.99 | 635.9 | 116.02 | 300 | 570 | 660 | 720.0 | 800 | ▂▃▅▇▇ |
SATQ | female | 11 | 0.98 | 596.0 | 113.07 | 200 | 500 | 600 | 683.0 | 800 | ▁▁▆▇▅ |
Moving beyond categorical variables, we next test the relationship between numeric values using simple correlation. See the supplemental materials for more in-depth explnations.
Correlation is done using the cor()
function. Suppose we want to see if the ACT scores increase with age.
# Covariance
# cov(sat.act$age, sat.act$ACT)
# Correlation
cor(sat.act$age, sat.act$ACT)
## [1] 0.111
A small correlation, but no test of significance. To obtain significance values, you must pass your data to cor.test()
. Note that this can be done by passing x
and y
or using the formula method (which will be used for linear models).
# Default method
# cor.test(sat.act$age, sat.act$ACT)
# Formula method
cor.test(~ age + ACT, data = sat.act)
##
## Pearson's product-moment correlation
##
## data: age and ACT
## t = 3, df = 698, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0367 0.1831
## sample estimates:
## cor
## 0.111
To visualize this relationship, we can pass the raw data to ggplot
and get a simple regression line using stat_smooth()
with the lm
method. See supplemental materials for an example.
You can also pass a dataframe of values to the cor
function to get a simple correlation matrix (a la SPSS).
cor(sat.act[c("age","ACT","SATV","SATQ")], use = "pairwise.complete.obs")
## age ACT SATV SATQ
## age 1.0000 0.111 -0.0424 -0.0339
## ACT 0.1105 1.000 0.5611 0.5871
## SATV -0.0424 0.561 1.0000 0.6443
## SATQ -0.0339 0.587 0.6443 1.0000
# Or, using tidyverse verbs...
# sat.act %>%
# select(., age, ACT, SATV, SATQ) %>%
# drop_na(., SATQ) %>%
# cor()
Or, optionally for easier-on-the-eyes output we can use a number of specialized functions.
Hmisc::rcorr(sat.act[c("age","ACT","SATV","SATQ")] %>% as.matrix(.))
## age ACT SATV SATQ
## age 1.00 0.11 -0.04 -0.03
## ACT 0.11 1.00 0.56 0.59
## SATV -0.04 0.56 1.00 0.64
## SATQ -0.03 0.59 0.64 1.00
##
## n
## age ACT SATV SATQ
## age 700 700 700 687
## ACT 700 700 700 687
## SATV 700 700 700 687
## SATQ 687 687 687 687
##
## P
## age ACT SATV SATQ
## age 0.0034 0.2631 0.3744
## ACT 0.0034 0.0000 0.0000
## SATV 0.2631 0.0000 0.0000
## SATQ 0.3744 0.0000 0.0000
Or directly to APA-acceptable tables. You can pass a filename
argument to apa.cor.table
to save it directly to file.
apaTables::apa.cor.table(sat.act[c("age","ACT","SATV","SATQ")])
##
##
## Means, standard deviations, and correlations with confidence intervals
##
##
## Variable M SD 1 2 3
## 1. age 25.59 9.50
##
## 2. ACT 28.55 4.82 .11**
## [.04, .18]
##
## 3. SATV 612.23 112.90 -.04 .56**
## [-.12, .03] [.51, .61]
##
## 4. SATQ 610.22 115.64 -.03 .59** .64**
## [-.11, .04] [.54, .63] [.60, .69]
##
##
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##
The overall goal is to give you an introduction to conducting regression analyses or linear modeling in R.
Let’s turn to ‘simple’ linear regression (one predictor, one outcome), then scale to multiple regression (many predictors, one outcome). The standard linear regression model is implemented by the lm
function in R. The lm
function uses ordinary least squares (OLS) which estimates the parameter by minimizing the squared residuals.
Let’s take a look at the simple case of the association between SAT math and verbal scores.
ggplot(sat.act, aes(x=SATV, y=SATQ)) +
geom_point(color='darkblue', size = 3) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='black', size=1.2) +
labs(x="Math Score", y="Verbal Score")
## `geom_smooth()` using formula 'y ~ x'
This relationship looks quite linear. The higher the math score the higher the verbal score.
In R regression models, we use the ~
operator to denote ‘regressed on’. It’s not especially intuitive, but we say the criterion is regressed on the predictor. Here, if we think verbal is predictive of math, we’d say ‘math regressed on verbal’ or ‘verbal predicts math.’
In formula terms, this is SATV ~ SATQ
, which we pass as the first argument to lm()
.
lm_SAT <- lm(SATQ ~ SATV, data=sat.act)
summary(lm_SAT)
##
## Call:
## lm(formula = SATQ ~ SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -302.1 -46.5 2.4 51.3 282.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 207.5253 18.5725 11.2 <2e-16 ***
## SATV 0.6576 0.0298 22.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.5 on 685 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.415, Adjusted R-squared: 0.414
## F-statistic: 486 on 1 and 685 DF, p-value: <2e-16
The output contains individual parameter estimates of the model (here, just the intercept and slope), their standard errors, significance tests, and p-values (one degree of freedom). We also get global information such as the sum of squared errors and the coefficient of determination (\(R^2\)).
The ggfortify
package also provides an autoplot
function that gives similar diagnostics within a handy ggplot-based graph.
autoplot(lm_SAT)
Using functionality from the car
and boot
packges, we can easily get estimates of the regression coefficients and standard errors using nonparametric bootstrapping, which relaxes the normal theory assumption on the standard errors and, therefore, the significance tests. Likewise, the model does not assume normally distributed error.
sat.act.na <- na.omit(sat.act)
lm_SAT.na <- lm(SATQ ~ SATV, data=sat.act.na)
system.time(lm_SAT.boot <- Boot(lm_SAT.na, R=2000))
## user system elapsed
## 2.711 0.036 2.785
summary(lm_SAT.boot, high.moments=TRUE)
##
## Number of bootstrap replications R = 2000
## original bootBias bootSE bootMed bootSkew bootKurtosis
## (Intercept) 207.525 0.071309 21.4796 207.396 0.0916 -0.164
## SATV 0.658 -0.000261 0.0336 0.659 -0.0698 -0.188
We can use the object to obtain 95% bootstrapped confidence intervals using the ‘bias corrected, accelerated’ method (aka bca).
confint(lm_SAT.boot, level=.95, type="bca")
## Bootstrap bca confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 167.486 251.89
## SATV 0.587 0.72
And we can easily compare the bootstrapped and standard OLS models:
hist(lm_SAT.boot, legend="separate")
We can easily extend to larger regression models by adding terms to the right side of the formula. For example, in the sat.act
dataset, we could examine the extent to which the math scores (SATQ
) is a function of both verbal scores (SATV
), age (age
), and sex (gender
, 1 = male, 2 = female).
asv_model <- lm(SATQ ~ age + sex + SATV, sat.act)
summary(asv_model)
##
## Call:
## lm(formula = SATQ ~ age + sex + SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -290.34 -50.32 5.81 51.51 295.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 236.6596 21.2572 11.13 < 2e-16 ***
## age -0.1267 0.3498 -0.36 0.72
## sexfemale -36.8580 6.9202 -5.33 1.4e-07 ***
## SATV 0.6541 0.0293 22.33 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.8 on 683 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.438, Adjusted R-squared: 0.436
## F-statistic: 178 on 3 and 683 DF, p-value: <2e-16
Note that the broom
package is very useful for extracting global and specific statistics from many models in R, including regression models. The introductory vignette provides a number of useful examples: https://cran.r-project.org/web/packages/broom/vignettes/broom.html. Here, what if we want to save the global statistics and parameter estimates into data.frame objects?
We can use the glance
function to get the global model statistics.
broom::glance(asv_model)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 0.438 0.436 86.8 178. 3.52e-85 3 -4040. 8089. 8112. 5150990. 683 687
And the tidy
function yields the parameter table
broom::tidy(asv_model)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 237. 21.3 11.1 1.43e-26
## 2 age -0.127 0.350 -0.362 7.17e- 1
## 3 sexfemale -36.9 6.92 -5.33 1.36e- 7
## 4 SATV 0.654 0.0293 22.3 2.51e-83
We can use the *
operator in R to ask that both the constituent variables and their interaction(s) are entered into the model. For example:
int_model <- lm(SATQ ~ sex*age*SATV, sat.act)
summary(int_model)
##
## Call:
## lm(formula = SATQ ~ sex * age * SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.41 -50.48 4.05 52.30 291.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.01000 86.97955 -0.26 0.791
## sexfemale 216.92703 108.13088 2.01 0.045 *
## age 9.54204 3.39075 2.81 0.005 **
## SATV 1.02899 0.13875 7.42 3.6e-13 ***
## sexfemale:age -8.86490 4.15128 -2.14 0.033 *
## sexfemale:SATV -0.33756 0.17252 -1.96 0.051 .
## age:SATV -0.01395 0.00546 -2.55 0.011 *
## sexfemale:age:SATV 0.01155 0.00668 1.73 0.084 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.2 on 679 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.45, Adjusted R-squared: 0.444
## F-statistic: 79.4 on 7 and 679 DF, p-value: <2e-16
This model includes individual effects of sex (gender
) and age (age
), as well as their interation (gender:age
). This highlights that the asterisk operator *
will compute all possible interations among the specified predictors. For example, a*b*c*d
will generate all effets up through and including the a x b x c x d
interation. By contrast, if you wish to specify a given interaction manually/directly, use the colon operator (e.g., a:b
). The downside of the colon operator is that it doesn’t guarantee that the corresponding lower-level effects are included, which is usually a sane default position. As a reminder, you should essentially never include an interation without including the lower level effects, because this can misassign the variance.
What do we see? Males tend to have higher math scores and maintain these scores with age. Women have lower maths scores and show a decreasing trend as they get older.
(Some of the code and text here has been adapted from Russell Lenth’s excellent emmeans
documentation: https://cran.r-project.org/web/packages/emmeans/)
One of the handiest packages in the R regression universe is emmeans
, which can provide the ‘expected marginal means’ (em means), as well as a host of other contrasts and comparisons. In particular, it is very easy to test simple slopes and pairwise differences. Furthermore, the package works with multcomp
to handle correction for multiple comparisons. See the longer documentation here.
Let’s look how age and verbal scores may interact to predict math scores.
# creating age groups for pairwise comparisons
sat.act$agef[sat.act$age < 25] <- 1
sat.act$agef[sat.act$age >= 25 & sat.act$age <= 50] <- 2
sat.act$agef[sat.act$age > 50] <- 3
# setting as factors
sat.act$agef <- as.factor(sat.act$agef)
sat.act$gender <- as.factor(sat.act$sex)
# running the model
sat.lm <- lm(SATQ ~ agef + SATV, data = sat.act)
summary(sat.lm)
##
## Call:
## lm(formula = SATQ ~ agef + SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -301.62 -45.92 2.07 51.81 283.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 206.9031 18.8894 10.95 <2e-16 ***
## agef2 -0.0946 7.1350 -0.01 0.99
## agef3 15.5437 18.9726 0.82 0.41
## SATV 0.6579 0.0299 22.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.6 on 683 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.416, Adjusted R-squared: 0.413
## F-statistic: 162 on 3 and 683 DF, p-value: <2e-16
This output is hard to look at because there are many dummy codes and we have to infer the reference condition for each factor (usually alphabetical). Also, we do not have an intuitive sense of the expected means in each condition because they depend on the sum of the intercept and the specific dummy code for the condition interest, averaging over the other factor.
We can obtain the expected means for each condition.
sat.emm.s <- emmeans(sat.lm, "agef")
print(sat.emm.s)
## agef emmean SE df lower.CL upper.CL
## 1 610 4.32 683 601 618
## 2 610 5.67 683 598 621
## 3 625 18.47 683 589 662
##
## Confidence level used: 0.95
sat.emm.p <- emmeans(sat.lm, "SATV")
print(sat.emm.p)
## SATV emmean SE df lower.CL upper.CL
## 612 615 4.32 683 606 623
##
## Results are averaged over the levels of: agef
## Confidence level used: 0.95
print(emmeans(sat.lm, ~agef*SATV))
## agef SATV emmean SE df lower.CL upper.CL
## 1 612 610 4.32 683 601 618
## 2 612 610 5.67 683 598 621
## 3 612 625 18.47 683 589 662
##
## Confidence level used: 0.95
If we wanted to compare the pairwise differences in the effect of age group on math scores while controlling for verbal scores (and potentially other variables we add to the model), we could use the pairs
function:
sat_pairs <- pairs(sat.emm.s)
print(sat_pairs)
## contrast estimate SE df t.ratio p.value
## 1 - 2 0.09 7.13 683 0.013 1.0000
## 1 - 3 -15.54 18.97 683 -0.819 0.6910
## 2 - 3 -15.64 19.32 683 -0.809 0.6970
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Note that you can get a sense of the contrasts being tested by emmeans
by examining the @linfct
slot of the object. I’ve learned a lot by examining these contrast matrices and thinking about how to setup a (focal) contrast of interest. Also note that you get p-value adjustment for free (here, Tukey’s HSD method).
Contrasts for the predicted mean level of math scores contrast for each age group, controlling for verbal score.
sat.emm.s@linfct
## (Intercept) agef2 agef3 SATV
## [1,] 1 0 0 612
## [2,] 1 1 0 612
## [3,] 1 0 1 612
What are the pairwise contrasts for verbal scores?
sat_pairs@linfct
## (Intercept) agef2 agef3 SATV
## [1,] 0 -1 0 0
## [2,] 0 0 -1 0
## [3,] 0 1 -1 0
Consider a model in which we examine the association between SATQ and SATV across age. Here, we regress SATV on SATQ, age (three levels), and their interaction.
fit_sat <- lm(SATQ ~ SATV*agef, data = sat.act)
summary(fit_sat)
##
## Call:
## lm(formula = SATQ ~ SATV * agef, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -302.51 -50.21 2.09 54.61 256.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 178.8432 22.1853 8.06 3.4e-15 ***
## SATV 0.7034 0.0354 19.90 < 2e-16 ***
## agef2 109.8710 42.6641 2.58 0.0102 *
## agef3 17.2861 95.7898 0.18 0.8568
## SATV:agef2 -0.1804 0.0690 -2.61 0.0091 **
## SATV:agef3 -0.0022 0.1547 -0.01 0.9887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.3 on 681 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.422, Adjusted R-squared: 0.417
## F-statistic: 99.3 on 5 and 681 DF, p-value: <2e-16
car::Anova(fit_sat, type="III") #overall effects of predictors in the model
## Anova Table (Type III tests)
##
## Response: SATQ
## Sum Sq Df F value Pr(>F)
## (Intercept) 506336 1 64.99 3.4e-15 ***
## SATV 3084321 1 395.85 < 2e-16 ***
## agef 51806 2 3.32 0.037 *
## SATV:agef 53928 2 3.46 0.032 *
## Residuals 5306050 681
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that this yields a categorical (age group) x continuous (verbal scores) interaction. The output from car::Anova
indicates that the interaction is significant, but we need more detailed guidance on how the slope for verbal scores is moderated by age group.
In a simple slopes test, we might wish to know whether the slope for SATV
is non-zero in each age group individually. Let’s start by getting the estimated marginal means for each age group.
emmeans(fit_sat, ~agef)
## NOTE: Results may be misleading due to involvement in interactions
## agef emmean SE df lower.CL upper.CL
## 1 610 4.31 681 601 618
## 2 609 5.66 681 598 620
## 3 626 18.43 681 589 662
##
## Confidence level used: 0.95
And pairwise differences between age groups:
pairs(emmeans(fit_sat, ~agef))
## NOTE: Results may be misleading due to involvement in interactions
## contrast estimate SE df t.ratio p.value
## 1 - 2 0.62 7.11 681 0.087 0.9960
## 1 - 3 -15.94 18.92 681 -0.842 0.6770
## 2 - 3 -16.56 19.28 681 -0.859 0.6660
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Transitioning to SATV, because we are interested its linear effect (slope), we use the emtrends
function to estimate the slope in each species individually. In terms of simple slopes, we test whether the SATV slope is non-zero in each age group. The infer
argument in the summary of emtrends
requests t-tests and p-values for the slopes.
summary(emtrends(object = fit_sat, ~agef, var="SATV"), infer=TRUE)
## agef SATV.trend SE df lower.CL upper.CL t.ratio p.value
## 1 0.703 0.0354 681 0.634 0.773 19.900 <.0001
## 2 0.523 0.0593 681 0.407 0.639 8.820 <.0001
## 3 0.701 0.1506 681 0.406 0.997 4.660 <.0001
##
## Confidence level used: 0.95
Finally, we could examine pairwise differences between slopes among age groups.
pairs(emtrends(object = fit_sat, ~agef, var="SATV"))
## contrast estimate SE df t.ratio p.value
## 1 - 2 0.1804 0.069 681 2.614 0.0250
## 1 - 3 0.0022 0.155 681 0.014 1.0000
## 2 - 3 -0.1782 0.162 681 -1.101 0.5140
##
## P value adjustment: tukey method for comparing a family of 3 estimates
There is a lot more on probing interactions here: https://cran.r-project.org/web/packages/emmeans/vignettes/interactions.html.
Further, sometimes we want to test the relationship between categorical variables and numeric variables. One way to do this is ANOVA. Analysis of variance (ANOVA) provides a statistical test of whether two or more population means are equal. ANOVA is done using the aov_ez
function.
Suppose we want to see if the ACT scores differ with education levels and gender.
A two-way between-subjects ANOVA is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable. First, let’s see the summary of each condition.
str(sat.act)
## 'data.frame': 700 obs. of 8 variables:
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
## $ education: Factor w/ 6 levels "none","some_hs",..: 4 4 4 5 3 6 6 4 5 6 ...
## $ age : int 19 23 20 27 33 26 30 19 23 40 ...
## $ ACT : int 24 35 21 26 31 28 36 22 22 35 ...
## $ SATV : int 500 600 480 550 600 640 610 520 400 730 ...
## $ SATQ : int 500 500 470 520 550 640 500 560 600 800 ...
## $ agef : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
describeBy(sat.act$ACT,list(sat.act$education,sat.act$gender), mat=TRUE,digits=2) %>%
dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)
## group1 group2 n mean sd median min max skew kurtosis
## X11 none male 27 29.0 5.00 29 20 36 -0.30 -1.13
## X12 some_hs male 20 26.7 7.11 28 15 35 -0.30 -1.51
## X13 high_school male 23 26.6 6.39 28 3 32 -2.14 5.39
## X14 some_college male 80 28.6 5.03 30 17 36 -0.45 -0.92
## X15 college male 51 28.9 4.42 29 16 36 -0.74 0.12
## X16 graduate male 46 30.8 3.11 32 25 36 -0.38 -0.81
## X17 none female 30 26.1 5.06 26 15 36 0.08 -0.56
## X18 some_hs female 25 28.1 5.13 27 18 36 -0.21 -0.78
## X19 high_school female 21 27.3 5.23 28 15 36 -0.32 -0.34
## X110 some_college female 195 28.2 4.78 29 16 36 -0.46 -0.47
## X111 college female 87 29.4 4.32 30 19 36 -0.27 -0.67
## X112 graduate female 95 29.0 4.19 29 18 36 -0.31 -0.73
sat.act$id<-1:nrow(sat.act)
sat.act$education<-factor(sat.act$education)
# Compute the analysis of variance
res.aov <- aov_ez(id="id", dv="ACT", data = sat.act, between=c("education","gender"))
## Contrasts set to contr.sum for the following variables: education, gender
# Summary of the analysis
print(res.aov)
## Anova Table (Type 3 tests)
##
## Response: ACT
## Effect df MSE F ges p.value
## 1 education 5, 688 22.56 4.53 *** .032 <.001
## 2 gender 1, 688 22.56 0.87 .001 .351
## 3 education:gender 5, 688 22.56 2.04 + .015 .071
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
We can conclude that education is statistically significant. ACT score is positively related with education. Also, there is no significant interaction effect between education and gender.
ACT scores generally increase with education. If we want to compare each level in education factor with the other, we can compute those with pairwise comparisons using emmeans
.
sat.emm.aov <- emmeans(res.aov, "education")
## NOTE: Results may be misleading due to involvement in interactions
sat_pairs_aov <- pairs(sat.emm.aov)
print(sat_pairs_aov)
## contrast estimate SE df t.ratio p.value
## none - some_hs 0.142 0.951 688 0.150 1.0000
## none - high_school 0.559 0.954 688 0.590 0.9920
## none - some_college -0.822 0.705 688 -1.170 0.8530
## none - college -1.643 0.757 688 -2.170 0.2520
## none - graduate -2.366 0.761 688 -3.110 0.0240
## some_hs - high_school 0.417 1.011 688 0.410 0.9980
## some_hs - some_college -0.964 0.779 688 -1.240 0.8190
## some_hs - college -1.785 0.826 688 -2.160 0.2580
## some_hs - graduate -2.508 0.830 688 -3.020 0.0310
## high_school - some_college -1.381 0.783 688 -1.760 0.4900
## high_school - college -2.202 0.830 688 -2.650 0.0860
## high_school - graduate -2.926 0.834 688 -3.510 0.0060
## some_college - college -0.821 0.524 688 -1.570 0.6210
## some_college - graduate -1.545 0.530 688 -2.910 0.0430
## college - graduate -0.724 0.598 688 -1.210 0.8320
##
## Results are averaged over the levels of: gender
## P value adjustment: tukey method for comparing a family of 6 estimates
Repeated measures ANOVA is commonly used in data analysis. For example, participants take more than one tasks or conditions. In longitudinal study, data is collected from each participant more than once. Assume there are different time points in this dataset. After introduce the time point variable into the data, let’s check whether ACT scores vary by time point and education.
set.seed(1999) # Needed for getting same results
sat.act <- sat.act %>%
mutate(education = factor(rep(sample(1:5, 140, replace = TRUE), each=5)),
id = rep(1:140, each = 5),
time = rep(1:5, 140)
)
res.aov2 <- aov_ez(
id = "id",
dv = "ACT",
data = sat.act,
between = c("education"),
within = c("time")
)
## Contrasts set to contr.sum for the following variables: education
summary(res.aov2)
##
## Univariate Type III Repeated-Measures ANOVA Assuming Sphericity
##
## Sum Sq num Df Error SS den Df F value Pr(>F)
## (Intercept) 531504 1 3619 135 19829.43 <2e-16 ***
## education 154 4 3619 135 1.44 0.225
## time 194 4 11905 540 2.20 0.067 .
## education:time 387 16 11905 540 1.10 0.355
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Mauchly Tests for Sphericity
##
## Test statistic p-value
## time 0.874 0.0362
## education:time 0.874 0.0362
##
##
## Greenhouse-Geisser and Huynh-Feldt Corrections
## for Departure from Sphericity
##
## GG eps Pr(>F[GG])
## time 0.936 0.072 .
## education:time 0.936 0.356
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## HF eps Pr(>F[HF])
## time 0.966 0.0698
## education:time 0.966 0.3556
Aside from simply printing a table, we can visualize the mean and sd of each conditions with a line plot. See the supplemental materials for an example.