1 Model evidence, K-L distance

“All models are wrong but some are useful” - George Box

Conceptually, model A is nested in model B if model A contains a subset of the parameters in model B. For example, if we have an MLM with 10 fixed-effect predictors (\(X_1 - X_{10}\)) and then we add an additional predictor \(X_{11}\) as a fixed effect, then the restricted model (lacking the new predictor) is nested in the full model. One way to think of this is that in the restricted model A, the effect of \(X_{11}\) is fixed to be zero – it’s like the predictor is there but has no ability to exert influence on the outcome.

Importantly, in univariate MLM (one outcome), when we talk about comparing models in terms of their evidence, the dependent variable must be the same for all models being compared. If we switch from modeling reaction times to modeling choices among alternative actions, the dependent variable has changed! This means the likelihood function will have changed, too, and we can only compared alternative models of a given outcome to each other.

A more subtle but perhaps even more crucial distinction is that the number of observations must be the same between the models. For example, if one model of a given outcome has 10,005 observations, while the next model (of the same outcome) has 9,889, then the likelihood functions are not the same and the models cannot be compared directly! Why might this happen? The most common source is that there is missingness on the predictor side that leads to listwise deletion.

For example, consider model A:

mA <- lmer(rt ~ rt_prev + sex + age + (1|id), data=df, na.action=na.omit)
length(fitted(mA))
> 10005

The number of predicted values (\(\hat{Y}\)) is 10,005. Let’s add another precictor, the level of extraversion.

mB <- lmer(rt ~ rt_prev + sex + age + extraversion + (1|id), data=df, na.action=na.omit)
length(fitted(mB))
> 9889
length(na.omit(df$extraversion))
> 9889

As you can see, the number of predicted values has dropped to 9889 and this matches the number of non-missing values of extraversion! The challenge is that mA and mB can no longer be compared. What to do?

The simplest solution is to filter the data.frame to have the number of observations in the most complex (i.e., most predictors) model so that simpler models don’t gain observations. Note that this problem is common when adding lagged predictors into a model since \(t-1\) will be missing on trial 1 and so on.

Another common source of error in model comparison occurs when one model transforms the dependent variable while another does not. For example, in the flanker data, can we compare a model of inverse RT (\(-1000/RT\)) to the untransformed \(RT\)?

minv <- lmer(rt_inv ~ 1 + (1|id), flanker)
mun <- lmer(rt~ 1 + (1|id), flanker)
anova(minv, mun)
## refitting model(s) with ML (instead of REML)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
minv 3 16382 16405 -8188 16376 NA NA NA
mun 3 190706 190730 -95350 190700 0 0 NA

As you see, R will run this comparison for us (be afraid!), but we would come to a completely erroneous conclusion, namely that the inverse-RT model fits the data many orders of magnitude better than the untransformed model. This is not a fair comparison because the scale of the outcome variable is not the same in the two models and, thus, the likelihood is not the same! I have reviewed a number of papers over the years where authors conclude that a model with a transformed outcome (often sqrt or log) fits the data far better than the untransformed one based on AIC or LRTs. Although we can perhaps compare how well a transformation normalizes a distribution, we cannot compare models of different transformations in terms of their likelihoods (i.e., the likelihood of the data given the model and parameters).

Having established whether two models are nested, we can move onto the important consideration of model evidence. In the case of equivalent models, relative evidence cannot be based on global fit given that both models will have the same likelihood and same number of parameters. But in the case of nested models, we would like to be able to make a statement such as, “Model A fits the data significantly better than Model B, p < .01,” or, “The evidence for Model A is ten times stronger than the evidence for Model B.”

1.1 Binary judgments: likelihood ratios and Vuong tests

The first statement concerns a binary judgment about model fit – that is, we are interested in testing the null hypothesis that two models fit the data equally well. The most common approach for making binary judgments in MLM is the likelihood ratio test (LRT). In the LRT approach, we can compare two nested models (using the anova function in lme4) to test the null hypothesis that one model (with fewer parameters – the nested model) fit the data equally well. If we reject this null (typically \(p < .05\)), then we conclude that the nested model fits significantly worse than the full model.

Let’s return to the flanker data, comparing the unconditional means model (intercept-only) and the model that has condition (incongruent versus congruent) and block (mostly incongruent versus mostly congruent) as fixed-effects predictors.

m1 <- lmer(rt_inv ~ 1 + (1|id), flanker)
summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: rt_inv ~ 1 + (1 | id)
##    Data: flanker
## 
## REML criterion at convergence: 16381
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -4.741 -0.598  0.012  0.601  4.363 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept) 0.0918   0.303   
##  Residual             0.1507   0.388   
## Number of obs: 16801, groups:  id, 107
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -2.6323     0.0295   -89.4
m2 <- lmer(rt_inv ~ block*cond + (1|id), flanker)
summary(m2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: rt_inv ~ block * cond + (1 | id)
##    Data: flanker
## 
## REML criterion at convergence: 15963
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -4.678 -0.593  0.019  0.599  4.138 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept) 0.0919   0.303   
##  Residual             0.1468   0.383   
## Number of obs: 16801, groups:  id, 107
## 
## Fixed effects:
##                               Estimate Std. Error t value
## (Intercept)                   -2.70605    0.03029  -89.35
## blockmost_con                  0.02340    0.00911    2.57
## condincongruent                0.10884    0.00910   11.96
## blockmost_con:condincongruent  0.05045    0.01290    3.91
## 
## Correlation of Fixed Effects:
##             (Intr) blckm_ cndncn
## blockmst_cn -0.210              
## condncngrnt -0.210  0.699       
## blckmst_cn:  0.148 -0.706 -0.705
anova(m1, m2)
## refitting model(s) with ML (instead of REML)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m1 3 16382 16405 -8188 16376 NA NA NA
m2 6 15946 15993 -7967 15934 441 3 0

We see that adding condition and block leads to a highly significant LRT, \(\chi^2 (3) = 441, p < 10^{-15})\), indicating that the addition of block, condition, and their interaction improves model fit. Also note that lme4 refits the models using maximum likelihood (ML) instead of restricted maximum likelihood (REML). REML is good for parameter estimation, but bad for model comparison because it does not provide an accurate estimate of the model likelihood (sometimes abbreviated LL or logLik to refer to the log of the likelihood).

1.2 Relative evidence and Kullback-Leibler (K-L) distance

This, however, is a binary judgment (is fit better or not?). What about the evidence in favor of the model that has block*cond versus the unconditional means model? This is a dimensional question and could be framed in terms of the probability that one model is the best in the set. For example, is the evidence in factor of the block*cond model five times greater than the unconditional means model?

To make judgments about relative evidence, we need a measure of how well one model fits compared to another (or many others).

The goal of most statistical modeling is to approximate the key factors that generated our empirical data. The data generation processes are usually complex and largely unknown, so our models will almost always be deficient in many ways. Nevertheless, we may wish to know how close one model is to the unknown truth relative to a comparator model. This is the substance of model selection theory – i.e., how one selects among models based on relative evidence.

Kullback-Leibler information quantifies the information lost when lost when a simple model, g is used to approximate the true model, f:

\[ I(f,g) = \int{f(x) \textrm{log} \left( \frac{f(x)}{g(x|\theta)} \right) dx} \] This can also be thought of as the distance between our model, g and the population (unknown) model, f.