Comparing estimates and their standard errors from mixed effects and linear models

2017/08/15

Some background

One reason to use mixed effects models is that they help to account for data with a complex structure, such as multiple responses (to questions, for example) from the same people, students grouped into classes, and measures collected over time. Often, the way they account for these complex structures is in terms of reducing their bias, which has to do with when a model comes up with an estimate that is off - too large, too small, or maybe too certain (or uncertain) relative to the true value of the thing that is estimated.

So, how are estimates (and their certainty) affected when using a mixed effects model instead of a linear model (i.e., regression)? Let’s first get into two ideas.

Big idea #1: The bias and variance trade off

Looking at models this way, bias represents a model that is “under-fit” (one that does not reflect the underlying patterns in the data), whereas variance represents a model that is “over-fit” (one that is fit very closely to its underlying data but may not generalize to other data sets). This book by James et al. (2013) describes this well, as does this book by McElreath (2015) that I did not want to like (in part because I read a review by Sweet about it but did, in particularly clear language. From this approach, mixed effects models do not reduce bias so much as balance this trade-off (more in a bit).

Let’s set up in R, and return to some of the rail-trail data. In brief, the data are reviews for around 200 rail-trails in Michigan, where each trail is associated with one or more (often many more) reviews. The models predict the rail-trail’s reviews based on characteristics of the trail: how long it is, whether it is paved or crushed stone, and so on.

Big idea #2: “Pooling” of estimates

The frame for looking at these is the idea of complete pooling (what we do when we specify a model with no group information), no pooling (what we do when we specify a model with group information through something like dummy-codes), and partial pooling (what we do when we specify a model with group information through a mixed-effects model). In terms of how we model these and just to establish a few terms, with “group” being a reference to some unit such as classes, students (in the case of repeated measures or longitudinal data):

  • Complete pooling: A a linear (or regression - or we can even think of them as ANCOVA) model with no dummy codes for groups
  • No pooling: A linear (or regression) model with groups dummy-coded
  • Partial pooling: A regression with intercepts or other coefficients (i.e., slopes) treated as random effects in a mixed-effects (or multi-level) model

In their (sort of classic) book, Gelman and Hill (2007) describe these ideas (about partial pooling as a way to think about what mixed effects models do and how regressions represent complete and no pooling alternates) in a way that I find particularly easy to understand, after having a hard time understanding mixed effects models when I initially learned about them. The way they treat partial pooling is from a (bit of a) Bayesian perspective: The group-specific predictions are a function of a prior estimated from the overall data.

The predictions are partially pooled in the direction of the estimates for the overall data, with how much pooling being determined by not only the mean values of the observations in each group but also how much variation exists within the observations in each groups. This is just a part of a fully Bayesian approach: partially because only the random effects are estimated with a prior distribution (if I understand correctly, in a fully Bayesian approach every parameter) would be estimated with a prior distribution), and partially because the way we interpret (at least the fixed effects in) mixed effects models is the same as for most (non-Bayesian) approaches, like a regression (for which we use the Central Limit Theorem to make inferences), whereas with a Bayesian approach, we can make direct statements about how certain or uncertain we are about our estimate.

This post will be a bit of a draft for now, as I hope to return to these ideas in more depth in the future.

Processing the data

library(tidyverse) # these are packages that are loaded to carry out the analysis
library(lme4)
library(stringr)
library(forcats)
library(broom)

f <- here::here("static", "data", "mi.rds")
df <- read_rds(f) # this is a file with the rail-trail data - you can get it from here: https://github.com/jrosen48/railtrail

df <- df %>% 
    unnest(raw_reviews) %>% 
    filter(!is.na(raw_reviews)) %>% 
    rename(raw_review = raw_reviews,
           trail_name = name)

We’ll process the data a bit.

df <- df %>% 
    mutate(category = as.factor(category),
           category = fct_recode(category, "Greenway/Non-RT" = "Canal"),
           mean_review = ifelse(mean_review == 0, NA, mean_review))

df <- mutate(df,
             surface_rc = case_when(
                 surface == "Asphalt" ~ "Paved",
                 surface == "Asphalt, Concrete" ~ "Paved",
                 surface == "Concrete" ~ "Paved",
                 surface == "Asphalt, Boardwalk" ~ "Paved",
                 str_detect(surface, "Stone") ~ "Crushed Stone",
                 str_detect(surface, "Ballast") ~ "Crushed Stone",
                 str_detect(surface, "Gravel") ~ "Crushed Stone",
                 TRUE ~ "Other"
             )
)

df$surface_rc <- as.factor(df$surface_rc)

df$surface_rc <- fct_relevel(df$surface_rc,
                             "Crushed Stone")

df$distance <- as.numeric(str_extract(df$distance, "[[:digit:]]+\\.*[[:digit:]]*"))

Specifying models

We will specify the same characteristics to predict how rail-trails are rated as in this earlier post.

Complete pooling

Here is the complete pooling model(no information about the grouping of the data is used, so the group estimates are (in a sense, because we never used group information in the first place!) pooled together):

m_complete_pooling <- lm(raw_review ~ 1 + distance + category + surface_rc, data = df)

No pooling

Here is the no pooling model (information about the grouping of the data is used, to the extent that each group is considered independently, and so the estimates are not *pooled** together at all):

m_no_pooling <- lm(raw_review ~ 1 + distance + category + surface_rc + trail_name, data = df)

Partial pooling

And here is the partially pooled model (information about the grouping of the data is used, unlike for the complete pooling model, but unlike the no pooling model, some of the overall information is used, thus, partial pooling of estimates takes place):

m_partial_pooling <- lmer(raw_review ~ 1 + distance + category + surface_rc + (1|trail_name), data = df)

Results from analysis

Let’s take a look at their estimates, standard errors, and their test statistics, just for the intercept and predictors (distance, rail-trail versus greenways, and other or paved surface relative to crushed stone):

results_df <- bind_rows(
    mutate(tidy(m_complete_pooling)[, 1:4], group = "Complete Pooling"),
    mutate(tidy(m_partial_pooling)[1:5, 1:4], group = "Partial Pooling"),
    mutate(tidy(m_no_pooling)[1:5, 1:4], group = "No Pooling")
)
## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

Let’s then create a table of the estimates with their standard errors for the three options - complete pooling, no pooling, and partial pooling:

results_df %>% 
    mutate(est_se = paste0(round(estimate, 2), " (", round(std.error, 2), ")")) %>% 
    select(term, group, est_se) %>% 
    spread(group, est_se)
## # A tibble: 5 x 4
##   term               `Complete Pooling` `No Pooling` `Partial Pooling`
##   <chr>              <chr>              <chr>        <chr>            
## 1 (Intercept)        3.68 (0.07)        4.05 (0.24)  3.81 (0.22)      
## 2 categoryRail-Trail 0.15 (0.05)        -0.81 (0.25) -0.02 (0.17)     
## 3 distance           0 (0)              0.07 (0.03)  0 (0.01)         
## 4 surface_rcOther    0.41 (0.09)        0.85 (0.18)  -0.24 (0.34)     
## 5 surface_rcPaved    0.47 (0.05)        -0.37 (0.42) 0.41 (0.18)

Interpretation

For all of the estimates - (Intercept), which is just the average rating of the trail from 1-5, categoryRail-Trail (this is an indicator comparing rail-trails to trails technically designated as greenways or pathways), distance (how long the trail is, in miles), and surface_rcPaved (indicator comparing paved trails to those made up of crushed stone - except for surface_rcOther (comparing surfaces like dirt to crushed stone), the confidence intervals, represented by the bars around each point, are more narrow than for the partial pooling model than for the complete pooling model, but less narrow than for the no pooling model. These confidence intervals represent the range in which we would expect 95% of the estimates for other rail-trails exhibiting the same characteristics as those in Michigan to be. For surface_rcOther, the confidence intervals are wider for the partial pooling model.

Let’s interpret what these overall patterns suggest. The complete pooling model is the one in which we do not include any group information at all, so the data is pooled together. The no pooling model is the one in which information about the grouping is used, but none for the overall data is.

Regression with no dummy codes for groups can result in a model with higher bias

It would make sense for the complete pooling model to usually have wider confidence intervals than the other two: Since none of the group information is used, we are missing some key pieces of the puzzle, and so we are not very sure about what is exactly going on (i.e., our estimates have a wide confidence interval, or a large standard error, one that may be too wide given what information we could use if we wanted). In terms of the bias-variance trade off, complete pooling can result in a model that has too high bias.

Regression with dummy-coded groups can result in a model with higher variance

On the other hand, it also would make sense for the no pooling model to usually have more narrow confidence intervals. In this case, we use the group information, but in some cases, it would be better to consider what the overall patterns are telling us. Since none of the overall information is used, for some of the groups, we are basing what we think is going on a small number of data points or a limited amount of information, and so we are overconfident in what we know (i.e., our estimates have a too narrow confidence interval, and accordingly a small standard error, and one that may be too small given what information we can use). In terms of the trade off, no pooling results in a model that has too high variance.

Mixed effects (or multi-level) models can result in a model that balance bias and variance

For both cases, the partial pooling, or the way we use the group information when we use a mixed-effects model, compromises between each of these approaches. Partial pooling balance this bias-variance trade off. One minor mystery at this point has to do with why the estimate and standard error for surface_rcOther do not follow this pattern, like the other four estimates and their standard errors. Perhaps this has to do with the relatively small number of observations associated with other surfaces (associated with just over 100 reviews, compared to more than 500 for each of the other surfaces), with the mixed effects model somehow considering there to be less information reflected in the estimate closer to zero and the wider confidence intervals than the linear models.

A few other things

The two other posts using mixed-effects models I wrote are here and here.

I also wrote a post representing some (similar versions) of these ideas but through plots of estimates for complete, no, and partial pooling models (with some code from Mahr).