Using characteristics of rail-trails to predict how they are rated

2017/08/02

Catching up

I wrote a blog post (one that, to be honest, I liked a lot) on what the best rail-trails are in Michigan (here). A friend and colleague at MSU, Andy, noticed that paved trails seemed to be rated higher, and this as well as my cfriend and colleague Kristy’s comment about how we can use the output of the the previous post sparked my curiosity in trying to figure out what characteristics predict how highly (or not highly) rated trails are.

Let’s start the same way we did before.

library(tidyverse)
library(lme4)
library(stringr)
library(sjPlot)
library(forcats)

f <- here::here("static", "data", "mi.rds")
df <- read_rds(f) # this is a file with the rail-trail data - you can get it from here: https://github.com/jrosen48/railtrail

df <- df %>% 
    unnest(raw_reviews) %>% 
    filter(!is.na(raw_reviews)) %>% 
    rename(raw_review = raw_reviews,
           trail_name = name) %>% 
    mutate(trail_name = str_sub(trail_name, end = -7L),
           distance = str_sub(distance, end = -6L),
           distance = as.numeric(distance),
           n_reviews = str_sub(n_reviews, end = -9L),
           n_reviews = as.numeric(n_reviews))

We’ll process the data a bit.

df <- df %>% 
    mutate(category = as.factor(category),
           category = fct_recode(category, "Greenway/Non-RT" = "Canal"),
           mean_review = ifelse(mean_review == 0, NA, mean_review))

df <- mutate(df,
             surface_rc = case_when(
                 surface == "Asphalt" ~ "Paved",
                 surface == "Asphalt, Concrete" ~ "Paved",
                 surface == "Concrete" ~ "Paved",
                 surface == "Asphalt, Boardwalk" ~ "Paved",
                 str_detect(surface, "Stone") ~ "Crushed Stone",
                 str_detect(surface, "Ballast") ~ "Crushed Stone",
                 str_detect(surface, "Gravel") ~ "Crushed Stone",
                 TRUE ~ "Other"
             )
)

df$surface_rc <- as.factor(df$surface_rc)

df$surface_rc <- fct_relevel(df$surface_rc,
                             "Crushed Stone")

Note that the other category includes surfaces like dirt and grass.

The model we built (take one)

Previously we fit a model like this:

m1 <- lmer(raw_review ~ 1 + (1|trail_name), data = df)

This model basically estimated the rating for each trail, taking account not only of the 1 - 5 ratings and how different they are from the “average” review across every trail. In short, it estimates ratings that are less biased by how many reviews there are.

Building a model (take two)

It’s a bit boring, and to extend this, we can add the variables for surface (paved, crushed stone, or other), category (rail-trail or greenway), and distance. You can focus on the B in the table above. The intercept represents the overall average rating, which is 3.50. The B for distance represents the increase in rating for each 1-mile increase in distance (0.00!). This suggests longer trails are not necessarily more highly rated, and the p (0.895) - which we use conventionally to find out whether the B is significant if it is below 0.05 - supports this interpretation.

Similarly, the B for rail-trail compared to greenways is small (and the p is far greater than 0.05) as is the case for other surfaces compared to crushed stone (B = -0.35, p = 0.318), but paved surfaces do seem different. They are associated with a rating almost half a point (B = 0.37, p = 0.037) higher than other trails, and almost a whole point (0.72) higher than other surfaces.

m2 <- lmer(raw_review ~ 1 + distance + category + surface_rc + (1|trail_name), data = df)

summary(m2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## raw_review ~ 1 + distance + category + surface_rc + (1 | trail_name)
##    Data: df
## 
## REML criterion at convergence: 7599.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.2026 -0.4539  0.1829  0.6112  2.0698 
## 
## Random effects:
##  Groups     Name        Variance Std.Dev.
##  trail_name (Intercept) 0.5349   0.7314  
##  Residual               0.8319   0.9121  
## Number of obs: 2757, groups:  trail_name, 116
## 
## Fixed effects:
##                      Estimate Std. Error t value
## (Intercept)         3.8056858  0.2156471  17.648
## distance           -0.0007626  0.0052627  -0.145
## categoryRail-Trail -0.0171616  0.1662643  -0.103
## surface_rcOther    -0.2437101  0.3393114  -0.718
## surface_rcPaved     0.4144877  0.1815039   2.284
## 
## Correlation of Fixed Effects:
##             (Intr) distnc ctgR-T srfc_O
## distance    -0.488                     
## ctgryRl-Trl -0.466 -0.267              
## srfc_rcOthr -0.403  0.253 -0.007       
## surfc_rcPvd -0.782  0.400  0.106  0.401
# sjt.lmer(m2, p.kr = F, show.re.var = F, show.ci = F, show.se = T)

Note that the arguments to sjt.lmer() only have to do with what output is produced.

So, where are we (really) riding next?

This suggests that if you want to ride a good rail-trail, first and foremost pick one that is paved, while whether a trail is technically a rail-trail or a greenway, and whether the trail is short or long, do not seem to matter. Although we will explore this more in later posts.