Background
I was curious about what rail trails were the best in Michigan, and so to figure out an answer, I checked out the TrailLink website, sponsored by the Rails-to-Trails Conservancy. I had just purchased a copy of their book Rail-Trails Michigan and Wisconsin, and wanted to see whether I could learn more from the website.
To start, I checked whether they had a way to access the reviews on the site through an API. They didn’t, so I checked their robots.txt
file at http://traillink.com/robots.txt
. They didn’t disallow access to their reviews for each state, so I was able to download all of the reviews for the 259 trails with reviews in Michigan.
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(forcats)
library(stringr)
library(lme4)
library(broom)
f <- here::here("static", "data", "mi.rds")
df <- read_rds(f) # this is a file with the rail-trail data - you can get it from here: https://github.com/jrosen48/railtrail
df <- df %>%
unnest(raw_reviews) %>%
filter(!is.na(raw_reviews)) %>%
rename(raw_review = raw_reviews,
trail_name = name) %>%
mutate(trail_name = str_sub(trail_name, end = -7L),
distance = str_sub(distance, end = -6L),
distance = as.numeric(distance),
n_reviews = str_sub(n_reviews, end = -9L),
n_reviews = as.numeric(n_reviews))
What are the characteristics of the best trails?
On the site, there are “surfaces” (i.e., asphalt and gravel) and “categories” (i.e., rail-trail and paved pathway), so I tried to group them into a few categories.
df <- df %>%
mutate(category = as.factor(category),
category = forcats::fct_recode(category, "Greenway/Non-RT" = "Canal"),
mean_review = ifelse(mean_review == 0, NA, mean_review))
df <- mutate(df,
surface_rc = case_when(
surface == "Asphalt" ~ "Paved",
surface == "Asphalt, Concrete" ~ "Paved",
surface == "Concrete" ~ "Paved",
surface == "Asphalt, Boardwalk" ~ "Paved",
str_detect(surface, "Stone") ~ "Crushed Stone",
str_detect(surface, "Ballast") ~ "Crushed Stone",
str_detect(surface, "Gravel") ~ "Crushed Stone",
TRUE ~ "Other"
)
)
Then, I checked out their mean reviews, from one to five stars.
Some trails had a ton of reviews:
df %>%
select(trail_name, surface_rc, category, distance, n_reviews) %>%
distinct() %>%
arrange(desc(n_reviews)) %>%
head(5) %>%
knitr::kable()
trail_name | surface_rc | category | distance | n_reviews |
---|---|---|---|---|
Lakelands Trail State Park | Crushed Stone | Rail-Trail | 26.0 | 78 |
Pere Marquette Rail-Trail | Paved | Rail-Trail | 30.0 | 75 |
Fred Meijer White Pine Trail State Park | Crushed Stone | Rail-Trail | 92.6 | 66 |
William Field Memorial Hart-Montague Trail State Park | Paved | Rail-Trail | 22.7 | 48 |
Kal-Haven Trail Sesquicentennial State Park | Crushed Stone | Rail-Trail | 34.0 | 47 |
And some had very few reviews- 60 of the trails had only one review!
Some of these reviews for trails with one review were high (five stars):
df %>%
select(trail_name, surface_rc, category, distance, n_reviews, mean_review) %>%
distinct() %>%
filter(n_reviews == 1) %>%
arrange(desc(mean_review)) %>%
head(5) %>%
knitr::kable()
trail_name | surface_rc | category | distance | n_reviews | mean_review |
---|---|---|---|---|---|
Big Rapids Riverwalk | Crushed Stone | Greenway/Non-RT | 3.8 | 1 | 5 |
Boardman Lake Trail | Crushed Stone | Rail-Trail | 2.0 | 1 | 5 |
Cannon Township Trail | Paved | Greenway/Non-RT | 4.0 | 1 | 5 |
Chippewa Trail | Paved | Greenway/Non-RT | 4.1 | 1 | 5 |
Grass River Natural Area Rail Trail | Crushed Stone | Rail-Trail | 2.2 | 1 | 5 |
Some of the trails with one review were very low:
df %>%
select(trail_name, surface_rc, category, distance, n_reviews, mean_review) %>%
distinct() %>%
filter(n_reviews == 1) %>%
arrange(mean_review) %>%
head(5) %>%
knitr::kable()
trail_name | surface_rc | category | distance | n_reviews | mean_review |
---|---|---|---|---|---|
Alpena to Hillman Trail | Crushed Stone | Rail-Trail | 22.0 | 1 | 1 |
Felch Grade Trail | Crushed Stone | Rail-Trail | 38.0 | 1 | 1 |
Interurban Trail (Kent County) | Paved | Rail-Trail | 2.0 | 1 | 2 |
Linear Trail Park | Paved | Greenway/Non-RT | 16.9 | 1 | 2 |
Albion River Trail | Paved | Rail-Trail | 1.6 | 1 | 3 |
Building a model
To try to figure out what trails had many good reviews, I used an approach that is not an average of all of the reviews for the trail, but a rating that uses the value of the individual reviews for a trail as well as how different they are from each other and how different they are from the “average” review across every trail.
What if, intsead, we just looked at the top-reviewed trails and then sorted them by how many reviews they had? Because many trails’ average review was five, this does not help much
These ratings - model_based_rating
below - are from the mixed effects model specified here:
m1 <- lmer(raw_review ~ 1 + (1|trail_name), data = df)
The data has to be merged back into the data frame with the other characteristics of the trail:
m1_tidied <- tidy(m1)
m1_fe <- filter(m1_tidied, group == "fixed")
estimated_trail_means <- ranef(m1)$trail_name %>%
rownames_to_column() %>%
as_tibble() %>%
rename(trail_name = rowname, estimated_mean = `(Intercept)`) %>%
mutate(model_based_rating = estimated_mean + m1_fe$estimate)
df_ss <- df %>%
group_by(trail_name) %>%
summarize(raw_mean = mean(raw_review))
df_out <- left_join(df_ss, estimated_trail_means)
df_out <- left_join(df_out, df)
So, where are we riding next?
Here are the top-10 trails of any length:
df_out %>%
select(trail_name, surface_rc, distance, category, estimated_mean, raw_mean, n_reviews) %>%
distinct() %>%
arrange(desc(estimated_mean)) %>%
mutate_if(is.numeric, function(x) round(x, 3)) %>%
head(10) %>%
knitr::kable()
trail_name | surface_rc | distance | category | estimated_mean | raw_mean | n_reviews |
---|---|---|---|---|---|---|
Saginaw Valley Rail Trail | Paved | 11.0 | Rail-Trail | 0.886 | 4.941 | 36 |
Clinton River Park Trail | Paved | 4.5 | Greenway/Non-RT | 0.875 | 4.933 | 17 |
Leelanau Trail | Paved | 16.6 | Rail-Trail | 0.829 | 4.900 | 20 |
Wayne County Metroparks Trail | Paved | 16.3 | Greenway/Non-RT | 0.815 | 4.889 | 9 |
Southern Links Trailway | Other | 10.2 | Rail-Trail | 0.811 | 4.853 | 39 |
Mackinac Island Loop (State Highway 185) | Paved | 8.3 | Greenway/Non-RT | 0.796 | 4.875 | 11 |
Detroit RiverWalk | Paved | 3.5 | Greenway/Non-RT | 0.779 | 5.000 | 3 |
Fred Meijer Pioneer Trail | Paved | 5.4 | Rail-Trail | 0.779 | 5.000 | 3 |
Grand Haven Waterfront Trail | Paved | 2.5 | Rail-Trail | 0.779 | 5.000 | 4 |
Granger Meadows Park Trail | Paved | 1.9 | Greenway/Non-RT | 0.779 | 5.000 | 2 |
What if we wanted to take a shorter trip - one less than 10 miles?
df_out %>%
select(trail_name, surface_rc, distance, category, estimated_mean, raw_mean, n_reviews) %>%
distinct() %>%
filter(distance < 10) %>%
arrange(desc(estimated_mean), desc(n_reviews)) %>%
head(10) %>%
knitr::kable()
trail_name | surface_rc | distance | category | estimated_mean | raw_mean | n_reviews |
---|---|---|---|---|---|---|
Clinton River Park Trail | Paved | 4.5 | Greenway/Non-RT | 0.8747665 | 4.933333 | 17 |
Mackinac Island Loop (State Highway 185) | Paved | 8.3 | Greenway/Non-RT | 0.7962137 | 4.875000 | 11 |
Grand Haven Waterfront Trail | Paved | 2.5 | Rail-Trail | 0.7789488 | 5.000000 | 4 |
Stony Creek Metropark Trail | Paved | 6.2 | Greenway/Non-RT | 0.7789488 | 5.000000 | 4 |
Detroit RiverWalk | Paved | 3.5 | Greenway/Non-RT | 0.7789488 | 5.000000 | 3 |
Fred Meijer Pioneer Trail | Paved | 5.4 | Rail-Trail | 0.7789488 | 5.000000 | 3 |
Granger Meadows Park Trail | Paved | 1.9 | Greenway/Non-RT | 0.7789488 | 5.000000 | 2 |
Western Gateway Trail | Paved | 6.0 | Rail-Trail | 0.7789488 | 5.000000 | 2 |
Paint Creek Trail (MI) | Crushed Stone | 8.9 | Rail-Trail | 0.7301712 | 4.785714 | 26 |
Dequindre Cut Greenway | Paved | 1.8 | Rail-Trail | 0.7091632 | 4.777778 | 12 |
Conclusion
This approach that uses a model is powerful because we can figure out what trails are higher (or lower) when we consider how many reviews we have about each trail. Needless to say, this approach is powerful in research, as well: Grades for students in classrooms, for example, can be analyzed in the same way if we want to learn what students are consistently performing differently (for better or worse!).
The code to download the reviews is here. The code in this post can be used to do a similar analysis.