How many groups of Star Wars characters are there? R-squared and cross-validation approaches

2017/07/02

Background

How many groups, or types, of Star Wars characters are there? I’ve been wanting to use the starwars dataset built-in to the dplyr package, and at the same time, have been working hard on an R package to carry out an analysis suited to doing this. Part of the challenge of using the approach in this R package is determining how groups groups there are.

Many approaches (Latent Profile Analysis, for example) use Maximum Likelihood estimation (while the approach I’ve developed uses a two-step cluster analysis based around the geometric (and algebraic) idea of “distance”, or how close (similar) observations are). This is easy enough when we’re talking about something like length. If something is 4 long and another thing 8, then what is there distance (4!)? When we’re talking about more than just length - say, length and width - then it’s the exact same idea, except the distance represents how far two things are across both measures - length and width.

But back to groups of Star Wars characters. How many are there? Let’s see what data we have:

library(dplyr)

starwars
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

It looks like we only have three measures that are numbers (height, mass, and birth_year) - though there are others we could possibly turn into numbers (maybe), and there are other approaches (Latent Class Analysis) that can deal with non-numeric measures (such as hair_color). But we’ll have to stick to the three measures that are numbers, for better or worse, for now.

R2

Let’s first take a look at the plot of R2 values, which are obtained from the second of the two steps of the cluster analysis - the k-means step (I say this because there are other, perhaps better, ways to calculate the R-squared values, such as from a MANOVA).

We just list the name of the data and the variables we would like to use. Since birth_year is on a very different metric than the other two variables, we’ll set to_scale and to_center to TRUE. We’ll also return a table, instead of a plot.

library(prcr)

plot_r_squared(starwars, height, mass, birth_year, to_scale = TRUE, to_center = TRUE, r_squared_table = T)
## ################################
## Clustering data for iteration 2
## Clustering data for iteration 3
## Clustering data for iteration 4
## Clustering data for iteration 5
## Clustering data for iteration 6
## Clustering data for iteration 7
## Clustering data for iteration 8
## Clustering data for iteration 9
## ################################
##   cluster r_squared_value
## 1       2           0.507
## 2       3              NA
## 3       4              NA
## 4       5              NA
## 5       6              NA
## 6       7              NA
## 7       8              NA
## 8       9              NA

Ooh! Not good. Before the second of the two steps settled on the groups, it ended up with a group with no observations. This is probably in part the result of a small sample, and possibly attributable to the measures we used - and maybe some missing data for some of the measures. Let’s take a look at the data:

starwars_ss <- select(starwars, height, mass, birth_year)
skimr::skim(starwars_ss)
Table 1: Data summary
Name starwars_ss
Number of rows 87
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 6 0.93 174.36 34.77 66 167.0 180 191.0 264 ▁▁▇▅▁
mass 28 0.68 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁
birth_year 44 0.49 87.57 154.69 8 35.0 52 72.0 896 ▇▁▁▁▁

It looks like the birth_year is missing for a lot - 44 - of the observations for the 87 Star Wars characters we have. We’re down to the bare-bones number of measures, but let’s try with just height and mass. We probably don’t need to scale the data.

plot_r_squared(starwars, height, mass, to_scale = TRUE, to_center = TRUE, r_squared_table = T)
## ################################
## Clustering data for iteration 2
## Clustering data for iteration 3
## Clustering data for iteration 4
## Clustering data for iteration 5
## Clustering data for iteration 6
## Clustering data for iteration 7
## Clustering data for iteration 8
## Clustering data for iteration 9
## ################################
##   cluster r_squared_value
## 1       2           0.485
## 2       3           0.872
## 3       4              NA
## 4       5              NA
## 5       6           0.977
## 6       7              NA
## 7       8              NA
## 8       9              NA

That’s better - in a sense. We have two, three, and six groups solutions. I wouldn’t trust the six group solution very much. The R2 value does increase substantialy between two and three groups. This suggests maybe there are three groups (when we use just the measures for weight and mass).

Groups

two_profiles <- create_profiles(starwars, height, mass, n_profiles = 2, to_scale = TRUE, to_center = TRUE)
plot(two_profiles)

three_profiles <- create_profiles(starwars, height, mass, n_profiles = 3, to_scale = TRUE, to_center = TRUE)
plot(three_profiles)

The third group: Massive, not so tall

It looks like there is one very massive (literally) observation that makes up one profile in both the two and three profile solutions. Who is it?

three_profiles$.data %>% 
    filter(cluster == 3) %>% 
    knitr::kable()
name height mass hair_color skin_color eye_color birth_year gender homeworld species films vehicles starships cluster
Jabba Desilijic Tiure 175 1358 NA green-tan, brown orange 600 hermaphrodite Nal Hutta Hutt c(“The Phantom Menace”, “Return of the Jedi”, “A New Hope”) character(0) character(0) 3

Jabba. Of course. It looks like with two or three groups, Jabba ends up in one cluster.

The second group: Less massive, small height

What about the seven - who seem to be less massive and with a small height - in the second group?

three_profiles$.data %>% 
    filter(cluster == 2) %>% 
    knitr::kable()
name height mass hair_color skin_color eye_color birth_year gender homeworld species films vehicles starships cluster
R2-D2 96 32 NA white, blue red 33 NA Naboo Droid c(“Attack of the Clones”, “The Phantom Menace”, “Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”, “The Force Awakens”) character(0) character(0) 2
R5-D4 97 32 NA white, red red NA NA Tatooine Droid A New Hope character(0) character(0) 2
Yoda 66 17 white green brown 896 male NA Yoda’s species c(“Attack of the Clones”, “The Phantom Menace”, “Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”) character(0) character(0) 2
Wicket Systri Warrick 88 20 brown brown brown 8 male Endor Ewok Return of the Jedi character(0) character(0) 2
Sebulba 112 40 none grey, red orange NA male Malastare Dug The Phantom Menace character(0) character(0) 2
Dud Bolt 94 45 none blue, grey yellow NA male Vulpter Vulptereen The Phantom Menace character(0) character(0) 2
Ratts Tyerell 79 15 none grey, blue unknown NA male Aleen Minor Aleena The Phantom Menace character(0) character(0) 2

These seem to be droids, Yoda, and some other tiny characters.

(Some from) the first group: Above average height, below average mass

The 51 in the first group, with slightly above average height, and slightly below average mass? It’s a big group, so here are just the first six, with a lot of familiar characters:

three_profiles$.data %>% 
    filter(cluster == 1) %>% 
    head() %>% 
    knitr::kable()
name height mass hair_color skin_color eye_color birth_year gender homeworld species films vehicles starships cluster
Luke Skywalker 172 77 blond fair blue 19.0 male Tatooine Human c(“Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”, “The Force Awakens”) c(“Snowspeeder”, “Imperial Speeder Bike”) c(“X-wing”, “Imperial shuttle”) 1
C-3PO 167 75 NA gold yellow 112.0 NA Tatooine Droid c(“Attack of the Clones”, “The Phantom Menace”, “Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”) character(0) character(0) 1
Darth Vader 202 136 none white yellow 41.9 male Tatooine Human c(“Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”) character(0) TIE Advanced x1 1
Leia Organa 150 49 brown light brown 19.0 female Alderaan Human c(“Revenge of the Sith”, “Return of the Jedi”, “The Empire Strikes Back”, “A New Hope”, “The Force Awakens”) Imperial Speeder Bike character(0) 1
Owen Lars 178 120 brown, grey light blue 52.0 male Tatooine Human c(“Attack of the Clones”, “Revenge of the Sith”, “A New Hope”) character(0) character(0) 1
Beru Whitesun lars 165 75 brown light blue 47.0 female Tatooine Human c(“Attack of the Clones”, “Revenge of the Sith”, “A New Hope”) character(0) character(0) 1

Cross-validation

The other technique for determining the number of groups, cross-validation, may be folly because of how it works: Split the data into two, and see how well groups in one half can be reproduced in the other. This may be a problem due to the Jabba-group.

We’ll use the same arguments except for plot_r_squared, which we don’t need, and for one argument, n_profiles, for how many groups we want to cross-validate the groupings for (we have to deal with complete cases, which is what the first two lines are for), for the three group solution:

starwars_ss <- starwars_ss[complete.cases(starwars_ss), ]
cross_validate(starwars_ss, height, mass, n_profiles = 2, to_scale = TRUE, to_center = TRUE)

Not pretty. Convergence issues galore (I decided not to print the messages because there were so many). The Fleiss’ Kappa was close to 0; the percentage agreement 0.61.

Conclusion

Looking at height and weight, we seem to be able to identify three broad groups of Star Wars characters. However, we shouldn’t have a ton of confidence in howe well these groups generalize to all Star Wars characters: Our sample is small, the measures we could use were limited, and our cross-validation did not provide us with much evidence to back up our three distinct groups.

On the other hand, we did have a starting point for how many groups to look for from our R2 values, which was good, and the groups seem interpretable on the basis of those characters in our three groups.

Try it out

The prcr package used to create the groups and calculate the R2 values is available in R using install.packages("prcr"). An in-development version with the function for cross-validation is available using the following two commands (if you have devtools installed already then only the second command is needed:

install.packages("devtools")
devtools::install_github("jrosen48/prcr")

Thanks and credit to Rebecca Steingut now at Teachers’s College - Columbia University for contributing to the in-development version of the package and the cross validation strategy implemented in it.