Writing a recursive function in R - or, tweets on tweets

2020/02/07

Background

Recursion has been a topic I’ve heard about, and even read about it, but have never used. Recently, I was analyzing data for the Next Generation Science Standards hashtag on Twitter, and was puzzled to find that some of the Tweets I had accessed were replies to Tweets which were themselves not in the data set.

Fortunately, the rtweet R package returns data on which Tweet a reply is to; using these, it is possible (using rtweet) to access the original Tweet.

But, this raises an issue: What if the Tweet a reply is to is itself a reply (to another Tweet)? This problem starts to become a little complex. The solution I had tried was to use a while loop; this worked basically fine, but I sensed there might be a better way. I talked this through with a student, and determined to see whether I could get it to work this way.

This post represents my attempt to write a recursive function - one that involves calling itself.

Twitter Data

First, let’s access the data. I use rtweet to access 1,000 recent Tweets with the #rstats hashtag.

library(rtweet)

d <- search_tweets("#rstats", n = 1000)

Writing a recursive function

Here is my attempt at writing a recursive function; it takes one argument, statuses, which is the status ids for the Tweets a reply is to; the reply_to_status_id column in the data collected from rtweet.

get_replies_recursive <- function(statuses, existing_data = NULL) {
  statuses <- statuses[!is.na(statuses)] # this line removes the NAs, which are very common because most Tweets are not replies
  new_data <- lookup_statuses(statuses)
  
  print(paste0("Accessed ", nrow(new_data), " new Tweets"))
  
  new_statuses <- new_data$reply_to_status_id[!is.na(new_data$reply_to_status_id)]
  
  if (is.null(existing_data)) {
    out_data <- new_data
  } else {
    out_data <- rbind(new_data, existing_data)
  }
  
  if (length(new_statuses) > 0) {
    # this is the key line where the function calls itself, but passing new statuses
    get_replies_recursive(new_statuses, existing_data = out_data)
  } else {
    print(paste0("Accessed ", nrow(out_data), " new Tweets in total"))
    return(out_data)
  }
}

Using the function

Using it is simple; you just pass the variable for the status ids for the Tweets a reply is to.

new_replies <- get_replies_recursive(d$reply_to_status_id)
## [1] "Accessed 17 new Tweets"
## [1] "Accessed 4 new Tweets"
## [1] "Accessed 4 new Tweets"
## [1] "Accessed 3 new Tweets"
## [1] "Accessed 1 new Tweets"
## [1] "Accessed 1 new Tweets"
## [1] "Accessed 1 new Tweets"
## [1] "Accessed 31 new Tweets in total"

fin

I’m unsure exactly how much more better (clear, efficient) using recursion is here, compared to the former approach I tried; I think this code is a bit simpler. I am also unsure how generally useful recursion is in real-life programming. Lastly, I’m not sure I did this quite right; I went down a bit of a recursive rabbit holem and the part that is most unusual to me is that the result is something like iteration, arrived at a different way. But, this seemed to work, and I welcome any feedback about it.