In what months are educational psychology jobs posted?

2017/08/15

Division 15 of the American Psychological Association sponsors the Ed Psych Jobs website, which is an excellent resource for Ed Psych job seekers. I thought it would possibly be helpful to see when jobs were posted in the past in order to have a better idea about when jobs may be posted this year.

Ed Psych Jobs, Robots (.txt), and paths_allowed, oh my

As this project involves a bit of web-scraping, I first checked the robots.txt file (located at http://edpsychjobs.info/robots.txt) to find out whether accessing any or all content in such a way was prohibited. It looks like only the log-in pages are listed as those that should not be accessed.

If interested, here is a good resource on robots.txt files.

As a shortcut, there is a neat R package that has the function paths_allowed(), that takes a URL to a page on a website, returning a TRUE if, according to the robots.txt file, accessing the content available through the URL is allowed. Here is an example of using that, which confirms the manual check of the file I did (it is better to be safe than sorry with web scraping):

library(robotstxt)
paths_allowed("http://edpsychjobs.info/category/all-jobs/")
## [1] TRUE

Accessing the dates of posts

Let’s get scraping. We will load a few packages and write a few lines of code to just scrape the dates (not the names of the jobs or any other information):

library(rvest)
library(tidyverse)
library(lubridate)
library(hrbrthemes)

read_the_dates <- function(page, url = "http://edpsychjobs.info/category/all-jobs/page/") {
    Sys.sleep(1)
    
    results_df <- tibble(
        dates = vector(length = 1)
    )
    
    base_url <- paste0(url, page)
    page_html <- read_html(base_url)
    date <- html_nodes(page_html, ".published")
    
    results_df <- mutate(results_df,
                         date = list(html_text(date)))
    
    return(results_df)
}

Notice in this code we used Sys.sleep(1).

Although the robots.txt did not specify a time delay (often websites request a 10 second delay between page loads for web scrapers), this command specifies a 1 second delay between page loads, just to be considerate in terms of taking up the bandwidth of the web server.

I checked manually to see how many pages had job postings; there were 76, and so I’m just passing numbers 1 through 76 as arguments to the function we wrote above through the function map_df(), which will take care of the iteration (this is the map part), and output the results in a data frame (specified through the df part).

Just as a note: Another way to do this would be to write a function that goes from page 1 up through a page number for which a page does not load. Still another way would be to use the very handy possibly() (from the purrr package) or some other function that deals with errors (in this case, a page that does not load), and specify a large range of page numbers, say, from 0 to 100.

output_df <- map_df(1:76, read_the_dates)

Analysis

Now that we have the data, we can process it a bit (note that the unnest() part is because we created a row in the data frame for each page, but each page contains at most six dates, so this function unnests those dates so that they would occupy six rows, instead of one).

Like many functions in R, this is only one way - but I think a good one! - of many that you could do to achieve the same output.

processed_dates <- output_df %>% 
    select(date) %>% 
    unnest(date) %>% 
    mutate(date = mdy(date),
           month = month(date, label = T),
           year = year(date))

We can now count up how many posts there are per month and look at their proportion:

processed_dates %>% 
    count(month) %>% 
    mutate(proportion = round(n / sum(n), 2))
## # A tibble: 12 x 3
##    month     n proportion
##    <ord> <int>      <dbl>
##  1 Jan      28       0.06
##  2 Feb      22       0.05
##  3 Mar      25       0.05
##  4 Apr      27       0.06
##  5 May      31       0.07
##  6 Jun      23       0.05
##  7 Jul      21       0.05
##  8 Aug      57       0.12
##  9 Sep      77       0.17
## 10 Oct      74       0.16
## 11 Nov      42       0.09
## 12 Dec      29       0.06

How many (and the proportion) per year:

processed_dates %>% 
    count(year) %>% 
    mutate(proportion = round(n / sum(n), 2))
## # A tibble: 6 x 3
##    year     n proportion
##   <dbl> <int>      <dbl>
## 1  2013    97       0.21
## 2  2014   113       0.25
## 3  2015    58       0.13
## 4  2016    52       0.11
## 5  2017    59       0.13
## 6  2018    77       0.17

And look at differences between years in posts per month:

df_plot_2 <- processed_dates %>% 
    count(year, month) %>% 
    complete(year, month, fill = list(n = 0)) %>% 
    mutate(year = as.factor(year)) %>% 
    filter(!(year == 2017 & month %in% c("Sep", "Oct", "Nov", "Dec")))

ggplot(df_plot_2, aes(x = month, y = n, group = year, color = year)) +
    geom_point(alpha = .3) +
    geom_line() +
    xlab(NULL) +
    ylab("Number of Posts") +
    ggtitle("Number of Posts / Month on EdPsychJobs")

What can we learn from this?

Overall, the number of postings seems to align with the academic job cycle, where jobs for the next year are posted just around a year before their start date: You would expect an announcement for a job that has a start date of August, 2018 to be around August through October of the year before.

But there are many jobs posted throughout the year, too. And there appear to be differences between years. Of course, there are many possible caveats:

Nevertheless, it is possibly useful. Any of this code can be used to re-create the data, improve this, or do something different with it.