Division 15 of the American Psychological Association sponsors the Ed Psych Jobs website, which is an excellent resource for Ed Psych job seekers. I thought it would possibly be helpful to see when jobs were posted in the past in order to have a better idea about when jobs may be posted this year.
Ed Psych Jobs, Robots (.txt), and paths_allowed, oh my
As this project involves a bit of web-scraping, I first checked the robots.txt
file (located at http://edpsychjobs.info/robots.txt) to find out whether accessing any or all content in such a way was prohibited. It looks like only the log-in pages are listed as those that should not be accessed.
If interested, here is a good resource on robots.txt
files.
As a shortcut, there is a neat R package that has the function paths_allowed()
, that takes a URL to a page on a website, returning a TRUE
if, according to the robots.txt
file, accessing the content available through the URL is allowed. Here is an example of using that, which confirms the manual check of the file I did (it is better to be safe than sorry with web scraping):
library(robotstxt)
paths_allowed("http://edpsychjobs.info/category/all-jobs/")
## [1] TRUE
Accessing the dates of posts
Let’s get scraping. We will load a few packages and write a few lines of code to just scrape the dates (not the names of the jobs or any other information):
library(rvest)
library(tidyverse)
library(lubridate)
library(hrbrthemes)
read_the_dates <- function(page, url = "http://edpsychjobs.info/category/all-jobs/page/") {
Sys.sleep(1)
results_df <- tibble(
dates = vector(length = 1)
)
base_url <- paste0(url, page)
page_html <- read_html(base_url)
date <- html_nodes(page_html, ".published")
results_df <- mutate(results_df,
date = list(html_text(date)))
return(results_df)
}
Notice in this code we used Sys.sleep(1)
.
Although the robots.txt
did not specify a time delay (often websites request a 10 second delay between page loads for web scrapers), this command specifies a 1 second delay between page loads, just to be considerate in terms of taking up the bandwidth of the web server.
I checked manually to see how many pages had job postings; there were 76, and so I’m just passing numbers 1 through 76 as arguments to the function we wrote above through the function map_df()
, which will take care of the iteration (this is the map
part), and output the results in a data frame (specified through the df
part).
Just as a note: Another way to do this would be to write a function that goes from page 1 up through a page number for which a page does not load. Still another way would be to use the very handy possibly()
(from the purrr
package) or some other function that deals with errors (in this case, a page that does not load), and specify a large range of page numbers, say, from 0 to 100.
output_df <- map_df(1:76, read_the_dates)
Analysis
Now that we have the data, we can process it a bit (note that the unnest()
part is because we created a row in the data frame for each page, but each page contains at most six dates, so this function unnests those dates so that they would occupy six rows, instead of one).
Like many functions in R, this is only one way - but I think a good one! - of many that you could do to achieve the same output.
processed_dates <- output_df %>%
select(date) %>%
unnest(date) %>%
mutate(date = mdy(date),
month = month(date, label = T),
year = year(date))
We can now count up how many posts there are per month and look at their proportion:
processed_dates %>%
count(month) %>%
mutate(proportion = round(n / sum(n), 2))
## # A tibble: 12 x 3
## month n proportion
## <ord> <int> <dbl>
## 1 Jan 28 0.06
## 2 Feb 22 0.05
## 3 Mar 25 0.05
## 4 Apr 27 0.06
## 5 May 31 0.07
## 6 Jun 23 0.05
## 7 Jul 21 0.05
## 8 Aug 57 0.12
## 9 Sep 77 0.17
## 10 Oct 74 0.16
## 11 Nov 42 0.09
## 12 Dec 29 0.06
How many (and the proportion) per year:
processed_dates %>%
count(year) %>%
mutate(proportion = round(n / sum(n), 2))
## # A tibble: 6 x 3
## year n proportion
## <dbl> <int> <dbl>
## 1 2013 97 0.21
## 2 2014 113 0.25
## 3 2015 58 0.13
## 4 2016 52 0.11
## 5 2017 59 0.13
## 6 2018 77 0.17
And look at differences between years in posts per month:
df_plot_2 <- processed_dates %>%
count(year, month) %>%
complete(year, month, fill = list(n = 0)) %>%
mutate(year = as.factor(year)) %>%
filter(!(year == 2017 & month %in% c("Sep", "Oct", "Nov", "Dec")))
ggplot(df_plot_2, aes(x = month, y = n, group = year, color = year)) +
geom_point(alpha = .3) +
geom_line() +
xlab(NULL) +
ylab("Number of Posts") +
ggtitle("Number of Posts / Month on EdPsychJobs")
What can we learn from this?
Overall, the number of postings seems to align with the academic job cycle, where jobs for the next year are posted just around a year before their start date: You would expect an announcement for a job that has a start date of August, 2018 to be around August through October of the year before.
But there are many jobs posted throughout the year, too. And there appear to be differences between years. Of course, there are many possible caveats:
- The extent to which the jobs posted on this site are comprehensive (do they serve as a source of information about all educational psychology jobs, or is it better to think of it as a sample?)
- There are other fields similar to educational psychology that may be worth comparing
- These postings are for all kinds of educational psychology-related jobs - post-docs, research assistant or associate jobs, adjunct, and tenure-track
- The analysis is completely descriptive, and you could develop a predictive model for how many jobs will be posted, say, in September and October, 2017
Nevertheless, it is possibly useful. Any of this code can be used to re-create the data, improve this, or do something different with it.