Chapter 40 DIY web data

40.1 Interacting with an API

In Chapter 39 we experimented with several packages that “wrapped” APIs. That is, they handled the creation of the request and the formatting of the output. In this chapter we’re going to look at (part of) what these functions were doing.

40.1.1 Load the tidyverse

We will be using the functions from the tidyverse throughout this chapter, so go ahead and load tidyverse package now.

40.1.2 Examine the structure of API requests using the Open Movie Database

First we’re going to examine the structure of API requests via the Open Movie Database (OMDb). OMDb is very similar to IMDb, except it has a nice, simple API. We can go to the website, input some search parameters, and obtain both the XML query and the response from it.

Exercise: determine the shape of an API request. Scroll down to the “Examples” section on the OMDb site and play around with the parameters. Take a look at the resulting API call and the query you get back.

If we enter the following parameters:

  • title = Interstellar,
  • year = 2014,
  • plot = full,
  • response = JSON

Here is what we see:

Example OMDb Query in JSON

Figure 40.1: Example OMDb Query in JSON

The request URL is:

http://www.omdbapi.com/?t=Interstellar&y=2014&plot=full

Notice the pattern in the request. Let’s try changing the response field from JSON to XML.

Example OMDb Query in XML

Figure 40.2: Example OMDb Query in XML

Now the request URL is:

http://www.omdbapi.com/?t=Interstellar&y=2014&plot=full&r=xml

Try pasting these URLs into your browser. You should see this if you tried the first URL:

…and this if you tried the second URL (where r=xml):

40.1.3 Create an OMDb API Key

This tells us that we need an API key to access the OMDb API. We will store our key for the OMDb API in our .Renviron file using the helper function edit_r_environ() from the usethis package. Follow these steps:

  1. Visit this URL and request your free API key: https://www.omdbapi.com/apikey.aspx
  2. Check your email and follow the instructions to activate your key.
  3. Install/load the usethis package and run edit_r_environ() in the R Console:

  4. Add OMDB_API_KEY=<your-secret-key> on a new line, press enter to add a blank line at the end (important!), save the file, and close it.
    • Note that we use <your-secret-key> as a placeholder here and throughout these instructions. Your actual API key will look something like: p319s0aa (no quotes or other characters like < or > should go on the right of the = sign).
  5. Restart R.
  6. You can now access your OMDb API key from the R console and save it as an object:

    #> [1] ""
  7. We can use this to easily add our API key to the request URL. Let’s make this API key an object we can refer to as movie_key:

40.1.3.1 Alternative strategy for keeping keys: .Rprofile

Remember to protect your key! It is important for your privacy. You know, like a key.

Now we follow the rOpenSci tutorial on API keys:

  • Add .Rprofile to your .gitignore !!
  • Make a .Rprofile file (windows tips; mac tips).
  • Write the following in it:
  • Restart R (i.e. reopen your RStudio project).

This code adds another element to the list of options, which you can see by calling options(). Part of the work done by rplos::searchplos() and friends is to go and obtain the value of this option with the function getOption("OMBD_API_KEY"). This indicates two things:

  1. Spelling is important when you set the option in your .Rprofile
  2. You can do a similar process for an arbitrary package or key. For example:

This is a simple means to keep your keys private, especially if you are sharing the same authentication across several projects.

40.1.3.2 A few timely reminders about your .Rprofile

  • It must end with a blank line!
  • It lives in the project’s working directory, i.e. the location of your .Rproj.
  • It must be gitignored.

Remember that using .Rprofile makes your code un-reproducible. In this case, that is exactly what we want!

40.1.4 Recreate the request URL in R

How can we recreate the same request URLs in R? We could use the glue package to paste together the base URL, parameter labels, and parameter values:

This works, but it only works for movie titled Interstellar from 2014 where we want the short plot and the XML format. Let’s try to pull out more variables and paste them in with glue:

We could go even further and make this into a function called omdb() that we can reuse more easily.

40.1.5 Get data using the curl package

Now we have a handy function that returns the API query. We can paste in the link, but we can also obtain data from within R using the curl package. Install/load the curl package first.

Using curl to get the data in XML format:

#> [1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root response=\"True\"><movie title=\"Interstellar\" year=\"2014\" rated=\"PG-13\" released=\"07 Nov 2014\" runtime=\"169 min\" genre=\"Adventure, Drama, Sci-Fi\" director=\"Christopher Nolan\" writer=\"Jonathan Nolan, Christopher Nolan\" actors=\"Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow\" plot=\"A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.\" language=\"English\" country=\"USA, UK, Canada\" awards=\"Won 1 Oscar. Another 43 wins &amp; 143 nominations.\" poster=\"https://m.media-amazon.com/images/M/MV5BZjdkOTU3MDktN2IxOS00OGEyLWFmMjktY2FiMmZkNWIyODZiXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg\" metascore=\"74\" imdbRating=\"8.6\" imdbVotes=\"1,333,291\" imdbID=\"tt0816692\" type=\"movie\"/></root>"

Using curl to get the data in JSON format:

#> [1] "{\"Title\":\"Interstellar\",\"Year\":\"2014\",\"Rated\":\"PG-13\",\"Released\":\"07 Nov 2014\",\"Runtime\":\"169 min\",\"Genre\":\"Adventure, Drama, Sci-Fi\",\"Director\":\"Christopher Nolan\",\"Writer\":\"Jonathan Nolan, Christopher Nolan\",\"Actors\":\"Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow\",\"Plot\":\"A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.\",\"Language\":\"English\",\"Country\":\"USA, UK, Canada\",\"Awards\":\"Won 1 Oscar. Another 43 wins & 143 nominations.\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZjdkOTU3MDktN2IxOS00OGEyLWFmMjktY2FiMmZkNWIyODZiXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.6/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"72%\"},{\"Source\":\"Metacritic\",\"Value\":\"74/100\"}],\"Metascore\":\"74\",\"imdbRating\":\"8.6\",\"imdbVotes\":\"1,333,291\",\"imdbID\":\"tt0816692\",\"Type\":\"movie\",\"DVD\":\"31 Mar 2015\",\"BoxOffice\":\"$158,737,441\",\"Production\":\"Paramount Pictures\",\"Website\":\"N/A\",\"Response\":\"True\"}"

We have two forms of data that are obviously structured. What are they?

40.2 Intro to JSON and XML

There are two common languages of web services:

  1. JavaScript Object Notation (JSON)
  2. eXtensible Markup Language (XML)

Here’s an example of JSON (from this wonderful site):

And here is XML (also from this site):

You can see that both of these data structures are quite easy to read. They are “self-describing”. In other words, they tell you how they are meant to be read. There are easy means of taking these data types and creating R objects.

40.2.1 Parsing the JSON response with jsonlite

Our JSON response above can be parsed using jsonlite::fromJSON(). First install/load the jsonlite package.

Parsing our JSON response with fromJSON():

The output is a named list. A familiar and friendly R structure. Because data frames are lists and because this list has no nested lists-within-lists, we can coerce it very simply:

answer_json %>% 
  fromJSON() %>% 
  as_tibble() %>% 
  glimpse()
#> Observations: 3
#> Variables: 25
#> $ Title      <chr> "Interstellar", "Interstellar", "Interstellar"
#> $ Year       <chr> "2014", "2014", "2014"
#> $ Rated      <chr> "PG-13", "PG-13", "PG-13"
#> $ Released   <chr> "07 Nov 2014", "07 Nov 2014", "07 Nov 2014"
#> $ Runtime    <chr> "169 min", "169 min", "169 min"
#> $ Genre      <chr> "Adventure, Drama, Sci-Fi", "Adventure, Drama, Sci-Fi…
#> $ Director   <chr> "Christopher Nolan", "Christopher Nolan", "Christophe…
#> $ Writer     <chr> "Jonathan Nolan, Christopher Nolan", "Jonathan Nolan,…
#> $ Actors     <chr> "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, J…
#> $ Plot       <chr> "A team of explorers travel through a wormhole in spa…
#> $ Language   <chr> "English", "English", "English"
#> $ Country    <chr> "USA, UK, Canada", "USA, UK, Canada", "USA, UK, Canad…
#> $ Awards     <chr> "Won 1 Oscar. Another 43 wins & 143 nominations.", "W…
#> $ Poster     <chr> "https://m.media-amazon.com/images/M/MV5BZjdkOTU3MDkt…
#> $ Ratings    <df[,2]> <data.frame[3 x 2]>
#> $ Metascore  <chr> "74", "74", "74"
#> $ imdbRating <chr> "8.6", "8.6", "8.6"
#> $ imdbVotes  <chr> "1,333,291", "1,333,291", "1,333,291"
#> $ imdbID     <chr> "tt0816692", "tt0816692", "tt0816692"
#> $ Type       <chr> "movie", "movie", "movie"
#> $ DVD        <chr> "31 Mar 2015", "31 Mar 2015", "31 Mar 2015"
#> $ BoxOffice  <chr> "$158,737,441", "$158,737,441", "$158,737,441"
#> $ Production <chr> "Paramount Pictures", "Paramount Pictures", "Paramoun…
#> $ Website    <chr> "N/A", "N/A", "N/A"
#> $ Response   <chr> "True", "True", "True"

40.2.2 Parsing the XML response using xml2

We can use the xml2 package to wrangle our XML response.

Parsing our XML response with read_xml():

Not exactly the result we were hoping for! However, this does tell us about the XML document’s structure:

  • It has a <root> node, which has a single child node, <movie>.
  • The information we want is all stored as attributes (e.g. title, year, etc.).

The xml2 package has various functions to assist in navigating through XML. We can use the xml_children() function to extract all of the children nodes (i.e. the single child, <movie>):

The xml_attrs() function “retrieves all attribute values as a named character vector”. Let’s use this to extract the information that we want from the <movie> node:

We can transform this named character vector into a data frame with the help of dplyr::bind_rows():

40.3 Introducing the easy way: httr

httr is yet another star in the tidyverse. It is a package designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:

  • GET() - Fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource.
  • POST() - Create a new resource. POST requests usually carry a payload that specifies the data for the new resource.
  • PUT() - Update an existing resource. The payload may contain the updated data for the resource.
  • DELETE() - Delete an existing resource.

HTTP is the foundation for APIs; understanding how it works is the key to interacting with all the diverse APIs out there. An excellent beginning resource for APIs (including HTTP basics) is An Introduction to APIs by Brian Cooksey.

httr also facilitates a variety of authentication protocols.

httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()). They have more informative outputs than simply using curl and come with nice convenience functions for working with the output:

Using httr to get the data in JSON format:

#> $Title
#> [1] "Interstellar"
#> 
#> $Year
#> [1] "2014"
#> 
#> $Rated
#> [1] "PG-13"
#> 
#> $Released
#> [1] "07 Nov 2014"
#> 
#> $Runtime
#> [1] "169 min"
#> 
#> $Genre
#> [1] "Adventure, Drama, Sci-Fi"
#> 
#> $Director
#> [1] "Christopher Nolan"
#> 
#> $Writer
#> [1] "Jonathan Nolan, Christopher Nolan"
#> 
#> $Actors
#> [1] "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow"
#> 
#> $Plot
#> [1] "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival."
#> 
#> $Language
#> [1] "English"
#> 
#> $Country
#> [1] "USA, UK, Canada"
#> 
#> $Awards
#> [1] "Won 1 Oscar. Another 43 wins & 143 nominations."
#> 
#> $Poster
#> [1] "https://m.media-amazon.com/images/M/MV5BZjdkOTU3MDktN2IxOS00OGEyLWFmMjktY2FiMmZkNWIyODZiXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg"
#> 
#> $Ratings
#> $Ratings[[1]]
#> $Ratings[[1]]$Source
#> [1] "Internet Movie Database"
#> 
#> $Ratings[[1]]$Value
#> [1] "8.6/10"
#> 
#> 
#> $Ratings[[2]]
#> $Ratings[[2]]$Source
#> [1] "Rotten Tomatoes"
#> 
#> $Ratings[[2]]$Value
#> [1] "72%"
#> 
#> 
#> $Ratings[[3]]
#> $Ratings[[3]]$Source
#> [1] "Metacritic"
#> 
#> $Ratings[[3]]$Value
#> [1] "74/100"
#> 
#> 
#> 
#> $Metascore
#> [1] "74"
#> 
#> $imdbRating
#> [1] "8.6"
#> 
#> $imdbVotes
#> [1] "1,333,291"
#> 
#> $imdbID
#> [1] "tt0816692"
#> 
#> $Type
#> [1] "movie"
#> 
#> $DVD
#> [1] "31 Mar 2015"
#> 
#> $BoxOffice
#> [1] "$158,737,441"
#> 
#> $Production
#> [1] "Paramount Pictures"
#> 
#> $Website
#> [1] "N/A"
#> 
#> $Response
#> [1] "True"

Using httr to get the data in XML format:

#> {xml_document}
#> <root response="True">
#> [1] <movie title="Interstellar" year="2014" rated="PG-13" released="07 N ...

httr also gives us access to lots of useful information about the quality of our response. For example, the header:

And also a handy means to extract specifically the HTTP status code:

In fact, we didn’t need to create omdb() at all. httr provides a straightforward means of making an HTTP request with the query argument:

#> $Title
#> [1] "The Martian"
#> 
#> $Year
#> [1] "2015"
#> 
#> $Rated
#> [1] "PG-13"
#> 
#> $Released
#> [1] "02 Oct 2015"
#> 
#> $Runtime
#> [1] "144 min"
#> 
#> $Genre
#> [1] "Adventure, Drama, Sci-Fi"
#> 
#> $Director
#> [1] "Ridley Scott"
#> 
#> $Writer
#> [1] "Drew Goddard (screenplay by), Andy Weir (based on the novel by)"
#> 
#> $Actors
#> [1] "Matt Damon, Jessica Chastain, Kristen Wiig, Jeff Daniels"
#> 
#> $Plot
#> [1] "An astronaut becomes stranded on Mars after his team assume him dead, and must rely on his ingenuity to find a way to signal to Earth that he is alive."
#> 
#> $Language
#> [1] "English, Mandarin"
#> 
#> $Country
#> [1] "UK, USA, Hungary"
#> 
#> $Awards
#> [1] "Nominated for 7 Oscars. Another 37 wins & 185 nominations."
#> 
#> $Poster
#> [1] "https://m.media-amazon.com/images/M/MV5BMTc2MTQ3MDA1Nl5BMl5BanBnXkFtZTgwODA3OTI4NjE@._V1_SX300.jpg"
#> 
#> $Ratings
#> $Ratings[[1]]
#> $Ratings[[1]]$Source
#> [1] "Internet Movie Database"
#> 
#> $Ratings[[1]]$Value
#> [1] "8.0/10"
#> 
#> 
#> $Ratings[[2]]
#> $Ratings[[2]]$Source
#> [1] "Rotten Tomatoes"
#> 
#> $Ratings[[2]]$Value
#> [1] "91%"
#> 
#> 
#> $Ratings[[3]]
#> $Ratings[[3]]$Source
#> [1] "Metacritic"
#> 
#> $Ratings[[3]]$Value
#> [1] "80/100"
#> 
#> 
#> 
#> $Metascore
#> [1] "80"
#> 
#> $imdbRating
#> [1] "8.0"
#> 
#> $imdbVotes
#> [1] "695,039"
#> 
#> $imdbID
#> [1] "tt3659388"
#> 
#> $Type
#> [1] "movie"
#> 
#> $DVD
#> [1] "12 Jan 2016"
#> 
#> $BoxOffice
#> [1] "$202,313,768"
#> 
#> $Production
#> [1] "20th Century Fox"
#> 
#> $Website
#> [1] "N/A"
#> 
#> $Response
#> [1] "True"

With httr, we are able to pass in the named arguments to the API call as a named list. We are also able to use spaces in movie names; httr encodes these in the URL before making the GET request.

It is very good to learn your HTTP status codes.

The documentation for httr includes a vignette of “Best practices for writing an API package”, which is useful for when you want to bring your favourite web resource into the world of R.

40.4 Scraping

What if data is present on a website, but isn’t provided in an API at all? It is possible to grab that information too. How easy that is to do depends a lot on the quality of the website that we are using.

HTML is a structured way of displaying information. It is very similar in structure to XML (in fact many modern html sites are actually XHTML5, which is also valid XML).

From [xkcd](https://imgs.xkcd.com/comics/tags.png)

Figure 40.3: From xkcd

Two pieces of equipment:

  1. The rvest package (CRAN; GitHub). Install via install.packages("rvest)".
  2. SelectorGadget: point and click CSS selectors. Install in your browser.

Before we go any further, let’s play a game together!

40.4.1 Obtain a table

Let’s make a simple HTML table and then parse it.

  1. Make a new, empty project
  2. Make a totally empty .Rmd file and save it as "GapminderHead.Rmd"
  3. Copy this into the body:

Knit the document and click “View in Browser”. It should look like this:

country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.8 8425333 779
Afghanistan Asia 1957 30.3 9240934 821
Afghanistan Asia 1962 32.0 10267083 853
Afghanistan Asia 1967 34.0 11537966 836
Afghanistan Asia 1972 36.1 13079460 740
Afghanistan Asia 1977 38.4 14880372 786

We have created a simple HTML table with the head of gapminder in it! We can get our data back by parsing this table into a data frame again. Extracting data from HTML is called “scraping”, and we can do it in R with the rvest package:

#> [[1]]
#>       country continent year lifeExp      pop gdpPercap
#> 1 Afghanistan      Asia 1952    28.8  8425333       779
#> 2 Afghanistan      Asia 1957    30.3  9240934       821
#> 3 Afghanistan      Asia 1962    32.0 10267083       853
#> 4 Afghanistan      Asia 1967    34.0 11537966       836
#> 5 Afghanistan      Asia 1972    36.1 13079460       740
#> 6 Afghanistan      Asia 1977    38.4 14880372       786

40.6 Random observations on scraping

  • Make sure you’ve obtained ONLY what you want! Scroll over the whole page to ensure that SelectorGadget hasn’t found too many things.
  • If you are having trouble parsing, try selecting a smaller subset of the thing you are seeking (e.g. being more precise).

MOST IMPORTANTLY confirm that there is NO rOpenSci package and NO API before you spend hours scraping (the API was right here).

40.7 Extras

40.7.1 Airports

First, go to this website about Airports. Follow the link to get your API key (you will need to click a confirmation email).

List of all the airports on the planet:

https://airport.api.aero/airport/?user_key={yourkey}

List of all the airports matching Toronto:

https://airport.api.aero/airport/match/toronto?user_key={yourkey}

The distance between YVR and LAX:

https://airport.api.aero/airport/distance/YVR/LAX?user_key={yourkey}

Do you need just the US airports? This API does that (also see this) and is free.