There are three ways to get data from the internet into R:
Pick one of the exercises below. At least one prompt is given for each of the above approaches.
Due date: Friday 09 December 2016.
set.seed()so that a peer could produce the same random sample.
makeor a makefile-like R script, or putting the download inside an
if()statement that checks if the data already exists.
Create a dataset with multiple records by requesting data from an API using the httr package.
Inspiration for APIs to call
GET() data from the API and convert it into a clean and tidy data frame. Store that as a file ready for (hypothetical!) downstream analysis. Do just enough basic exploration of the resulting data, possibly including some plots, that you and a reader are convinced you’ve successfully downloaded and cleaned it.
Take as many of these opportunities as you can justify to make your task more interesting and realistic, to use techniques from elsewhere in the course (esp. nested list processing with purrr), and to gain experience with more sophisticated usage of httr.
Scrape a multi-record dataset off the web! Convert it into a clean and tidy data frame. Store that as a file ready for (hypothetical!) downstream analysis. Do just enough basic exploration of the resulting data, possibly including some plots, that you and a reader are convinced you’ve successfully downloaded and cleaned it.
I think it’s dubious to scrape data that is available through a proper API, so if you do that anyway … perhaps you should get the data both ways and reflect on the comparison. Also, make sure you not violating a site’s terms of service or your own ethical standards with your webscraping. Just because you can, it doesn’t mean you should!
Many APIs have purpose-built R packages that make it even easier to get data from them.
If you choose one of these options, then you need to go further and combine two datasets, at least one of which is from the web.
These were developed in 2015 by TA Andrew MacDonald
Prompt 1: Combine
gapminder and data from
geonames. Install the
geonames package (on CRAN, on GitHub). Make a user account and use
geonames to access data about the world’s countries. Use data from
gapminder to investigate either of these questions:
library("ggplot2") library("gapminder") ggplot(subset(gapminder, continent != "Oceania"), aes(x = year, y = pop, group = country, color = country)) + geom_line(lwd = 1, show.legend = FALSE) + facet_wrap(~ continent) + scale_color_manual(values = country_colors) + theme_bw() + theme(strip.text = element_text(size = rel(1.1))) + scale_y_log10()
Replace population with population density. To do this, look up the country codes in
geonames(), obtain the area of each country and compute density as population divided by area. TIP check out the handy package countrycode to help you merge country names!
Prompt 2: Look at two other rOpenSci packages:
rplos. Both packages are on CRAN and more info is on their GitHub repo READMEs. Find out what data are available from each, and combine them! Here are three suggestions:
rebird– how many articles are published on a bird species?
geonames– Choose a subset of countries. How many papers have been published by people from that country? In that country? How does that relate to GDP?
geonames– Do countries with more bird species also have more languages?
Recall the general homework rubric.
Peers and/or TAs will run the code and try to get the same output. You’ll be evaluated on the clarity and robustness of your workflow.