genderthroughout US history
library(dplyr) library(knitr) # library(devtools)
All this and more is described at the rOpenSci repository of R tools for interacting with the internet
There are many ways to obtain data from the Internet; let’s consider four categories:
In the simplest case, the data you need is already on the internet in a tabular format. There are a couple of strategies here:
readr::read_csv to read the data straight into R.
use the command line program
curl to do that work, and place it in a Makefile or shell script (see the Make lesson for how to do that).
The second case is most useful when the data you want has been provided in a format that needs cleanup. For example, the World Value Survey makes several datasets available as Excel sheets. The safest option here is to download the
.xls file, then read it into R with
readxl::read_excel() or something similar. An exception to this is data provided as Google Spreadsheets, which can be read straight into R using the
httr::GETdata read this way needs to be parsed later with read.table
rio::import()can “read a number of common data formats directly from an https:// URL”. Isn’t that very similar to the previous?
What about packages that install data?
Many times, the data that you want is not already organized into one or a few tables that you can read directly into R. More frequently, you find this data is given in the form of an API. Application Programming Interfaces (APIs) are descriptions of the kind of requests that can be made of a certain piece of software, and descriptions of the kind of answers that are returned. Many sources of data – databases, websites, services – have made all (or part) of their data available via APIs over the internet. Computer programs (“clients”) can make requests of the server, and the server will respond by sending data (or an error message). This client can be many kinds of other programs or websites, including R running from your laptop.
Many common web services and APIs have been “wrapped”, i.e. R functions have been written around them which send your query to the server and format the response.
Why do we want this?
rebird is on CRAN.
the ebird website categorizes some popular locations as “Hotspots”. These are areas where there are both lots of birds and lots of birders. Once such location is at Iona Island, near Vancouver. You can see data for this site at http://ebird.org/ebird/hotspot/L261851
At that link, you can see a page like this:
The data already look to be organized in a data frame!
rebird allows us to read these data directly into R. (The ID code for Iona Island is **“L261851**)
ebirdhotspot(locID = "L261851") %>% head() %>% kable()
## Warning: `rbind_all()` is deprecated. Please use `bind_rows()` instead.
|2015-11-24 11:41||-123.2111||Iona Island (general)||TRUE||Snow Goose||FALSE||Chen caerulescens||FALSE||40||49.22133||L261851|
|2015-11-24 11:41||-123.2111||Iona Island (general)||TRUE||Gadwall||FALSE||Anas strepera||FALSE||10||49.22133||L261851|
|2015-11-24 11:41||-123.2111||Iona Island (general)||TRUE||American Wigeon||FALSE||Anas americana||FALSE||25||49.22133||L261851|
|2015-11-24 11:41||-123.2111||Iona Island (general)||TRUE||Mallard||FALSE||Anas platyrhynchos||FALSE||NA||49.22133||L261851|
|2015-11-24 11:41||-123.2111||Iona Island (general)||TRUE||Northern Pintail||FALSE||Anas acuta||FALSE||35||49.22133||L261851|
|2015-11-24 11:41||-123.2111||Iona Island (general)||TRUE||Green-winged Teal||FALSE||Anas crecca||FALSE||80||49.22133||L261851|
We can use the function
ebirdgeo to get a list for an area. (Note that South and West are negative):
vanbirds <- ebirdgeo(lat = 49.2500, lng = -123.1000)
## Warning: `rbind_all()` is deprecated. Please use `bind_rows()` instead.
vanbirds %>% head %>% kable
|2015-11-25 15:30||-123.1754||Sea Island–Ferguson Rd||TRUE||Song Sparrow||FALSE||Melospiza melodia||FALSE||1||49.20665||L363627|
|2015-11-25 15:30||-123.1754||Sea Island–Ferguson Rd||TRUE||European Starling||FALSE||Sturnus vulgaris||FALSE||5||49.20665||L363627|
|2015-11-25 15:30||-123.1754||Sea Island–Ferguson Rd||TRUE||Northwestern Crow||FALSE||Corvus caurinus||FALSE||2||49.20665||L363627|
|2015-11-25 15:30||-123.1754||Sea Island–Ferguson Rd||TRUE||Short-eared Owl||FALSE||Asio flammeus||FALSE||2||49.20665||L363627|
|2015-11-25 15:30||-123.1754||Sea Island–Ferguson Rd||TRUE||Northern Harrier||FALSE||Circus cyaneus||FALSE||1||49.20665||L363627|
|2015-11-25 15:30||-123.1754||Sea Island–Ferguson Rd||TRUE||Spotted Towhee||FALSE||Pipilo maculatus||FALSE||2||49.20665||L363627|
|Note: Check the||defaults on||this function. e.g. radiu||s of circle||, time of year.|
We can also search by “region”, which refers to short codes which serve as common shorthands for different political units. For example, France is represented by the letters FR
frenchbirds <- ebirdregion("FR") frenchbirds %>% head() %>% kable()
Find out WHEN a bird has been seen in a certain place! Choosing a name from
vanbirds above (the Bald Eagle):
eagle <- ebirdgeo(species = 'Haliaeetus leucocephalus', lat = 42, lng = -76) eagle %>% head() %>% kable()
rebird knows where you are:
ebirdgeo(species = 'Buteo lagopus')
# install.packages(geonames) library(geonames)
There are two things we need to do to be able to use this package to access the geonames API
in R. However this is insecure. We don't want to risk committing this line and pushing it to our public github page! Instead, you should create a file in the same place as your `.Rproj` file. Name this file `.Rprofile`, and add
To that file. Important: Make sure your
.Rprofile ends with a blank line!
What can we do? get access to lots of geographical information via the various “web services” see here
countryInfo <- GNcountryInfo()
countryInfo %>% head %>% kable
|EU||Andorra la Vella||ca||3041565||42.4284925987684||AND||42.6560438963||AN||84000||1.78654277783198||020||468.0||AD||1.40718671411128||Principality of Andorra||Europe||EUR|
|AS||Abu Dhabi||ar-AE,fa,en,hi,ur||290557||22.6333293914795||ARE||26.0841598510742||AE||4975593||56.3816604614258||784||82880.0||AE||51.5833282470703||United Arab Emirates||Asia||AED|
|AS||Kabul||fa-AF,ps,uz-AF,tk||1149361||29.377472||AFG||38.483418||AF||29121286||74.879448||004||647500.0||AF||60.478443||Islamic Republic of Afghanistan||Asia||AFN|
|NA||Saint John’s||en-AG||3576396||16.996979||ATG||17.729387||AC||86754||-61.672421||028||443.0||AG||-61.906425||Antigua and Barbuda||North America||XCD|
|NA||The Valley||en-AI||3573511||18.166815||AIA||18.283424||AV||13254||-62.971359||660||102.0||AI||-63.172901||Anguilla||North America||XCD|
|EU||Tirana||sq,el||783754||39.648361||ALB||42.665611||AL||2986952||21.068472||008||28748.0||AL||19.293972||Republic of Albania||Europe||ALL|
This country info dataset is very helpful for accessing the rest of the data, because it gives us the standardized codes for country and language.
What are the cities of France?
francedata <- countryInfo %>% filter(countryName == "France")
frenchcities <- with(francedata, GNcities(north = north, east = east, south = south, west = west, maxRows = 500)) frenchcities %>% head %>% kable
|2.3488||2988507||FR||Paris||city, village,…||Paris||capital of a political entity||en.wikipedia.org/wiki/Paris||48.85341||P||2138551||PPLC|
|4.34878349304199||2800866||BE||Brussels||city, village,…||Brussels||capital of a political entity||en.wikipedia.org/wiki/City_of_Brussels||50.8504450552593||P||1019022||PPLC|
|7.44744300842285||2661552||CH||Bern||city, village,…||Bern||capital of a political entity||en.wikipedia.org/wiki/Bern||46.9480943365053||P||121631||PPLC|
|6.13||2960316||LU||Luxembourg||city, village,…||Luxembourg||capital of a political entity||en.wikipedia.org/wiki/Luxembourg_%28city%29||49.6116667||P||76684||PPLC|
|7.4166667||2993458||MC||Monaco||city, village,…||Monaco||capital of a political entity||en.wikipedia.org/wiki/Monaco||43.7333333||P||32965||PPLC|
|-2.10491180419922||3042091||JE||Saint Helier||city, village,…||Saint Helier||capital of a political entity||en.wikipedia.org/wiki/Saint_Helier||49.1880427659223||P||28000||PPLC|
We can use geonames to search for georeferenced Wikipedia articles. Here are those within 20 Km of Rio de Janerio, comparing results for English-language Wikipedia (
lang = "en") and Portuguese-language Wikipedia (
lang = "pt"):
rio_english <- GNfindNearbyWikipedia(lat = -22.9083, lng = -43.1964, radius = 20, lang = "en", maxRows = 500) rio_portuguese <- GNfindNearbyWikipedia(lat = -22.9083, lng = -43.1964, radius = 20, lang = "pt", maxRows = 500)
##  305
##  349
PLOS ONE is an open-access journal. They allow access to an impressive range of search tools, and allow you to obtain the full text of their articles.
install.packages("rplos") ## Do this only once:
## Loading required package: ggplot2
Immediately we get a message. It’s a link to the tutorial on the Ropensci website!. How nice :)
Sys.setenv(PlosApiKey = "Paste your Key in here!!") key <- Sys.getenv("PlosApiKey")
Remember to protect your key! it is important for your privacy. You know, like a key * Now we follow the ROpenSci tutorial on API keys * Add
.Rprofile to your
.gitignore !! * Make a
.Rprofile file windows tips mac tips * Write the following in it:
options(PlosApiKey = "YOUR_KEY")
This code adds another element to the list of options, which you can see by calling
options(). Part of the work done by
rplos::searchplos() and friends is to go and obtain the value of this option with the function
getOption("PlosApiKey"). This indicates two things: 1. Spelling is important when you set the option in your
.Rprofile 2. you can do a similar process for an arbitrary package or key. For example:
## in .Rprofile options("this_is_my_key" = XXXX) ## later, in the R script: key <- getOption("this_is_my_key")
This is a simple means to keep your keys private, especially if you are sharing the same authentication across several projects.
print("This is Andrew's Rprofile and you can't have it!") options(PlosApiKey = "XXXXXXXXX")
Remember that using
.Rprofile makes your code un-reproducible. In this case, that is exactly what we want!
Let’s do some searches:
searchplos(q= "Helianthus", fl= "id", limit = 5)
searchplos("materials_and_methods:France", fl = "title, materials_and_methods", key = key) lat <- searchplos("materials_and_methods:study site", fl = "title, materials_and_methods", key = key) aff <- searchplos("*:*", fl = "title, author_affiliate", key = key) aff$author_affiliate[] searchplos("*:*", fl = "id", key = key)
here is a list of options for the search or can do
out <- highplos(q='alcohol', hl.fl = 'abstract', rows=10, , key = key) highbrow(out)
plot_throughtime(terms = "phylogeny", limit = 200, key = key)
genderthroughout US history
The gender package allows you access to American data on the gender of names. Because names change gender over the years, the probability of a name belonging to a man or a woman also depends on the year:
library(gender) gender("Kelsey") gender("Kelsey", years = 1940)