B Short random things

B.1 Draw the rest of the owl, a pep talk for building off simple examples

"How to draw an owl" from [imgur](http://imgur.com/gallery/RadSf)"

Figure B.1: “How to draw an owl” from imgur“

Figure B.1 is an image that is often used to illustrate how hard it can be to go from simple examples to the real thing you actually want.

I recently needed to draw a f*cking owl in R, so I decided to record the process as an experiment.

When I teach STAT545 or Software Carpentry, I try to convey as much about process as anything else. You can always look up technical details, e.g., syntax, but you don’t usually get to see how other people work. This is also how I approach teaching about writing R functions. Newcomers often look at finished code and assume it flowed perfectly formed out of someone’s fingertips. It probably did not.

My modus operandi: start with something that works and add features in small increments, maniacally checking that everything still works. Other people undoubtedly move faster (and, therefore, travel faster but crash harder), but I’m OK with that.

B.1.1 Context: writing a function factory

I have an R package googlesheets that gets Google Sheets in and out of R. Lately we’ve had a lot of trouble with Internal Server Error (HTTP 500), which, as you might expect, is an error on the Google server side. All you can do as a user is try, try again. But this is a showstopper for unattended scripts or multi-step operations, like building and checking the package. A single error renders lots of other work moot, which is completely infuriating.

I want to catch these errors and automatically retry the request after an appropriate delay.

The brute force approach would be to literally drop little for or while loops all over the package, to inspect the response and retry if necessary. But I try to follow the DRY principle, so would prefer to write a new “retry-capable” version of the function that makes these http requests.

It also turns out there’s more than one function for making these requests. I’m talking about the HTTP verbs you use with REST APIs: GET, POST, PATCH, etc. I potentially need to give them all the “retry” treatment. So what I really need is a function factory: an HTTP verb goes in and out comes a retry-capable version of the verb.

It turns out you can write R (or S) for ~20 years and not be very facile with this technique. I certainly am not! But I can read, which is how I got the two circles to start my owl.

B.1.2 Start at the beginning

My reference is the section of Wickham’s Advanced R (2015) that is about closures, “functions written by functions”. Here’s one of the two main examples: a function that creates an exponentiation function.

power <- function(exponent) {
  function(x) {
    x ^ exponent
  }
}
square <- power(2)
square(2)
#> [1] 4
square(4)
#> [1] 16
cube <- power(3)
cube(2)
#> [1] 8
cube(4)
#> [1] 64

I make myself type all this code in and run it. No shortcuts!

What have I learned? I can write a factory, power(), that takes some input, exponent, and gives me back a function, such as square() or cube().

But let’s be honest, this is pretty far from what I need to do.

B.1.3 Can the input be a function?

My problem is different. My input is a function, not an exponent like 2 or 3. Can I even do that?

The simplest thing I could think of that sort of looks like my problem:

The factory takes a function as input.
It returns a function that executes the input function twice, with whatever inputs that function had in the first place.

call_me_twice <- function(f) {
  function(...) {
    f(...)
    f(...)
  }
}

Now I need an input function to be the guinea pig. It needs to take input and be chatty, so I can tell if it’s getting executed. Make sure it works as expected!

jfun <- function(x) cat(x, "\n")
jfun("a")
#> a
jfun(1)
#> 1

Put it all together.

jfunner <- call_me_twice(jfun)
jfunner("a")
#> a 
#> a
jfunner(1)
#> 1 
#> 1

I won’t lie, I’m pleasantly surprised this worked. Morale boost.

B.1.4 Faux VERB

I need a placeholder for the HTTP verbs with these qualities:

Takes some input.
Generates a non-deterministic status.
Returns a list with the input, status, and some content.

VERB <- function(url = "URL!")
  list(url = url,
       status = sample(c(200, 500), size = 1, prob = c(0.6, 0.4)),
       content = rnorm(5))
VERB()
#> $url
#> [1] "URL!"
#> 
#> $status
#> [1] 500
#> 
#> $content
#> [1]  0.773  0.922  2.287  0.608 -0.256
VERB()
#> $url
#> [1] "URL!"
#> 
#> $status
#> [1] 200
#> 
#> $content
#> [1] -1.695  0.156  0.917  1.237  0.163

Oh wait, we have functions in R to do something over and over again.

replicate(5, VERB())
#>         [,1]      [,2]      [,3]      [,4]      [,5]     
#> url     "URL!"    "URL!"    "URL!"    "URL!"    "URL!"   
#> status  200       200       500       500       200      
#> content Numeric,5 Numeric,5 Numeric,5 Numeric,5 Numeric,5

Send VERB() off to the function factory.

VERB_twice <- call_me_twice(VERB)
replicate(5, VERB_twice())
#>         [,1]      [,2]      [,3]      [,4]      [,5]     
#> url     "URL!"    "URL!"    "URL!"    "URL!"    "URL!"   
#> status  500       200       200       200       200      
#> content Numeric,5 Numeric,5 Numeric,5 Numeric,5 Numeric,5

Hmmmm…I only see one output per call of VERB_twice(). But why? Is it because VERB() is only getting called once? That means I’ve screwed up. Or is VERB() getting called twice but I’m only seeing evidence of the second call?

B.1.5 A better faux VERB

VERB <- function(url = "URL!") {
  req <- list(url = url,
              status = sample(c(200, 500), size = 1, prob = c(0.6, 0.4)),
              content = rnorm(5))
  message(req$status)
  req
}
replicate(5, VERB())
#> 200
#> 200
#> 500
#> 200
#> 200
#>         [,1]      [,2]      [,3]      [,4]      [,5]     
#> url     "URL!"    "URL!"    "URL!"    "URL!"    "URL!"   
#> status  200       200       500       200       200      
#> content Numeric,5 Numeric,5 Numeric,5 Numeric,5 Numeric,5

Why is this better? Each call of VERB() causes a message AND returns something.

Send new and improved VERB() off to the function factory.

VERB_twice <- call_me_twice(VERB)
replicate(5, VERB_twice())
#> 200
#> 200
#> 500
#> 200
#> 200
#> 200
#> 200
#> 200
#> 200
#> 200
#>         [,1]      [,2]      [,3]      [,4]      [,5]     
#> url     "URL!"    "URL!"    "URL!"    "URL!"    "URL!"   
#> status  200       200       200       200       200      
#> content Numeric,5 Numeric,5 Numeric,5 Numeric,5 Numeric,5

I like it! What do I like about it?

5 calls produce 10 messages, which tells me VERB() is getting called twice.
5 calls produce 5 outputs, which is good for my eventual goal, where I will only want to return the value of the last call of the enclosed function.

B.1.6 Retry `n` times … and a temporary setback

Instead of hard-wiring 2 calls of the enclosed function f, let’s call it n times via a for loop.

call_me_n <- function(f, n = 3) {
  function(...) for (i in seq_len(n)) f(...)
}

Let’s try my new function factory.

VERB_3 <- call_me_n(VERB)
VERB_3()
#> 500
#> 200
#> 200

That’s disappointing. I see the message, but get no actual output. Is there really no output coming back? Or is it just invisible?

x <- VERB_3()
#> 200
#> 500
#> 200
str(x)
#>  NULL

Nope, there really is no output. Fix that.

call_me_n <- function(f, n = 3) {
  function(...) {
    for (i in seq_len(n)) out <- f(...)
    out
  }
}
VERB_3 <- call_me_n(VERB)
VERB_3()
#> 200
#> 500
#> 200
#> $url
#> [1] "URL!"
#> 
#> $status
#> [1] 200
#> 
#> $content
#> [1] -1.3931 -1.1298 -2.6355  0.7029 -0.0562

YAY! Before I move on, let’s make sure I can actually set the n argument to something other than 3.

VERB_4 <- call_me_n(VERB, 4)
VERB_4()
#> 200
#> 200
#> 200
#> 200
#> $url
#> [1] "URL!"
#> 
#> $status
#> [1] 200
#> 
#> $content
#> [1]  1.480 -0.021  0.475  0.649 -1.092

B.1.7 Conditional retries

Almost done!

My real factory needs to use the output of the enclosed HTTP verb to decide whether a retry is sensible. The new name reflects HTTP verb specificity. I now add the actual logic and behavior I need in real life.

VERB_n <- function(VERB, n = 3) {
  force(VERB)
  force(n)
  function(...) {
    for (i in seq_len(n)) {
      out <- VERB(...)
      if (out$status < 499 || i == n) break
      backoff <- runif(n = 1, min = 0, max = 2 ^ i - 1)
      message("HTTP error ", out$status, " on attempt ", i,
              " ... retrying after a back off of ", round(backoff, 2),
              " seconds.")
      Sys.sleep(backoff)
    }
    out
  }
}

I send my existing faux VERB() off to the new and improved factory. Start providing input again, just to make sure that all still works.

VERB_5 <- VERB_n(VERB, n = 5)
VERB_5("Owls can rotate their necks 270 degrees.")
#> 200
#> $url
#> [1] "Owls can rotate their necks 270 degrees."
#> 
#> $status
#> [1] 200
#> 
#> $content
#> [1]  1.381 -1.811 -0.326 -0.494 -1.732
VERB_5("Owls are cute.")
#> 200
#> $url
#> [1] "Owls are cute."
#> 
#> $status
#> [1] 200
#> 
#> $content
#> [1]  0.365 -0.258  0.574 -0.681  0.347
VERB_5("A group of owls is called a Parliament.")
#> 500
#> HTTP error 500 on attempt 1 ... retrying after a back off of 0.91 seconds.
#> 500
#> HTTP error 500 on attempt 2 ... retrying after a back off of 0.92 seconds.
#> 500
#> HTTP error 500 on attempt 3 ... retrying after a back off of 5.66 seconds.
#> 200
#> $url
#> [1] "A group of owls is called a Parliament."
#> 
#> $status
#> [1] 200
#> 
#> $content
#> [1] -1.1191 -2.3098  0.5061 -0.0282  0.3780

And we have drawn some f*cking owls, with retries!

From [BuzzFeed](https://img.buzzfeed.com/buzzfeed-static/static/2014-03/enhanced/webdr05/7/10/enhanced-22024-1394207918-30.jpg)

Figure B.2: From BuzzFeed

B.1.8 The final result is not that exciting

Now in real life, I create retry-capable HTTP verbs like so: rGET <- VERB_n(httr::GET). Then just replace all instances of httr::GET() with rGet(). It’s terribly anticlimactic.

The final version of the function factory is about a dozen lines of fairly pedestrian code. I probably wrote and discarded at least 10x that. This is typical, so don’t be surprised if this is how it works for you too. Get a working example and take tiny steps to morph it into the thing you need.

The results of this effort are, however, pretty gratifying. I have had zero build/check failures locally and on Travis, since I implemented retries on httr::GET(). Or, to be honest, I’ve had failures, but for other reasons. So it was totally worth it! I also thank Konrad Rudolph and Kevin Ushey for straightening me out on the need to use force() inside the function factory.

B.2 How to obtain a bunch of GitHub issues or pull requests with R

Using dplyr + purrr + tidyr to analyze data about GitHub repos via the gh package

B.3 How to tame XML with nested data frames and purrr

Using dplyr + purrr + tidyr + xml2 to tame the annoying XML from Google Sheets

B.4 Make browsing your GitHub repos more rewarding

The unreasonable effectiveness of GitHub browsability

One of my favorite aspects of GitHub is the ability to inspect a repository’s files in a browser. Certain practices make browsing more rewarding and can postpone the day when you must create a proper website for a project. Perhaps indefinitely.

B.4.1 Be savvy about your files

Keep files in the plainest, web-friendliest form that is compatible with your main goals. Plain text is the very best. GitHub offers special handling for certain types of files:

Markdown files, which may be destined for conversion into, e.g., HTML
Markdown files named README.md
HTML files, often the result of compiling Markdown files
Source code, such as .R files
Delimited files, containing data one might bring into R via read.table()
PNG files

B.4.2 Get over your hang ups re: committing derived products

Let’s acknowledge the discomfort some people feel about putting derived products under version control. Specifically, if you’ve got an R Markdown document foo.Rmd, it can be knit() to produce the intermediate product foo.md, which can be converted to the ultimate output foo.html. Which of those files are you “allowed” to put under version control? Source-is-real hardliners will say only foo.Rmd but pragmatists know this can be a serious bummer in real life. Just because I can rebuild everything from scratch, it doesn’t mean I want to.

The taboo of keeping derived products under version control originates from compilation of binary executables from source. Software built on a Mac would not work on Windows and so it made sense to keep these binaries out of the holy source code repository. Also, you could assume the people with access to the repository have the full development stack and relish opportunities to use it. None of these arguments really apply to the foo.Rmd --> foo.md --> foo.html workflow. We don’t have to blindly follow traditions from the compilation domain!

In fact, looking at the diffs for foo.md or foo-figure-01.png can be extremely informative. This is also true in larger data analytic projects projects after a make clean; make all operation. By looking at the diffs in the downstream products, you often catch unexpected changes. This can tip you off to changes in the underlying data and/or the behavior of packages you depend on.

This is a note about cool things GitHub can do with various file types, if they happen to end up in your repo. I won’t ask you how they got there.

B.4.3 Markdown

You will quickly discover that GitHub renders Markdown files very nicely. By clicking on foo.md, you’ll get a decent preview of foo.html. Yay!

Aggressively exploit this handy feature. Make Markdown your default format for narrative text files and use them liberally to embed notes to yourself and others in a repository hosted on GitHub. It’s an easy way to get pseudo-webpages inside a project “for free”. You may never even compile these files to HTML explicitly; in many cases, the HTML preview offered by GitHub is all you ever need.

What does this mean for R Markdown files? Keep intermediate Markdown. Commit both foo.Rmd and foo.md, even if you choose to .gitignore the final foo.html. As of September 2014, GitHub renders R Markdown files nicely, like Markdown, and with proper syntax highlighting, which is great. But, of course, the code blocks just sit there un-executed, so my advice about keeping intermediate Markdown still holds. You want YAML frontmatter that looks something like this for .Rmd:

---
title: "Something fascinating"
author: "Jenny Bryan"
date: "`r format(Sys.Date())`"
output:
  html_document:
    keep_md: TRUE
---

or like this for .R:

#' ---
#' title: "Something fascinating"
#' author: "Jenny Bryan"
#' date: "`r format(Sys.Date())`"
#' output:
#'   html_document:
#'     keep_md: TRUE
#' ---

In RStudio, when editing .Rmd, click on the gear next to “Knit HTML” for YAML authoring help For a quick, stand-alone document that doesn’t fit neatly into a repository or project (yet), make it a Gist.

Example: Hadley Wickham’s advice on what you need to do to become a data scientist. Gists can contain multiple files, so you can still provide the R script or R Markdown source and the resulting Markdown, as I’ve done in this write-up of Twitter-sourced tips for cross-tabulation.

B.4.4 `README.md`

You probably already know that GitHub renders README.md at the top-level of your repo as the de facto landing page. This is analogous to what happens when you point a web browser at a directory instead of a specific web page: if there is a file named index.html, that’s what the server will show you by default. On GitHub, files named README.md play exactly this role for directories in your repo.

Implication: for any logical group of files or mini project-within-your-project, create a sub-directory in your repository. And then create a README.md file to annotate these files, collect relevant links, etc. Now when you navigate to the sub-directory on GitHub the nicely rendered README.md will simply appear.

Some repositories consist solely of README.md. Examples: Jeff Leek’s write-ups on How to share data with a statistician or Developing R packages. I am becoming a bigger fan of README-only repos than gists because repo issues trigger notifications, whereas comments on gists do not.

If you’ve got a directory full of web-friendly figures, such as PNGs, you can use code like this to generate a README.md for a quick DIY gallery, as Karl Broman has done with his FruitSnacks. I have also used this device to share Keynote slides on GitHub (mea culpa!). Export them as PNGs images and throw ’em into a README gallery: slides on file organization and some on file naming.

B.4.5 Finding stuff

OK these are pure GitHub tips but if you’ve made it this far, you’re obviously a keener.

Press t to activate the file finder whenever you’re in a repo’s file and directory view. AWESOME, especially when there are files tucked into lots of subdirectories.
Press y to get a permanent link when you’re viewing a specific file. Watch what changes in the URL. This is important if you are about to link to a file or to specific lines. Otherwise your links will break easily in the future. If the file is deleted or renamed or if lines get inserted or deleted, your links will no longer point to what you intended. Use y to get links that include a specific commit in the URL.

B.4.6 HTML

If you have an HTML file in a GitHub repository, simply visiting the file shows the raw HTML. Here’s a nice ugly example:

https://github.com/STAT545-UBC/STAT545-UBC.github.io/blob/master/bit003_api-key-env-var.html

No one wants to look at that. ~~You can provide this URL to rawgit.com to serve this HTML more properly and get a decent preview.~~

~~You can form two different types of URLs with rawgit.com:~~

~~For sharing low-traffic, temporary examples or demos with small numbers of people, do this:~~
- ~~https://rawgit.com/STAT545-UBC/STAT545-UBC.github.io/master/bit003_api-key-env-var.html~~
- ~~Basically: replace https://github.com/ with https://rawgit.com/~~
~~For use on production websites with any amount of traffic, do this:~~
- ~~https://cdn.rawgit.com/STAT545-UBC/STAT545-UBC.github.io/master/bit003_api-key-env-var.html~~
- ~~Basically: replace https://github.com/ with https://cdn.rawgit.com/~~

2018-10-09 update: RawGit announced that it is in a sunset phase and will soon shut down. They recommended: jsDelivr, GitHub Pages, CodeSandbox, and unpkg as alternatives.

You may also want to check out this Chrome extension or GitHub & BitBucket HTML Preview.

This sort of enhanced link might be one of the useful things to put in a README.md or other Markdown file in the repo.

Sometimes including HTML files will cause GitHub to think that your R repository is HTML. Besides being slightly annoying, this can make it difficult for people to find your work if they are searching specifically for R repos. You can exclude these files or directories from GitHub’s language statistics by adding a .gitattributes file that marks them as ‘documentation’ rather than code. See an example here.

B.4.7 Source code

You will notice that GitHub does automatic syntax highlighting for source code. For example, notice the coloring of this R script. The file’s extension is the primary determinant for if/how syntax highlighting will be applied. You can see information on recognized languages, the default extensions and more at github/linguist. You should be doing it anyway, but let this be another reason to follow convention in your use of file extensions.

Note you can click on “Raw” in this context as well, to get just the plain text and nothing but the plain text.

B.4.8 Delimited files

GitHub will nicely render tabular data in the form of .csv (comma-separated) and .tsv (tab-separated) files. You can read more in the blog post announcing this feature in August 2013 or in this GitHub help page.

Advice: take advantage of this! If something in your repo can be naturally stored as delimited data, by all means, do so. Make the comma or tab your default delimiter and use the file suffixes GitHub is expecting. I have noticed that GitHub is more easily confused than R about things like quoting, so always inspect the GitHub-rendered .csv or .tsv file in the browser. You may need to do light cleaning to get the automagic rendering to work properly. Think of it as yet another way to learn about imperfections in your data.

Here’s an example of a tab delimited file on GitHub: lotr_clean.tsv, originally found ~~here~~ (nope, IBM shut down manyeyes July 2015).

Note you can click on “Raw” in this context as well, to get just the plain text and nothing but the plain text.

B.4.9 PNGs

PNG is the “no brainer” format in which to store figures for the web. But many of us like a vector-based format, such as PDF, for general purpose figures. Bottom line: PNGs will drive you less crazy than PDFs on GitHub. To reduce the aggravation around viewing figures in the browser, make sure to have a PNG version in the repo.

Examples:

This PNG figure just shows up in the browser
A different figure stored as PDF ~~produces the dreaded, annoying “View Raw” speed bump. You’ll have to click through and, on my OS + browser, wait for the PDF to appear in an external PDF viewer.~~ 2015-06-19 update: since I first wrote this GitHub has elevated its treatment of PDFs so YAY. It’s slow but it works.

Hopefully we are moving towards a world where you can have “web friendly” and “vector” at the same time, without undue headaches. As of October 2014, GitHub provides enhanced viewing and diffing of SVGs. So don’t read this advice as discouraging SVGs. Make them! But consider keeping a PNG around as emergency back up for now.

B.4.10 Linking to a ZIP archive of your repo

The browsability of GitHub makes your work accessible to people who care about your content but who don’t (yet) use Git themselves. What if such a person wants all the files? Yes, there is a clickable “Download ZIP” button offered by GitHub. But what if you want a link to include in an email or other document? If you add /archive/master.zip to the end of the URL for your repo, you construct a link that will download a ZIP archive of your repository. Click here to try this out on a very small repo:

https://github.com/jennybc/lotr/archive/master.zip

Go look in your downloads folder!

B.4.11 Links and embedded figures

To link to another page in your repo, just use a relative link:
- [admin](courseAdmin/) will link to the courseAdmin/ directory inside the current directory.
- [admin](/courseAdmin/) will link to the top-level courseAdmin/ directory from any where in the repo.
The same idea also works for images. ![](image.png) will include image.png located in the current directory.

B.4.12 Let people correct you on the internet

They love that!

You can create a link that takes people directly to an editing interface in the browser. Behind the scenes, assuming the clicker is signed into GitHub but is not you, this will create a fork in their account and send you a pull request. When I click the link below, I am able to actually commit directly to master for this repo.

CLICK HERE to suggest an edit to this page!

Here’s what that link looks like in the Markdown source:

[CLICK HERE to suggest an edit to this page!](https://github.com/rstudio-education/stat545/edit/master/39_appendix.Rmd)

and here it is with placeholders:

[INVITATION TO EDIT](<URL to your repo>/edit/master/<path to your md file>)

AFAIK, to do that in a slick automatic way across an entire repo/site, you need to be using Jekyll or some other automated system. But you could easily handcode such links on a small scale.

B.5 How to send a bunch of emails from R

Workflow for sending email with R and gmailr.

B.6 Store an API key as an environment variable

This can be found here.

B.7 Data Carpentry lesson on tidy data

A lesson I contributed to Data Carpentry on tidying data.

This is a lesson on tidying data. Specifically, what to do when a conceptual variable is spread out over 2 or more variables in a data frame.

Data used: words spoken by characters of different races and gender in the Lord of the Rings movie trilogy

Directory of this lesson in the Data Carpentry GitHub repo.
01-intro shows untidy and tidy data. Then we demonstrate how tidy data is more useful for analysis and visualization. Includes references, resources, and exercises.
02-tidy shows how to tidy data, using gather() from the tidyr package. Includes references, resources, and exercises.
03-tidy-bonus-content is not part of the lesson but may be useful as learners try to apply the principles of tidy data in more general settings. Includes links to packages used.

Learner-facing dependencies:

Files in the tidy-data/ sub-directory of the Data Carpentry data/ directory.
tidyr package (only true dependency).
ggplot2 is used for illustration but is not mission critical.
dplyr and reshape2 are used in the bonus content.

Instructor dependencies:

curl if you execute the code to grab the Lord of the Rings data used in examples from GitHub. Note that the files are also included in the datacarpentry/data/tidy-data/ directory, so data download is avoidable.
rmarkdown, knitr, and xtable if you want to compile the Rmd to md and html.