Overview

Although we spend a lot of time working with data interactively, this sort of hands-on babysitting is not always appropriate. We have a philosophy of “source is real” in this class and that philosophy can be implemented on a grander scale. Just as we save R code in a script so we can replay analytical steps, we can also record how a series of scripts and commands work together to produce a set of analytical results. This is what we mean by automating data analysis or building an analytical pipeline.

Chapter 33 - Why and how we automate data analyses + examples
Chapter 34 - make: special considerations for Windows
- 2015-11-17 NOTE: since we have already set up a build environment for R packages, it is my hope that everyone has make. These instructions were from 2014, when we did everything in a different order. Cross your fingers and ignore!
- (If you are running macOS or Linux, make should already be installed.)
Chapter 35 - Test drive make and RStudio
- Walk before you run! Prove that make is actually installed and that it can be found and executed from the shell and from RStudio. It is also important to tell RStudio to NOT substitute spaces for tabs when editing a Makefile (applies to any text editor).
Chapter 36 - Hands-on activity
- This fully developed example shows you:
  - How to run an R script non-interactively
  - How to use make…
    - To record which files are inputs vs intermediates vs outputs
    - To capture how scripts and commands convert inputs to outputs
    - To re-run parts of an analysis that are out-of-date
  - The intersection of R and make, i.e. how to…
    - Run snippets of R code
    - Run an entire R script
    - Render an R Markdown document (or R script)
  - The interface between RStudio and make
  - How to use make from the shell
  - How Git facilitates the process of building a pipeline
- 2015-11-19 Andrew MacDonald translated the above into a pipeline for the remake package from Rich Fitzjohn: see this gist.
Chapter 37 - Three more toy pipelines, using the Lord of the Rings data

Resources

xkcd comic on automation. ‘Automating’ comes from the roots ‘auto-’ meaning ‘self-’, and ‘mating’, meaning ‘screwing’.
Karl Broman covers GNU Make in his course Tools for Reproducible Research.
Karl Broman also wrote minimal make: a minimal tutorial on make, aimed at stats / data science types.
Using Make for reproducible scientific analyses, blog post by Ben Morris.
Software Carpentry’s Slides on Make.
Zachary M. Jones wrote GNU Make for Reproducible Data Analysis.
Keeping tabs on your data analysis workflow, blog post by Adam Laiacano.
Mike Bostock, of D3.js and New York Times fame, explains Why Use Make: “it’s about the benefits of capturing workflows via a file-based dependency-tracking build system”.
Make for Data Scientists, blog post by Paul Butler, who also made a beautiful map of Facebook connections using R.
Other, more modern data-oriented alternatives to make:
- Drake, a kind of “make for data”
- Nextflow for “data-driven computational pipelines”
- remake, “Make-like declarative workflows in R”
Managing Projects with GNU Make, 3rd Edition by Robert Mecklenburg (2009) is a fantastic book but, sadly, is very focused on compiling software.
littler is an R package maintained by Dirk Eddelbuettel that “provides the r program, a simplified command-line interface for GNU R.”