Overview
Although we spend a lot of time working with data interactively, this sort of hands-on babysitting is not always appropriate. We have a philosophy of “source is real” in this class and that philosophy can be implemented on a grander scale. Just as we save R code in a script so we can replay analytical steps, we can also record how a series of scripts and commands work together to produce a set of analytical results. This is what we mean by automating data analysis or building an analytical pipeline.
- Chapter 33 - Why and how we automate data analyses + examples
- Chapter 34 -
make
: special considerations for Windows- 2015-11-17 NOTE: since we have already set up a build environment for R packages, it is my hope that everyone has
make
. These instructions were from 2014, when we did everything in a different order. Cross your fingers and ignore! - (If you are running macOS or Linux,
make
should already be installed.)
- 2015-11-17 NOTE: since we have already set up a build environment for R packages, it is my hope that everyone has
- Chapter 35 - Test drive
make
and RStudio- Walk before you run! Prove that
make
is actually installed and that it can be found and executed from the shell and from RStudio. It is also important to tell RStudio to NOT substitute spaces for tabs when editing aMakefile
(applies to any text editor).
- Walk before you run! Prove that
- Chapter 36 - Hands-on activity
- This fully developed example shows you:
- How to run an R script non-interactively
- How to use
make
…- To record which files are inputs vs intermediates vs outputs
- To capture how scripts and commands convert inputs to outputs
- To re-run parts of an analysis that are out-of-date
- The intersection of R and
make
, i.e. how to…- Run snippets of R code
- Run an entire R script
- Render an R Markdown document (or R script)
- The interface between RStudio and
make
- How to use
make
from the shell - How Git facilitates the process of building a pipeline
- 2015-11-19 Andrew MacDonald translated the above into a pipeline for the
remake
package from Rich Fitzjohn: see this gist.
- This fully developed example shows you:
- Chapter 37 - Three more toy pipelines, using the Lord of the Rings data
Resources
- xkcd comic on automation. ‘Automating’ comes from the roots ‘auto-’ meaning ‘self-’, and ‘mating’, meaning ‘screwing’.
- Karl Broman covers GNU Make in his course Tools for Reproducible Research.
- Karl Broman also wrote minimal make: a minimal tutorial on make, aimed at stats / data science types.
- Using Make for reproducible scientific analyses, blog post by Ben Morris.
- Software Carpentry’s Slides on
Make
. - Zachary M. Jones wrote GNU Make for Reproducible Data Analysis.
- Keeping tabs on your data analysis workflow, blog post by Adam Laiacano.
- Mike Bostock, of D3.js and New York Times fame, explains Why Use Make: “it’s about the benefits of capturing workflows via a file-based dependency-tracking build system”.
- Make for Data Scientists, blog post by Paul Butler, who also made a beautiful map of Facebook connections using R.
- Other, more modern data-oriented alternatives to
make
: - Managing Projects with GNU Make, 3rd Edition by Robert Mecklenburg (2009) is a fantastic book but, sadly, is very focused on compiling software.
littler
is an R package maintained by Dirk Eddelbuettel that “provides ther
program, a simplified command-line interface for GNU R.”