Chapter 7 Single table dplyr functions

7.1 Where were we?

In Chapter 6, Introduction to dplyr, we used two very important verbs and an operator:

  • filter() for subsetting data with row logic
  • select() for subsetting data variable- or column-wise
  • the pipe operator %>%, which feeds the LHS as the first argument to the expression on the RHS

We also discussed dplyr’s role inside the tidyverse and tibbles:

  • dplyr is a core package in the tidyverse meta-package. Since we often make incidental usage of the others, we will load dplyr and the others via library(tidyverse).
  • The tidyverse embraces a special flavor of data frame, called a tibble. The gapminder dataset is stored as a tibble.

7.4 Use mutate() to add new variables

Imagine we wanted to recover each country’s GDP. After all, the Gapminder data has a variable for population and GDP per capita. Let’s multiply them together.

mutate() is a function that defines and inserts new variables into a tibble. You can refer to existing variables by name.

Hmmmm … those GDP numbers are almost uselessly large and abstract. Consider the advice of Randall Munroe of xkcd:

One thing that bothers me is large numbers presented without context… ‘If I added a zero to this number, would the sentence containing it mean something different to me?’ If the answer is ‘no,’ maybe the number has no business being in the sentence in the first place."

Maybe it would be more meaningful to consumers of my tables and figures to stick with GDP per capita. But what if I reported GDP per capita, relative to some benchmark country. Since Canada is my adopted home, I’ll go with that.

I need to create a new variable that is gdpPercap divided by Canadian gdpPercap, taking care that I always divide two numbers that pertain to the same year.

How I achieve this:

  1. Filter down to the rows for Canada.
  2. Create a new temporary variable in my_gap:
    1. Extract the gdpPercap variable from the Canadian data.
    2. Replicate it once per country in the dataset, so it has the right length.
  3. Divide raw gdpPercap by this Canadian figure.
  4. Discard the temporary variable of replicated Canadian gdpPercap.

Note that, mutate() builds new variables sequentially so you can reference earlier ones (like tmp) when defining later ones (like gdpPercapRel). Also, you can get rid of a variable by setting it to NULL.

How could we sanity check that this worked? The Canadian values for gdpPercapRel better all be 1!

I perceive Canada to be a “high GDP” country, so I predict that the distribution of gdpPercapRel is located below 1, possibly even well below. Check your intuition!

The relative GDP per capita numbers are, in general, well below 1. We see that most of the countries covered by this dataset have substantially lower GDP per capita, relative to Canada, across the entire time period.

Remember: Trust No One. Including (especially?) yourself. Always try to find a way to check that you’ve done what meant to. Prepare to be horrified.

7.5 Use arrange() to row-order data in a principled way

arrange() reorders the rows in a data frame. Imagine you wanted this data ordered by year then country, as opposed to by country then year.

Or maybe you want just the data from 2007, sorted on life expectancy?

Oh, you’d like to sort on life expectancy in descending order? Then use desc().

I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.

7.6 Use rename() to rename variables

When I first cleaned this Gapminder excerpt, I was a camelCase person, but now I’m all about snake_case. So I am vexed by the variable names I chose when I cleaned this data years ago. Let’s rename some variables!

I did NOT assign the post-rename object back to my_gap because that would make the chunks in this tutorial harder to copy/paste and run out of order. In real life, I would probably assign this back to my_gap, in a data preparation script, and proceed with the new variable names.

7.7 select() can rename and reposition variables

You’ve seen simple use of select(). There are two tricks you might enjoy:

  1. select() can rename the variables you request to keep.
  2. select() can be used with everything() to hoist a variable up to the front of the tibble.

everything() is one of several helpers for variable selection. Read its help to see the rest.

7.8 group_by() is a mighty weapon

I have found friends and family collaborators love to ask seemingly innocuous questions like, “which country experienced the sharpest 5-year drop in life expectancy?”. In fact, that is a totally natural question to ask. But if you are using a language that doesn’t know about data, it’s an incredibly annoying question to answer.

dplyr offers powerful tools to solve this class of problem:

  • group_by() adds extra structure to your dataset – grouping information – which lays the groundwork for computations within the groups.
  • summarize() takes a dataset with \(n\) observations, computes requested summaries, and returns a dataset with 1 observation.
  • Window functions take a dataset with \(n\) observations and return a dataset with \(n\) observations.
  • mutate() and summarize() will honor groups.
  • You can also do very general computations on your groups with do(), though elsewhere in this course, I advocate for other approaches that I find more intuitive, using the purrr package.

Combined with the verbs you already know, these new tools allow you to solve an extremely diverse set of problems with relative ease.

7.8.1 Counting things up

Let’s start with simple counting. How many observations do we have per continent?

Let us pause here to think about the tidyverse. You could get these same frequencies using table() from base R.

But the object of class table that is returned makes downstream computation a bit fiddlier than you’d like. For example, it’s too bad the continent levels come back only as names and not as a proper factor, with the original set of levels. This is an example of how the tidyverse smooths transitions where you want the output of step i to become the input of step i + 1.

The tally() function is a convenience function that knows to count rows. It honors groups.

The count() function is an even more convenient function that does both grouping and counting.

What if we wanted to add the number of unique countries for each continent? You can compute multiple summaries inside summarize(). Use the n_distinct() function to count the number of distinct countries within each continent.

7.8.2 General summarization

The functions you’ll apply within summarize() include classical statistical summaries, like mean(), median(), var(), sd(), mad(), IQR(), min(), and max(). Remember they are functions that take \(n\) inputs and distill them down into 1 output.

Although this may be statistically ill-advised, let’s compute the average life expectancy by continent.

summarize_at() applies the same summary function(s) to multiple variables. Let’s compute average and median life expectancy and GDP per capita by continent by year…but only for 1952 and 2007.

Let’s focus just on Asia. What are the minimum and maximum life expectancies seen by year?

Of course it would be much more interesting to see which country contributed these extreme observations. Is the minimum (maximum) always coming from the same country? We tackle that with window functions shortly.

7.9 Grouped mutate

Sometimes you don’t want to collapse the \(n\) rows for each group into one row. You want to keep your groups, but compute within them.

7.9.1 Computing with group-wise summaries

Let’s make a new variable that is the years of life expectancy gained (lost) relative to 1952, for each individual country. We group by country and use mutate() to make a new variable. The first() function extracts the first value from a vector. Notice that first() is operating on the vector of life expectancies within each country group.

Within country, we take the difference between life expectancy in year \(i\) and life expectancy in 1952. Therefore we always see zeroes for 1952 and, for most countries, a sequence of positive and increasing numbers.

7.9.2 Window functions

Window functions take \(n\) inputs and give back \(n\) outputs. Furthermore, the output depends on all the values. So rank() is a window function but log() is not. Here we use window functions based on ranks and offsets.

Let’s revisit the worst and best life expectancies in Asia over time, but retaining info about which country contributes these extreme values.

We see that (min = Afghanistan, max = Japan) is the most frequent result, but Cambodia and Israel pop up at least once each as the min or max, respectively. That table should make you impatient for our upcoming work on tidying and reshaping data! Wouldn’t it be nice to have one row per year?

How did that actually work? First, I store and view a partial that leaves off the filter() statement. All of these operations should be familiar.

Now we apply a window function – min_rank(). Since asia is grouped by year, min_rank() operates within mini-datasets, each for a specific year. Applied to the variable lifeExp, min_rank() returns the rank of each country’s observed life expectancy. FYI, the min part just specifies how ties are broken. Here is an explicit peek at these within-year life expectancy ranks, in both the (default) ascending and descending order.

For concreteness, I use mutate() to actually create these variables, even though I dropped this in the solution above. Let’s look at a bit of that.

Afghanistan tends to present 1’s in the le_rank variable, Japan tends to present 1’s in the le_desc_rank variable and other countries, like Thailand, present less extreme ranks.

You can understand the original filter() statement now:

These two sets of ranks are formed on-the-fly, within year group, and filter() retains rows with rank less than 2, which means … the row with rank = 1. Since we do for ascending and descending ranks, we get both the min and the max.

If we had wanted just the min OR the max, an alternative approach using top_n() would have worked.

7.10 Grand Finale

So let’s answer that “simple” question: which country experienced the sharpest 5-year drop in life expectancy? Recall that this excerpt of the Gapminder data only has data every five years, e.g. for 1952, 1957, etc. So this really means looking at life expectancy changes between adjacent timepoints.

At this point, that’s just too easy, so let’s do it by continent while we’re at it.

Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.

Break the code into pieces, starting at the top, and inspect the intermediate results. That’s certainly how I was able to write such a thing. These commands do not leap fully formed out of anyone’s forehead – they are built up gradually, with lots of errors and refinements along the way. I’m not even sure it’s a great idea to do so much manipulation in one fell swoop. Is the statement above really hard for you to read? If yes, then by all means break it into pieces and make some intermediate objects. Your code should be easy to write and read when you’re done.

In later tutorials, we’ll explore more of dplyr, such as operations based on two datasets.

7.11 Resources

dplyr official stuff:

RStudio Data Transformation Cheat Sheet, covering dplyr. Remember you can get to these via Help > Cheatsheets.

Data transformation chapter of R for Data Science (Wickham and Grolemund 2016).

Excellent slides on pipelines and dplyr by TJ Mahr, talk given to the Madison R Users Group.

Blog post Hands-on dplyr tutorial for faster data manipulation in R by Data School, that includes a link to an R Markdown document and links to videos.

Chapter 15: cheatsheet I made for dplyr join functions (not relevant yet but soon).