Factors: where they fit in

We’ve spent alot of time working with big, beautiful data frames, like the Gapminder data. But we also need to manage the individual variables housed within.

Factors are the variable type that useRs love to hate. It is how we store truly categorical information in R. The values a factor can take on are called the levels. For example, the levels of the factor continent in Gapminder are are “Africa”, “Americas”, etc. and this is what’s usually presented to your eyeballs by R. In general, the levels are friendly human-readable character strings, like “male/female” and “control/treated”. But never ever ever forget that, under the hood, R is really storing integer codes 1, 2, 3, etc.

This Janus-like nature of factors means they are rich with booby traps for the unsuspecting but they are a necessary evil. I recommend you learn how to be the boss of your factors. The pros far outweigh the cons. Specifically in modelling and figure-making, factors are anticipated and accommodated by the functions and packages you will want to exploit.

The worst kind of factor is the stealth factor. The variable that you think of as character, but that is actually a factor (numeric!!). This is a classic R gotcha. Check your variable types explicitly when things seem weird. It happens to the best of us.

Where do stealth factors come from? Base R has a burning desire to turn character information into factor. The happens most commonly at data import via read.table() and friends. But data.frame() and other functions are also eager to convert character to factor. To shut this down, use stringsAsFactors = FALSE in read.table() and data.frame() or – even better – use the tidyverse! For data import, use readr::read_csv(), readr::read_tsv(), etc. For data frame creation, use tibble::tibble(). And so on.

Good articles about how the factor fiasco came to be:

The forcats package

forcats is a non-core package in the tidyverse. It is installed via install.packages("tidyverse"), but not loaded via library(tidyverse). Load it yourself as needed via library(forcats). Main functions start with fct_. There really is no coherent family of base functions that forcats replaces – that’s why it’s such a welcome addition.

Currently this lesson will be mostly code vs prose. See the previous lesson for more discussion during the transition.

Load tidyverse, forcats, gapminder

library(tidyverse)
#> + ggplot2 2.2.1             Date: 2017-10-29
#> + tibble  1.3.4                R: 3.4.1
#> + tidyr   0.7.1               OS: macOS Sierra 10.12.6
#> + readr   1.1.1              GUI: X11
#> + purrr   0.2.3.9000      Locale: en_CA.UTF-8
#> + dplyr   0.7.4               TZ: America/Vancouver
#> + stringr 1.2.0.9000      
#> + forcats 0.2.0
#> Warning: package 'dplyr' was built under R version 3.4.2
#> ── Conflicts ────────────────────────────────────────────────────
#> * filter(),  from dplyr, masks stats::filter()
#> * lag(),     from dplyr, masks stats::lag()
library(forcats)
library(gapminder)

Factor inspection

Get to know your factor before you start touching it! It’s polite. Let’s use gapminder$continent as our example.

str(gapminder$continent)
#>  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
levels(gapminder$continent)
#> [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
nlevels(gapminder$continent)
#> [1] 5
class(gapminder$continent)
#> [1] "factor"

To get a frequency table as a tibble, from a tibble, use dplyr::count(). To get similar from a free-range factor, use forcats::fct_count().

gapminder %>% 
  count(continent)
#> # A tibble: 5 x 2
#>   continent     n
#>      <fctr> <int>
#> 1    Africa   624
#> 2  Americas   300
#> 3      Asia   396
#> 4    Europe   360
#> 5   Oceania    24
fct_count(gapminder$continent)
#> # A tibble: 5 x 2
#>          f     n
#>     <fctr> <int>
#> 1   Africa   624
#> 2 Americas   300
#> 3     Asia   396
#> 4   Europe   360
#> 5  Oceania    24

Dropping unused levels

Just because you drop all the rows corresponding to a specific factor level, the levels of the factor itself do not change. Sometimes all these unused levels can come back to haunt you later, e.g., in figure legends.

Watch what happens to the levels of country (= nothing) when we filter Gapminder to a handful of countries.

nlevels(gapminder$country)
#> [1] 142
h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
h_gap <- gapminder %>%
  filter(country %in% h_countries)
nlevels(h_gap$country)
#> [1] 142

Even though h_gap only has data for a handful of countries, we are still schlepping around all 142 levels from the original gapminder tibble.

How can you get rid of them? The base function droplevels() operates on all the factors in a data frame or on a single factor. The function forcats::fct_drop() operates on a factor.

h_gap_dropped <- h_gap %>% 
  droplevels()
nlevels(h_gap_dropped$country)
#> [1] 5

## use forcats::fct_drop() on a free-range factor
h_gap$country %>%
  fct_drop() %>%
  levels()
#> [1] "Egypt"     "Haiti"     "Romania"   "Thailand"  "Venezuela"

Exercise: Filter the gapminder data down to rows where population is less than a quarter of a million, i.e. 250,000. Get rid of the unused factor levels for country and continent in different ways, such as:

Change order of the levels, principled

By default, factor levels are ordered alphabetically. Which might as well be random, when you think about it! It is preferable to order the levels according to some principle:

First, let’s order continent by frequency, forwards and backwards. This is often a great idea for tables and figures, esp. frequency barplots.

## default order is alphabetical
gapminder$continent %>%
  levels()
#> [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

## order by frequency
gapminder$continent %>% 
  fct_infreq() %>%
  levels()
#> [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"

## backwards!
gapminder$continent %>% 
  fct_infreq() %>%
  fct_rev() %>% 
  levels()
#> [1] "Oceania"  "Americas" "Europe"   "Asia"     "Africa"

These two barcharts of frequency by continent differ only in the order of the continents. Which do you prefer?

Now we order country by another variable, forwards and backwards. This other variable is usually quantitative and you will order the factor accoding to a grouped summary. The factor is the grouping variable and the default summarizing function is median() but you can specify something else.

## order countries by median life expectancy
fct_reorder(gapminder$country, gapminder$lifeExp) %>% 
  levels() %>% head()
#> [1] "Sierra Leone"  "Guinea-Bissau" "Afghanistan"   "Angola"       
#> [5] "Somalia"       "Guinea"

## order accoring to minimum life exp instead of median
fct_reorder(gapminder$country, gapminder$lifeExp, min) %>% 
  levels() %>% head()
#> [1] "Rwanda"       "Afghanistan"  "Gambia"       "Angola"      
#> [5] "Sierra Leone" "Cambodia"

## backwards!
fct_reorder(gapminder$country, gapminder$lifeExp, .desc = TRUE) %>% 
  levels() %>% head()
#> [1] "Iceland"     "Japan"       "Sweden"      "Switzerland" "Netherlands"
#> [6] "Norway"

Example of why we reorder factor levels: often makes plots much better! When a factor is mapped to x or y, it should almost always be reordered by the quantitative variable you are mapping to the other one.

Compare the interpretability of these two plots of life expectancy in Asian countries in 2007. The only difference is the order of the country factor. Which one do you find easier to learn from?

gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia")
ggplot(gap_asia_2007, aes(x = lifeExp, y = country)) + geom_point()
ggplot(gap_asia_2007, aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
  geom_point()

Use fct_reorder2() when you have a line chart of a quantitative x against another quantitative y and your factor provides the color. This way the legend appears in some order as the data! Contrast the legend on the left with the one on the right.

h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
h_gap <- gapminder %>%
  filter(country %in% h_countries) %>% 
  droplevels()
ggplot(h_gap, aes(x = year, y = lifeExp, color = country)) +
  geom_line()
ggplot(h_gap, aes(x = year, y = lifeExp,
                  color = fct_reorder2(country, year, lifeExp))) +
  geom_line() +
  labs(color = "country")

Change order of the levels, “because I said so”

Sometimes you just want to hoist one or more levels to the front. Why? Because I said so. This resembles what we do when we move variables to the front with dplyr::select(special_var, everything()).

h_gap$country %>% levels()
#> [1] "Egypt"     "Haiti"     "Romania"   "Thailand"  "Venezuela"
h_gap$country %>% fct_relevel("Romania", "Haiti") %>% levels()
#> [1] "Romania"   "Haiti"     "Egypt"     "Thailand"  "Venezuela"

This might be useful if you are preparing a report for, say, the Romanian government. The reason for always putting Romania first has nothing to do with the data, it is important for external reasons and you need a way to express this.

Recode the levels

Sometimes you have better ideas about what certain levels should be. This is called recoding.

i_gap <- gapminder %>% 
  filter(country %in% c("United States", "Sweden", "Australia")) %>% 
  droplevels()
i_gap$country %>% levels()
#> [1] "Australia"     "Sweden"        "United States"
i_gap$country %>%
  fct_recode("USA" = "United States", "Oz" = "Australia") %>% levels()
#> [1] "Oz"     "Sweden" "USA"

Exercise: Isolate the data for "Australia", "Korea, Dem. Rep.", and "Korea, Rep." in the 2000x. Revalue the country factor levels to "Oz", "North Korea", and "South Korea".

Grow a factor

Let’s create two data frames, each with data from two countries, dropping unused factor levels.

df1 <- gapminder %>%
  filter(country %in% c("United States", "Mexico"), year > 2000) %>%
  droplevels()
df2 <- gapminder %>%
  filter(country %in% c("France", "Germany"), year > 2000) %>%
  droplevels()

The country factors in df1 and df2 have different levels.

levels(df1$country)
#> [1] "Mexico"        "United States"
levels(df2$country)
#> [1] "France"  "Germany"

Can you just catenate them?

c(df1$country, df2$country)
#> [1] 1 1 2 2 1 1 2 2

Umm, no. That is wrong on many levels! Use fct_c() to do this.

fct_c(df1$country, df2$country)
#> [1] Mexico        Mexico        United States United States France       
#> [6] France        Germany       Germany      
#> Levels: Mexico United States France Germany

Exercise: Explore how different forms of row binding work behave here, in terms of the country variable in the result.

bind_rows(df1, df2)
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector

#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector

#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> # A tibble: 8 x 6
#>         country continent  year lifeExp       pop gdpPercap
#>           <chr>     <chr> <int>   <dbl>     <int>     <dbl>
#> 1        Mexico  Americas  2002  74.902 102479927  10742.44
#> 2        Mexico  Americas  2007  76.195 108700891  11977.57
#> 3 United States  Americas  2002  77.310 287675526  39097.10
#> 4 United States  Americas  2007  78.242 301139947  42951.65
#> 5        France    Europe  2002  79.590  59925035  28926.03
#> 6        France    Europe  2007  80.657  61083916  30470.02
#> 7       Germany    Europe  2002  78.670  82350671  30035.80
#> 8       Germany    Europe  2007  79.406  82400996  32170.37
rbind(df1, df2)
#> # A tibble: 8 x 6
#>         country continent  year lifeExp       pop gdpPercap
#>          <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
#> 1        Mexico  Americas  2002  74.902 102479927  10742.44
#> 2        Mexico  Americas  2007  76.195 108700891  11977.57
#> 3 United States  Americas  2002  77.310 287675526  39097.10
#> 4 United States  Americas  2007  78.242 301139947  42951.65
#> 5        France    Europe  2002  79.590  59925035  28926.03
#> 6        France    Europe  2007  80.657  61083916  30470.02
#> 7       Germany    Europe  2002  78.670  82350671  30035.80
#> 8       Germany    Europe  2007  79.406  82400996  32170.37