Chapter 11 Character vectors

11.1 Character vectors: where they fit in

We’ve spent a lot of time working with big, beautiful data frames. That are clean and wholesome, like the Gapminder data.

But real life will be much nastier. You will bring data into R from the outside world and discover there are problems. You might think: how hard can it be to deal with character data? And the answer is: it can be very hard!

Here we discuss common remedial tasks for cleaning and transforming character data, also known as “strings”. A data frame or tibble will consist of one or more atomic vectors of a certain class. This lesson deals with things you can do with vectors of class character.

11.2 Resources

I start with this because we cannot possibly do this topic justice in a short amount of time. Our goal is to make you aware of broad classes of problems and their respective solutions. Once you have a character problem in real life, these resources will be extremely helpful as you delve deeper.

11.2.1 Manipulating character vectors

  • stringr package.
    • A core package in the tidyverse. It is installed via install.packages("tidyverse") and also loaded via library(tidyverse). Of course, you can also install or load it individually.
    • Main functions start with str_. Auto-complete is your friend.
    • Replacements for base functions re: string manipulation and regular expressions (see below).
    • Main advantages over base functions: greater consistency about inputs and outputs. Outputs are more ready for your next analytical task.
  • tidyr package.
    • Especially useful for functions that split one character vector into many and vice versa: separate(), unite(), extract().
  • Base functions: nchar(), strsplit(), substr(), paste(), paste0().
  • The glue package is fantastic for string interpolation. If stringr::str_interp() doesn’t get your job done, check out the glue package.

11.2.2 Regular expressions resources

A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as “write only”, because regular expressions are easier to write than to read/understand. And they are not particularly easy to write.

11.2.4 Character vectors that live in a data frame

  • Certain operations are facilitated by tidyr. These are described below.
  • For a general discussion of how to work on variables that live in a data frame, see Vectors versus tibbles (Appendix A).

11.4 Regex-free string manipulation with stringr and tidyr

Basic string manipulation tasks:

  • Study a single character vector
    • How long are the strings?
    • Presence/absence of a literal string
  • Operate on a single character vector
    • Keep/discard elements that contain a literal string
    • Split into two or more character vectors using a fixed delimiter
    • Snip out pieces of the strings based on character position
    • Collapse into a single string
  • Operate on two or more character vectors
    • Glue them together element-wise to get a new character vector.

fruit, words, and sentences are character vectors that ship with stringr for practicing.

11.4.2 String splitting by delimiter

Use stringr::str_split() to split strings on a delimiter. Some of our fruits are compound words, like “grapefruit”, but some have two words, like “ugli fruit”. Here we split on a single space " ", but show use of a regular expression later.

It’s bummer that we get a list back. But it must be so! In full generality, split strings must return list, because who knows how many pieces there will be?

If you are willing to commit to the number of pieces, you can use str_split_fixed() and get a character matrix. You’re welcome!

If the to-be-split variable lives in a data frame, tidyr::separate() will split it into 2 or more variables.

11.4.3 Substring extraction (and replacement) by position

Count characters in your strings with str_length(). Note this is different from the length of the character vector itself.

You can snip out substrings based on character position with str_sub().

The start and end arguments are vectorised. Example: a sliding 3-character window.

Finally, str_sub() also works for assignment, i.e. on the left hand side of <-.

11.4.4 Collapse a vector

You can collapse a character vector of length n > 1 to a single string with str_c(), which also has other uses (see the next section).

11.4.5 Create a character vector by catenating multiple vectors

If you have two or more character vectors of the same length, you can glue them together element-wise, to get a new vector of that length. Here are some … awful smoothie flavors?

Element-wise catenation can be combined with collapsing.

If the to-be-combined vectors are variables in a data frame, you can use tidyr::unite() to make a single new variable from them.

11.4.6 Substring replacement

You can replace a pattern with str_replace(). Here we use an explicit string-to-replace, but later we revisit with a regular expression.

A special case that comes up a lot is replacing NA, for which there is str_replace_na().

If the NA-afflicted variable lives in a data frame, you can use tidyr::replace_na().

And that concludes our treatment of regex-free manipulations of character data!

11.5 Regular expressions with stringr

From [\@ThePracticalDev](https://twitter.com/ThePracticalDev/status/774309983467016193)

Figure 11.1: From @ThePracticalDev

11.5.1 Load gapminder

The country names in the gapminder dataset are convenient for examples. Load it now and store the 142 unique country names to the object countries.

11.5.2 Characters with special meaning

Frequently your string tasks cannot be expressed in terms of a fixed string, but can be described in terms of a pattern. Regular expressions, aka “regexes”, are the standard way to specify these patterns. In regexes, specific characters and constructs take on special meaning in order to match multiple strings.

The first metacharacter is the period ., which stands for any single character, except a newline (which by the way, is represented by \n). The regex a.b will match all countries that have an a, followed by any single character, followed by b. Yes, regexes are case sensitive, i.e. “Italy” does not match.

Notice that i.a matches “ina”, “ica”, “ita”, and more.

Anchors can be included to express where the expression must occur within the string. The ^ indicates the beginning of string and $ indicates the end.

Note how the regex i.a$ matches many fewer countries than i.a alone. Likewise, more elements of my_fruit match d than ^d, which requires “d” at string start.

The metacharacter \b indicates a word boundary and \B indicates NOT a word boundary. This is our first encounter with something called “escaping” and right now I just want you at accept that we need to prepend a second backslash to use these sequences in regexes in R. We’ll come back to this tedious point later.

11.5.3 Character classes

Characters can be specified via classes. You can make them explicitly “by hand” or use some pre-existing ones. The 2014 STAT 545 regex lesson (Appendix A) has a good list of character classes. Character classes are usually given inside square brackets, [] but a few come up so often that we have a metacharacter for them, such as \d for a single digit.

Here we match ia at the end of the country name, preceded by one of the characters in the class. Or, in the negated class, preceded by anything but one of those characters.

Here we revisit splitting my_fruit with two more general ways to match whitespace: the \s metacharacter and the POSIX class [:space:]. Notice that we must prepend an extra backslash \ to escape \s and the POSIX class has to be surrounded by two sets of square brackets.

Let’s see the country names that contain punctuation.

11.5.4 Quantifiers

You can decorate characters (and other constructs, like metacharacters and classes) with information about how many characters they are allowed to match.

quantifier meaning quantifier meaning
* 0 or more {n} exactly n
+ 1 or more {n,} at least n
? 0 or 1 {,m} at most m
{n,m} between n and m, inclusive

Explore these by inspecting matches for l followed by e, allowing for various numbers of characters in between.

l.*e will match strings with 0 or more characters in between, i.e. any string with an l eventually followed by an e. This is the most inclusive regex for this example, so we store the result as matches to use as a baseline for comparison.

Change the quantifier from * to + to require at least one intervening character. The strings that no longer match: all have a literal le with no preceding l and no following e.

Change the quantifier from * to ? to require at most one intervening character. In the strings that no longer match, the shortest gap between l and following e is at least two characters.

Finally, we remove the quantifier and allow for no intervening characters. The strings that no longer match lack a literal le.

11.5.5 Escaping

You’ve probably caught on by now that there are certain characters with special meaning in regexes, including $ * + . ? [ ] ^ { } | ( ) \.

What if you really need the plus sign to be a literal plus sign and not a regex quantifier? You will need to escape it by prepending a backslash. But wait … there’s more! Before a regex is interpreted as a regular expression, it is also interpreted by R as a string. And backslash is used to escape there as well. So, in the end, you need to preprend two backslashes in order to match a literal plus sign in a regex.

This will be more clear with examples!

11.5.5.1 Escapes in plain old strings

Here is routine, non-regex use of backslash \ escapes in plain vanilla R strings. We intentionally use cat() instead of print() here.

11.5.5.2 Escapes in regular expressions

Examples of using escapes in regexes to match characters that would otherwise have a special interpretation.

We know several gapminder country names contain a period. How do we isolate them? Although it’s tempting, this command str_subset(countries, pattern = ".") won’t work!

A last example that matches an actual square bracket.

11.5.6 Groups and backreferences

Your first use of regex is likely to be simple matching: detecting or isolating strings that match a pattern.

But soon you will want to use regexes to transform the strings in character vectors. That means you need a way to address specific parts of the matching strings and to operate on them.

You can use parentheses inside regexes to define groups and you can refer to those groups later with backreferences.

For now, this lesson will refer you to other place to read up on this: