Join two tables
Join (a.k.a. merge) two tables: dplyr join cheatsheet with comic characters and publishers.
Why the cheatsheet
Examples for those of us who don’t speak SQL so good. There are lots of Venn diagrams re: SQL joins on the internet, but I wanted R examples. Those diagrams also utterly fail to show what’s really going on vis-a-vis rows AND columns.
Other great places to read about joins:
The data
Working with two small data frames: superheroes
and publishers
.
library(tidyverse) ## dplyr provides the join functions
superheroes <- tibble::tribble(
~name, ~alignment, ~gender, ~publisher,
"Magneto", "bad", "male", "Marvel",
"Storm", "good", "female", "Marvel",
"Mystique", "bad", "female", "Marvel",
"Batman", "good", "male", "DC",
"Joker", "bad", "male", "DC",
"Catwoman", "bad", "female", "DC",
"Hellboy", "good", "male", "Dark Horse Comics"
)
publishers <- tibble::tribble(
~publisher, ~yr_founded,
"DC", 1934L,
"Marvel", 1939L,
"Image", 1992L
)
Sorry, cheat sheet does not illustrate “multiple match” situations terribly well.
Sub-plot: watch the row and variable order of the join results for a healthy reminder of why it’s dangerous to rely on any of that in an analysis.
inner_join(superheroes, publishers)
inner_join(x, y)
: Return all rows from x
where there are matching values in y
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combination of the matches are returned. This is a mutating join.
We lose Hellboy in the join because, although he appears in x = superheroes
, his publisher Dark Horse Comics does not appear in y = publishers
. The join result has all variables from x = superheroes
plus yr_founded
, from y
.
superheroes |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
Hellboy |
good |
male |
Dark Horse Comics |
|
publishers |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
Image |
1992 |
|
inner_join(x = superheroes, y = publishers) |
name |
alignment |
gender |
publisher |
yr_founded |
Magneto |
bad |
male |
Marvel |
1939 |
Storm |
good |
female |
Marvel |
1939 |
Mystique |
bad |
female |
Marvel |
1939 |
Batman |
good |
male |
DC |
1934 |
Joker |
bad |
male |
DC |
1934 |
Catwoman |
bad |
female |
DC |
1934 |
|
semi_join(superheroes, publishers)
semi_join(x, y)
: Return all rows from x
where there are matching values in y
, keeping just columns from x
. A semi join differs from an inner join because an inner join will return one row of x
for each matching row of y
, where a semi join will never duplicate rows of x
. This is a filtering join.
We get a similar result as with inner_join()
but the join result contains only the variables originally found in x = superheroes
.
superheroes |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
Hellboy |
good |
male |
Dark Horse Comics |
|
publishers |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
Image |
1992 |
|
semi_join(x = superheroes, y = publishers) |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
|
left_join(superheroes, publishers)
left_join(x, y)
: Return all rows from x
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combination of the matches are returned. This is a mutating join.
We basically get x = superheroes
back, but with the addition of variable yr_founded
, which is unique to y = publishers
. Hellboy, whose publisher does not appear in y = publishers
, has an NA
for yr_founded
.
superheroes |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
Hellboy |
good |
male |
Dark Horse Comics |
|
publishers |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
Image |
1992 |
|
left_join(x = superheroes, y = publishers) |
name |
alignment |
gender |
publisher |
yr_founded |
Magneto |
bad |
male |
Marvel |
1939 |
Storm |
good |
female |
Marvel |
1939 |
Mystique |
bad |
female |
Marvel |
1939 |
Batman |
good |
male |
DC |
1934 |
Joker |
bad |
male |
DC |
1934 |
Catwoman |
bad |
female |
DC |
1934 |
Hellboy |
good |
male |
Dark Horse Comics |
NA |
|
anti_join(superheroes, publishers)
anti_join(x, y)
: Return all rows from x
where there are not matching values in y
, keeping just columns from x
. This is a filtering join.
We keep only Hellboy now (and do not get yr_founded
).
superheroes |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
Hellboy |
good |
male |
Dark Horse Comics |
|
publishers |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
Image |
1992 |
|
anti_join(x = superheroes, y = publishers) |
name |
alignment |
gender |
publisher |
Hellboy |
good |
male |
Dark Horse Comics |
|
inner_join(publishers, superheroes)
inner_join(x, y)
: Return all rows from x
where there are matching values in y
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combination of the matches are returned. This is a mutating join.
In a way, this does illustrate multiple matches, if you think about it from the x = publishers
direction. Every publisher that has a match in y = superheroes
appears multiple times in the result, once for each match. In fact, we’re getting the same result as with inner_join(superheroes, publishers)
, up to variable order (which you should also never rely on in an analysis).
publishers |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
Image |
1992 |
|
superheroes |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
Hellboy |
good |
male |
Dark Horse Comics |
|
inner_join(x = publishers, y = superheroes) |
publisher |
yr_founded |
name |
alignment |
gender |
DC |
1934 |
Batman |
good |
male |
DC |
1934 |
Joker |
bad |
male |
DC |
1934 |
Catwoman |
bad |
female |
Marvel |
1939 |
Magneto |
bad |
male |
Marvel |
1939 |
Storm |
good |
female |
Marvel |
1939 |
Mystique |
bad |
female |
|
semi_join(publishers, superheroes)
semi_join(x, y)
: Return all rows from x
where there are matching values in y
, keeping just columns from x
. A semi join differs from an inner join because an inner join will return one row of x
for each matching row of y
, where a semi join will never duplicate rows of x
. This is a filtering join.
Now the effects of switching the x
and y
roles is more clear. The result resembles x = publishers
, but the publisher Image is lost, because there are no observations where publisher == "Image"
in y = superheroes
.
publishers |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
Image |
1992 |
|
superheroes |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
Hellboy |
good |
male |
Dark Horse Comics |
|
semi_join(x = publishers, y = superheroes) |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
|
left_join(publishers, superheroes)
left_join(x, y)
: Return all rows from x
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combination of the matches are returned. This is a mutating join.
We get a similar result as with inner_join()
but the publisher Image survives in the join, even though no superheroes from Image appear in y = superheroes
. As a result, Image has NA
s for name
, alignment
, and gender
.
publishers |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
Image |
1992 |
|
superheroes |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
Hellboy |
good |
male |
Dark Horse Comics |
|
left_join(x = publishers, y = superheroes) |
publisher |
yr_founded |
name |
alignment |
gender |
DC |
1934 |
Batman |
good |
male |
DC |
1934 |
Joker |
bad |
male |
DC |
1934 |
Catwoman |
bad |
female |
Marvel |
1939 |
Magneto |
bad |
male |
Marvel |
1939 |
Storm |
good |
female |
Marvel |
1939 |
Mystique |
bad |
female |
Image |
1992 |
NA |
NA |
NA |
|
anti_join(publishers, superheroes)
anti_join(x, y)
: Return all rows from x
where there are not matching values in y
, keeping just columns from x
. This is a filtering join.
We keep only publisher Image now (and the variables found in x = publishers
).
publishers |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
Image |
1992 |
|
superheroes |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
Hellboy |
good |
male |
Dark Horse Comics |
|
anti_join(x = publishers, y = superheroes) |
publisher |
yr_founded |
Image |
1992 |
|
full_join(superheroes, publishers)
full_join(x, y)
: Return all rows and all columns from both x
and y
. Where there are not matching values, returns NA
for the one missing. This is a mutating join.
We get all rows of x = superheroes
plus a new row from y = publishers
, containing the publisher Image. We get all variables from x = superheroes
AND all variables from y = publishers
. Any row that derives solely from one table or the other carries NA
s in the variables found only in the other table.
superheroes |
name |
alignment |
gender |
publisher |
Magneto |
bad |
male |
Marvel |
Storm |
good |
female |
Marvel |
Mystique |
bad |
female |
Marvel |
Batman |
good |
male |
DC |
Joker |
bad |
male |
DC |
Catwoman |
bad |
female |
DC |
Hellboy |
good |
male |
Dark Horse Comics |
|
publishers |
publisher |
yr_founded |
DC |
1934 |
Marvel |
1939 |
Image |
1992 |
|
full_join(x = superheroes, y = publishers) |
name |
alignment |
gender |
publisher |
yr_founded |
Magneto |
bad |
male |
Marvel |
1939 |
Storm |
good |
female |
Marvel |
1939 |
Mystique |
bad |
female |
Marvel |
1939 |
Batman |
good |
male |
DC |
1934 |
Joker |
bad |
male |
DC |
1934 |
Catwoman |
bad |
female |
DC |
1934 |
Hellboy |
good |
male |
Dark Horse Comics |
NA |
NA |
NA |
NA |
Image |
1992 |
|