1 Learning Objectives

1.0.0.1 Resources

Some useful resources for data frame joins: - Jenny’s Cheatsheet for dplyr join functions. - Possibly look at Jenny’s Tidy data using Lord of the Rings - “two-table verbs”’s vignette on For data reshaping: - The tidyr’s vignette

We are going to use a set of datasets that Joey put together for us. You can install them from github.

# if you have not installed devtools, time to do it (uncomment following line)
# install.packages("devtools")
# uncomment and run the following line to get the `singer` package
devtools::install_github("JoeyBernhardt/singer")

1.1 Join together

For the first part of the lecture, we are going to work a tad on some more joining operation.

The package singer comes with two smallish dataframe about songs and artists. Let’s take a look at them

library(singer)
data("songs")
songs
data("locations")
locations

1.1.1 Challenge 1

What flow would you use to get a dataframe with all the rows from songs where there is matching information the in locations dataframe?

1.1.2 Challenge 2

What flow would you use to get a dataframe with all the rows from songs where there is matching information the in locations dataframe, getting back a dataframe with title, artist_name and year?

1.1.3 Challenge 2

What command(s) would you use to get a dataframe with all the rows from songs where there is matching information the in locations dataframe, getting back a dataframe with title, artist_name and year?

1.1.4 Challenge 3

What flow would you use to get the number of releases (albums) present in this two small datasets per year?

1.2 Reshaping

Let’s consider the bigger datagrame artist_locations

data("singer_locations")
singer_locations

Let’s suppose that we would like to do some plotting, showing the mean of artist_hotttnesss, artist_familiarity and duration for year. Let’s select the appropriate part of the dataframe.

hfd_y <- singer_locations %>%
  select(year, artist_hotttnesss, artist_familiarity, duration)

You may have heard that ggplot prefers long dataframes, what does that mean? Let’s see with an example. The dataframe we have is wide: that means that the observation presents the values of interest on different columns.

hfd_y

To have in on a long format, we can use tidyr’s function gather(). Let’s see:

hfd_y_long <- hfd_y %>%
  gather(key = "measure", value = "Value", artist_hotttnesss:duration)

1.2.1 micro challenge

Observe the number of rows and the first entries: what do you think it happened?


We can go back to the wide format using tidyr’s function spread(). Yet, be braced for a problem:

hfd_y_long %>%
  spread(measure, Value)

What did go wrong here? As we had more than one song per year, when we made it long we lost some information: we don’t know how to reassemble the wide table. Let’s try again, but with additional wizardry: remember that if you want to gather and then spread, you need to have in the dataframe some column(s) uniquely identifying each row. In this case song_id does the magic.

hfd_y_unique <- singer_locations %>%
  select(song_id, year, artist_hotttnesss, artist_familiarity, duration)

Gather:

hfd_y_unique_long <- hfd_y_unique %>%
  gather(key = "measure", value = "Value", artist_hotttnesss:duration)

and then spread

hfd_y_unique_long %>%
  spread(measure, Value)
LS0tCnRpdGxlOiAiU1RBVCA1NDUgQ2xhc3MgTWVldGluZyAxMCIKb3V0cHV0OgogICAgaHRtbF9ub3RlYm9vazoKICAgICAgICB0b2M6IHRydWUKICAgICAgICB0aGVtZTogY2VydWxlYW4KICAgICAgICBudW1iZXJfc2VjdGlvbnM6IHRydWUKZWRpdG9yX29wdGlvbnM6IAogIGNodW5rX291dHB1dF90eXBlOiBpbmxpbmUKLS0tCgojIExlYXJuaW5nIE9iamVjdGl2ZXMKCgpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0Kc3VwcHJlc3NQYWNrYWdlU3RhcnR1cE1lc3NhZ2VzKGxpYnJhcnkodGlkeXZlcnNlKSkKa25pdHI6Om9wdHNfY2h1bmskc2V0KGZpZy53aWR0aD00LCBmaWcuaGVpZ2h0PTMpCmBgYAoKIyMjIyBSZXNvdXJjZXMKClNvbWUgdXNlZnVsIHJlc291cmNlcyBmb3IgZGF0YSBmcmFtZSBqb2luczoKLSBKZW5ueSdzIFtDaGVhdHNoZWV0XShiaXQwMDFfZHBseXItY2hlYXRzaGVldC5odG1sKSBmb3IgYGRwbHlyYCBqb2luIGZ1bmN0aW9ucy4KLSBQb3NzaWJseSBsb29rIGF0IEplbm55J3MgW1RpZHkgZGF0YSB1c2luZyBMb3JkIG9mIHRoZSBSaW5nc10oaHR0cHM6Ly9naXRodWIuY29tL2plbm55YmMvbG90ci10aWR5KQotICJ0d28tdGFibGUgdmVyYnMiJ3MgW3ZpZ25ldHRlXShodHRwczovL2NyYW4uci1wcm9qZWN0Lm9yZy93ZWIvcGFja2FnZXMvZHBseXIvdmlnbmV0dGVzL3R3by10YWJsZS5odG1sKSBvbgpGb3IgZGF0YSByZXNoYXBpbmc6Ci0gIFRoZSBgdGlkeXJgJ3MgW3ZpZ25ldHRlXShodHRwczovL2NyYW4uci1wcm9qZWN0Lm9yZy93ZWIvcGFja2FnZXMvdGlkeXIvdmlnbmV0dGVzL3RpZHktZGF0YS5odG1sKQoKCldlIGFyZSBnb2luZyB0byB1c2UgYSBzZXQgb2YgZGF0YXNldHMgdGhhdCBKb2V5IHB1dCB0b2dldGhlciBmb3IgdXMuIFlvdSBjYW4gaW5zdGFsbCB0aGVtIGZyb20gZ2l0aHViLgoKYGBge3J9CiMgaWYgeW91IGhhdmUgbm90IGluc3RhbGxlZCBkZXZ0b29scywgdGltZSB0byBkbyBpdCAodW5jb21tZW50IGZvbGxvd2luZyBsaW5lKQojIGluc3RhbGwucGFja2FnZXMoImRldnRvb2xzIikKIyB1bmNvbW1lbnQgYW5kIHJ1biB0aGUgZm9sbG93aW5nIGxpbmUgdG8gZ2V0IHRoZSBgc2luZ2VyYCBwYWNrYWdlCmRldnRvb2xzOjppbnN0YWxsX2dpdGh1YigiSm9leUJlcm5oYXJkdC9zaW5nZXIiKQpgYGAKCiMjIEpvaW4gdG9nZXRoZXIKCkZvciB0aGUgZmlyc3QgcGFydCBvZiB0aGUgbGVjdHVyZSwgd2UgYXJlIGdvaW5nIHRvIHdvcmsgYSB0YWQgb24gc29tZSBtb3JlIGpvaW5pbmcgb3BlcmF0aW9uLgoKVGhlIHBhY2thZ2UgYHNpbmdlcmAgY29tZXMgd2l0aCB0d28gc21hbGxpc2ggZGF0YWZyYW1lIGFib3V0IHNvbmdzIGFuZCBhcnRpc3RzLiBMZXQncyB0YWtlIGEgbG9vayBhdCB0aGVtCgpgYGB7cn0KbGlicmFyeShzaW5nZXIpCmRhdGEoInNvbmdzIikKc29uZ3MKYGBgCgpgYGB7cn0KZGF0YSgibG9jYXRpb25zIikKbG9jYXRpb25zCmBgYAoKIyMjIENoYWxsZW5nZSAxCgpXaGF0IGZsb3cgd291bGQgeW91IHVzZSB0byBnZXQgYSBkYXRhZnJhbWUgd2l0aCBhbGwgdGhlIHJvd3MgZnJvbSBgc29uZ3NgIHdoZXJlIHRoZXJlIGlzIG1hdGNoaW5nIGluZm9ybWF0aW9uIHRoZSBpbiBgbG9jYXRpb25zYCBkYXRhZnJhbWU/CgpgYGB7cn0KCmBgYAoKIyMjIENoYWxsZW5nZSAyCgpXaGF0IGZsb3cgd291bGQgeW91IHVzZSB0byBnZXQgYSBkYXRhZnJhbWUgd2l0aCBhbGwgdGhlIHJvd3MgZnJvbSBgc29uZ3NgIHdoZXJlIHRoZXJlIGlzIG1hdGNoaW5nIGluZm9ybWF0aW9uIHRoZSBpbiBgbG9jYXRpb25zYCBkYXRhZnJhbWUsIGdldHRpbmcgYmFjayBhIGRhdGFmcmFtZSB3aXRoIGB0aXRsZWAsIGBhcnRpc3RfbmFtZWAgYW5kIGB5ZWFyYD8KCmBgYHtyfQoKYGBgCgojIyMgQ2hhbGxlbmdlIDIKCldoYXQgY29tbWFuZChzKSB3b3VsZCB5b3UgdXNlIHRvIGdldCBhIGRhdGFmcmFtZSB3aXRoIGFsbCB0aGUgcm93cyBmcm9tIGBzb25nc2Agd2hlcmUgdGhlcmUgaXMgbWF0Y2hpbmcgaW5mb3JtYXRpb24gdGhlIGluIGBsb2NhdGlvbnNgIGRhdGFmcmFtZSwgZ2V0dGluZyBiYWNrIGEgZGF0YWZyYW1lIHdpdGggYHRpdGxlYCwgYGFydGlzdF9uYW1lYCBhbmQgYHllYXJgPwoKYGBge3J9CgpgYGAKCiMjIyBDaGFsbGVuZ2UgMwoKV2hhdCBmbG93IHdvdWxkIHlvdSB1c2UgdG8gZ2V0IHRoZSBudW1iZXIgb2YgcmVsZWFzZXMgKGFsYnVtcykgcHJlc2VudCBpbiB0aGlzIHR3byBzbWFsbCBkYXRhc2V0cyBwZXIgeWVhcj8KCmBgYHtyfQoKYGBgCgoKIyMgUmVzaGFwaW5nCgpMZXQncyBjb25zaWRlciB0aGUgYmlnZ2VyIGRhdGFncmFtZSBgYXJ0aXN0X2xvY2F0aW9uc2AKCmBgYHtyfQpkYXRhKCJzaW5nZXJfbG9jYXRpb25zIikKc2luZ2VyX2xvY2F0aW9ucwpgYGAKCgpMZXQncyBzdXBwb3NlIHRoYXQgd2Ugd291bGQgbGlrZSB0byBkbyBzb21lIHBsb3R0aW5nLCBzaG93aW5nIHRoZSBtZWFuIG9mIGBhcnRpc3RfaG90dHRuZXNzc2AsIGBhcnRpc3RfZmFtaWxpYXJpdHlgIGFuZCBgZHVyYXRpb25gIGZvciBgeWVhcmAuIExldCdzIHNlbGVjdCB0aGUgYXBwcm9wcmlhdGUgcGFydCBvZiB0aGUgZGF0YWZyYW1lLgoKYGBge3J9CmhmZF95IDwtIHNpbmdlcl9sb2NhdGlvbnMgJT4lCiAgc2VsZWN0KHllYXIsIGFydGlzdF9ob3R0dG5lc3NzLCBhcnRpc3RfZmFtaWxpYXJpdHksIGR1cmF0aW9uKQpgYGAKCgpZb3UgbWF5IGhhdmUgaGVhcmQgdGhhdCBnZ3Bsb3QgcHJlZmVycyBsb25nIGRhdGFmcmFtZXMsIHdoYXQgZG9lcyB0aGF0IG1lYW4/IExldCdzIHNlZSB3aXRoIGFuIGV4YW1wbGUuIFRoZSBkYXRhZnJhbWUgd2UgaGF2ZSBpcyAqKndpZGUqKjogdGhhdCBtZWFucyB0aGF0IHRoZSBvYnNlcnZhdGlvbiBwcmVzZW50cyB0aGUgdmFsdWVzIG9mIGludGVyZXN0IG9uIGRpZmZlcmVudCBjb2x1bW5zLgoKYGBge3J9CmhmZF95CmBgYAoKVG8gaGF2ZSBpbiBvbiBhIGxvbmcgZm9ybWF0LCB3ZSBjYW4gdXNlIGB0aWR5cmAncyBmdW5jdGlvbiBgZ2F0aGVyKClgLiBMZXQncyBzZWU6CgpgYGB7cn0KaGZkX3lfbG9uZyA8LSBoZmRfeSAlPiUKICBnYXRoZXIoa2V5ID0gIm1lYXN1cmUiLCB2YWx1ZSA9ICJWYWx1ZSIsIGFydGlzdF9ob3R0dG5lc3NzOmR1cmF0aW9uKQpgYGAKCiMjIyBtaWNybyBjaGFsbGVuZ2UKCk9ic2VydmUgdGhlIG51bWJlciBvZiByb3dzIGFuZCB0aGUgZmlyc3QgZW50cmllczogd2hhdCBkbyB5b3UgdGhpbmsgaXQgaGFwcGVuZWQ/CgoKLS0tCgpXZSBjYW4gZ28gYmFjayB0byB0aGUgd2lkZSBmb3JtYXQgdXNpbmcgYHRpZHlyYCdzIGZ1bmN0aW9uIGBzcHJlYWQoKWAuIFlldCwgYmUgYnJhY2VkIGZvciBhIHByb2JsZW06CgpgYGB7cn0KaGZkX3lfbG9uZyAlPiUKICBzcHJlYWQobWVhc3VyZSwgVmFsdWUpCmBgYAoKV2hhdCBkaWQgZ28gd3JvbmcgaGVyZT8gQXMgd2UgaGFkIG1vcmUgdGhhbiBvbmUgc29uZyBwZXIgeWVhciwgd2hlbiB3ZSBtYWRlIGl0IGxvbmcgd2UgbG9zdCBzb21lIGluZm9ybWF0aW9uOiB3ZSBkb24ndCBrbm93IGhvdyB0byByZWFzc2VtYmxlIHRoZSB3aWRlIHRhYmxlLiBMZXQncyB0cnkgYWdhaW4sIGJ1dCB3aXRoIGFkZGl0aW9uYWwgd2l6YXJkcnk6IHJlbWVtYmVyIHRoYXQgaWYgeW91IHdhbnQgdG8gZ2F0aGVyIGFuZCB0aGVuIHNwcmVhZCwgeW91IG5lZWQgdG8gaGF2ZSBpbiB0aGUgZGF0YWZyYW1lIHNvbWUgY29sdW1uKHMpIHVuaXF1ZWx5IGlkZW50aWZ5aW5nIGVhY2ggcm93LiBJbiB0aGlzIGNhc2UgYHNvbmdfaWRgIGRvZXMgdGhlIG1hZ2ljLgoKCmBgYHtyfQpoZmRfeV91bmlxdWUgPC0gc2luZ2VyX2xvY2F0aW9ucyAlPiUKICBzZWxlY3Qoc29uZ19pZCwgeWVhciwgYXJ0aXN0X2hvdHR0bmVzc3MsIGFydGlzdF9mYW1pbGlhcml0eSwgZHVyYXRpb24pCmBgYAoKR2F0aGVyOgoKYGBge3J9CmhmZF95X3VuaXF1ZV9sb25nIDwtIGhmZF95X3VuaXF1ZSAlPiUKICBnYXRoZXIoa2V5ID0gIm1lYXN1cmUiLCB2YWx1ZSA9ICJWYWx1ZSIsIGFydGlzdF9ob3R0dG5lc3NzOmR1cmF0aW9uKQpgYGAKCmFuZCB0aGVuIHNwcmVhZAoKYGBge3J9CmhmZF95X3VuaXF1ZV9sb25nICU+JQogIHNwcmVhZChtZWFzdXJlLCBWYWx1ZSkKYGBg