Consult the general homework guidelines.
Due sometime Tuesday 2016-09-27. Before class is better but get help in class or office hours of you need it.
The goal is to explore a new-to-you dataset. In particular, to begin to establish a workflow for data frames or “tibbles”. You will use dplyr and ggplot2 to do some description and visualization.
Remember the sampler concept. Your homework should serve as your own personal cheatsheet in the future for things to do with a new dataset. Give yourself the cheatsheet you deserve!
Work with the
gapminder data we explored in class. If you really want to, you can explore a different dataset but get permission from Jenny. Self-assess the suitability of your dataset by reading this issue.
The Gapminder data is distributed as an R package from CRAN.
Install it if you have not done so already and remember to load it.
Install and load dplyr. Probably via the tidyverse meta-package.
Pick at least one categorical variable and at least one quantitative variable to explore.
ggplot2 tutorial, which also uses the
gapminder data, for ideas.
Make a few plots, probably of the same variable you chose to characterize numerically. Try to explore more than one plot type. Just as an example of what I mean:
You don’t have to use all the data in every plot! It’s fine to filter down to one country or small handful of countries.
filter() to create data subsets that you want to plot.
Practice piping together
select(). Possibly even piping into
For people who want to take things further.
Evaluate this code and describe the result. Presumably the analyst’s intent was to get the data for Rwanda and Afghanistan. Did she succeed? Why or why not? If not, what is the correct way to do this?
filter(gapminder, country == c("Rwanda", "Afghanistan"))
Read What I do when I get a new data set as told through tweets from SimplyStatistics to get some ideas!
Present numerical tables in a more attractive form, such as using
Use more of the dplyr functions for operating on a single table.
Adapt exercises from the chapters in the “Explore” section of R for Data Science to the Gapminder dataset.
Reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc. What things were hard, even though you saw them in class? What was easy(-ish) even though we haven’t done it in class?
Follow instructions on How to submit homework
Our general rubric applies, but also …
Check minus: There are some mistakes or omissions, such as the number of rows or variables in the data frame. Or maybe no confirmation of its class or that of the variables inside. There are no plots or there’s just one type of plot (and its probably a scatterplot). There’s no use of
select(). It’s hard to figure out which file I’m supposed to be looking at. Maybe the student forgot to commit and push the figures to GitHub.
Check: Hits all the elements. No obvious mistakes. Pleasant to read. No heroic detective work required. Solid.
Check plus: Some “above and beyond”, creativity, etc. You learned something new from reviewing their work and you’re eager to incorporate it into your work now. Use of dplyr goes beyond
select(). The ggplot2 figures are quite diverse. The repo is very organized and it’s a breeze to find the file for this homework specifically.