P4-01: Smaller investments and bigger payoffs of using R in intro courses via "tame data"

By Albert Y. Kim (Amherst College)


"Should I use R in my intro stats courses?" is a question that has confronted many statistics instructors. We argue that this decision should be made in light of the ratio of investments to payoffs of using R. On the one hand, there are input costs, in particular having to teach students to learn basic command-line coding. On the other hand, there are payoffs such as R being free and its extensive open-source developed package ecosystem. While there have been many recent advances that tilt this ratio in favor of R, such as user-friendlier R packages and online platforms for learning R such as DataCamp, we propose what we feel is another modest advance: the use of carefully curated datasets. In particular, we propose a set of "data taming" principles to be applied to datasets as "they exist in the wild" so that just enough scaffolding is provided to make the data accessible to novices, but not so much that the true nature of rich, real, and realistic data is betrayed. Kim, Ismay, and Chunn implemented these principles during the building of the fivethirtyeight R package of data from articles on FiveThirtyEight.com, all with the singular goal of making their data accessible to novices and also simpler to teach for instructors too.