Teaching Data Cleaning and Wrangling with R’s data.table Package


By Erin Franke (Carnegie Mellon University)


Information

Data wrangling is one of the least talked about but simultaneously most important roles of a statistician. Data.table, an R package built off data.frame, is a powerful tool for fast and memory-efficient data manipulation that also offers concise syntax and minimal dependencies. In this presentation, we describe and demonstrate the active learning materials, including visual-based examples and try-it-yourself questions, that we have developed for teaching data.table in introductory statistics and data science classrooms. We have used these materials to teach a session on data.table to an REU with about 40 undergraduate students who have diverse coding backgrounds. These active learning examples introduce data cleaning from the ground up, covering the “six main verbs” (commonly known as select, filter, mutate, arrange, summarize, and group_by in the dplyr package) and how to implement them in R with data.table. Our try-it-yourself example questions give students an opportunity to practice wrangling data with larger-scale datasets and do not assume students have extensive prior coding experience.