Tech Talk: Teaching web scraping - Integrating data science into statistics

Mine Dogucu (New College of Florida); Mine Cetinkaya-Rundel (Duke University)


While working on statistics projects, students often lean on the internet as the source of data. They locate a data table online and try to copy and paste it into Excel. Hand scraping data from the web can lead to oddly-formatted Excel files full of hyperlinks and empty cells, which often requires further manual cleaning of the data in a non-reproducible way in order to get it into a tidy data frame to base a statistical analysis on. Web-scraping can help prevent foreseeable errors, automates the repetitive task of copying and pasting across multiple webpages, and enables access to larger amount of data in a short span of time. We will share our experiences and examples of web scraping activities from our classrooms and how we link this relatively modern technique with traditional statistics topics such as multiple regression.