By Ronald Yurko & Rebecca Nugent (Carnegie Mellon University)
As part of a revamp of the general education introductory statistics course at Carnegie Mellon, we built an interactive data explorer platform that allows students to fully engage in the entire data analysis workflow without relying on a particular programming language. Its functionality includes tracking actions and storing answers including open-ended questions where students describe graphs and interpret results. Under the assumption that text gives a richer picture of student comprehension (vs a right/wrong multiple choice question), we apply several text analysis methods to compare and detect differences over the course of the semester. Initially, we focus on 10 labs for 71 students from Fall 2017. These labs all incorporated text responses, ranging in complexity and structure. Using bag-of-words techniques, topic models, and spherical k-means, we were able to describe patterns in student responses as well as identify students who answered “differently” than the rest of the group. We also observe how this structure changes over the semester, ideally indicating that students are mastering the terminology and the material but also flagging labs or questions that are potentially misleading or poorly-written. This approach could be used as a first-step autograder or summary of students’ written work and allow the instructor to focus attention more efficiently. We also discuss implications of our results on better understanding the science of data science, or how students from different backgrounds approach introductory statistics and data analysis.