P2-24: Databases! A Visual Introduction to the Data Science Techniques of Database Querying and Design

By Jennifer Broatch (Arizona State University)


Data wrangling and database skills are essential to statistics careers, yet many students (both majors and non-majors) are not exposed to these concepts. Horton et al. (2015) suggest that students develop data science skills early and often, beginning with the introductory course, and that early exposure is critical. Similarly, the Curriculum Guidelines for Undergraduate Programs in Statistical Science prepared by the American Statistical Association (2014) emphasize the increased importance of data science and the need for students to be “facile with database systems” (for an early discussion see Higgins (1999)). We present a set of three separate and interactive modules (“animations”) to introduce the concepts of relational databases, querying using SQL, and database design (NSF DUE-1431848/DUE-1431661). The concept of relational databases and querying are shown to be essential for data manipulation and management skills. In a review and discussion of seven exemplar data science courses, Hardin et al. (2015) note that these examples all include relational databases and SQL as a topic in a semester long data science course (links to course materials and syllabus for the full data science course are included in Hardin et al.). Unlike the full data-science courses reviewed in Hardin et al., we present three interactive animations as a learning tool for many disciplines, including introductory statistics instructors, to utilize within their course as an at-home activity. (Animations and resources are freely available at http://databasesmanymajors.faculty.asu.edu/)

The three visual animations engage student learning in the three major topics: 1) Introduction to Relational Databases 2) Introduction to Querying, and 3) Database Conceptual Design. The first introduces relational databases and how they differ from spreadsheets. The second one covers querying of relational databases, and finally the third discusses the conceptual design of data, which explains how to model data and then map the design to a relational database schema. These visualizations are customizable, if desired, and are available in different application domains to promote relevance to a variety of students. Customizations currently available on the site include: Astronomy, Computational Molecular Biology, Environmental Science/Ecology, Forensics, Geographic Information Systems, and Sports Statistics. Multiple domains attract students of all majors (not just statistics/data science majors) and promote relevance to a variety of students. This early exposure and introduction to database topics has a broad audience and can be used in any introductory statistics, data science, science, or any other course that might want to promote early data science skills. The animations can also assist students conducting Senior Capstone courses like those reviewed in Martonosi and Williams (2016) to bridge the gap between students’ statistical training and the data manipulation and management challenges of the real-world.

Faculty will be able to supplement their curricula with these self-contained database animations. Each animation takes students about an hour to complete. We also provide instructor resources to introduce the animations to the students with “Cooperative Learning Exercises” to follow up in class on learning, if desired. To ensure students understand the topics presented within each animation outside of class, self-assessment checkpoints are included at the end of each topic with feedback provided on each question. This type of formative feedback promotes learning (McMillan and Hearn, 2008). Dietrich et al. showed that the visualizations were an effective pedagogical tool for students in various courses across two major universities for both computing and non-computing majors.

The visualizations introduce fundamental database concepts that can be applied across platforms. The first animation introduces tables that are conceptually linked using primary and foreign keys. Although the querying visualization focuses on SQL, it was noted by Hardin et al. (when referring to a course by Wickham) “while each language may have its own syntax the underlying operation that is being performed on the data is the same.”  Thus, the concepts presented provide a foundation for learning database querying in any language including R (see Wickham, 2014 for an R example).

These visualizations introduce databases to students with no previous database experience without the need for the instructor to create new course materials or develop materials themselves. Horton et al. (2015) specifically mention that these introductory database skills, such as the ability to query a relational database management system (RDBMS) be introduced in an introductory course. The animations presented satisfy that need and easily allow instructors to introduce these essential data science topics.