Highlighting real statistical studies

Kevin Ross – Cal Poly, San Luis Obispo

One of the recommendations of the GAISE report is to “use real data where possible.” While this is great advice, perhaps an even better recommendation is to “always use real statistical studies.” This post describes some ways I highlight real studies in my courses. While my approach might not be novel, I hope you find some of these ideas useful. [pullquote]Highlighting real data in our teaching is extremely important. However, perhaps a better goal is to highlight real statistical studies…[/pullquote]

For students, it doesn’t get more real than studies about themselves. I start by collecting data on students before the first class through an online survey. Variables often include: number of Facebook friends, number of text messages sent yesterday, hours of sleep on a typical weeknight, music preference (pop, rock, country, etc). This data set is then available on the first day of class to provide a real context for concepts like observational units, variables and their types, descriptive statistics, outliers (it seems that every class has a student that sends about 500 texts). As we use these data and collect more throughout the course, students enjoy posing their own questions and learning about each other.

I also collect data on students to replicate real research studies. For example, since 2012 Microsoft has sponsored the “Bing It On” challenge, a blind comparison test of search engine results from Bing and Google. You might remember the substantial ad campaign in which Microsoft claimed that people who take the challenge prefer Bing nearly 2 to 1 over Google. I have my students take the challenge and then use the data to perform a significance test of Microsoft’s claim (which is usually soundly rejected). This particular study is well suited to a simulation-based curriculum; the null distribution of the sample proportion, under the assumption that the population proportion is 2/3, can be simulated in class by rolling dice. One other nice feature of this study is that there are usually a handful of students whose result is “draw”. Students question how the draws should be counted: For Bing? For Google? Not counted at all? We can then discuss how the different rules for counting draws affect our analysis. In this way, we can investigate how changing the sample proportion or sample size affects the p-value in a real context, rather than just “imagine that the sample size had been…” Finally, we compare our study to similar research regarding Microsoft’s claim so students can witness that the same methods they learn in class are used in practice by real researchers.

Another way to incorporate real studies is to have students perform their own. The benefits of projects in which students propose their own research questions and collect their own data are many and have been discussed elsewhere, so I’ll just highlight one here: Past student projects become real statistical studies for future students. So let students do the work of finding real data for you!

The main hurdle to using real studies is finding good studies. It is not easy to find a study that (1) is real, (2) is of general interest to students, and (3) achieves a pedagogical goal. Valuable sources include colleagues, textbooks, and the internet. One hesitation I initially had to using real research studies was that the statistical methods used are often more advanced than those covered in introductory statistics. But eventually the realization dawned on me: I might not be able to replicate the study, but I can still use the research question and the data! Analyzing the data through the less sophisticated methods covered in class provides opportunities to look back and ahead to consider the limitations in our analysis and what we might want to investigate in the future. (I should also mention here that using real studies includes using real bad studies. For example, if you ever want to demonstrate survey bias just head to the online polls page on Fox Nation.)

In the middle of lecture, I briefly present a “statistical study of the day” which is current and of general interest to students. Sometimes I select studies that correspond to current course topics. For example, I presented an article discussing the association between age and number of tattoos while covering two-way tables and chi-square procedures. (In general, fivethirtyeight.com is a great source, and the Significant Digits feature is tailor made for a “statistic of the day”.)

However, most often I deliberately select statistical studies of the day which are completely tangential to the course. A few examples include:

The searchable Rap Stats database, which contains frequency of words in rap lyrics over time, demonstrates interactive graphics and time series (so students can witness the death of “beepers” and the rise of “twerking”).
A paper which uses Bayesian hierarchical modeling to (spoiler alert?) predict which characters will die in the next Game of Thrones novel, provides an introduction to different interpretations of probability.
A recent study of the relationship between language used on Twitter and rates of coronary heart disease, introduces analysis of text data. (Figure 1 of the research paper contains the most NSFW word clouds I’ve ever seen.)

While these presentations are at a very superficial level, their goal is to give students a broader perspective of what the science of data can do.

I have one final comment about “highlighting real data”. While I am a proponent of the simulation-based curriculum, I have noticed that students tend to blur the line between the data and the simulation output. Therefore, one aspect of highlighting real data is to clearly emphasize the real data before jumping straight to inference, by

Spending time discussing how the data were collected. Students must understand that a lot more work goes into collecting data than just flipping coins or pushing buttons on an applet.
Describing the data. I prefer to give students the data file and have them explore it in JMP® the day before the applet-based inference activity. My hope is that this separation emphasizes what is real and what is simulated.

Highlighting real data in our teaching is extremely important. However, perhaps a better goal is to highlight real statistical studies – the real research questions, the people who conducted the study, the methods of data collection, the conclusions and their limitations, the way the studies are reported in the media. And, of course, the real data.

Simulation-based statistical inference

A blog about teaching introductory statistics with simulation-based inference

Leave a Reply Cancel reply