Mine Cetinkaya-Rundel, Duke University
Just a couple years ago I would have answered the question “Why simulation based?” with the following:
- opportunity to introduce inference before (or without) discussing details of probability distributions
- conceptual understanding of p-values – both the “assume the null hypothesis is true” part and the “observed or more extreme” part
[pullquote]Being able to introduce computation as an essential tool for conducting statistical inference is a huge benefit of simulation based inference. [/pullquote]These are the reasons why in the first chapter of OpenIntro Statistics (link), a textbook I co-authored, we decided to include a section on randomization tests. The Introductory Statistics with Randomization and Simulation (link) textbook takes these ideas a step further and provides an introduction to statistical inference completely from a simulation based perspective. I believe these are important reasons for teaching simulation based inference, and many have already discussed them at length. However, for this post I’d like to focus on a lesser-discussed reason for teaching simulation based inference: it provides an opportunity to teach computation.
For the last two years I have been teaching a course called “Better Living Through Data Science: Exploring / Modeling / Predicting / Understanding” (link) (man that’s a mouthful!). The course combines techniques from statistics, math, computer science, and social sciences, to learn how to use data to understand natural phenomena, explore patterns, model outcomes, and make predictions. The target audience of the course is first year undergraduates, and the goal is to get them interested in statistics and data science as early as possible in their academic careers. Four main themes in the course are data wrangling and munging, data visualization and exploratory data analysis, statistical inference, and modeling. There is a heavy computation component to the course, and the students learn R not just as a data analysis tool but also as a programming language. Attempting to also teaching inference along with these more programming / data science topics is challenging but I believe that using a simulation based approach works perfectly within the more computation focused curriculum.
In teaching simulation based inference I like to focus on a physical construct first. In class we use data from the “Is yawning contagious?” bit from MythBusters that aims to test whether a person can be subconsciously influenced into yawning if another person near them yawns. I believe I first came across this dataset on the Rossman/Chance applet collection (link). I start by showing a clip from the show (link), which is a great way to break up the class. (As an aside: I also give students a mini-challenge: not yawning throughout the entire class. It’s a nice distraction to call on those who do, and sometimes they catch me yawn while watching the video and call me out on it!) We then summarize the data from the study:
50 people were randomly assigned to two groups – 34 to a group where a person near them yawned (treatment) and 16 to a group where they didn’t see someone yawn (control). 10 people in the treatment group and 4 in the control group yawned.
We briefly discuss whether the difference in the two sample proportions (roughly 29% yawners in the treatment group and 25% yawners in the control group) is “substantial”, saving the term “significant” till later.
Then I pass out some playing cards to use for physically simulating the experiment. Here is the setup:
- Take out two aces from the deck of cards and set them aside.
- The remaining 50 playing cards to represent each participant in the study:
- 14 face cards (including the 2 aces) represent the people who yawn.
- 36 non-face cards represent the people who don’t yawn.
And here is the simulation scheme:
- Shuffle the 50 cards at least 7 times (link: http://www.dartmouth.edu/~chance/course/topics/winning_number.html) to ensure that the cards counted out are from a random process.
- Split the deck into two: 34 cards (treatment group) and 16 cards (control group).
- Count the number of face cards (yawners) in each deck.
- Calculate the difference in proportions of yawners (treatment – control).
While students are busy shuffling I sketch a number line on the board, and each student marks the difference on this line, creating a stacked dot plot. We discuss how shuffling the cards and randomly splitting them into the two groups means we had no control owners how many yawners (face cards) and how many non-yawners (non-face cards) ended up in each deck, i.e. everything was left up to chance. We also calculate the p-value as the proportion of dots on the plot they created greater than or equal to 4% (observed difference in the original data).
Next, we talk about automating this process, joking that statisticians don’t sit in their offices shuffling cards all day. We have 50 observations in the dataset, and for each observation we have two attributes recorded: group (treatment / control) and outcome (yawn / no yawn). The first computational task is to create a data frame that the students will populate with the simulation results. Then, the students need to think about how to translate each step of the physical activity into code:
- physical shuffling and splitting into two decks -> randomly sampling the group label
- calculating the differences in simulated proportions -> grouping by group and taking the difference in proportions of yawners
- repeating this process many times -> nesting the above steps in a for loop
- plotting the differences in proportions
- calculating the differences in proportions at least as extreme as the one observed -> using if statements
The fact that there is a physical parallel to every main computational step makes introducing a somewhat sophisticated but essential programming construct like for loops much easier as well as introducing challenging statistical concepts like p-values and null-distributions.
Being able to introduce computation as an essential tool for conducting statistical inference is a huge benefit of simulation based inference. Sure, one could teach for loops just for the sake of teaching them, but I believe that providing this context is very important in a data science course (as opposed to a computer science course). Additionally, once the students implement this task, the extensions to conducting hypothesis tests for other circumstances, other types of variables, as well as constructing bootstrap intervals is a small leap.