**Matt Beckman, Penn State University**

*What this is & what this isn’t*

This post is intended share some pragmatic thoughts for teaching SBI in a large class, and not necessarily converting your curriculum to the SBI framework. A number of suggestions on the latter have been published in this blog and elsewhere. Besides, my colleague–Kari Lock Morgan–had already done a remarkable job accomplishing that feat in the course to be described before I arrived. What follows are simply remarks about rubber-meets-the-road strategies from teaching an SBI course with 225 students to either capitalize on large class size or at least help navigate some logistical challenges that surface with increased enrollment.

What follows are simply remarks about rubber-meets-the-road strategies from teaching an SBI course with 225 students to either capitalize on large class size or at least help navigate some logistical challenges that surface with increased enrollment.

*A familiar activity scaled for a large class *

While teaching smaller classes, I adopted a popular illustration with M&M’s (sometimes Skittles) to introduce bootstrapping. I didn’t invent or perfect the activity, but here’s a summary: Each student gets a fun size bag of M&M’s, calculates the proportion of blue, and then marks the result on a class dotplot in front of the room. We can now have a conversation about a sampling distribution under the assumption that each fun size bag represents a random sample of M&M’s. We emphasize the point that each mark on the dotplot represents a statistic calculated from an actual sample in the room, and perhaps point out a few (say, extremes) and identify the responsible students to emphasize the point. Students then sample with replacement from their own bag to build a bootstrap distribution. Students do this a few times by hand, and then introduce software to speed things up. We emphasize that each dot in the bootstrap distribution is now a proportion of blue M&M’s out of 16 draws with replacement from their own fun size bag (assuming 16 is the sample size contained). We highlight similarities and differences between the sampling distribution and bootstrap distributions. For example, bootstrap distributions will generally be centered in different places, but still wind up with a useful estimate of standard error. I tend to get a lot of mileage out of this activity as the semester progresses. When I sense that students are losing sight of the fundamentals, we periodically back up and talk about the M&M’s again to get back on solid footing. The thought of scaling this for 225 students was daunting at first, but since I like having this example in my back pocket later in the semester it was worth a shot. To start, I basically buy the entire stock of fun-size M&M’s available at a super-store in town. We’ll have plenty of M&Ms in the room to make our point, so I also supplement with an alternative of some kind for those who don’t want/can’t eat M&M’s. The student dotplot is really the main hurdle here. With 225 students in fixed seating, a human wave filing down the aisles to mark the chalk board is a non-starter so we use Google Sheets or Forms. Students use a smart phone to access a shortened link (e.g. tiny.cc) or scan a QR code displayed on the front screen to access the spreadsheet/form and enter their result from their seats. After a quick filter, I cut and paste the data into software, and make our class dotplot. From start to finish, this method in a large class might even be faster than the manual approach I had used in smaller classes.

* A few themes. . . *

Of course this is just one specific activity, but there are some themes here that apply to a lot of activities.

*Try it anyway.* The “worst case scenarios” that I concoct when I imagine attempting an activity developed for small classes never really seem to happen. I probably fear that the mild noise, mess, and chaos of smaller classes will escalate into a completely wasted class period with so many students. Thankfully, that hasn’t happened (. . . yet?) largely because a little technology can neutralize the few inflection points where things would most likely derail (e.g. 225 students physically marking the chalkboard).

*Redeem the smart phone!* Don’t underestimate the tiny computers that most of your students already bring to class. QR codes–those pixelated square codes you scan–and shortened URLs (e.g. tiny.cc; bit.ly) can direct students to an applet, Google Sheet (Form, Doc, etc), or anything else on the web right from their seats. Students can even share with a neighbor or look over the shoulder in front of them–particularly in stadium seat lecture halls. As another example, my class uses the Lock5 text and the accompanying StatKey software which works great on a smart phone. I can introduce authentic data analysis tasks during “lecture” for students to tackle in pairs; one runs StatKey, the other takes notes.

*The large enrollment actually has some perks.* Since the class is large, class data sets are that much closer to Asymptopia. I’m sure I’m not the only one who’s attempted to use a class plot to lay the groundwork for the CLT and wound up with a skewed or bimodal-looking mess that looks nothing like the bell-shape I had in mind. . . Furthermore, rare events and oddities that lead to interesting discussions almost always show up. Someone will enter the count rather than proportion, and someone else usually observes something rare just by chance. Since I’m the gatekeeper for the class plot, I decide whether there’s time to discuss data entry errors or if we need to filter them out and plow ahead in the interest of time. Also, a student with no blue M&Ms, for example, isn’t hypothetical or a mistake. Rare events in the tails of the distribution really do happen!

*Keep an eye toward the big picture.* You’ll rarely need 100% cooperation to make your point. For example, in the M&M activity some students eat the candy before we start, others won’t get to the spreadsheet in time, and some didn’t get M&M’s to begin with! Even if only 50 students fully participate (< 25% in my case), the illustration still serves its purpose and benefits the whole class.

As statisticians, we tend to think that if we just have enough data in front of us then we can get at the heart of what is going on in any scenario and many statistics educators want to know what is going on with student learning outcomes from different curricula. So the solution is simple, right? Just collect a bunch of data on our students’ learning outcomes under different curricula and identify the strongest pedagogy. We can even get fancy and toss in some experimental design to structure the application of treatments to our experimental units to support causal conclusions about impacts on learning outcomes. Alright, I am being facetious here. It is never that straight forward. I will admit that this was my first instinct when I set out to do educational research as a graduate student. There are a number of issues that constrain plans for what would be a tidy and straightforward educational experiment: defining the curricular treatments, assigning students to curricula, applying the curricular treatments, measuring learning outcomes.

In order to reinforce the analyses from small-scale educational experiments like ours, we need to find a way to either eliminate or account for the classroom-based dependence structures.

First there is the problem of defining our treatments: each is an entire curriculum?! This will include lessons, lectures, labs, discussions, homework, exams, etc.; which create a mountain of new prep-work for the instructor(s) involved and leads to questions about what exactly is being compared. More importantly, it dilutes the findings because we will need to attribute observed learning differences to the entire array of curricular differences, precluding conclusions on the efficacy of particular components. Then there is the challenge of assigning students to curricula. There are logistical and institutional limitations that make true random assignment of students to classes infeasible, so we face potentially bias results. After preparing the curricular treatments and getting our students into classrooms, we still need to teach the class. Almost certainly we will apply a curriculum to an entire classroom but measure learning on the student level; the dreaded mismatch of experimental and observational units. Now we need multiple classrooms with each curriculum to properly account for inter/intra group variability. We could always aggregate on the classroom level and discuss overall learning outcomes, but then our necessary number of classrooms grows even further. Additionally the instructor effect will be confounded with the classroom. Lastly, we have to find a way to measure student learning outcomes in a way that other educators will trust and/or respect, so that the results are not discarded.

This was the laundry list of challenges faced by myself and Dennis Lock when we started our comparison of learning outcomes under simulation-based inference and traditional introductory statistics curricula.[1] We were graduate assistants each assigned as the instructor for an intro-stat section of around 60 students and we could not change the times to which students had enrolled. Given the challenges listed above (coupled with our limited influence as graduate students) we needed to find creative work arounds. Before registration we arranged for our two sections to have the same meeting time, allowing us to randomly reassign their classroom *locations*. For the first half of the semester — prior to the inference unit — both sections met in one large classroom that Dennis and I co-taught with one set of lectures, labs, homework, exams. During the inference unit the students split into two classrooms that were kept as similar as possible for things like data sources, progression of concepts, homework/test schedules in order to focus the results on how simulation-based pedagogy impacted student learning. We also rotated weekly between the two classrooms in an attempt to mitigate confounding instructor effect with treatment effect. We elected to evaluate student learning on the final exam using both a widely recognized metric (the ARTIST scaled question sets) and our own question sets.

There is one hurdle that our small-scale experiment could not overcome – replication over classrooms. In our study we had only one classroom per treatment to work with, so we had no way to account for within classroom dependence structure in our analysis. Our simulation study showed that if student scores from the same classroom were more related than students across different classrooms then our Type 1 error rate inflates – presenting a real danger of misinterpreting classroom clusters floating around the same average learning score as a curricular effect.

In order to reinforce the analyses from small-scale educational experiments like ours, we need to find a way to either eliminate or account for the classroom-based dependence structures. We can strive to structure the course so that the assumption of independence across students is actually true, requiring one-on-one style teaching. Online courses do not remove the burden of developing multiple curricula, but could make use of random assignment to curricula that could be effectively executed through course management software. The alternative is to account for the classroom effects in our models. While it is impossible to estimate classroom parameters in the covariance structure without classroom replication, we could feasibly plug in reasonable approximations. This would require a much stronger understanding of how inter and intra classroom covariance structures behave. One idea is to conduct a series of uniformity studies where many classrooms are given the same curriculum and we examine covariance structures for student scores with our intended learning assessment tool. It may seem silly to argue that “in order to do *small*-scale research, we need to first run a *large*-scale uniformity study”; but there are many larger colleges and universities already running many parallel sections with identical curriculum that would simply need to run the assessments and disseminate the data. Sharing these results would allow many small-scale researchers to plug in reasonable proxies to protect their analyses from Type 1 errors. While it is difficult to run curricular assessment without access to lots of resources and many course sections, it is important that we push to conduct small-scale curricular assessment that provides robust results.

**David Diez, OpenIntro**

The percentile bootstrap approach has made inroads to introductory statistics courses, sometimes with the incorrect declaration that it can be used without checking any conditions. Unfortunately, the percentile bootstrap performs worse than methods based on the t-distribution for small samples of numerical data. I would wager that the large majority of statisticians proselytize the opposite to be true, and I think this misplaced faith has created a small epidemic.

The percentile bootstrap is nothing new, but its weaknesses remain largely unknown in the community. I find myself wrestling with several considerations whenever I think about this topic.

A few years ago I created this spreadsheet to compare the percentile bootstrap to classical methods. For small samples, the t-confidence interval outperforms the percentile bootstrap through a sample size of 30 for numerical data. The difference is particularly stark when the population is skewed and the sample size is very small. Tim Hesterberg published a much more comprehensive investigation of multiple classical and bootstrap methods in 2014. He found similar results for small samples, where the t-confidence interval outperformed the percentile bootstrap until the sample size was 35 or larger.

Teaching the percentile bootstrap without thoughtfully explaining the conditions, particularly as a replacement for classical methods, seems like one step forward and two steps back. The percentile bootstrap is nothing new, but its weaknesses remain largely unknown in the community. I find myself wrestling with several considerations whenever I think about this topic.

**The percentile bootstrap is a stepping stone.**I don’t think the percentile bootstrap method should be taught as “the” bootstrap method. It’s too unreliable. The percentile bootstrap should be taught as a first step towards better methods and / or as a first tool for students to start exploring a wider range of analyses, e.g. of the median, standard deviation, and IQR.**There are better bootstrap methods.**Tim’s excellent paper found that the*bootstrap t-interval*is much more robust than the percentile bootstrap, and the bootstrap t-interval is even much more robust than the classical methods for small samples and skewed data. (Research opportunity #1)**The bootstrap opens the door to more statistics.**The reason why I remain bullish on the long term value of advanced bootstrap methods is that they ease the analysis of a wider range of statistics, such as the standard deviation and IQR.**We need to establish appropriate conditions for the bootstrap.**Every statistical tool fails in many ways, and we need to better understand when methods fail before they are taught to the next generation of statisticians. As a starting point, I suggest a rule of thumb for the percentile bootstrap below. To be clear, more thoughtful work is required here and appropriate conditions are far from settled. (Research opportunity #2)**Shifts in pedagogy are costly, so let’s do our homework first.**A shift in how intro statistics is taught on a large scale is very expensive. It requires teaching tens of thousands of teachers the new pedagogy, getting those teachers to buy into the change, and then pushing schools (or students) to buy new textbooks. I think we owe it to schools, teachers, and most of all students to have solid evidence (data!) from a diverse set of studies showing the practical benefits of the bootstrap method before we ask them to incur the costs of this transition. (Research opportunity #3)

I want to wrap up with my rule of thumb for the percentile bootstrap. *If I’d be comfortable applying the Z-test or Z-confidence interval to a data set, then I think it’s safe to use the percentile bootstrap for the mean or median.* In most introductory courses, that usually means (1) the data are from a simple random sample or from random assignment in an experiment, (2) there are at least 30 observations in the sample, and (3) the distribution is not too strongly skewed.

**Kari Lock Morgan, Assistant Professor of Statistics, Penn State University**

Computers (or miniature versions such as smart phones) are necessary to do simulation-based inference. How then can we assess knowledge and understanding of these methods *without* computers? Never fear, this can be done! I personally *choose* to give exams without technology, despite teaching in a computer classroom once a week, largely to avoid the headache of proctoring a large class with internet access. Here are some general tips I’ve found helpful for assessing SBI without technology:

** Much of the understanding to be assessed is NOT specific to SBI.** In any given example, calculating a p-value or interval is but one small part of a larger context that often includes scope of inference, defining parameter(s), stating hypotheses, interpreting plot(s) of the data, calculating the statistic, interpreting the p-value or interval in context, and making relevant conclusions. The assessment of this content can be largely independent of whether SBI is used.

** In lieu of technology, give pictures of randomization and bootstrap distributions. **Eyeballing an interval or p-value from a picture of a bootstrap or randomization distribution can be difficult for students, difficult to grade, and an irrelevant skill to assess. Here are several alternative approaches to get from a picture and observed statistic to a p-value or interval without technology:

Choose examples with obviously small or not small p-values.

*Use a countable number of dots in the tail(s).**For p-values, this would mean choosing an example with only a small (easily countable) number of dots beyond the observed statistic, which students could count and divide by the total. For intervals, you might generate 1000 bootstrap samples and ask for a 99% interval, requiring students to count only 5 dots in each tail.*

*Have students circle the relevant part of the distribution.*

*Choose examples with obviously small or not small p-values.*

**Choose the interval/p-value from a list of options.**While precise answers can be difficult to eyeball, a student reasoning correctly should be able to choose the correct answer from a list of possible options.

** Emphasize concepts or material that remain necessary with technology.** SBI has an advantage here over traditional-based inference where it can be all too tempting to create assessments centered on plugging summary statistics into the correct formula. For example, specifying a relevant parameter/statistic and type of plot, based on the variable type(s), and interpreting results in context, all remain necessary skills with technology available, while plugging numbers into formulas and using paper distribution tables do not. If an assessment item would be irrelevant with technology, it may not be important enough to assess without technology. I’m not advocating avoiding all details that technology automates, but rather suggesting that we not assess these details just for the sake of having something convenient to assess. For example, I think it

** Ask students to describe what a single dot represents**. Disclaimer: assessing the mechanics underlying the simulation has actually become less and less important to me over time; I personally care more about students understanding that the p-value measures the extremity of the observed statistic if the null hypothesis is true, than the specific method used in a particular simulation. Nonetheless, asking students what a single dot on a randomization or bootstrap distribution represents, or asking how they would generate an additional dot, assesses understanding of the underlying simulation process. Asking this question as free-response can be very enlightening, but [WARNING!] can also be hard to grade. If you have large classes, you might consider making this multiple choice.

Here is a sketch of a generic randomization test exam question incorporating the above tips (exact questions will vary with context, but this provides a rough template):

- Define, with notation, the parameter of interest.
- State the null and alternative hypotheses.
- What type of graph would you use to visualize these data? (or interpret a plot)
- Give the relevant sample statistic.
- Describe what one of the dots in the randomization distribution represents.
- Estimate the p-value from the randomization distribution shown.
- Make a generic conclusion about the null hypothesis based on (f).
- Make a conclusion in context based on (g).
- Can we make conclusions about causality? Why or why not?

Note that almost all parts remain relevant in the presence of technology, with the exception of potentially (d) and (f). Part (d) may involve calculating a difference in proportions from a two-way table or identifying relevant statistics from software output. Part (f) may include counting dots, circling part of the distribution, or choosing from possible options, as discussed above.

This template could also apply when technology *is* available, with only minor tweaks (I’ve used a similar template for lab exams in the past). In this case, only (c), (d), and (f) would differ, as students could actually generate the plot, statistic, and randomization distribution. The assessment without technology loses the ability to assess whether students can get the software to work, but I believe it does not fundamentally compromise the ability to assess knowledge and understanding.

If your students *do *have access to technology for assessments, see the parallel blog post by Robin Lock.

**Robin Lock, Burry Professor of Statistics, St. Lawrence University**

I have the luxury of teaching in a computer classroom with 28 workstations that are embedded in desks with glass tops to show the monitor below the work surface. This setup has several advantages (in addition to enforcing max class size cap of 28) since computing is readily available to use at any point in class, yet I can easily see all of the students, they can see me (no peeking around monitors), and they still have a nice big flat surface to spread out notes, handouts and, occasionally a text book (although many students now use an e-version of the text). I also have software on the instructor’s station (*Smart Sync*) that shows a thumbnail view of what’s on all student screens. Since the class is setup to use technology whenever needed and appropriate, it is natural to extend this to quizzes and exams, so my students routinely expect to use software as part of those activities.

Ideally I’d like to see what each student produces on the screen and how they interpret the output to make statistical conclusions, but it’s not practical to look over everyone’s shoulder as they work.

This is useful when assessing simulation-based inference (SBI) methods since I can ask students to actually carry out the procedures, but this raises some challenges for constructing assessments that are doable in a relatively short amount of time (as opposed to projects that can go into greater depth), address the concepts I want to assess, and, don’t forget this one, still relatively efficient to grade! The last point can be a bit of a challenge, since SBI methods will generally not yield a single correct answer, yet with a little practice one can get pretty good at quickly distinguishing reasonable answers from responses that show errors in procedures or reasoning.

**Using a Scaleless Distribution**

Ideally I’d like to see what each student produces on the screen and how they interpret the output to make statistical conclusions, but it’s not practical to look over everyone’s shoulder as they work. I still do traditional paper/pencil quizzes and exams, so it’s also not feasible to have students electronically cut/paste graphics and other output. Having them draw a rough sketch is one option, but another workaround I’ve used is to include a “generic” distribution on the quiz with no scale shown and ask students to fill in the scale based on their simulated distribution. Here are a couple of examples from this semester’s quizzes that illustrate this approach. We use the StatKey software package (*http://lock5stat.com/statkey*) for generating bootstrap and randomization distributions.

**Sample Question #1: Multiple Choices** Some people say that, if you are randomly guessing on a multiple choice test, the correct answer is more likely to be a middle choice than at either extreme. For example, if the five choices are A, B, C, D and E, you should avoid picking A or E. Let’s try testing this theory.

(a) If the five multiple choice options are equally likely to be correct, what proportion of questions should have E as the correct choice?

(b) Suppose that a sample of n=400 multiple choice questions from AP exams had E as the correct choice for 68 questions. What is the proportion of questions with E correct in this sample? (Use good statistical notation to label your answer).

(c) Write down the hypotheses if the question of interest is whether there is evidence that the proportion of questions with E correct (call it p_{E} ) is less than would be expected when answers are randomly assigned.

(d) Use StatKey to produce a randomization distribution for this test, based on the sample of 68 E answers out of 400 questions. On the plot below label the center of your randomization distribution and enough values on the horizontal axis to show the scale.

(e) What does a single dot in the plot above represent?

(f) Use StatKey to find the p-value for this test and show on the graph above how this looks.

(g) Assuming a 5% significance level, write a sentence that interprets what this test tells you in the context of this problem.

A quick glance at the scale shows if it is centered in the proper place (p_{0}=0.2) or if students have forgotten to change the default p_{0}=0.5 in StatKey or produced a bootstrap distribution that is centered at instead. We can also easily check if students are using the distribution properly to find the p-value using the proportion of samples with randomization .

**Sample Question #2: Restaurant Tips** We collected a random sample of n=157 restaurant bills from the First Crush bistro in Potsdam, NY and recorded the size of the tip on each bill. The data should be available in StatKey (if you are in the proper procedure – ask for a one-point penalty if you can’t find it).

(a) Use StatKey to construct a bootstrap distribution of *mean tip size* based on this sample. Your plot should look similar to the one below (Note: You could use the same plot as in the question above or generate a new one specifically for this dataset.). Label values on the horizontal axis to indicate the scale you see in your plot, including the center point.

(b) Find an 80% confidence interval for the mean tip size at First Crush, based on your bootstrap distribution. Indicate on your plot above how you find the interval and don’t forget the interpretive sentence.

Using StatKey’s built-in datasets is very convenient for exam purposes, but it is also very easy to enter new sample data, especially for inference involving proportions (as in Question #1) where students only need to enter the appropriate counts. For quantitative data, StatKey now allows users to upload their own file in .csv or .txt format, so it is relatively easy to give students access to a data file in an appropriate format to upload during the quiz. For readers who want the full **RestaurantTips** dataset the file can be downloaded from *http:\\lock5stat.com*.

Don’t have access to computers for quizzes and exams? Check the companion to this blog post by Kari Lock Morgan for tips on assessing SBI concepts without technology.

]]>Many of us will agree that using tactile demonstrations is super fun and can also be an excellent way to teach a particular concept. Students engage with the material differently when they can touch, smell, or taste the objects as opposed to only seeing or listening to a demonstration. The SBI blog has had many excellent articles describing in-class tactile simulations, see here and here and here.

However, sometimes the logistical constraints setting up the demonstration take away too much from an already packed 50 minute class session. And those details get even harder with large classes. One of the biggest challenges comes from collecting data or getting results back from the students. Although some classes have sophisticated clickers that make data collection easier, setting up and using clickers is also a logistical challenge (well worth it for using all semester, but not for a one day class demonstration).

The conversation that ensues about the experimental design is incredibly valuable for understanding paired design (and the motivation for the pairing) or survival analysis (and the need for tools to analyze censored data).

My view on classroom demonstrations is that doing most of the tactile demonstration can communicate the vast majority of the pedagogical ideas. I will demonstrate what I mean with examples using chocolate chips. I have used chocolate chips in class many times to teach two different concepts: (1) censored data analyzed with survival analysis (example taken from

In both classroom experiences, I provide small Dixie cups with two visibly different types of chocolate chips (typically two of either white chocolate, milk chocolate, semi-sweet chocolate, or peanut-butter chips).

Then we spend some time as a class talking about the experiment and the goal of the experiment. For both of the statistical methods I cover in class, the goal is to compare the length of time taken to melt the two different types of chips (the length of time is described by: the average if doing t-tests; the distribution if doing signed-rank test; the survival curve if doing survival analysis). I will give some details of the class interactions below, but many of these ideas are also discussed in the texts mentioned above and in the references. For simplicity, I will describe using a paired t-test to determine if milk or white chocolate chips melt faster, on average.

I always start with a basic question: What will we do to determine which chocolate chip melts faster on average?

Inevitably, I get a basic answer in return: Put chocolate chips in mouth, see how long it takes to melt, record data, decide which one melts faster.

And then I stop and ask them how in the world they’ll know how to do what they just suggested? There is a tremendous amount of additional information necessary before running a viable experiment. Among the important decisions to be made (by class consensus) are:

- How is the chocolate chip going to reside in the mouth? Can you chew? Can you use your tongue? Can you move the chip?
- Who gets which color? (Everyone gets both if paired!)
- In what order do the chips get melted? Does everyone melt the same chip first?
- How will the melting be timed?
- How long do we wait between chips?

The conversation that ensues about the experimental design is incredibly valuable for understanding paired design (and the motivation for the pairing) or survival analysis (and the need for tools to analyze censored data). The process of coming to a class consensus about the chip experiment requires the students to justify their decisions to their classmates. For example, the class notes that if you hold the chip at the top of your mouth, it’ll melt faster. Oh yeah, and that brings up the fact that some people will have mouth environments where chips melt faster (which is why we pair)!

I almost never collect the data that my students generate. Partly because melting times for chips aren’t substantially different (so power is pretty low). But also because collecting the data, entering it into the computer, and running the appropriate test does not provide the same pedagogical-idea-per-class-minute value that the earlier discussion on design did. I can make up data, use the textbook’s data, or use student data from a previous year (if I have it). And the students can then *quickly* see the analysis done on screen.

Don’t get me wrong, there are large benefits to collecting class data so that students can understand first hand how variability plays a role. I believe that collecting student data is particularly valuable when each student gets a different *statistic* (for example, number of successes in a binomial trial) which can be put together as a class generated sampling distribution. But in the chip example, the vast majority of the learning happens with understanding the experimental design. And because we must continually make choices in our classes, we should not be afraid to cut out the parts of the demonstration (here collecting data) we believe are less important to the learning.

References:

*Investigating Statistical Concepts, Applications, and Methods* by Beth Chance and Allan Rossman, http://www.rossmanchance.com/iscam3/

*Practicing Statistics* by Shonda Kuiper and Jeff Sklar, http://web.grinnell.edu/individuals/kuipers/stat2labs/

A chimpanzee named Sarah was the subject in a study of whether chimpanzees can solve problems. Sarah was shown 30-second videos of a human actor struggling with one of several problems (for example, not able to reach bananas hanging from the ceiling). Then Sarah was shown two photographs, one that depicted a solution to the problem (like stepping onto a box) and one that did not match that scenario. Researchers watched Sarah select one of the photos, and they kept track of whether Sarah chose the correct photo depicting a solution to the problem. Sarah chose the correct photo in 7 of 8 scenarios that she was presented. In order to judge whether Sarah understands how to solve problems we will define π to be the probability Sarah will pick the photo of the correct solution.

I don’t let them get away with just claiming that the p-value is some particular number – they have to explain how they know it is that number.

- Write out the null and alternative hypothesis for this study (in words and/or symbols).
- If you conduct a test of significance using simulation, what values would you use in the one-proportion applet?
- Assume you conduct a test of significance using simulation and get the following null distribution. (Note: this null distribution uses only
**100**simulated samples and not the usual 1000 or 5000.) Based on the null distribution, what is the p-value for the test? Circle the p-value on the plot. - Based on your p-value, do you have strong evidence that Sarah is not just guessing about which photograph belongs to each scenario? Explain briefly.
- What does a single dot represent in the null distribution shown above?

- A simulation for the results of Sarah completing one trial, if she chooses the correct picture half of the time in the long run.
- A simulation for the results of Sarah completing one trial, if she chooses the correct picture more than half the time in the long run.
- A simulation for the proportion of times Sarah chooses the correct pictures out of 8, if she chooses the correct picture half the time in the long run.
- A simulation for the proportion of times Sarah chooses the correct picture out of 8, if she chooses the correct picture more than half of the times in the long run.

Of course there are many more important reasons that I like this question. It asks the students to write the null and alternative hypotheses. They can only do this if they understand that the null hypothesis is the “by chance” model – if they understand that the null hypothesis is that Sarah is just guessing about which picture is correct. And they need to put the hypotheses in context in order to decide whether we should use a one-sided or two-sided alternative in this situation. And just in case they can write the correct null hypothesis, but don’t understand how that hypothesis is used in the simulation, I ask them to tell me what value is placed in each of the boxes of the applet.

This is usually tough for the students. They often get confused early in the course about where π and p̂ are supposed to go. What are the differences in these values? What happens when they put the same values in both the first and last box? (I often ask them that on a quiz too. Sometimes I even ask what the p-value would be if π = p̂.) Actually, they usually want to put 0.5 in the first box every time they use the applet. So I wish I could come up with more problems where the “by chance” model didn’t have π = ½, but instead had π = ⅓, or π = ¼, or some other value. There are some examples in the ISI textbook (adjustments on Rock/Paper/Scissors for example) that I do use, but sometimes the backgrounds are too complicated for testing environments.

Later in the course I would expect *them* to define the parameter π in words, but at this point, they are usually still having trouble with that. So I define it for them. I do, however, expect them to recognize that they need to convert from the count of 7 successes to the sample proportion of ⅞ = 0.875. I also expect them to know where to put this in the applet, and most importantly, I expect them to know that this is the critical part of finding the p-value. They need to be able to find that value on the horizontal axis of the null distribution, count the dots above that value, and divide that number of dots by 100. I am always tickled when they can do this early in the course. If they can’t do this, I need to do some more work with them.

I usually have a question similar to this on each test and the final exam. The context changes, and the parameter(s)/variable(s) changes, but I give them a null distribution and few enough dots that they can count (or a statistic far enough in the tail that they can see the p-value is zero). I don’t let them get away with just claiming that the p-value is some particular number – they have to explain how they know it is that number. By the end of the course, nearly every student who is attending class and really wants to pass can answer that question. Many can even extend to a new context that they haven’t seen before, for example with a new statistic such as the mean/median.

The last, multiple choice part of this question tries to make sure that the students understand exactly what the null distribution represents. There are lots of different ways I might ask this over course of a semester. By the end of the semester, I’m usually asking them to tell me what one dot on the plot represents. But early on, I like to give them options from which to choose. The distinctions between choices may be subtle. And they can lead to lively discussions about how the null distribution is a plot of outcomes assuming the null hypothesis is true, not of what might happen if the alternative hypothesis is true.

An additional question I sometimes ask is what do they expect the center of this null distribution to be? I’ve even given them several plots to choose from (usually after we’ve learned more about spreads). But they should always know that the center should be roughly what they are assuming π to be in the null distribution. In this case, with the small sample size, they should expect the spread to be fairly large.

Of course all of my student assessments are works in progress. Every time I administer one, I learn about improvements I can make the next time. But I’m sure I’ll continue to provide null distributions and ask students to show me and estimate the p-value on that distribution.

]]>

I think it is important to note that I am not just asking them to conduct a randomization test, but am also asking them for interpretations and to think about how study design affects our conclusions. Asking them a multitude of questions for one context also saves me time for coming up with different contexts and reduces cognitive load for my students.

The question I am sharing was written to test students’ understanding of randomization tests as well as some other basic statistical literacy topics (e.g., types of studies). Additionally, the question is based on a REAL experiment (with random assignment) and is something my students can relate to. I try to use real datasets whenever I can, which can be messy.

In the experiment, participants listened to a lecture and were tested on the material they learned. Some of the participants were seated in a location where they would be able to view other students surfing the web, and some participants were seated in a location where they could not easily view another student surfing the web. There was a fairly small sample size (19 in each treatment group) and it is possible the population data are skewed due to a ceiling effect common with test scores, so a traditional *t*-test would not be appropriate. If you are teaching *t*-tests in addition to randomization tests, this study provides the opportunity to add additional questions about why it would be more appropriate to conduct a randomization test than a *t*-test. I have also been known to give my students output to both the randomization test and *t*-test, ask them which method is more appropriate, and then to answer questions based on the method they chose.

When describing the context of the problem for any activity or test question, I try to pose an overarching research question. These descriptions tend to be fairly long, due to the messiness of real data. I then ask students multiple questions about different aspects of the study ending by having them answer the research question. I think it is important to note that I am not just asking them to conduct a randomization test, but am also asking them for interpretations and to think about how study design affects our conclusions. Asking them a multitude of questions for one context also saves me time for coming up with different contexts and reduces cognitive load for my students.

You may be wondering how long it takes to create a question such as the one I have shared here. To be honest, it does take some time. A majority of the time, maybe a couple of hours, is spent trying to find a context and data to use. For this particular experiment, I contacted the first author of the journal publication and was given access to the data within a couple of days. Unfortunately, many times, I cannot gain access to the real data in the time frame that I need it, so I simulate data based on statistics provided in the article. With practice, I promise it becomes easier to search and find new examples to use in test questions. Once you have your new contexts and data, the actual questions will be similar year to year.

Is it worth it to go through the effort to find real examples to use in test questions? I will end by saying yes, maybe they will learn something from taking the test, such as realizing that it is not respectful to their classmates to check Facebook during lecture!

Research Question: Do undergraduate students grades suffer when neighboring students use laptops during class?

An study[1] was conducted at a university with a sample of 38 volunteer undergraduate students (referred to as participants). The participants were randomly assigned a seat in a lecture hall and listened to a 45 minute lecture on meteorology. All participants were told they could not use laptops during the lecture, but they could take notes using pencil and paper.

Additional students (referred to as multitasking peers) were scattered throughout the classroom who did use laptops. These multitasking peers were told to pretend to take notes on a laptop as well as to browse the Internet and visit websites such as Facebook. Half of the participants were placed in seats where they were able to clearly view the laptops used by multitasking peers (group 1) and the other half of the participants were placed in seats where they were not able to clearly view the laptops used by the multitasking peers (group 2).

After the lecture, the multitasking peers left the room and the 38 participants completed a 30 minute multiple-choice test with 48 questions. The percent correct was reported for each participant. The average percent correct for the “in-view” treatment group (group 1) was 56% and the average percent correct for the “not-in-view” treatment group (group 2) was 73%.

- Is this an experimental study or an observational study? Explain.
- What is the statistic of interest? Describe it in words and provide a value.
- Based on your statistic alone, can we conclude that being in view of multitasking peers leads to a decrease in test scores?

A simulation for a randomization test was conducted to see if the average score for participants not in view (group 2) was significantly higher than participants in view (group 1). A randomization distribution for 1000 simulated randomization samples was created and plotted below.

- What are the null and alternative hypotheses being tested?
- Why is the randomization distribution centered at 0?
- Provide an estimate of the
*p*-value. - Write a sentence interpreting your
*p*-value. - What would be the appropriate decision based on the randomization test?
- Provide an answer to the research question using the results and decision from your randomization test.
- Would it be appropriate to generalize these results to the population of all undergraduate students?
- Would it be appropriate to say that being in view of multitasking peers caused the mean score to be less than not being in view?

[1] Sana, F., Weston, T., & Cepeda, N. J. (2013). Laptop multitasking hinders classroom learning for both users and nearby peers. *Computers & Education: 62*, 24-31. doi: 10.1016/j.compedu.2012.10.003.

- Hope College students (in 2003) were wondering if there are any gender differences when it comes to how long people talk on their cell phone. They asked a sample of other students and asked them their gender (0=female, 1=male) and how long their last cell phone call was as measured in seconds (they could find this data recorded on the phone). Dealing with the outlier in this data set makes it interesting. cellphonedata

In Canada, school curricula differ by province, but most Canadian mathematics curricula include glimpses of statistical thinking, typically in the middle grades. In the province of Ontario, tracing the statistics part of the curriculum through the grades reveals a progression in sophistication of tools for summarizing data, with some scattered mentions of the ideas of informal inference. Students are encouraged to make inferences from their observations, but typically without tools to support their generalizability. Teachers are aware that there are important statistical ideas their students need to understand to do this well. For example, they know that a larger sample size is usually better, but they don’t know how to show their students the effects of sample size on the inferences they can make. In addition, teachers often have the challenge of irregular access to technology and uneven expertise and support. In this context, I recently worked with a group of 15 middle school teachers on an activity that uses multiple random samples to better understand the effect of sample size, with only minimal need for technology.

With the random sampler, students can draw random samples of data from the accumulated databases of questionnaire responses from students from participating countries.

The activity starts with class participation in Census at School. Census at School is an international project designed to engage school students and their teachers in learning activities that are both interesting to the students and promote the advancement of statistical thinking. Participating students complete a questionnaire that has been designed to provide data of various types for analysis and to provoke interesting questions to investigate. In Canada, the Census at School project is operated by the Statistical Society of Canada at www.censusatschool.ca; in the US, the project is operated by the American Statistical Association at www.amstat.org/censusatschool.

My favorite feature of Census at School, the random sampler, is underused by many teachers. With the random sampler, students can draw random samples of data from the accumulated databases of questionnaire responses from students from participating countries. They can explore how their class compares to a sample of students from their own country or another part of the world, how statistics vary from random sample to random sample, and use these explorations to consider some aspects of the confidence they should have in their inferences. Random samples from American students can be drawn at https://www.amstat.org/censusatschool/RandomSampleForm.cfm. And random samples from other participating countries, including Canada, can be drawn at http://datatool.censusatschool.org.uk/datatool.swf

On the Census at School questionnaire, students are asked if they are right-handed, left-handed or ambidextrous. It is common knowledge that right-handedness is by far the most common, but are there differences by sex? In this activity, we investigated whether boys and girls are equally likely to be right-handed.

The activity progresses from looking at the relevant data from one class (one small sample), comparing that to national data (one large sample), and then exploring what values are possible from many random samples, representing various classes.

After calculating the frequency and relative frequency of right-handedness for boys and girls in their class, students can consider whether or not they have evidence that the either sex is more likely to be right-handed. Students’ intuition typically makes them uncomfortable with making general statements from their class data and discussion can elicit whether their class data is both sufficiently representative and whether it includes enough observations. The Canadian Census at School project publishes summary statistics each year for all participating Canadian students and the students can compare their class result to the relevant Canadian summary statistics. They can consider why there might be differences, and whether their class results or the Canadian results give a more compelling argument for a general statement about whether boys or girls are more likely to be right-handed.

Examining repeated samples from the Census at School database forms the final part of the activity.* *This allows the students to consider how the percentages of right-handed boys and girls might vary from class to class. Students can start by considering randomly chosen classes of 20 students, 10 girls and 10 boys, by using the Census at School random sampler to have each student collect his or her own sample of 10 girls and 10 boys from the Canadian data the same age as their class, or have a randomly chosen “class” provided to each student by the teacher. They can discuss the type of graphical display that would best illustrate the results, including the comparison between boys and girls. Figure 1 shows a low-tech possibility; to create it, each student quickly contributes two sticky notes representing the boy and girl estimates for their sample. To see the effects of a larger sample size, students can next imagine a school with single-sex classes of size 25. They can collect random samples of a girls’ class and a boys’ class their age, with 25 students in each class, and re-create their graphical display with these new data. Figure 2 shows an example resulting plot. Even with this simple activity and a small number of samples, the effect of the number of observations on variability and the resulting ability to see differences between groups becomes clear.

The teachers at the workshop felt that they could easily adapt this activity for use with their middle school students, and were happy to have an idea for how to take their students’ statistical reasoning to a higher level.

**Figure ****1**: Comparing proportions of right-handed boys and girls in 15 samples of size 10 of grade 6 students of each sex.

**Figure ****2**: Comparing proportions of right-handed boys and girls in 15 samples of size 25 of grade 6 students of each sex.

]]>