I teach simulation-based statistical inference methods (using R) in my 100-level Introduction to Data Science course. This course is the required first course for all Data Science minors, and a service course to numerous departments. I love teaching statistical inference this way because it reconnects me (and my students) with Fisher’s original ideas and methods, and expresses Tukey’s ideas that we learn about populations by being in dialogue with data. In the context of this welcome return to the empirical framework through which we understand and teach statistical inference, I wonder why we still teach students null hypothesis significance testing (NHST) in the same old way. I expect we’re all aware of the vast literature accumulated over the past 40 years that is critical of NHST and its role in the reproducibility crises in many disciplines. I feel like an introductory statistics or data science course that embraces simulated-based inference should also move away from teaching students conventional NHST methods for learning about populations.

I’m just encouraging us to think about whether the formal, reflexive method of classical NHST fits within an SBI pedagogical framework. Cohen and many others have urged us to replace NHST with inferential tools such as parameter estimation, effect size estimation, replication, and meta-analysis—tools that help us learn much more about our population of interest..

Although I am still working out this larger project in my teaching, I offer these thoughts in the context of teaching inference in the 2 independent-sample design with the mean difference statistic (M_{1} – M_{2}) in an Introduction to Data Science class:

First, I help my students think about all the possible values of the mean difference, one of which is the parameter (μ_{1 }– μ_{2}), and how ridiculously implausible it would be for that to be 0. If the true mean difference isn’t 0, then the remaining possible values vary only in the sign and magnitude of the mean difference. This points our interest toward estimation and away from significance, and sets up the task of estimating the parameter rather than in establishing that it (probably) isn’t 0. I’ll paraphrase one of my idols, Jacob Cohen (from his famous *The Earth is Round (p<.05)* paper): null hypotheses are rarely true, so rejecting them is hardly surprising.

Second, I help the students think about what kind of probability distribution we need to estimate the parameter. We talk about the importance of having (or assuming we have) data from a random sample for this task, what M_{1} – M_{2} might be if we had a *different* random sample from this population, and how those random differences in M_{1} – M_{2} are important to estimation. Using R we generate the probability distribution under H_{A}, not under the chance or null model used in NHST. Through resampling, students create a picture of the population of interest and its parameters. Once created, the distribution allows students to do inference through confidence interval estimates of the parameter using various methods (i.e., normal-theory, percentile).

Third, students should see that the distribution described above is a probability distribution for testing all sorts of hypotheses including, if we’re interested, the null (or Cohen would say, nil-null) hypothesis in which μ_{1 }– μ_{2 }= 0. Students can find the probability of any hypothesized μ_{1 }– μ_{2} occurring in *this* population. We can then point our students toward hypothesis testing that sets evidence thresholds with practically or clinically significant standards, rather than a standard of differing from 0. For example, if the data in our 2-sample study evaluated the effect of exercise on blood pressure, we could get students to consider questions like: a) whether this effect was large enough to merit changing one’s behavior, b) what other behavioral interventions might be as, or more, effective, and c) how a replication of (or a failure to replicate) this sized outcome in another sample would change our parameter estimate and/or our confidence in the estimate.

I’m not dumping on hypothesis testing—inference by hypothesis testing is valuable way to learn about a population of interest and how it differs from other populations. I’m just encouraging us to think about whether the formal, reflexive method of classical NHST fits within an SBI pedagogical framework. Cohen and many others have urged us to replace NHST with inferential tools such as parameter estimation, effect size estimation, replication, and meta-analysis—tools that help us learn much more about our population of interest.

]]>We met Buzz and Doris when we wanted to learn statistics. They are dolphins who were trying to get some rewards if they were able to communicate while we were learning to statistically test if they were communicating. In 16 trials, Doris gave signs to Buzz as to which button to press and it turned out Buzz pushed the correct button in 15 out of the 16 trials. We, still not convinced that they were communicating, assumed that it was just a lucky day for them and tried to simulate 15 successes in 16 trials with tossing 16 coins to see whether or not we can get 15 heads out of 16 tosses. The first time we only get 9 heads out of 16, the second time we get 8 heads and we continued this until we had done 100 repetitions. It turned out we could only get a maximum of 12 heads out of 16 tosses. Let’s continue the repetitions until 1000 and out of 1000 there was only 1 simulation that gave us 15 heads out of 16 tosses. It seems impossible now that the dolphins had just had a lucky day. They had something more than just guessing which button to press. Since that day, we know more about a p-value, and null hypothesis.

The above was my first experience in teaching statistics with simulation-based inference.

The above was my first experience in teaching statistics with simulation-based inference. These dolphins were teaching my students (and also me) how to do hypothesis testing more effectively than other statistics teachers. My students will never forget it. We then learned more about hypothesis testing in other cases, such as one-mean test, two-mean test, two-proportion test, pair-mean test and regression in similar manners and they were really liked it.

Simulation-based inference methods work best when my students do not have enough mathematics background. I was teaching classes for the biology education department and also for the civil engineering department and these students seemed not to have enough mathematical background nor did they actually like mathematical statistics. They were in that situation because they didn’t want to learn mathematics nor statistics in the first place. It turns out that they can still follow what I have been teaching with simulation-based methods. My students feel that the lesson is very intuitive, and they like the fact that it is a problem-based learning rather than traditional approach. Some of students who actually repeated the course were quite amazed because I did not teach them how to compute probabilities from a normal distribution and other distributions. They were taught by traditional statistics last year and did not pass.

Apart from teaching statistics with simulation-based inference, we are also very active in giving workshops about Buzz and Doris to other statistical lecturers in Indonesia. As of now, along with my colleagues in Universitas Pelita Harapan, we have conducted 5 workshops to share the good news about teaching statistics with simulation-based inference. Most of participants were quite excited about this and they really wanted to implement this curriculum in their universities. We agreed that changes have to be made in order to have more statistical literacy in our society.

]]>This post is based on joined work with Oliver Gansser, Matthias Gehrke, Bianca Krol, and Norman Markgraf.

The FOM is a private University of Applied Science in Germany for people studying while working. We are offering several, mainly economic related bachelor and master study programs in 29 study centers across Germany. The size of the courses with statistical content varies: from 15 to 150 students – or even more.

We used a relaunch of our BA degree in Summer 2016 to rethink and rebuild our curriculum in the different introductory statistics courses.

We used a relaunch of our BA degree in Summer 2016 to rethink and rebuild our curriculum in the different introductory statistics courses.

Inspired by Diez, Barr, & Çetinkaya-Rundel (2014), Ismay & Kim (2018) and Pruim, Horton, & Kaplan (2015) we introduced simluation based inference with help of the R package mosaic. As our program is desigend for extra-occupational students, we stress the application of statistical methods in the professional life of the students. We have chosen “mosaic approach” by Pruim, Horton and Kaplan because of its simple and coherent kind of *literate programming* approach for data analysis, as pointed out by Donald Knuth:

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

By the way: our introductory example is a triangle test – performed in a pub. This setting seems to be be quite engaging … Our didactical approach is build on interaction and activation; for example almost every 4th slide is a quiz or exercise.

As already pointed out, we run different courses at both Bachelor and Master levels in numerous study centers. Therefore we applied a modular lecture slide concept (for the different curricula in the different majors) via so called child chunks in RMarkdown, jointly developed in a GitHub repository. In our pretest we – and our students – were quite excited to find that the concepts of statistics are made more accessible – and can be applied to real-life problems.

But how can we convince our heterogeneous colleagues about SBI (and R)? In February 2018 we organized a central workshop which was also streamed in the internet where we introduced the concepts and teaching materials. There we presented all the arguments, like Hesterberg (2014) or Chance, Wong & Tintle (2016) and the results of our own pre-tests. It was nice to see that the concept was supported by both, those who focus on statistical theory and those who focus on (computational) application, but I think the final icebreaker was the song “The Bootstrap Begins” by Giles Hooker, available on CAUSEweb.org. With all these available resources we were starting to fly from a well-feathered nest. A big thank to all of you! One remaining problem for our students is that there is to my best knowledge so far no German textbook for SBI.

Of course there is still much room for improvement and evaluation and we are still learning about the misconceptions of our students (like e.g. bootstrapping is used for obtaining a normal distribution) but we think we are much better than the old formulary and pocket-calculator centered statistics lectures.

With the (very) basic statistical computing and the formula interface in R mosaic, we are paving the way to topics and concepts more related to data science, like data wrangling or algorithmic modeling (Breiman (2001)). In times of Big Data and Artificial Intelligence we feel obliged to teach very basic ideas of these rather new strains of thought – while not forgetting (or maybe even focus stronger on) the epistemological background and logical foundations of inference and probability.

We gratefully acknowledge that our work was supported by an internal teaching innovation grant by our institution.

]]>**Matt Beckman, Penn State University**

*What this is & what this isn’t*

This post is intended share some pragmatic thoughts for teaching SBI in a large class, and not necessarily converting your curriculum to the SBI framework. A number of suggestions on the latter have been published in this blog and elsewhere. Besides, my colleague–Kari Lock Morgan–had already done a remarkable job accomplishing that feat in the course to be described before I arrived. What follows are simply remarks about rubber-meets-the-road strategies from teaching an SBI course with 225 students to either capitalize on large class size or at least help navigate some logistical challenges that surface with increased enrollment.

What follows are simply remarks about rubber-meets-the-road strategies from teaching an SBI course with 225 students to either capitalize on large class size or at least help navigate some logistical challenges that surface with increased enrollment.

*A familiar activity scaled for a large class *

While teaching smaller classes, I adopted a popular illustration with M&M’s (sometimes Skittles) to introduce bootstrapping. I didn’t invent or perfect the activity, but here’s a summary: Each student gets a fun size bag of M&M’s, calculates the proportion of blue, and then marks the result on a class dotplot in front of the room. We can now have a conversation about a sampling distribution under the assumption that each fun size bag represents a random sample of M&M’s. We emphasize the point that each mark on the dotplot represents a statistic calculated from an actual sample in the room, and perhaps point out a few (say, extremes) and identify the responsible students to emphasize the point. Students then sample with replacement from their own bag to build a bootstrap distribution. Students do this a few times by hand, and then introduce software to speed things up. We emphasize that each dot in the bootstrap distribution is now a proportion of blue M&M’s out of 16 draws with replacement from their own fun size bag (assuming 16 is the sample size contained). We highlight similarities and differences between the sampling distribution and bootstrap distributions. For example, bootstrap distributions will generally be centered in different places, but still wind up with a useful estimate of standard error. I tend to get a lot of mileage out of this activity as the semester progresses. When I sense that students are losing sight of the fundamentals, we periodically back up and talk about the M&M’s again to get back on solid footing. The thought of scaling this for 225 students was daunting at first, but since I like having this example in my back pocket later in the semester it was worth a shot. To start, I basically buy the entire stock of fun-size M&M’s available at a super-store in town. We’ll have plenty of M&Ms in the room to make our point, so I also supplement with an alternative of some kind for those who don’t want/can’t eat M&M’s. The student dotplot is really the main hurdle here. With 225 students in fixed seating, a human wave filing down the aisles to mark the chalk board is a non-starter so we use Google Sheets or Forms. Students use a smart phone to access a shortened link (e.g. tiny.cc) or scan a QR code displayed on the front screen to access the spreadsheet/form and enter their result from their seats. After a quick filter, I cut and paste the data into software, and make our class dotplot. From start to finish, this method in a large class might even be faster than the manual approach I had used in smaller classes.

* A few themes. . . *

Of course this is just one specific activity, but there are some themes here that apply to a lot of activities.

*Try it anyway.* The “worst case scenarios” that I concoct when I imagine attempting an activity developed for small classes never really seem to happen. I probably fear that the mild noise, mess, and chaos of smaller classes will escalate into a completely wasted class period with so many students. Thankfully, that hasn’t happened (. . . yet?) largely because a little technology can neutralize the few inflection points where things would most likely derail (e.g. 225 students physically marking the chalkboard).

*Redeem the smart phone!* Don’t underestimate the tiny computers that most of your students already bring to class. QR codes–those pixelated square codes you scan–and shortened URLs (e.g. tiny.cc; bit.ly) can direct students to an applet, Google Sheet (Form, Doc, etc), or anything else on the web right from their seats. Students can even share with a neighbor or look over the shoulder in front of them–particularly in stadium seat lecture halls. As another example, my class uses the Lock5 text and the accompanying StatKey software which works great on a smart phone. I can introduce authentic data analysis tasks during “lecture” for students to tackle in pairs; one runs StatKey, the other takes notes.

*The large enrollment actually has some perks.* Since the class is large, class data sets are that much closer to Asymptopia. I’m sure I’m not the only one who’s attempted to use a class plot to lay the groundwork for the CLT and wound up with a skewed or bimodal-looking mess that looks nothing like the bell-shape I had in mind. . . Furthermore, rare events and oddities that lead to interesting discussions almost always show up. Someone will enter the count rather than proportion, and someone else usually observes something rare just by chance. Since I’m the gatekeeper for the class plot, I decide whether there’s time to discuss data entry errors or if we need to filter them out and plow ahead in the interest of time. Also, a student with no blue M&Ms, for example, isn’t hypothetical or a mistake. Rare events in the tails of the distribution really do happen!

*Keep an eye toward the big picture.* You’ll rarely need 100% cooperation to make your point. For example, in the M&M activity some students eat the candy before we start, others won’t get to the spreadsheet in time, and some didn’t get M&M’s to begin with! Even if only 50 students fully participate (< 25% in my case), the illustration still serves its purpose and benefits the whole class.

As statisticians, we tend to think that if we just have enough data in front of us then we can get at the heart of what is going on in any scenario and many statistics educators want to know what is going on with student learning outcomes from different curricula. So the solution is simple, right? Just collect a bunch of data on our students’ learning outcomes under different curricula and identify the strongest pedagogy. We can even get fancy and toss in some experimental design to structure the application of treatments to our experimental units to support causal conclusions about impacts on learning outcomes. Alright, I am being facetious here. It is never that straight forward. I will admit that this was my first instinct when I set out to do educational research as a graduate student. There are a number of issues that constrain plans for what would be a tidy and straightforward educational experiment: defining the curricular treatments, assigning students to curricula, applying the curricular treatments, measuring learning outcomes.

In order to reinforce the analyses from small-scale educational experiments like ours, we need to find a way to either eliminate or account for the classroom-based dependence structures.

First there is the problem of defining our treatments: each is an entire curriculum?! This will include lessons, lectures, labs, discussions, homework, exams, etc.; which create a mountain of new prep-work for the instructor(s) involved and leads to questions about what exactly is being compared. More importantly, it dilutes the findings because we will need to attribute observed learning differences to the entire array of curricular differences, precluding conclusions on the efficacy of particular components. Then there is the challenge of assigning students to curricula. There are logistical and institutional limitations that make true random assignment of students to classes infeasible, so we face potentially bias results. After preparing the curricular treatments and getting our students into classrooms, we still need to teach the class. Almost certainly we will apply a curriculum to an entire classroom but measure learning on the student level; the dreaded mismatch of experimental and observational units. Now we need multiple classrooms with each curriculum to properly account for inter/intra group variability. We could always aggregate on the classroom level and discuss overall learning outcomes, but then our necessary number of classrooms grows even further. Additionally the instructor effect will be confounded with the classroom. Lastly, we have to find a way to measure student learning outcomes in a way that other educators will trust and/or respect, so that the results are not discarded.

This was the laundry list of challenges faced by myself and Dennis Lock when we started our comparison of learning outcomes under simulation-based inference and traditional introductory statistics curricula.[1] We were graduate assistants each assigned as the instructor for an intro-stat section of around 60 students and we could not change the times to which students had enrolled. Given the challenges listed above (coupled with our limited influence as graduate students) we needed to find creative work arounds. Before registration we arranged for our two sections to have the same meeting time, allowing us to randomly reassign their classroom *locations*. For the first half of the semester — prior to the inference unit — both sections met in one large classroom that Dennis and I co-taught with one set of lectures, labs, homework, exams. During the inference unit the students split into two classrooms that were kept as similar as possible for things like data sources, progression of concepts, homework/test schedules in order to focus the results on how simulation-based pedagogy impacted student learning. We also rotated weekly between the two classrooms in an attempt to mitigate confounding instructor effect with treatment effect. We elected to evaluate student learning on the final exam using both a widely recognized metric (the ARTIST scaled question sets) and our own question sets.

There is one hurdle that our small-scale experiment could not overcome – replication over classrooms. In our study we had only one classroom per treatment to work with, so we had no way to account for within classroom dependence structure in our analysis. Our simulation study showed that if student scores from the same classroom were more related than students across different classrooms then our Type 1 error rate inflates – presenting a real danger of misinterpreting classroom clusters floating around the same average learning score as a curricular effect.

In order to reinforce the analyses from small-scale educational experiments like ours, we need to find a way to either eliminate or account for the classroom-based dependence structures. We can strive to structure the course so that the assumption of independence across students is actually true, requiring one-on-one style teaching. Online courses do not remove the burden of developing multiple curricula, but could make use of random assignment to curricula that could be effectively executed through course management software. The alternative is to account for the classroom effects in our models. While it is impossible to estimate classroom parameters in the covariance structure without classroom replication, we could feasibly plug in reasonable approximations. This would require a much stronger understanding of how inter and intra classroom covariance structures behave. One idea is to conduct a series of uniformity studies where many classrooms are given the same curriculum and we examine covariance structures for student scores with our intended learning assessment tool. It may seem silly to argue that “in order to do *small*-scale research, we need to first run a *large*-scale uniformity study”; but there are many larger colleges and universities already running many parallel sections with identical curriculum that would simply need to run the assessments and disseminate the data. Sharing these results would allow many small-scale researchers to plug in reasonable proxies to protect their analyses from Type 1 errors. While it is difficult to run curricular assessment without access to lots of resources and many course sections, it is important that we push to conduct small-scale curricular assessment that provides robust results.

**David Diez, OpenIntro**

The percentile bootstrap approach has made inroads to introductory statistics courses, sometimes with the incorrect declaration that it can be used without checking any conditions. Unfortunately, the percentile bootstrap performs worse than methods based on the t-distribution for small samples of numerical data. I would wager that the large majority of statisticians proselytize the opposite to be true, and I think this misplaced faith has created a small epidemic.

The percentile bootstrap is nothing new, but its weaknesses remain largely unknown in the community. I find myself wrestling with several considerations whenever I think about this topic.

A few years ago I created this spreadsheet to compare the percentile bootstrap to classical methods. For small samples, the t-confidence interval outperforms the percentile bootstrap through a sample size of 30 for numerical data. The difference is particularly stark when the population is skewed and the sample size is very small. Tim Hesterberg published a much more comprehensive investigation of multiple classical and bootstrap methods in 2014. He found similar results for small samples, where the t-confidence interval outperformed the percentile bootstrap until the sample size was 35 or larger.

Teaching the percentile bootstrap without thoughtfully explaining the conditions, particularly as a replacement for classical methods, seems like one step forward and two steps back. The percentile bootstrap is nothing new, but its weaknesses remain largely unknown in the community. I find myself wrestling with several considerations whenever I think about this topic.

**The percentile bootstrap is a stepping stone.**I don’t think the percentile bootstrap method should be taught as “the” bootstrap method. It’s too unreliable. The percentile bootstrap should be taught as a first step towards better methods and / or as a first tool for students to start exploring a wider range of analyses, e.g. of the median, standard deviation, and IQR.**There are better bootstrap methods.**Tim’s excellent paper found that the*bootstrap t-interval*is much more robust than the percentile bootstrap, and the bootstrap t-interval is even much more robust than the classical methods for small samples and skewed data. (Research opportunity #1)**The bootstrap opens the door to more statistics.**The reason why I remain bullish on the long term value of advanced bootstrap methods is that they ease the analysis of a wider range of statistics, such as the standard deviation and IQR.**We need to establish appropriate conditions for the bootstrap.**Every statistical tool fails in many ways, and we need to better understand when methods fail before they are taught to the next generation of statisticians. As a starting point, I suggest a rule of thumb for the percentile bootstrap below. To be clear, more thoughtful work is required here and appropriate conditions are far from settled. (Research opportunity #2)**Shifts in pedagogy are costly, so let’s do our homework first.**A shift in how intro statistics is taught on a large scale is very expensive. It requires teaching tens of thousands of teachers the new pedagogy, getting those teachers to buy into the change, and then pushing schools (or students) to buy new textbooks. I think we owe it to schools, teachers, and most of all students to have solid evidence (data!) from a diverse set of studies showing the practical benefits of the bootstrap method before we ask them to incur the costs of this transition. (Research opportunity #3)

I want to wrap up with my rule of thumb for the percentile bootstrap. *If I’d be comfortable applying the Z-test or Z-confidence interval to a data set, then I think it’s safe to use the percentile bootstrap for the mean or median.* In most introductory courses, that usually means (1) the data are from a simple random sample or from random assignment in an experiment, (2) there are at least 30 observations in the sample, and (3) the distribution is not too strongly skewed.

**Kari Lock Morgan, Assistant Professor of Statistics, Penn State University**

Computers (or miniature versions such as smart phones) are necessary to do simulation-based inference. How then can we assess knowledge and understanding of these methods *without* computers? Never fear, this can be done! I personally *choose* to give exams without technology, despite teaching in a computer classroom once a week, largely to avoid the headache of proctoring a large class with internet access. Here are some general tips I’ve found helpful for assessing SBI without technology:

** Much of the understanding to be assessed is NOT specific to SBI.** In any given example, calculating a p-value or interval is but one small part of a larger context that often includes scope of inference, defining parameter(s), stating hypotheses, interpreting plot(s) of the data, calculating the statistic, interpreting the p-value or interval in context, and making relevant conclusions. The assessment of this content can be largely independent of whether SBI is used.

** In lieu of technology, give pictures of randomization and bootstrap distributions. **Eyeballing an interval or p-value from a picture of a bootstrap or randomization distribution can be difficult for students, difficult to grade, and an irrelevant skill to assess. Here are several alternative approaches to get from a picture and observed statistic to a p-value or interval without technology:

Choose examples with obviously small or not small p-values.

*Use a countable number of dots in the tail(s).**For p-values, this would mean choosing an example with only a small (easily countable) number of dots beyond the observed statistic, which students could count and divide by the total. For intervals, you might generate 1000 bootstrap samples and ask for a 99% interval, requiring students to count only 5 dots in each tail.*

*Have students circle the relevant part of the distribution.*

*Choose examples with obviously small or not small p-values.*

**Choose the interval/p-value from a list of options.**While precise answers can be difficult to eyeball, a student reasoning correctly should be able to choose the correct answer from a list of possible options.

** Emphasize concepts or material that remain necessary with technology.** SBI has an advantage here over traditional-based inference where it can be all too tempting to create assessments centered on plugging summary statistics into the correct formula. For example, specifying a relevant parameter/statistic and type of plot, based on the variable type(s), and interpreting results in context, all remain necessary skills with technology available, while plugging numbers into formulas and using paper distribution tables do not. If an assessment item would be irrelevant with technology, it may not be important enough to assess without technology. I’m not advocating avoiding all details that technology automates, but rather suggesting that we not assess these details just for the sake of having something convenient to assess. For example, I think it

** Ask students to describe what a single dot represents**. Disclaimer: assessing the mechanics underlying the simulation has actually become less and less important to me over time; I personally care more about students understanding that the p-value measures the extremity of the observed statistic if the null hypothesis is true, than the specific method used in a particular simulation. Nonetheless, asking students what a single dot on a randomization or bootstrap distribution represents, or asking how they would generate an additional dot, assesses understanding of the underlying simulation process. Asking this question as free-response can be very enlightening, but [WARNING!] can also be hard to grade. If you have large classes, you might consider making this multiple choice.

Here is a sketch of a generic randomization test exam question incorporating the above tips (exact questions will vary with context, but this provides a rough template):

- Define, with notation, the parameter of interest.
- State the null and alternative hypotheses.
- What type of graph would you use to visualize these data? (or interpret a plot)
- Give the relevant sample statistic.
- Describe what one of the dots in the randomization distribution represents.
- Estimate the p-value from the randomization distribution shown.
- Make a generic conclusion about the null hypothesis based on (f).
- Make a conclusion in context based on (g).
- Can we make conclusions about causality? Why or why not?

Note that almost all parts remain relevant in the presence of technology, with the exception of potentially (d) and (f). Part (d) may involve calculating a difference in proportions from a two-way table or identifying relevant statistics from software output. Part (f) may include counting dots, circling part of the distribution, or choosing from possible options, as discussed above.

This template could also apply when technology *is* available, with only minor tweaks (I’ve used a similar template for lab exams in the past). In this case, only (c), (d), and (f) would differ, as students could actually generate the plot, statistic, and randomization distribution. The assessment without technology loses the ability to assess whether students can get the software to work, but I believe it does not fundamentally compromise the ability to assess knowledge and understanding.

If your students *do *have access to technology for assessments, see the parallel blog post by Robin Lock.

**Robin Lock, Burry Professor of Statistics, St. Lawrence University**

I have the luxury of teaching in a computer classroom with 28 workstations that are embedded in desks with glass tops to show the monitor below the work surface. This setup has several advantages (in addition to enforcing max class size cap of 28) since computing is readily available to use at any point in class, yet I can easily see all of the students, they can see me (no peeking around monitors), and they still have a nice big flat surface to spread out notes, handouts and, occasionally a text book (although many students now use an e-version of the text). I also have software on the instructor’s station (*Smart Sync*) that shows a thumbnail view of what’s on all student screens. Since the class is setup to use technology whenever needed and appropriate, it is natural to extend this to quizzes and exams, so my students routinely expect to use software as part of those activities.

Ideally I’d like to see what each student produces on the screen and how they interpret the output to make statistical conclusions, but it’s not practical to look over everyone’s shoulder as they work.

This is useful when assessing simulation-based inference (SBI) methods since I can ask students to actually carry out the procedures, but this raises some challenges for constructing assessments that are doable in a relatively short amount of time (as opposed to projects that can go into greater depth), address the concepts I want to assess, and, don’t forget this one, still relatively efficient to grade! The last point can be a bit of a challenge, since SBI methods will generally not yield a single correct answer, yet with a little practice one can get pretty good at quickly distinguishing reasonable answers from responses that show errors in procedures or reasoning.

**Using a Scaleless Distribution**

Ideally I’d like to see what each student produces on the screen and how they interpret the output to make statistical conclusions, but it’s not practical to look over everyone’s shoulder as they work. I still do traditional paper/pencil quizzes and exams, so it’s also not feasible to have students electronically cut/paste graphics and other output. Having them draw a rough sketch is one option, but another workaround I’ve used is to include a “generic” distribution on the quiz with no scale shown and ask students to fill in the scale based on their simulated distribution. Here are a couple of examples from this semester’s quizzes that illustrate this approach. We use the StatKey software package (*http://lock5stat.com/statkey*) for generating bootstrap and randomization distributions.

**Sample Question #1: Multiple Choices** Some people say that, if you are randomly guessing on a multiple choice test, the correct answer is more likely to be a middle choice than at either extreme. For example, if the five choices are A, B, C, D and E, you should avoid picking A or E. Let’s try testing this theory.

(a) If the five multiple choice options are equally likely to be correct, what proportion of questions should have E as the correct choice?

(b) Suppose that a sample of n=400 multiple choice questions from AP exams had E as the correct choice for 68 questions. What is the proportion of questions with E correct in this sample? (Use good statistical notation to label your answer).

(c) Write down the hypotheses if the question of interest is whether there is evidence that the proportion of questions with E correct (call it p_{E} ) is less than would be expected when answers are randomly assigned.

(d) Use StatKey to produce a randomization distribution for this test, based on the sample of 68 E answers out of 400 questions. On the plot below label the center of your randomization distribution and enough values on the horizontal axis to show the scale.

(e) What does a single dot in the plot above represent?

(f) Use StatKey to find the p-value for this test and show on the graph above how this looks.

(g) Assuming a 5% significance level, write a sentence that interprets what this test tells you in the context of this problem.

A quick glance at the scale shows if it is centered in the proper place (p_{0}=0.2) or if students have forgotten to change the default p_{0}=0.5 in StatKey or produced a bootstrap distribution that is centered at instead. We can also easily check if students are using the distribution properly to find the p-value using the proportion of samples with randomization .

**Sample Question #2: Restaurant Tips** We collected a random sample of n=157 restaurant bills from the First Crush bistro in Potsdam, NY and recorded the size of the tip on each bill. The data should be available in StatKey (if you are in the proper procedure – ask for a one-point penalty if you can’t find it).

(a) Use StatKey to construct a bootstrap distribution of *mean tip size* based on this sample. Your plot should look similar to the one below (Note: You could use the same plot as in the question above or generate a new one specifically for this dataset.). Label values on the horizontal axis to indicate the scale you see in your plot, including the center point.

(b) Find an 80% confidence interval for the mean tip size at First Crush, based on your bootstrap distribution. Indicate on your plot above how you find the interval and don’t forget the interpretive sentence.

Using StatKey’s built-in datasets is very convenient for exam purposes, but it is also very easy to enter new sample data, especially for inference involving proportions (as in Question #1) where students only need to enter the appropriate counts. For quantitative data, StatKey now allows users to upload their own file in .csv or .txt format, so it is relatively easy to give students access to a data file in an appropriate format to upload during the quiz. For readers who want the full **RestaurantTips** dataset the file can be downloaded from *http:\\lock5stat.com*.

Don’t have access to computers for quizzes and exams? Check the companion to this blog post by Kari Lock Morgan for tips on assessing SBI concepts without technology.

]]>Many of us will agree that using tactile demonstrations is super fun and can also be an excellent way to teach a particular concept. Students engage with the material differently when they can touch, smell, or taste the objects as opposed to only seeing or listening to a demonstration. The SBI blog has had many excellent articles describing in-class tactile simulations, see here and here and here.

However, sometimes the logistical constraints setting up the demonstration take away too much from an already packed 50 minute class session. And those details get even harder with large classes. One of the biggest challenges comes from collecting data or getting results back from the students. Although some classes have sophisticated clickers that make data collection easier, setting up and using clickers is also a logistical challenge (well worth it for using all semester, but not for a one day class demonstration).

The conversation that ensues about the experimental design is incredibly valuable for understanding paired design (and the motivation for the pairing) or survival analysis (and the need for tools to analyze censored data).

My view on classroom demonstrations is that doing most of the tactile demonstration can communicate the vast majority of the pedagogical ideas. I will demonstrate what I mean with examples using chocolate chips. I have used chocolate chips in class many times to teach two different concepts: (1) censored data analyzed with survival analysis (example taken from

In both classroom experiences, I provide small Dixie cups with two visibly different types of chocolate chips (typically two of either white chocolate, milk chocolate, semi-sweet chocolate, or peanut-butter chips).

Then we spend some time as a class talking about the experiment and the goal of the experiment. For both of the statistical methods I cover in class, the goal is to compare the length of time taken to melt the two different types of chips (the length of time is described by: the average if doing t-tests; the distribution if doing signed-rank test; the survival curve if doing survival analysis). I will give some details of the class interactions below, but many of these ideas are also discussed in the texts mentioned above and in the references. For simplicity, I will describe using a paired t-test to determine if milk or white chocolate chips melt faster, on average.

I always start with a basic question: What will we do to determine which chocolate chip melts faster on average?

Inevitably, I get a basic answer in return: Put chocolate chips in mouth, see how long it takes to melt, record data, decide which one melts faster.

And then I stop and ask them how in the world they’ll know how to do what they just suggested? There is a tremendous amount of additional information necessary before running a viable experiment. Among the important decisions to be made (by class consensus) are:

- How is the chocolate chip going to reside in the mouth? Can you chew? Can you use your tongue? Can you move the chip?
- Who gets which color? (Everyone gets both if paired!)
- In what order do the chips get melted? Does everyone melt the same chip first?
- How will the melting be timed?
- How long do we wait between chips?

The conversation that ensues about the experimental design is incredibly valuable for understanding paired design (and the motivation for the pairing) or survival analysis (and the need for tools to analyze censored data). The process of coming to a class consensus about the chip experiment requires the students to justify their decisions to their classmates. For example, the class notes that if you hold the chip at the top of your mouth, it’ll melt faster. Oh yeah, and that brings up the fact that some people will have mouth environments where chips melt faster (which is why we pair)!

I almost never collect the data that my students generate. Partly because melting times for chips aren’t substantially different (so power is pretty low). But also because collecting the data, entering it into the computer, and running the appropriate test does not provide the same pedagogical-idea-per-class-minute value that the earlier discussion on design did. I can make up data, use the textbook’s data, or use student data from a previous year (if I have it). And the students can then *quickly* see the analysis done on screen.

Don’t get me wrong, there are large benefits to collecting class data so that students can understand first hand how variability plays a role. I believe that collecting student data is particularly valuable when each student gets a different *statistic* (for example, number of successes in a binomial trial) which can be put together as a class generated sampling distribution. But in the chip example, the vast majority of the learning happens with understanding the experimental design. And because we must continually make choices in our classes, we should not be afraid to cut out the parts of the demonstration (here collecting data) we believe are less important to the learning.

References:

*Investigating Statistical Concepts, Applications, and Methods* by Beth Chance and Allan Rossman, http://www.rossmanchance.com/iscam3/

*Practicing Statistics* by Shonda Kuiper and Jeff Sklar, http://web.grinnell.edu/individuals/kuipers/stat2labs/

A chimpanzee named Sarah was the subject in a study of whether chimpanzees can solve problems. Sarah was shown 30-second videos of a human actor struggling with one of several problems (for example, not able to reach bananas hanging from the ceiling). Then Sarah was shown two photographs, one that depicted a solution to the problem (like stepping onto a box) and one that did not match that scenario. Researchers watched Sarah select one of the photos, and they kept track of whether Sarah chose the correct photo depicting a solution to the problem. Sarah chose the correct photo in 7 of 8 scenarios that she was presented. In order to judge whether Sarah understands how to solve problems we will define π to be the probability Sarah will pick the photo of the correct solution.

I don’t let them get away with just claiming that the p-value is some particular number – they have to explain how they know it is that number.

- Write out the null and alternative hypothesis for this study (in words and/or symbols).
- If you conduct a test of significance using simulation, what values would you use in the one-proportion applet?
- Assume you conduct a test of significance using simulation and get the following null distribution. (Note: this null distribution uses only
**100**simulated samples and not the usual 1000 or 5000.) Based on the null distribution, what is the p-value for the test? Circle the p-value on the plot. - Based on your p-value, do you have strong evidence that Sarah is not just guessing about which photograph belongs to each scenario? Explain briefly.
- What does a single dot represent in the null distribution shown above?

- A simulation for the results of Sarah completing one trial, if she chooses the correct picture half of the time in the long run.
- A simulation for the results of Sarah completing one trial, if she chooses the correct picture more than half the time in the long run.
- A simulation for the proportion of times Sarah chooses the correct pictures out of 8, if she chooses the correct picture half the time in the long run.
- A simulation for the proportion of times Sarah chooses the correct picture out of 8, if she chooses the correct picture more than half of the times in the long run.

Of course there are many more important reasons that I like this question. It asks the students to write the null and alternative hypotheses. They can only do this if they understand that the null hypothesis is the “by chance” model – if they understand that the null hypothesis is that Sarah is just guessing about which picture is correct. And they need to put the hypotheses in context in order to decide whether we should use a one-sided or two-sided alternative in this situation. And just in case they can write the correct null hypothesis, but don’t understand how that hypothesis is used in the simulation, I ask them to tell me what value is placed in each of the boxes of the applet.

This is usually tough for the students. They often get confused early in the course about where π and p̂ are supposed to go. What are the differences in these values? What happens when they put the same values in both the first and last box? (I often ask them that on a quiz too. Sometimes I even ask what the p-value would be if π = p̂.) Actually, they usually want to put 0.5 in the first box every time they use the applet. So I wish I could come up with more problems where the “by chance” model didn’t have π = ½, but instead had π = ⅓, or π = ¼, or some other value. There are some examples in the ISI textbook (adjustments on Rock/Paper/Scissors for example) that I do use, but sometimes the backgrounds are too complicated for testing environments.

Later in the course I would expect *them* to define the parameter π in words, but at this point, they are usually still having trouble with that. So I define it for them. I do, however, expect them to recognize that they need to convert from the count of 7 successes to the sample proportion of ⅞ = 0.875. I also expect them to know where to put this in the applet, and most importantly, I expect them to know that this is the critical part of finding the p-value. They need to be able to find that value on the horizontal axis of the null distribution, count the dots above that value, and divide that number of dots by 100. I am always tickled when they can do this early in the course. If they can’t do this, I need to do some more work with them.

I usually have a question similar to this on each test and the final exam. The context changes, and the parameter(s)/variable(s) changes, but I give them a null distribution and few enough dots that they can count (or a statistic far enough in the tail that they can see the p-value is zero). I don’t let them get away with just claiming that the p-value is some particular number – they have to explain how they know it is that number. By the end of the course, nearly every student who is attending class and really wants to pass can answer that question. Many can even extend to a new context that they haven’t seen before, for example with a new statistic such as the mean/median.

The last, multiple choice part of this question tries to make sure that the students understand exactly what the null distribution represents. There are lots of different ways I might ask this over course of a semester. By the end of the semester, I’m usually asking them to tell me what one dot on the plot represents. But early on, I like to give them options from which to choose. The distinctions between choices may be subtle. And they can lead to lively discussions about how the null distribution is a plot of outcomes assuming the null hypothesis is true, not of what might happen if the alternative hypothesis is true.

An additional question I sometimes ask is what do they expect the center of this null distribution to be? I’ve even given them several plots to choose from (usually after we’ve learned more about spreads). But they should always know that the center should be roughly what they are assuming π to be in the null distribution. In this case, with the small sample size, they should expect the spread to be fairly large.

Of course all of my student assessments are works in progress. Every time I administer one, I learn about improvements I can make the next time. But I’m sure I’ll continue to provide null distributions and ask students to show me and estimate the p-value on that distribution.

]]>