Assessing Knowledge and Understanding of Simulation-Based Inference Without Technology

Kari Lock Morgan, Assistant Professor of Statistics, Penn State University

Computers (or miniature versions such as smart phones) are necessary to do simulation-based inference.  How then can we assess knowledge and understanding of these methods without computers?  Never fear, this can be done!  I personally choose to give exams without technology, despite teaching in a computer classroom once a week, largely to avoid the headache of proctoring a large class with internet access. Here are some general tips I’ve found helpful for assessing SBI without technology:

Much of the understanding to be assessed is NOT specific to SBI.  In any given example, calculating a p-value or interval is but one small part of a larger context that often includes scope of inference, defining parameter(s), stating hypotheses, interpreting plot(s) of the data, calculating the statistic, interpreting the p-value or interval in context, and making relevant conclusions.  The assessment of this content can be largely independent of whether SBI is used.

In lieu of technology, give pictures of randomization and bootstrap distributions. Eyeballing an interval or p-value from a picture of a bootstrap or randomization distribution can be difficult for students, difficult to grade, and an irrelevant skill to assess.  Here are several alternative approaches to get from a picture and observed statistic to a p-value or interval without technology:

[pullquote]Choose examples with obviously small or not small p-values.[/pullquote]

• Use a countable number of dots in the tail(s).  For p-values, this would mean choosing an example with only a small (easily countable) number of dots beyond the observed statistic, which students could count and divide by the total.  For intervals, you might generate 1000 bootstrap samples and ask for a 99% interval, requiring students to count only 5 dots in each tail.
• Have students circle the relevant part of the distribution.  Because not all examples lend themselves to a countable number of dots in the tail(s), an alternative is to have students circle the relevant part of the distribution, labeling any relevant quantities (this helps with grading).  (Note: I also use this technique for normal and t-based inference, giving blank distributions to be shaded).  This will not yield a numeric answer and rough estimation can be difficult, so…
• Choose examples with obviously small or not small p-values.  Test conclusions depend only on whether the p-value is small or not small, so unless you give the p-value or have students count dots, choose examples in which the smallness of the p-value is immediately obvious from a picture; either a statistic near the middle of the randomization distribution or way out in the tail.  This allows students to make conclusions without ever actually calculating an exact p-value.
• Choose the interval/p-value from a list of options. While precise answers can be difficult to eyeball, a student reasoning correctly should be able to choose the correct answer from a list of possible options.

Emphasize concepts or material that remain necessary with technology. SBI has an advantage here over traditional-based inference where it can be all too tempting to create assessments centered on plugging summary statistics into the correct formula.  For example, specifying a relevant parameter/statistic and type of plot, based on the variable type(s), and interpreting results in context, all remain necessary skills with technology available, while plugging numbers into formulas and using paper distribution tables do not.  If an assessment item would be irrelevant with technology, it may not be important enough to assess without technology.  I’m not advocating avoiding all details that technology automates, but rather suggesting that we not assess these details just for the sake of having something convenient to assess.  For example, I think it is important that students know to find a p-value as the proportion in the tail(s) of a randomization distribution beyond the observed statistic, even if their software spits out a p-value automatically (it is no coincidence that this is exactly the knowledge students need to find a p-value in StatKey).

Ask students to describe what a single dot represents. Disclaimer: assessing the mechanics underlying the simulation has actually become less and less important to me over time; I personally care more about students understanding that the p-value measures the extremity of the observed statistic if the null hypothesis is true, than the specific method used in a particular simulation. Nonetheless, asking students what a single dot on a randomization or bootstrap distribution represents, or asking how they would generate an additional dot, assesses understanding of the underlying simulation process.  Asking this question as free-response can be very enlightening, but [WARNING!] can also be hard to grade.  If you have large classes, you might consider making this multiple choice.

Here is a sketch of a generic randomization test exam question incorporating the above tips (exact questions will vary with context, but this provides a rough template):

1. Define, with notation, the parameter of interest.
2. State the null and alternative hypotheses.
3. What type of graph would you use to visualize these data? (or interpret a plot)
4. Give the relevant sample statistic.
5. Describe what one of the dots in the randomization distribution represents.
6. Estimate the p-value from the randomization distribution shown.
7. Make a generic conclusion about the null hypothesis based on (f).
8. Make a conclusion in context based on (g).
9. Can we make conclusions about causality?  Why or why not?

Note that almost all parts remain relevant in the presence of technology, with the exception of potentially (d) and (f).  Part (d) may involve calculating a difference in proportions from a two-way table or identifying relevant statistics from software output. Part (f) may include counting dots, circling part of the distribution, or choosing from possible options, as discussed above.

This template could also apply when technology is available, with only minor tweaks (I’ve used a similar template for lab exams in the past).  In this case, only (c), (d), and (f) would differ, as students could actually generate the plot, statistic, and randomization distribution. The assessment without technology loses the ability to assess whether students can get the software to work, but I believe it does not fundamentally compromise the ability to assess knowledge and understanding.

If your students do have access to technology for assessments, see the parallel blog post by Robin Lock.