**Karsten Maurer, Assistant Professor of Statistics, The Miami University**

As statisticians, we tend to think that if we just have enough data in front of us then we can get at the heart of what is going on in any scenario and many statistics educators want to know what is going on with student learning outcomes from different curricula. So the solution is simple, right? Just collect a bunch of data on our students’ learning outcomes under different curricula and identify the strongest pedagogy. We can even get fancy and toss in some experimental design to structure the application of treatments to our experimental units to support causal conclusions about impacts on learning outcomes. Alright, I am being facetious here. It is never that straight forward. I will admit that this was my first instinct when I set out to do educational research as a graduate student. There are a number of issues that constrain plans for what would be a tidy and straightforward educational experiment: defining the curricular treatments, assigning students to curricula, applying the curricular treatments, measuring learning outcomes.

In order to reinforce the analyses from small-scale educational experiments like ours, we need to find a way to either eliminate or account for the classroom-based dependence structures.

First there is the problem of defining our treatments: each is an entire curriculum?! This will include lessons, lectures, labs, discussions, homework, exams, etc.; which create a mountain of new prep-work for the instructor(s) involved and leads to questions about what exactly is being compared. More importantly, it dilutes the findings because we will need to attribute observed learning differences to the entire array of curricular differences, precluding conclusions on the efficacy of particular components. Then there is the challenge of assigning students to curricula. There are logistical and institutional limitations that make true random assignment of students to classes infeasible, so we face potentially bias results. After preparing the curricular treatments and getting our students into classrooms, we still need to teach the class. Almost certainly we will apply a curriculum to an entire classroom but measure learning on the student level; the dreaded mismatch of experimental and observational units. Now we need multiple classrooms with each curriculum to properly account for inter/intra group variability. We could always aggregate on the classroom level and discuss overall learning outcomes, but then our necessary number of classrooms grows even further. Additionally the instructor effect will be confounded with the classroom. Lastly, we have to find a way to measure student learning outcomes in a way that other educators will trust and/or respect, so that the results are not discarded.

This was the laundry list of challenges faced by myself and Dennis Lock when we started our comparison of learning outcomes under simulation-based inference and traditional introductory statistics curricula.[1] We were graduate assistants each assigned as the instructor for an intro-stat section of around 60 students and we could not change the times to which students had enrolled. Given the challenges listed above (coupled with our limited influence as graduate students) we needed to find creative work arounds. Before registration we arranged for our two sections to have the same meeting time, allowing us to randomly reassign their classroom *locations*. For the first half of the semester — prior to the inference unit — both sections met in one large classroom that Dennis and I co-taught with one set of lectures, labs, homework, exams. During the inference unit the students split into two classrooms that were kept as similar as possible for things like data sources, progression of concepts, homework/test schedules in order to focus the results on how simulation-based pedagogy impacted student learning. We also rotated weekly between the two classrooms in an attempt to mitigate confounding instructor effect with treatment effect. We elected to evaluate student learning on the final exam using both a widely recognized metric (the ARTIST scaled question sets) and our own question sets.

There is one hurdle that our small-scale experiment could not overcome – replication over classrooms. In our study we had only one classroom per treatment to work with, so we had no way to account for within classroom dependence structure in our analysis. Our simulation study showed that if student scores from the same classroom were more related than students across different classrooms then our Type 1 error rate inflates – presenting a real danger of misinterpreting classroom clusters floating around the same average learning score as a curricular effect.

In order to reinforce the analyses from small-scale educational experiments like ours, we need to find a way to either eliminate or account for the classroom-based dependence structures. We can strive to structure the course so that the assumption of independence across students is actually true, requiring one-on-one style teaching. Online courses do not remove the burden of developing multiple curricula, but could make use of random assignment to curricula that could be effectively executed through course management software. The alternative is to account for the classroom effects in our models. While it is impossible to estimate classroom parameters in the covariance structure without classroom replication, we could feasibly plug in reasonable approximations. This would require a much stronger understanding of how inter and intra classroom covariance structures behave. One idea is to conduct a series of uniformity studies where many classrooms are given the same curriculum and we examine covariance structures for student scores with our intended learning assessment tool. It may seem silly to argue that “in order to do *small*-scale research, we need to first run a *large*-scale uniformity study”; but there are many larger colleges and universities already running many parallel sections with identical curriculum that would simply need to run the assessments and disseminate the data. Sharing these results would allow many small-scale researchers to plug in reasonable proxies to protect their analyses from Type 1 errors. While it is difficult to run curricular assessment without access to lots of resources and many course sections, it is important that we push to conduct small-scale curricular assessment that provides robust results.