**David Diez, OpenIntro**

The percentile bootstrap approach has made inroads to introductory statistics courses, sometimes with the incorrect declaration that it can be used without checking any conditions. Unfortunately, the percentile bootstrap performs worse than methods based on the t-distribution for small samples of numerical data. I would wager that the large majority of statisticians proselytize the opposite to be true, and I think this misplaced faith has created a small epidemic.

The percentile bootstrap is nothing new, but its weaknesses remain largely unknown in the community. I find myself wrestling with several considerations whenever I think about this topic.

A few years ago I created this spreadsheet to compare the percentile bootstrap to classical methods. For small samples, the t-confidence interval outperforms the percentile bootstrap through a sample size of 30 for numerical data. The difference is particularly stark when the population is skewed and the sample size is very small. Tim Hesterberg published a much more comprehensive investigation of multiple classical and bootstrap methods in 2014. He found similar results for small samples, where the t-confidence interval outperformed the percentile bootstrap until the sample size was 35 or larger.

Teaching the percentile bootstrap without thoughtfully explaining the conditions, particularly as a replacement for classical methods, seems like one step forward and two steps back. The percentile bootstrap is nothing new, but its weaknesses remain largely unknown in the community. I find myself wrestling with several considerations whenever I think about this topic.

**The percentile bootstrap is a stepping stone.**I don’t think the percentile bootstrap method should be taught as “the” bootstrap method. It’s too unreliable. The percentile bootstrap should be taught as a first step towards better methods and / or as a first tool for students to start exploring a wider range of analyses, e.g. of the median, standard deviation, and IQR.**There are better bootstrap methods.**Tim’s excellent paper found that the*bootstrap t-interval*is much more robust than the percentile bootstrap, and the bootstrap t-interval is even much more robust than the classical methods for small samples and skewed data. (Research opportunity #1)**The bootstrap opens the door to more statistics.**The reason why I remain bullish on the long term value of advanced bootstrap methods is that they ease the analysis of a wider range of statistics, such as the standard deviation and IQR.**We need to establish appropriate conditions for the bootstrap.**Every statistical tool fails in many ways, and we need to better understand when methods fail before they are taught to the next generation of statisticians. As a starting point, I suggest a rule of thumb for the percentile bootstrap below. To be clear, more thoughtful work is required here and appropriate conditions are far from settled. (Research opportunity #2)**Shifts in pedagogy are costly, so let’s do our homework first.**A shift in how intro statistics is taught on a large scale is very expensive. It requires teaching tens of thousands of teachers the new pedagogy, getting those teachers to buy into the change, and then pushing schools (or students) to buy new textbooks. I think we owe it to schools, teachers, and most of all students to have solid evidence (data!) from a diverse set of studies showing the practical benefits of the bootstrap method before we ask them to incur the costs of this transition. (Research opportunity #3)

I want to wrap up with my rule of thumb for the percentile bootstrap. *If I’d be comfortable applying the Z-test or Z-confidence interval to a data set, then I think it’s safe to use the percentile bootstrap for the mean or median.* In most introductory courses, that usually means (1) the data are from a simple random sample or from random assignment in an experiment, (2) there are at least 30 observations in the sample, and (3) the distribution is not too strongly skewed.

Pingback: Bootstrap Confidence Intervals – not as reliable as their reputation! – Approximating Normality

Bob HaydenThank you, David, for your work on these simulations and for posting this here. I was introduced to the bootstrap by Peter Bruce back when his Resampling Statistics program ran under DOS. A measure of variability for the bootstrap distribution seemed to me a reasonable descriptive statistic for the variability of the sampling distribution, much as s over sqrt(n) is, but I could not see any reason to think that 95% of the percentile bootstrap intervals would capture a population parameter. With good reason, apparently! I also could not see how it could work for ANY statistic — it seemed unlikely to work for the max, min, or range for example. So I would want to add research opportunity 1.5 to see which statistics it does work for.

I also have had pedagogical concerns. When we use the bootstrap for inference, we are essentially using it as some sort of approximation to the sampling distribution. Sampling distributions are hard enough without this added layer of complexity. I suspect that the bootstrap seems easier to students because they are just thinking, “95% of the bootstrap means are in the middle 95% of the bootstrap distribution,” which is just a tautology not relevant to understanding inference. So the bootstrap REPLACES understanding of sampling distributions.

While bootstrap t may work better it is not clear how this could be used to INTRODUCE inference as it seems to be dependent on learning t methods first.