### "Statistics in the Supermarket: An Intellectually Nutritious Activity"

Alan Reifman & Zhen Cong, Texas Tech University

#### Abstract

Statistics educators have long advocated the use of real, as opposed to artificially created, datasets for teaching purposes. Singer and Willett's (1990) article in *The American Statistician* articulates this position, but with more of an emphasis on having students re-analyze existing data (e.g., from the U.S. Census or previously published articles), rather than collecting new data. Having students in a class collect new data has the advantage of giving them greater ownership of the data collection and analyses. However, common forms of data collection such as administering surveys present their own challenges and sources of delay (e.g., finding potential respondents; seeking human-subjects approval under some circumstances). The present activity allows students to collect their own data, free from common logistical challenges. Further, the activity contains a detective-like component, requiring students' use of statistical analyses to estimate answers to real-world questions. The activity revolves around the NuVal® rating system, which derives a single value from 1-100 to indicate a given food product's nutritional value, with higher scores indicating healthier foods (www.nuval.com). Some supermarket chains, including one in our home city, display NuVal scores on shelves below all the food products they sell. The NuVal website (FAQ section) notes that the system "takes more than 30 different nutrients and nutrition factors into account," which are submitted to a "complex algorithm." Nutritional information on product packages (e.g., percent daily value of carbohydrates, saturated fat, and different vitamins) does allow one to estimate correlations (or regression coefficients) with products' overall NuVal scores. Although trying to discover the complete algorithm through statistical analysis was beyond the scope of our activity, students could use available information to address questions such as: Do products with greater Vitamin A and lower saturated fat receive higher overall NuVal scores? We assigned students in a 2012 graduate-level multiple-regression course to use the technique to estimate how 10 specific nutritional quantities (which served as independent variables) contributed to the overall NuVal score (the dependent variable). To begin, each student was instructed to go to a NuVal-participating supermarket and record the nutritional information along a particular aisle (different for each student). Students were told how to randomly sample products on a given aisle, resulting in approximately 48 records for each student, where each record contained an overall NuVal score and 10 specific nutritional quantities for a product. Each student then conducted a multiple-regression analysis with his or her own dataset. In this way, students could see that if a particular nutritional property showed a positive regression coefficient in relation to NuVal scores, then a higher percent daily value of that ingredient would raise the overall NuVal score (and by how much). Alternatively, a negative regression coefficient would signal that higher percent daily value of an ingredient would lower the overall NuVal score. Students successfully completed the data collection, analysis, and interpretation. Beyond the basic question of how the 10 nutritional components related to the overall NuVal score, the activity lends itself to discussion of further issues. One is how the R-square for the 10-predictor equation would be expected to fall well short of 1.00 (and indeed did) due to the fact that the NuVal score is based on 30 contributing factors (with possible nonlinear relations). Another is how the relation of a given nutritional property to overall NuVal score may be different within classes of food type (e.g., cookies/crackers; frozen vegetables) when the data are analyzed separately by food type. Though implemented with graduate students, the present activity would also be appropriate for advanced undergraduate students, in our view.

#### Materials

- Download slides (PowerPoint)

#### Recording

*(Tip: click the fullscreen control)*

Having trouble viewing? Try: Download (.mp4)

*(Tip: right-click and choose "Save As...")*

#### Comments

**Nicholas Horton:**

I'd not heard of the "NuVal" scores, but I really like the idea that students are challenged to try to understand factors that are part of this complex algorithm by use of multiple regression. I've advocated for the incorporation of some multiple regression topics in the introductory statistics course (potentially just using descriptive inference) and think that this is a great motivating example.

I'd be interested in seeing a pairs plot for these data, and potentially analyzing them myself. Have you considered submitting this to the Datasets and Stories Department of the Journal of Statistics Education? @askdrstats

**Alan Reifman:**

Thanks for the kind words and suggestion regarding the JSE Datasets archive.

**Jennifer Kaplan:**

I would also love to have one of the data sets. I think I could use it in my course in the fall (so I would rather not wait for it to appear in JSE -- though I do encourage you to submit it to JSE as a dataset and story).

**Dennis Pearl:**

Neat project. Another extension might be to examine the issue of over-fitting by having each student's regression equation applied to a different student's data to see how well it fits an independent set.

**Alan Reifman:**

Thanks!