Strategies to find real data from genuine studies

somaSoma Roy – Cal Poly, San Luis Obispo

I firmly believe that the key to getting students to appreciate what the discipline of Statistics does is to show them examples – lots of examples of a variety of real studies that investigate real research questions, and have students analyze the data from such studies. And, so in all of my classes I use data from real research studies to help students understand that Statistics is about things that matter, and that it has applications to the real world, which they tend to think of as separate from their statistics class, especially when it is a General Education class. Below I have listed a few strategies I use to give students experience with real data from genuine studies, a few resources where you can find such data and studies, and have also included a few examples of studies I use in class.[pullquote]… I often have to go through many articles before I find something that fits the objective(s) I have in mind. On the plus side, I often find articles that though not suitable for the topic I have in mind at the time, does have other things to offer.[/pullquote]

Strategy 1: Always being on the lookout

Many of the studies I use in my classes came about because I read something in the news or heard something on the radio, and then looked for related articles on the Internet.

Example: This morning (March 31, 2015) when I was browsing the Health section under I saw the headline “The Soda-Cancer Connection” – which led me to this article (Loftfield et al., 2015) from the Journal of National Cancer Institute. The study discusses a prospective cohort study of 447,357 non-Hispanic whites followed for a median of 10.5 years, and in those follow-up years 2904 cases of malignant melanoma were detected. Even though the researchers performed analyses that are much more sophisticated than we expect our STAT 101 students to do, we can still use the summary data provided in the article to have STAT 101 students do simple analyses. For example, using the information in Table 1 and Figure 1 of the article, I came up with the following two-way table that cross classifies the participants by whether or not they developed melanoma and how much coffee they drank per day.

Coffee Intake



≤ 1 cup/day 2-3 cups/day 4+ cups/day


Developed malignant melanoma


942 1253 399


Did not develop malignant melanoma


139901 186767 73521




140843 188020 73920


This example can be used to explore several different ideas:

  • Conditional proportions and segmented bar charts
  • Inference methods for comparing several groups on a categorical response (simulation-based and theory-based methods)
  • Scope of conclusion – to whom are the study results applicable; can cause-and-effect conclusions be drawn?
  • Using odds ratios/relative risk for dose-response modeling

Social media can be a useful source, too. I often find interesting posts on Facebook that turn into examples to be used in class.

Strategy 2: Patiently searching on the Internet

The Internet can be a very useful tool when looking for data – as long as you know the right words to type into the search engine. I am happy to say that with time and practice I have become quite good at finding exactly the kind of studies I want, when teaching certain topics. For example, when I search for “effect of exercise on blood pressure” on Google Scholar many journal articles turn up, and one among them is “Effects of Regular Exercise on Blood Pressure and Left Ventricular Hypertrophy in African-American Men with Severe Hypertension” from The New England Journal of Medicine, 1995. The article provides means and SDs of a few variables (systolic BP, diastolic BP, etc.) for the two treatment groups, as well as the sample sizes, and these statistics can be used to carry out statistical inference procedures.

If, like me, you want to run simulation-based inference on the data, then the summary statistics are not as helpful as the raw data would be. I have found that most authors of recent research studies are happy to share the raw data with you if you email them.

Strategy 2 does require lots of patience, and I often have to go through many articles before I find something that fits the objective(s) I have in mind. On the plus side, I often find articles that though not suitable for the topic I have in mind at the time, does have other things to offer. I archive/save these articles for future use. I save them in folders such as “Multiple groups, categorical response” or “Two groups, quantitative response,” etc.

Other online sources of data:

Strategy 3: Asking colleagues

I am very fortunate to have many great colleagues in my department with whom I get to talk about teaching statistics all the time, and I find them to be great resources when it comes to finding real data and studies to use in class. I understand that not all statistics instructors are as lucky as I am, but that’s why we have online communities and websites that serve statistics educators: the isostat listserv, ASA’s Section on Statistical Education listserv, and of course, the sbi listserv to name a few.

[pullquote]The benefit of using data collected on [students] is that it makes the investigation that much more relevant to them.[/pullquote]

Strategy 4: Collecting data on students 

In all my classes students frequently investigate research questions by using data that they have collected on themselves. For example, they collect and analyze data to investigate whether people can remember more information when the information is presented in smaller recognizable chunks rather than larger unrecognizable chunks. My hope is that this study’s results will provide students helpful hints when it comes to studying habits. Students also collect data on whether heart rates are different after having been sitting versus jumping for 30 seconds. The benefit of using data collected on them is that it makes the investigation that much more relevant to them.

I often conduct online surveys in my classes where students are asked questions such as: How many hours of sleep did you get last night? Do you eat breakfast? How far from home did you have to travel to go to school here? Do you consider yourself an early bird or a night owl? How many hours per week outside of class do you plan on spending studying for your statistics course? Do you live on campus or off campus? Then as we cover various data types and inference methods, students analyze these data. For example, at my school the expectation is that students will spend at least 8 hours per week outside of class on a 4-unit course. Students use their class data to test whether on average the students in their class are planning on spending less than the recommended amount. Students also investigate whether those students who live on campus tend to get more sleep than those who live off campus?

At the end of the quarter this gives me the opportunity to ask students to identify something they learned in the class not related to statistics or some example/study they found memorable as a measure of whether I was successful in engaging them in some of these contexts.

In one of my introductory statistics classes, students collect data daily on the amount of time they spend on different activities, such as, in class, preparing/studying for class, working out, hanging out with friends, etc. At the of the quarter they analyze the data to see whether and how the time spent of various activities changed over the duration of the term. My hope is that this will help students realize that time management is skill to be worked on, and that their data will guide them towards honing this skill.

Strategy 5: Student projects

To provide students with additional opportunities to holistically practice methods of data collection, analysis, and reporting, I include a project component in all my classes. Students work in teams to first come up with research questions, and data collection plans. Then, they collect and analyze the data, and then write a report summarizing the research and the findings. This gives students a chance to apply statistics to real world data that matters to them, and to practice their written communication skills, especially technical writing. As part of their presentation, students have to convince me why their research question matters, and also do a brief literature review of other similar studies. One of my favorite student projects (Bacon, Boggan, Burton, and Stamer, 2012) is one that investigated whether men with children tend to live longer than men without children, and I use that dataset in class now.[pullquote] [I] ask students to identify something they learned in the class not related to statistics or some example/study they found memorable as a measure of whether I was successful in engaging them in some of these contexts.[/pullquote]

In some of my upper level statistics courses, I have projects that are solely about reviewing articles from peer-reviewed journals; for these projects students have to find studies that answer scientific questions using specific statistical methods, and then write up about the methods and materials, as well the findings.

Note: If you want to read more about how to incorporate student projects in your courses, we have several posts on this page.

If you have strategies that you use to find interesting datasets and real research studies to use in class, I would love to hear about them! Please share your ideas by posting comments!

One thought on “Strategies to find real data from genuine studies

  1. Megan Olson Hunt

    I found a nice example of a paper utilizing Fisher’s Exact Test that I like for a few reasons:

    Wolkenstein et al. (1998). Randomised comparison of thalidomide versus placebo in toxic epidermal necrolysis. The Lancet 352, 1586-1589.

    1) It’s a good illustration of a case where it’s difficult to collect data and thus a small sample size occurs. Specifically, it’s a rare disease, then you have to get patient consent even once you find those with this condition.

    2) It’s a case of using Fisher’s Exact Test in a real scientific study.

    3) The study had to be stopped because more people were dying from the active treatment than placebo (10/12 vs. 3/10). Thus, it’s a nice opportunity to talk about ethics when studying human subjects.



Leave a Reply

Your email address will not be published. Required fields are marked *