"Big Data Generates Beguiling Coincidences"
with Milo Schield, Augsburg College
Hosted by: Sam Morris, North Carolina State University
Today's data users face a data deluge: data is everywhere in massive amounts. Big data leads to the omni-presence of coincidence which leads people to conclude that there is something more going on than "mere" chance. Educators often see this differently, and ponder how to lead students to a more accurate idea of "expected." This presentation argues that coincidences are more likely because of what is unseen and presents a probabilistic approach to "expected." Spreadsheets are presented that help make the unseen more visible and help students challenge and develop their notion of "expected". These spreadsheets demonstrate coincidence with runs with coins, with linear and non-linear clusters in a two-dimensional grid, and with the Birthday problem. Coincidences are explained mathematically and geographically. Participants will access the ideas and the materials and assess their inclusion in an intro stats course.
Really enjoyed this. Looking forward to using the spreadsheets in my class.
I'd appreciate hearing how it works in your class.
I'm not sure we've really woken up to big data yet (in temrs of the theory to deal with it). I think this is a great start. I read Ben Goldacre's "Bad Science" and hear what he says about pharmaceutical firms and subset analysis after subset analysis until you get one you like. So I think these kinds of activities are a really important start (you can only do so much with coins and runs). Thanks!
The bigger the dataset, the more opportunities there are for coincidences.
I'm having fun pondering the "grains of rice" experiment. Will you clarify: does each cell represent a trial, and the number in that cell represents the # of the square the grain of rice landed on (on some other, numbered, grid)? I'm trying to verify that P(4 cells touching) = 10^(-4) as this would be the probability of a "run" of four 9's, but it doesn't seem like that's this situation. Thanks for the amusement!
I'm glad you enjoyed the activity. In the "grains of rice" spreadsheet, each cell represents a trial with random integers between 0 and 9. If the value is 9, it is treated as though a grain of rice had landed their (one chance in 10) and the cell is formatted in red to indicate the rice. So long as "cluster" is defined as just those red cells that touch, then the chance of a cluster is 10^-k where k is the number of red cells touching. By allowing touching horizontally, vertically and diagonally a cluster has a lot more ways to be formed that a run in a single direction.
Really interesting, Milo. The notion of coincidences and calculating the true probabilities of such events has long been an interest of mine. I'm so far behind in my reading that I've just gotten to the February 2012 Significance Magazine. There is a nice article about coincidences there -- in case you are even further behind in reading then I am!