Chance News 55: Difference between revisions

From ChanceWiki
Jump to navigation Jump to search
Line 130: Line 130:
6.  Do you agree with the "pretty simple" reason given for the increased rate of bumping?
6.  Do you agree with the "pretty simple" reason given for the increased rate of bumping?


==The Bulgarian Lottery==
==The Bulgarian Toto 6 of 42 lottery==  


To be continued
The  Bulgarian Toto 6 of 42 lottery was the subject of an investigation after the
[http://www.novinite.com/view_news.php?id=107914] same set of six numbers {4, 15, 23, 24, 35, 42] was drawn in two successive lotteries on September 6 and September 10, 2009. The article cites a mathematician as stating that the probability of picking the same six numbers twice in a row is 4,200,000:1.  We wondered how he arrived at this number.  What is the probability that a specified set of six numbers will repeat consecutively?
 
There are <math>{42 \choose 6} = 5245786</math> different sets of six numbers and the probability that a SPECIFIED set will occur in the next two consecutive draws is <math>1/5245786^2</math>.  The probability that SOME set will occur in the next two consecutive draws is  <math>5245786 \times 1/5245786^2 = /5245786</math>. 
 
But now, suppose the lottery has been running continuously for <math>m</math> draws and we ask what the chance is that during this period there were consecutive draws of the same set.  As before, first consider a fixed set of six numbers.
 
There are <math>m-1</math> opportunities for this set to be drawn twice in succession (beginning with the second drawing). The probability that this will happen is then the probability of the union <math>P(A) = P(\cup_i A_i A_{i+1}) </math> where <math>A_i</math> is the event that this set of numbers is drawn on the ith draw.
 
Bonferroni's inequality gives the upper bound <math>P(A) \le \sum_i P(A_i A_{i+1})</math> while Hunter's inequality gives the lower bound <math>P(A) \ge \sum_i P(A_i A_{i+1}) - \sum_i P(A_i A_{i+1}A_{i+2}).</math>
 
We assume (!) that the events <math>A_i</math> are independent and identically distributed with probability <math>p = 1/5245786</math> leading to <math>(m-1) p^2  - (m-2) p^3 \le P(A) \le (m-1) p^2</math>.  Since
<math>p</math> is very small the <math>p^3</math> term can be ignored
giving <math>P(A) \approx (m-1)/5245786^2.</math>
 
It appears that the draws are held twice per week so for one year <math>m = 104</math> giving the probability <math>3.74 \times 10^{-12}</math> that a specified set of numbers will be drawn twice in succession.
[http://www.canada.com/Bulgaria+identical+lottery+draw+just+coincidence/2003980/story.html?id=2003980 According to a spokeswoman the lottery has been taking place for 52 years].
Using <math>m = 104 \times 52 = 5408</math>, the probability that a specified set of numbers will be drawn twice in succession over this period is <math>1.89 \times 10^{-10}</math>, still very small.
 
But now let's ask the question, not for a fixed set of numbers but for some set of numbers. After all, in discussing this coincidence the the repeated set arises by chance alone and is not specified in advance.
 
In <math>m</math> drawings what is the probability that SOME set of six numbers will be repeated in consecutive draws.
 
There are 5245786 possible sets of numbers that could be repeated. Enumerate the sets by integers
<math>1 \le k ≤ \le 5245786</math> with <math>E_k</math> the event that set <math>k</math> repeats consecutively sometime during
these <math>m</math> drawings. The probability of the union <math>P(\cup E_k)</math> is needed. Each of the 5245786 events
<math>E_k</math> has probability <math>(m-1)/ 5245786^2</math> and if they were independent we could evaluate the probability using complements as
<math>P(\cup E_k) = 1 - (1- (m-1)/5245786^2)^{5245786} \approx 1 - e^{-(m-1)/5245786}</math>. However, they are dependent, but as long as <math>m</math> is small relative to 5245786, Bonferroni's and Hunter's bounds can once again be used to estimate
<math>P(\cup E_k) \approx (m-1)/5245786.</math> For <math>m = 5408</math> this is 0.0010302. (Note that assuming independence gives 0.0010307)
 
This probability relates to one lottery. Suppose we consider all lotteries worldwise and ask for the probability that in some lottery, somewhere, some set of numbers will be repeated consecutively. All lotteries are variant of Toto with different numbers involved. Each lottery will have had its own cumulative number of drawings. In order to gauge the magnitude of the probability wanted, assume that there are <math>x</math> lotteries, each one sharing the same numerical characteristics as the Bulgarian one.
 
This time we can use independence. The probability that some set will be repeated is 1 minus the probability that in no lottery is a set of numbers selected on two consecutive drawings
<math>= 1 - (1 - (m -1)/5245786)^x</math>. For <math>x = 50</math> this is 0.0503 while for <math>x = 100</math> the probability is 0.0980. (An approximation to one significant digit for this range of values of interest is <math>x(m-1)/5245786.</math>)
 
For a different problem that discusses "very big numbers" see the article about
[http://www.nytimes.com/1990/02/27/science/1-in-a-trillion-coincidence-you-say-not-really-experts-find.html?pagewanted=all double lottery winner].
 
Questions.
 
1.  Instead of Hunter's lower bound, what would the second Bonferroni bound give?
 
2.  How many years would the Bulgarian lottery need to be running in order to have the same probability that some set of numbers will appear three times in succession?
 
3.  Instead of demanding that the same set of numbers appear twice in succession, what is the probability that some set of numbers will repeat during <math>m</math> drawings (This is simpler and is the famous birthday problem).
 
4.  The second application of Hunter's bound requires estimating <math>\sum  P( E_{k} E_{k+1} )</math> which involves terms of the form <math>P(A_i A_{i+1} B_j B_{j+1} )</math> where <math>A_i</math> is the event that the set <math>k</math> occurs on draw <math>i</math> and <math>B_j</math> is the event
that the
set <math>k+1</math> occurs on draw <math>j</math>. Each of these terms has probability
<math>1/5245786^4</math>. Count the number of terms to validate the claim that <math>P(\cup_k E_k) \approx (m-1)/5245786</math>.


==Baby, it’s cold outside==
==Baby, it’s cold outside==

Revision as of 12:41, 25 September 2009

Quotations

Populism, in its latest manifestation, celebrates ignorant opinion and undifferentiated rage. .... The typical opinion poll … doesn’t trouble to ask whether the respondent knows the first thing about the topic being opined upon, and no conventional poll disqualifies an answer on the ground of mere total ignorance. The premise of opinion polling is that people are, and of right ought to be, omni-opinionated – that they should have views on all subjects at all times – and that all such views are equally valid. …. So, given the prominence of polls in our political culture, it’s no surprise that people have come to believe that their opinions on the issues of the day need not be fettered by either facts or reflection. …. Now there’s the intellectual free lunch: I’m entitled to vociferous opinions on any subject, without having to know, or even think, about it.

Michael Kinsley, The New Yorker, February 6, 1995

We live in a world of real dangers and imagined fears. …. We are hounded by what I call “psycho-facts”: beliefs that, though not supported by hard evidence, are taken as real because their constant repetition changes the way we experience life. …. We act as if there’s a constitutional right to immortality and that anything that raises risk should be outlawed. ….

Robert J. Samuelson, Newsweek, May 9, 1994

In a September 18 Statesman Journal story, “Ducks’ defense faces tough challenge”, a coach was quoted:

The only statistic that counts is winning and losing …. We don't get caught up in that. .... How many yards and those things.


A retiring associate professor of math at BVU was described in a Storm Lake Pilot Tribune article [1] of September 17:

His love for math outweighed his love of sports by a few percentage points.

Forsooths

Responding to a Canadian viewer who pointed out that "life expectancy in Canada under our health system is higher than the USA," Fox's Bill O'Reilly on 7/27/09 said,

Well, that's to be expected, Peter, because we have 10 times as many people as you do. That translates to 10 times as many accidents, crimes, down the line.


According to a September 18 FOX8 News WVUE-TV story, “Chance for rain”, the following information was published in a cover story in an early 2009 bulletin of the American Meteorological Society:

[Researchers at the University of Washington] found people in Seattle didn't have much of a grasp for what the probability forecast [of rain] really means, but found the numbers helpful in planning their day.


Hanna Karp, in “What’s the Point of Cheerleading?”, The Wall Street Journal, September 17, 2009, states:

Risk-assessment experts say it’s hard to get a handle on the perils of cheerleading.


An advertisement in The Wall Street Journal, of September 22, 2009, contained a chart with an interesting legend. See the chart “Effectiveness of virtual vs. in-person meetings,” in “The Return on Investment of U.S. Business Travel”, prepared by Oxford Economics USA, September 2009, document page 21/pdf page 20.
Students might find it challenging to describe in one sentence what it says. They also might be asked to re-create the chart so that it would convey the message more effectively, that is, pass the interocular trauma test.

Breaking News

The Wall Street Journal of September 8, 2009 reports on a study in the Journal of Bone and Joint Surgery: “The researchers compared the outcomes of patients who underwent surgery between 6 a.m. and 4 p.m. for fractures of the femur or tibia to those who had comparable surgeries for similar fractures outside those normal hours.”

Sample

Reoperations

Needed

Sample Size
Sample Proportion
Outside Normal Hours
28
82

.3415

Within Normal Hours
12
70
.1714

The results are:

Difference = p (1) - p (2) Estimate for difference: 0.170035 95% CI for difference: (0.0346494, 0.305420) Test for difference = 0 (vs not = 0): Z = 2.37 P-Value = 0.018

Fisher's exact test: P-Value = 0.026

Discussion

1. Why is the Fisher exact test P-Value (0.026) to be preferred to the other P-Value mentioned (0.018)?

2. The Wall Street Journal mentioned several caveats “making it difficult to determine the underlying reasons for the after-hours patients’ poor outcomes.” List a few practical significance hedges to the statistically significant result.

Contributed by Paul Alper

Amazon River at age 1,000,003 years

“Metrics mania: Are Americans too reliant on numbers?”
by John Yemma, The Christian Science Monitor, September 16, 2009

The author first reminds readers of an old joke:

A guy strikes up a conversation with another guy on a long plane flight to South America. They are over the Amazon.

Guy 1: “Did you know that the Amazon is 1,000,003 years old?”
Guy 2: “Really? How can you be so precise?”

Guy 1: “I was on this same flight three years ago, and a geologist told me the Amazon was a million years old.”

He then discusses the difficulty with “metrics-based management” efforts, but concludes, in a hopeful vein, with a formula and some encouragement:

Metrics + Grain of Salt = Somewhat Useful Information.
Still, even if we can’t trust data absolutely, we can extract meaning. We may not know how old the Amazon really is, but we know one thing for certain: It is three years older than when Guy 1 first flew over it.

A blogger comments [2],

So true. I am an European who has lived in the US for almost 20 years. I am constantly amazed at the ‘number obsession’ that seems to rule all areas of society. It may be because this country is so big, that a common measure can only be found in quantities, not qualities.

Gompertz Law of human mortality

“You’re Likely to Live!”
by “Freakonomics,” The New York Times, September 14, 2009
This very brief article describes the “Gompertz Law of human mortality,” provides some statistics about the different chances of dying at different ages, and refers readers to three websites:
(a) Article with Gompertz Law details and graphs: “Your body wasn’t built to last: a lesson from human mortality rates”, "gravity and levity" blog, July 8, 2009.
(b) Applet that gives life expectancy at user-selected age: “Death Probability Calculator”, undated.
(c) TED video of songs, the first of which relates to aging: “Time is marching on”, March 2007.

Things that go bump

“Bumped Passengers Learn a Cruel Flying Lesson”
by Scott McCartney, The Wall Street Journal, September 17, 2009

This article discusses the recent spike in the rates of passenger-bumping by airlines, despite the increased penalties that the federal government requires the airlines to pay bumped-but-ticketed passengers. Although bumping affects fewer than 2 passengers out of every 10,000, that rate rose by 40% in the second quarter of 2009 over the rate for the second quarter of 2008.

It's pretty simple: It's just because planes are more full than last year," says [a US Airways official, whose airline] had the highest bumping rate among major airlines, at 1.88 passengers per 10,000 in the second quarter.
This summer, the nine major airlines filled 85.5% of their seats, up from 84.1% last summer. The peak was July, with 86.7% of seats filled.

Federal rules allow airlines to overbook in order to compensate for no-shows. The recent increase in bumping rates may be explained by the reduced demand for air travel, especially by business customers.

The [Department of Transportation] says it isn't concerned about the rise in bumping because the rates are still lower than historical highs. During the 1970s and 1980s, bumping rates were routinely four times as high as today's rate.

Discussion

Suppose that, on average, 85% of ticket-holders show up for their flights. Assume that the distribution of the number of ticket-holders who show up is binomial (especially that every ticket-holder has the same chance of being bumped) and that a ticket-holder is bumped only due to lack of a seat.

1. For each n tickets sold, or over-sold, for a 200-seat plane, find the number of ticket-holders an airline could expect to show up, on average.
(a) n = 200 (b) n = 210 (c) n = 220 (d) n = 230 (e) n = 240 (f) n = 250.

2. It appears that the airline would not have to bump any ticket-holders for some values of n. Is that a statistically correct inference, based on your understanding of expected value? Even if those expected values always “came true,” what problem would remain for the airline?

3. For each n tickets sold, or over-sold, find the probability of at least one ticket-holder being bumped off the 200-seat plane.
(a) n = 200 (b) n = 210 (c) n = 220 (d) n = 230 (e) n = 240 (f) n = 250.

4. For which value(s) of n would you have a negligible risk of being bumped? Under what circumstances might any risk be too great?

5. The more tickets an airline sells, the more likely it is to fill the plane and thus maximize its revenue for a flight. However, at some point, the increased revenue may be offset by losses of future dollars from angry ticket-holders and compensation payouts to increasing numbers of bumped ticket-holders. What other information would you want/need to know before deciding how many tickets to sell for a 200-seat plane?

6. Do you agree with the "pretty simple" reason given for the increased rate of bumping?

The Bulgarian Toto 6 of 42 lottery

The Bulgarian Toto 6 of 42 lottery was the subject of an investigation after the [3] same set of six numbers {4, 15, 23, 24, 35, 42] was drawn in two successive lotteries on September 6 and September 10, 2009. The article cites a mathematician as stating that the probability of picking the same six numbers twice in a row is 4,200,000:1. We wondered how he arrived at this number. What is the probability that a specified set of six numbers will repeat consecutively?

There are <math>{42 \choose 6} = 5245786</math> different sets of six numbers and the probability that a SPECIFIED set will occur in the next two consecutive draws is <math>1/5245786^2</math>. The probability that SOME set will occur in the next two consecutive draws is <math>5245786 \times 1/5245786^2 = /5245786</math>.

But now, suppose the lottery has been running continuously for <math>m</math> draws and we ask what the chance is that during this period there were consecutive draws of the same set. As before, first consider a fixed set of six numbers.

There are <math>m-1</math> opportunities for this set to be drawn twice in succession (beginning with the second drawing). The probability that this will happen is then the probability of the union <math>P(A) = P(\cup_i A_i A_{i+1}) </math> where <math>A_i</math> is the event that this set of numbers is drawn on the ith draw.

Bonferroni's inequality gives the upper bound <math>P(A) \le \sum_i P(A_i A_{i+1})</math> while Hunter's inequality gives the lower bound <math>P(A) \ge \sum_i P(A_i A_{i+1}) - \sum_i P(A_i A_{i+1}A_{i+2}).</math>

We assume (!) that the events <math>A_i</math> are independent and identically distributed with probability <math>p = 1/5245786</math> leading to <math>(m-1) p^2 - (m-2) p^3 \le P(A) \le (m-1) p^2</math>. Since <math>p</math> is very small the <math>p^3</math> term can be ignored giving <math>P(A) \approx (m-1)/5245786^2.</math>

It appears that the draws are held twice per week so for one year <math>m = 104</math> giving the probability <math>3.74 \times 10^{-12}</math> that a specified set of numbers will be drawn twice in succession. According to a spokeswoman the lottery has been taking place for 52 years. Using <math>m = 104 \times 52 = 5408</math>, the probability that a specified set of numbers will be drawn twice in succession over this period is <math>1.89 \times 10^{-10}</math>, still very small.

But now let's ask the question, not for a fixed set of numbers but for some set of numbers. After all, in discussing this coincidence the the repeated set arises by chance alone and is not specified in advance.

In <math>m</math> drawings what is the probability that SOME set of six numbers will be repeated in consecutive draws.

There are 5245786 possible sets of numbers that could be repeated. Enumerate the sets by integers <math>1 \le k ≤ \le 5245786</math> with <math>E_k</math> the event that set <math>k</math> repeats consecutively sometime during these <math>m</math> drawings. The probability of the union <math>P(\cup E_k)</math> is needed. Each of the 5245786 events <math>E_k</math> has probability <math>(m-1)/ 5245786^2</math> and if they were independent we could evaluate the probability using complements as <math>P(\cup E_k) = 1 - (1- (m-1)/5245786^2)^{5245786} \approx 1 - e^{-(m-1)/5245786}</math>. However, they are dependent, but as long as <math>m</math> is small relative to 5245786, Bonferroni's and Hunter's bounds can once again be used to estimate <math>P(\cup E_k) \approx (m-1)/5245786.</math> For <math>m = 5408</math> this is 0.0010302. (Note that assuming independence gives 0.0010307)

This probability relates to one lottery. Suppose we consider all lotteries worldwise and ask for the probability that in some lottery, somewhere, some set of numbers will be repeated consecutively. All lotteries are variant of Toto with different numbers involved. Each lottery will have had its own cumulative number of drawings. In order to gauge the magnitude of the probability wanted, assume that there are <math>x</math> lotteries, each one sharing the same numerical characteristics as the Bulgarian one.

This time we can use independence. The probability that some set will be repeated is 1 minus the probability that in no lottery is a set of numbers selected on two consecutive drawings <math>= 1 - (1 - (m -1)/5245786)^x</math>. For <math>x = 50</math> this is 0.0503 while for <math>x = 100</math> the probability is 0.0980. (An approximation to one significant digit for this range of values of interest is <math>x(m-1)/5245786.</math>)

For a different problem that discusses "very big numbers" see the article about double lottery winner.

Questions.

1. Instead of Hunter's lower bound, what would the second Bonferroni bound give?

2. How many years would the Bulgarian lottery need to be running in order to have the same probability that some set of numbers will appear three times in succession?

3. Instead of demanding that the same set of numbers appear twice in succession, what is the probability that some set of numbers will repeat during <math>m</math> drawings (This is simpler and is the famous birthday problem).

4. The second application of Hunter's bound requires estimating <math>\sum P( E_{k} E_{k+1} )</math> which involves terms of the form <math>P(A_i A_{i+1} B_j B_{j+1} )</math> where <math>A_i</math> is the event that the set <math>k</math> occurs on draw <math>i</math> and <math>B_j</math> is the event that the set <math>k+1</math> occurs on draw <math>j</math>. Each of these terms has probability <math>1/5245786^4</math>. Count the number of terms to validate the claim that <math>P(\cup_k E_k) \approx (m-1)/5245786</math>.

Baby, it’s cold outside

“New Light on the Plight of Winter Babies”
by Justin Lahart, The Wall Street Journal, September 22, 2009

Two Notre Dame economists “may have uncovered an overlooked explanation for why season of birth matters” with respect to the often reported poor test results, less healthiness, reduced longevity, and lower school completion rates and earnings of children born in the winter. See “Season of Birth and Later Outcomes: Old Questions, New Answers”, by Kasey Buckles and Daniel Hungerman, National Bureau of Economic Research, December 2008.

Working independently, Hungerman found that “children in the same families tend to be born at the same time of year,” and Buckles found a “tendency that less educated mothers were having children in winter.” They put their heads together and concluded that:

A key assumption of much of [the previous] research is that the backgrounds of children born in the winter are the same as the backgrounds of children born at other times of the year.

Some previous explanations for seasonal birth differences were school attendance laws, the amount of sunshine available in a season, or the level of pesticides in the water in a season. With respect to the first explanation, economists Joshua Angrist of MIT and Alan Krueger of Princeton posited in 1991 that, since winter babies can drop out of school earlier because they reach their 16th birthdays earlier, those babies have lower education levels that, in turn, lead to lower earnings.

Upon examination of CDC birth-certificate data for virtually all 52 million children born during the period 1989-2001, the Notre Dame researchers noted:

The percentage of children born to unwed mothers, teenage mothers and mothers who hadn't completed high school kept peaking in January every year. Over the 13-year period, for example, 13.2% of January births were to teen mothers, compared with 12% in May -- a small but statistically significant difference, they say.

A Columbia University economist comments about how striking the Notre Dame results are: "You can take a look at those graphs and see the clear pattern and that it's remarkably stable over time." See graphs [4] of January and May births with respect to birth mother’s marital status, age, and education.

Angrist disagrees, stating "The bottom line is a slight change in the estimate. …. It hardly overturns our finding."

Buckles and Hungerman are now working on finding an explanation of why a mother’s socioeconomic status is related to a child’s birth month.

(As of September 24, there were 298 blogs [5] responding to this article!)

U.S. Census: 2008 sampling results released

The U.S. Census Bureau has released the 2008 results of its ongoing "American Community Survey".