Chance News 86

From ChanceWiki
Revision as of 13:44, 9 July 2012 by Simon66217 (Talk | contribs) (Crowdsourcing and its failed prediction on health care law)

Jump to: navigation, search


"Asymptotically we are all dead."

--paraphrase of Keynes, sometimes attributed to Melvin R. Novick

Submitted by Paul Alper

"To err is human, to forgive divine but to include errors in your design is statistical."

--Leslie Kish, in Chance, statistics, and statisticians, 1977 ASA Presidential Address

Submitted by Bill Peterson

"The only statistical test one ever needs is the IOTT or 'interocular trauma test.' The result just hits one between the eyes. If one needs any more statistical analysis, one should be working harder to control sources of error, or perhaps studying something else entirely."

--David H. Krantz, in The null hypothesis testing in psychology, JASA, December 1999, p. 1373.

Krantz is describing how some psychologists view statistical testing. On the same page he describes another viewpoint:

"Nothing is due to chance. This is the Freudian stance...but fortunately, it has little support among researchers."

Submitted by Paul Alper


“The survey interviewed 991 Americans online from June 28-30. The precision of the Reuters/Ipsos online polls is measured using a credibility interval. In this case, the poll has a credibility interval of plus or minus 3.6 percentage points.” (emphasis added)

“Obamacare Support Rises After Supreme Court Ruling, Poll Finds”
Huffington Post, July 1, 2012

Submitted by Margaret Cibes


The science of ‘gaydar’
by Joshua A. Tabak and Vivian Zayas, New York Times, 3 June 2012

The definition of GAYDAR is the "Ability to sense a homosexual" according to

In their NYT article, Tabak and Zayas write

Should you trust your gaydar in everyday life? Probably not. In our experiments, average gaydar judgment accuracy was only in the 60 percent range. This demonstrates gaydar ability — which is far from judgment proficiency. But is gaydar real? Absolutely.

At PLoS One is their complete research paper where subjects viewed facial photographs of homosexuals and straights; the subjects then had a short time to decide the sexual orientation. In their first experiment, the faces were shown upright only:

Twenty-four University of Washington students (19 women; age range = 18–22 years) participated in exchange for extra course credit. Data from seven additional participants were excluded from analyses due to failure to follow instructions (n = 4) or computer malfunction (n = 3).

In the second experiment, faces were shown upright and upside-down and the subjects had a short time to decide the sexual orientation:

One hundred twenty-nine University of Washington students (92 women; age range = 18–25 years) participated in exchange for extra course credit. Data from 16 additional participants were excluded from analyses due to failure to follow instructions (n = 12) or average reaction times more than 3 SD above the mean (n = 4).

According to the authors, “there are two components of “accuracy”: the hit rate which is “the proportion of gay faces correctly perceived as gay, and the false alarm rate” which is “the proportion of straight faces incorrectly perceived as gay.” The figure reproduced below (full version here) indicates that the accuracy was better when the target gender was women as opposed to men and the accuracy was better when the faces were upright as opposed to upside down. Presumably, random guessing would produce an “accuracy” of about .5.

TabakZayas fig3.png
Figure 3. Accuracy of detecting sexual orientation from upright and upside-down faces (Experiment 2).
Mean accuracy (A′) in judging sexual orientation from faces presented for 50 milliseconds as a function of the target’s gender and spatial orientation (upright or upside-down; Experiment 2). Judgments of upright faces are based on both configural and featural processing, whereas judgments of upside-down faces are based only on featural face processing. Error bars represent ±1 SEM.

Reproduced below are the detailed results for the second experiment:

TabakZayas table1.png
Table 1. Hit and False Alarm Rates for Snap Judgments of Sexual Orientation in Experiment 2.

The author’s conclude with the following statement:

The present research is the first to demonstrate (a) that configural face processing significantly contributes to perception of sexual orientation, and (b) that sexual orientation is inferred more easily from women’s vs. men’s faces. In light of these findings, it is interesting to note the popular desire to learn to read faces like books. Considering how challenging it is to read a book upside-down, it seems that we read faces better than we read books.


1. Gaydar "accuracy" seems to be defined in the paper as hit rate / (hit rate + false alarm rate) or, to use terms common in medical tests, positive predictive value = true positive / (true positive + false positive). The paper makes no mention of negative predictive value = true negative / (true negative + false negative). As is illustrated in the Wikipedia article, legitimate medical tests will tolerate a low positive predictive value because a more expensive test exists in the rare case that the disease is actually present; negative predictive values must be high to avoid potentially deadly false optimism. The situation here is somewhat different because the subjects were exposed to an approximately equal number of gays and straights whereas in medical tests, most people in the population do not have the “disease.”

2. Perhaps the analogy with medical testing is inappropriate. That is, an error is an error and no distinction should be made between the two types of errors. Consider the above table for the case of women and upright spatial orientation. The hit rate is .36 and the false alarm rate is .22. If we assume that the 67 subjects viewed 100 gay faces and 100 straight faces, then we obtain the following table for average values:

Orientation + - Total
Gay 36 64 100
Straight 22 78 100
Total 58 142 200

This leads to Prob (success) = (36 + 78)/200 = .57; Prob (error) = (64 +22)/200 = .43 In effect, the model could be hidden tosses of a coin and the subjects, in an ESP fashion, guess heads or tails before the toss. Of course, a Bayesian would then assume a prior distribution and combine that with the results of the study to obtain a posterior probability and avoid any mention of p-value based on .5 as the null.

3. Why might the following be a useful way of assessing gaydar via a modification of the procedure used in the article? Repeat the second experiment with no gays.

4. Why might the following be a useful way of assessing gaydar via a modification of the procedure used in the article? Repeat the second experiment with no straights.

5. In case the reader feels that Discussion #3 and #4 are deceptive, see this Psychwiki which looks at the history of deception in social psychology:

deception can often be seen in the “cover story” for the study, which provides the participant with a justification for the procedures and measures used. The ultimate goal of using deception in research is to ensure that the behaviors or reactions observed in a controlled laboratory setting are as close to those behaviors and reactions that occur outside of the laboratory setting.

Submitted by Paul Alper

Coin experiments

“The Pleasures of Suffering for Science”, The Wall Street Journal, June 8, 2012

This story, about scientists experimenting on themselves, included a reference to a coin-tossing experiment:

Even mathematics offers an example of physical self-sacrifice, through repetitive stress injury. University of Georgia professor Pope R. Hill flipped a coin 100,000 times to prove that heads and tails would come up an approximately equal number of times. The experiment lasted a year. He fell sick but completed the count, though he had to enlist the aid of an assistant near the end.

A Google search for Prof. Hill turned up the following story at the “Weird Science” website:

If you repeatedly flip a coin, the law of probability states that approximately half the time you should get heads and half the time tails. But does this law hold true in practice?

Pope R. Hill, a professor at the University of Georgia during the 1930s, wanted to find out. But he thought coin-flipping was too imprecise a measurement, since any one coin might be imbalanced, causing it to favor heads or tails.
Instead, he filled a can with 200 pennies. Half were dated 1919, half dated 1920. He shook up the can, withdrew a coin, and recorded its date. Then he returned the coin to the can. He repeated this procedure 100,000 times!

Of the 100,000 draws, 50,145 came out 1920. 49,855 came out 1919. Hill concluded that the law of half and half does work out in practice.


1. Do you think that drawing a single coin from among 1919 and 1920 coins - even in a perfectly shaken can - would solve the problem of potential imbalance between heads and tails on any single coin toss? Can you think of any possible imbalance in the former case?
2. In the second story, there is a remarkable relationship between Hill’s final counts. What questions, if any, might it raise in your mind about the experiment?
3. Which is the more accurate expectation from a coin-tossing experiment: (a) “heads and tails would come up an approximately equal number of times” (first story) or (b) “approximately half the time you should get heads and half the time tails” (second story)?

Submitted by Margaret Cibes

Note: In the archives of the Chance Newsletter there is a description of some other historical attempts to empirically demonstrate the chance of heads. We read there:

The French naturalist Count Buffon (1707-1788), know to us from the Buffon needle problem, tossed a coin 4040 times with heads coming up 2048 or 50.693 percent of the time. Karl Pearson tossed a coin 24,000 times with head coming up 12,012 or 50.05 percent of the time. While imprisoned by the Germans in the second world war, South African mathematician John Kerrich tossed a coin 10,000 times with heads coming up 5067 or 50.67 percent of the time. You can find his data in the classic text, Statistics, by Freedman, Pisani and Purves.

New presidential poll may be outlier

“Bloomberg Poll Shows Big But Questionable Obama Lead”
Huffington Post, June 20, 2012

A Bloomberg News national poll shows Obama leading his Republican challenger by a “surprisingly large margin of 53 to 40 percent,” instead of the (at most) single-digit margin shown in other recent polls.

While a Bloomberg representative expressed the same surprise as others, she stated that this result is based on a sample with the same demographics as its previous polls and on its usual methodology.

The article’s author states:

The most likely possibility is that this poll simply represents a statistical outlier. Yes, with a 3 percent margin of error, its Obama advantage of 53 to 40 percent is significantly different than the low single-digit lead suggested by the polling averages. However, that margin of error assumes a 95 percent level of confidence, which in simpler language means that one poll estimate in 20 will fall outside the margin of error by chance alone.

See Bloomberg’s report about the poll here.

Submitted by Margaret Cibes

Further discussion from FiveThirtyEight

Outlier polls are no substitute for news
by Nate Silver, FiveThirtyEight blog, New York Times, 20 June 2012

Silver identifies two options for dealing with such a poll, which a number of news sources have describe as an "outlier." One could simply choose to disregard it, or else "include it in some sort of average and then get on with your life." He with the following excellent advice:

My general that you should not throw out data without a good reason. If cherry-picking the two or three data points that you like the most is a sin of the first order, disregarding the two or three data points that you like the least will lead to many of the same problems.

In the case of the Bloomberg poll, because the organization has a good record on accuracy, he has chosen to include it the overall average of poll results that he uses for FiveThirtyEight forecasts.

Description of one of the further adjustments that Silver makes in his model can be found in his later post Calculating ‘house effects’ of polling firms (22 June 2012). Silver explains that what often is interpreted as movement in public opinion as measured in two different polls can instead be a reflection of systematic tendencies of polling organizations to favor either Democratic or Republican candidates. Reproduced below is a chart from the post that shows the size and direction of this house effect for some major organizations:


As described there, "The house effect adjustment is calculated by applying a regression analysis that compares the results of different polling firms’ surveys in the same states...The regression analysis makes these comparisons across all combinations of polling firms and states, and comes up with an overall estimate of the house effect as a result." Looking at the table, it is interesting to note that these effects are comparable to the stated margin of sampling error for typical national polls.

Submitted by Bill Peterson

Rock-paper-scissors in Texas elections

“Elections are a Crap Shoot in Texas, Where a Roll of the Dice Can Win”
by Nathan Koppel, The Wall Street Journal, June 19, 2012

The state of Texas permits tied candidates to agree to “settle the matter by a game of chance.” The article describes instances of candidates using a die or a coin to decide an election.

In one case, “leaving nothing to chance, the city attorney drafted a three-page agreement ahead of time detailing how the flip would be conducted.”

However, not any game is permitted:

Tonya Roberts, city secretary for Rice … consulted the Texas secretary of state's office after a city-council race ended last month in a 25-25 tie. She asked whether the race could be settled with a game of "rock, paper, scissors" but was told no. "I guess some people do consider that a game of skill," she said.

For some suggested strategies for winning this game, see “How to Win at Rock, Paper, Scissors” in wikiHow, and/or “To win at rock-paper-scissors, put on a blindfold”, in Discover Magazine.


Assume that the use of the rock-paper-scissors game had not been suggested by one of the Rice candidates, who might have been an experienced player. Do you think that a one-time play of this game, between random strangers, could have been considered a game of chance? Why or why not?

Submitted by Margaret Cibes

NSF may stop funding a “soft” science

“Political Scientists Are Lousy Forecasters”
by Jacqueline Stevens, The New York Times, June 23, 2012

A Northwestern University political science professor has written an op-ed piece responding to a House-passed amendment that would eliminate NSF grants to political scientists. To date the Senate has not voted on the bill.

She provides several anecdotes about political scientists having made incorrect predictions and states that she is “sympathetic with the [group] behind this amendment.” She feels that:

[T]he government — disproportionately — supports research that is amenable to statistical analyses and models even though everyone knows the clean equations mask messy realities that contrived data sets and assumptions don’t, and can’t, capture. …. It’s an open secret in my discipline: in terms of accurate political predictions …, my colleagues have failed spectacularly and wasted colossal amounts of time and money. …. Many of today’s peer-reviewed studies offer trivial confirmations of the obvious and policy documents filled with egregious, dangerous errors. ….I look forward to seeing what happens to my discipline and politics more generally once we stop mistaking probability studies and statistical significance for knowledge.


1. The author makes a number of categorical statements based on anecdotal evidence. Could her conclusions about political science research be an example of the “availability heuristic/fallacy”?
2. Do you think that the problems the author identifies are limited to, or at least more common in, the area of political science than in the other "soft," or even any "hard," sciences? What information would you need in order to confirm/reject your opinion?

(Disclosure: The submitter's spouse is a political scientist, whose Ph.D. program, including stats, was entirely funded by a government act (National Defense Education Act), but who is also skeptical about some social science research.)

Submitted by Margaret Cibes at the suggestion of James Greenwood

Crowdsourcing and its failed prediction on health care law

When the Crowd Isn't Wise by David Leonhardt, The New York Times, July 7, 2012.

Many people were surprised at the U.S. Supreme Court ruling that upheld most aspects of the Affordable Care Act (ACA). That included a prominent source that relied on crowdsoucing. Intrade, an online prediction market. Intrade results indicated that the individual insurance mandate would be ruled unconstitutional. This prediction held steady in spite of some last minute rumors about the Supreme Court decision.

With the rumors swirling, I began to check the odds at Intrade, the online prediction market where people can bet on real-world events, several times a day. The odds had barely budged. They continued to show about a 75 percent chance that the law’s so-called mandate would be ruled unconstitutional, right up until the morning it was ruled constitutional.

The concept of crowdsourcing has been discussed on Chance News before. The basic idea is that individual experts have systematic biases, but these biases cancel out when averaged over a large number of people.

Crowdsourcing does have some notable successes, but it isn't perfect.

For one thing, many of the betting pools on Intrade and Betfair attract relatively few traders, in part because using them legally is cumbersome. (No, I do not know from experience.) The thinness of these markets can cause them to adjust too slowly to new information.
And there is this: If the circle of people who possess information is small enough — as with the selection of a vice president or pope or, arguably, a decision by the Supreme Court — the crowds may not have much wisdom to impart. “There is a class of markets that I think are basically pointless,” says Justin Wolfers, an economist whose research on prediction markets, much of it with Eric Zitzewitz of Dartmouth, has made him mostly a fan of them. “There is no widely available public information.”

So, should you return to the individual expert for prediction? Maybe not.

Mutual fund managers, as a class, lose their clients’ money because they do not outperform the market and charge fees for their mediocrity. Sports pundits have a dismal record of predicting games relative to the Las Vegas odds, which are just another market price. As imperfect as prediction markets are in forecasting elections, they have at least as good a recent record as polls. Or consider the housing bubble: both the market and most experts missed it.

Mr. Leonhardt offers a middle path.

The answer, I think, is to take the best of what both experts and markets have to offer, realizing that the combination of the two offers a better window onto the future than either alone. Markets are at their best when they can synthesize large amounts of disparate information, as on an election night. Experts are most useful when a system exists to identify the most truly knowledgeable — a system that often resembles a market.

This last sentence introduces the thought that you use crowdsourcing to find the best experts. Social media like Twitter allows people an interesting way to identify experts who are truly experts.

Think for a moment about what a Twitter feed is: it’s a personalized market of experts (and friends), in which you can build your own focus group and listen to its collective analysis about the past, present and future. An RSS feed, in which you choose blogs to read, works similarly. You make decisions about which experts are worthy of your attention, based both on your own judgments about them and on other experts’ judgments.
Their predictions now face a market discipline that did not always exist before the Internet came along. “Experts exist,” as Mr. Wolfers says, “but they’re not necessarily the same as the guys on TV.”


1. How bad did Intrade really perform on the Supreme Court decision on ACA? How large a probability does a system like Intrade have to place on a bad prediction to cause you to lose faith in it?

2. What are some of the potential problems with identifying experts by the number of retweets that they get in Twitter?