Chance News 20
Like dreams, statistics are a form of wish fulfillment. - Jean Baudrillard
According to an article in the WSJ by Dr. Jerome Groopman of the Harvard Medical School criticizing alternative medicine: on the wall of the office of Dr. Stephen Straus who directs NCCAM, (formerly the Office of Alternative Medicine which is within the National Institutes of Health) there exists the following framed quotation, "The plural of anecdote is not evidence." This useful and insightful aphorism appears in various versions as can be seen by this website here.
"People who live longer have a greater chance of developing cancer in old age." Heard on the "Today" news programme on BBC Radio 4 and reported to the MEDSTATS discussion group by Ted Harding.
The next two Forsooths are from the September RRS NEWS.
The number of motorists willing to pay to travel on Britain's roads is falling, a survey out today reveals. More than one in four drivers were will to pay to use city centre roads in 2002, but that figure fell to just 36 per cent in 2005, a study for the RAC said.
16 March 2006
At present, Labour has a majority of 64, which means it holds 32 more seats than the other parties combined.
Times on line20 March 2006
A car talk puzzle
Week of 08-21-07
The bullet holes were all over the place on the R.A.F. planes -- in the wings and the fuselage, and seemingly distributed randomly on the undersides. So, where did the R.A.F. mathematician recommend extra armor, to save future missions?
A clumsy attempt at anonymization
A Face is Exposed for AOL Searcher No. 4417749. Michael Barbaro and Tom Zeller, Jr. The New York Times (August 9, 2006).
Statisticians frequently deal with confidentiality issues when deciding what type of data and what amount of detail should be withheld to protect sensitive information about individual patients or institutions. It's not an easy task and there are some subtle traps. And sometimes there are not so subtle traps.
At the request of some researchers, America Online (AOL) released data on 20 million web searches performed 650 thousand AOL users over a three month span. They released the data, not just to those researchers, but to the general public. AOL quickly realized that this was a bad idea and removed the database, but it had already been copied to many locations. It is unlikely that they will ever be able to persuade the web owners at all the other locations to take the files offline.
The data was anonymized by replacing the user name with a random number. This is important, because some of the search terms are for rather sensitive items. Examples of things that people searched on are
- "can you adopt after a suicide attempt" or
- "how to tell your family you're a victim of incest."
But replacing a name by a number did not come even close to anonymizing all of the records. The problem is that people will do web searches about things that reveal hints about themselves. Actual searches listed in the data base included things like geographic locations:
- "gynecology oncologists in new york city,"
- "orange county california jails inmate information,"
- "employment needed- louisville ky," or
- "salem probate court decisions,"
or places where the searchers shopped or banked or got health care,
- "gerards restaurant in dc,"
- "st. margaret's hospital washington d.c.,"
- "l&n federal credit union," or
- "mustang sally gentlemans club,"
or products that the searchers owned,
- "cheap rims for a ford focus," or
- "how to change brake pads on scion xb,"
or their hobbies,
- "knitting stitches," or
- "texas hold'em poker on line seminars."
It gets even more revealing when people do web searches on their relatives or even themselves.
These individual searches are, according to one report, like individual pieces in a mosaic. Put enough of them together and you can get a really clear picture of who the searcher is. Can you actually identify people from their web searches? The answer is yes.
Accrdoing to the New York Times report, one user, with the id number 4417749 searched for
- "landscapers in Lilburn, Ga," and
- "homes sold in shadow lake subdivision gwinnett county georgia,"
as well as the names of several people, all of whose last names were Arnold. It didn't take long for the New York Times to track down a 62 year old widow named Thelma Arnold.
Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. "I had no idea somebody was looking over my shoulder."
This is an important lesson that statisticians have been aware of for some time. An individual piece of information by itself may not compromise someone's privacy, but will do so when it is combined with other pieces of information. Knowing that someone lives in a small town still preserves anonymity, but when that small town name appears in a database of all pediatric heart transplant cases, you have a problem.
1. List some of the other things that people might search on that would potentially reveal their identities.
2. Could this data set be cleaned up to the point where it could be truly thought to be anonymized?
3. Why would a researcher be interested in what people search for on the Internet? What sort of information would be useful for someone in Marketing?
Submitted by Steve Simon
Mean vs. Median
Who's Counting: It's Mean to Ignore the Median
ABCNews.com, 6 August 2006
John Allen Paulos
This latest installment of "Who's Counting" focuses on the distinction between the mean and median. Paulos begins with the familiar example of housing prices, and goes on to discuss the implications for interpreting newly released data on the performance of the US economy for 2004. Republicans point out that the economy grew at a rate of 4.2%, and complain that they are not getting enough credit for the good news. Democrats counter that real median income is falling and poverty is rising. How can both be true? Just as a few expensive houses in a neighborhood can pull the mean substantially above the median, gains by a wealthy few at the top of the income ladder can pull up the mean, even if most people are not benefiting.
To show that this is happening, Paulos cites work on income distribution by economists Thomas Picketty and Emmanuel Satz. According to their calculations, the the richest one percent, whose incomes exceed $315,000, gained on average nearly 17% over the year in question. However, the good news did not extend very far down the income distribution. Looking at the top five percent of all incomes, the average gain is described as "minimal." This means that the gains were concentrated near the very top. In fact, even among the top one percent, Picketty and Satz found that half of income gains went to the top tenth of the group.
Paulos points out that the pattern of the income distribution can be described mathematically in terms of so-called "power laws," which apply to a variety of observed phenomenon, including Internet surfing and investing. A general description of power laws from Wikipedia can be found here.
Submitted by Bill Peterson
A Reader's Guide to Polls
Precisely False vs. Approximately Right: A Reader's Guide To Polls
The New York Times, August 27, 2006, The Public Editor
Jack Rosenthal, a former New York Times senior editor filling in as the guest "Public Reader", is concerned that the media often reports the outcomes of a poll without explaining how the poll should be interpreted and without alerting the readers when there are serious problems with the way the poll is carried out. He provides the following example:
Last March, the American Medical Association reported an alarming rate of binge drinking and unprotected sex among college women during spring break. The report was based on a survey of "a random sample" of 644 women and supplied a scientific-sounding "margin of error of +/– 4.00 percent." Television, columnists and comedians embraced the racy report. The New York Times did not publish the story, but did include some of the data in a chart.
The sample, it turned out, was not random. It included only women who volunteered to answer questions — and only a quarter of them had actually ever taken a spring break trip. They hardly constituted a reliable cross section, and there is no way to calculate a margin of sampling error for such a "sample."
For more information about this AMA survey, Rosenthal refers readers to a polling blog Mystery Pollster maintained by Mark Blumenthal, a pollster for the Democratic Party. Here we read:
Cliff Zukin, the current president of the American Association for Public Opinion Research (AAPOR), saw the survey results printed in the Times, and wondered about how the survey had been conducted. He contacted the AMA and was referred to the methodology section of their online release. He saw the following description (which has since been scrubbed):
The American Medical Association commissioned the survey. Fako & Associates, Inc., of Lemont, Illinois, a national public opinion research firm, conducted the survey online. A nationwide random sample of 644 women age 17 - 35 who currently attend college, graduated from college or attended, but did not graduate from college within the United States were surveyed. The survey has a margin of error of +/- 4.00 percent at the 95 percent level of confidence [emphasis added].
Zukin then contacted Janet Willams at the AMA asking for more details on how the study was carried out. She responded:
The poll was conducted in the industry standard for internet polls -- this was not academic research -- it was a public opinion poll that is standard for policy development and used by politicians and nonprofits.
I'm very troubled by this methodology. As an op-in non-probability sample, it lacks scientific validity in that your respondents are not generalizable to the population you purport to make inferences about. As such the report of the findings may be seriously misleading. I do not accept the distinction you make between academic research and a "public opinion" survey.
In her reply Williams said:
As far as the methodology, it is the standard in the industry and does generalize for the population. Apparently I need to reiterate that this is not an academic study and will [not ?] be published in any peer reviewed journal; this is a standard media advocacy tool that is regularly used by the American Lung Association, American Heart Association, American Cancer Society and others.
Rosenthal gives another example:
Another example surfaced last week in The Wall Street Journal. It examined a “landmark survey,” conducted for liquor retailers, claiming to show that “millions of kids” buy alcohol online. A random sample? The pollster paid the teenage respondents and included only Internet users.
This survey is critiqued in Carl Bialik's "Numbers Guy" column in the Wall Street Journal Online, August 18, 2006.
Such misrepresentations help explain why The Times recently issued a seven-page paper on polling standards for editors and reporters. "Keeping poorly done survey research out of the paper is just as important as getting good survey research into the paper," the document said.
Rosenthal says "readers, too, need to know something about polls--at least enough to sniff out good polls from bad" and so he provides a brief reader's guide. This includes understanding margin of error and being aware of problems in the way the questions are asked such as: use of double negatives, the order of the questions, the effect of strength of feeling about an issue etc.
The MysteryPollster remarks that the TIMES document on polling standards is apparently not in the public domain while ABC have made their standards public in their report News' Polling Methodology and Standards and suggested that the Times should also make their standards for editors and reporters public.
(1) The first item in the Reader's guide is to beware of too much precision. The following example is given:
A recent Zogby Interactive poll, for instance, showed that the candidates for the Senate in Missouri were separated by 3.8 percentage points. Yet the stated margin of sampling error meant the difference between the candidates could be seven points. The survey would have to interview unimaginably many thousands for that zero point eight to be useful.
Why should we beware of too much precision?
(2) The second item deals with sampling error. We read:
The Times and other media accompany poll reports with a box explaining how the random sample was selected and stating the sampling error. Error is actually a misnomer. What this figure actually describes is a range of approximation.
For a typical election sample of 1,000, the error rate is plus or minus three percentage points for each candidate, meaning that a 50-50 race could actually differ by 53 to 47.
Do you agree that the error in "sampling error" is a misnomer? Do you see anything wrong with the second sentance?
(3) Rosenthal says:
There’s also a formula for calculating the error in comparing one survey with another. For instance, last May, a Times/CBS News survey found that 31 percent of the public approved of President Bush’s performance; in the survey published last Wednesday, the number was 36 percent. Is that a real change? Yes. After adjustment for comparative error, the approval rating has gained by at least one point.
What was the sample size?
Submitted by Laurie Snell
One Million Ways to Die, Ryan Singel, Wired.com, 11 Sep 2006.
This on-line article compares official mortality data with the number of Americans who have been killed inside the United States by terrorism since 1995. It highlights that many threats are far more likely to kill an American than any terrorist -- at least, statistically speaking. For example, it claims that your appendix is more likely to kill you than al-Qaida is.
The rankings are:
S E V E R E Driving off the road: 254,419 Falling: 146,542 Accidental poisoning: 140,327
H I G H Dying from work: 59,730 Walking down the street: 52,000. Accidentally drowning: 38,302
E L E V A T E D Killed by the flu: 19,415 Dying from a hernia: 16,742
G U A R D E D Accidental firing of a gun: 8,536 Electrocution: 5,171
L O W Being shot by law enforcement: 3,949 Terrorism: 3147 Carbon monoxide in products: 1,554
- The rankings are based on the number of mortalities in each category throughout the 11-year period spanning 1995 through 2005 (extrapolated from best available data). What issues might arise from extrapolation of data? Is the past data a good guide to future exposures for all of these risks?
- Are the underlying populations from which the data are compiled really comparable? If you think the exposures to risk vary by threat, what adjustments might be made to standardise the data?
- Why do you think the risk from certain threats is perceived to be greater or less than the statistics suggest?
- If these point estimates included some estimates of variation, such as a full probability distribution, what differences might you expect to see between such distributions? Do you think that that extra information might influence your perception of risk, or even how you might define risk in the first place?
- National Highway and Safety Agency (.pdf)
- National Vital Statistics Reports, Vol. 50, No. 15 (09/16/2002) (.pdf)
- US Consumer Product Safety Commission
- the Insurance Information Institute.
Submitted by John Gavin.