Chance News 20: Difference between revisions

Revision as of 20:08, 31 August 2006

Quotations

Like dreams, statistics are a form of wish fulfillment. - Jean Baudrillard

According to an article in the WSJ by Dr. Jerome Groopman of the Harvard Medical School criticizing alternative medicine: on the wall of the office of Dr. Stephen Straus who directs NCCAM, (formerly the Office of Alternative Medicine which is within the National Institutes of Health) there exists the following framed quotation, "The plural of anecdote is not evidence." This useful and insightful aphorism appears in various versions as can be seen by this website here.

Forsooth

"People who live longer have a greater chance of developing cancer in old age." Heard on the "Today" news programme on BBC Radio 4 and reported to the MEDSTATS discussion group by Ted Harding.

A clumsy attempt at anonymization

A Face is Exposed for AOL Searcher No. 4417749. Michael Barbaro and Tom Zeller, Jr. The New York Times (August 9, 2006).

Statisticians frequently deal with confidentiality issues when deciding what type of data and what amount of detail should be withheld to protect sensitive information about individual patients or institutions. It's not an easy task and there are some subtle traps. And sometimes there are not so subtle traps.

At the request of some researchers, America Online (AOL) released data on 20 million web searches performed 650 thousand AOL users over a three month span. They released the data, not just to those researchers, but to the general public. AOL quickly realized that this was a bad idea and removed the database, but it had already been copied to many locations. It is unlikely that they will ever be able to persuade the web owners at all the other locations to take the files offline.

The data was anonymized by replacing the user name with a random number. This is important, because some of the search terms are for rather sensitive items. Examples of things that people searched on are

- "can you adopt after a suicide attempt" or

- "how to tell your family you're a victim of incest."

But replacing a name by a number did not come even close to anonymizing all of the records. The problem is that people will do web searches about things that reveal hints about themselves. Actual searches listed in the data base included things like geographic locations:

- "gynecology oncologists in new york city,"

- "orange county california jails inmate information,"

- "employment needed- louisville ky," or

- "salem probate court decisions,"

or places where the searchers shopped or banked or got health care,

- "gerards restaurant in dc,"

- "st. margaret's hospital washington d.c.,"

- "l&n federal credit union," or

- "mustang sally gentlemans club,"

or products that the searchers owned,

- "cheap rims for a ford focus," or

- "how to change brake pads on scion xb,"

or their hobbies,

- "knitting stitches," or

- "texas hold'em poker on line seminars."

It gets even more revealing when people do web searches on their relatives or even themselves.

These individual searches are, according to one report, like individual pieces in a mosaic. Put enough of them together and you can get a really clear picture of who the searcher is. Can you actually identify people from their web searches? The answer is yes.

Accrdoing to the New York Times report, one user, with the id number 4417749 searched for

- "landscapers in Lilburn, Ga," and

- "homes sold in shadow lake subdivision gwinnett county georgia,"

as well as the names of several people, all of whose last names were Arnold. It didn't take long for the New York Times to track down a 62 year old widow named Thelma Arnold.

Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. "I had no idea somebody was looking over my shoulder."

This is an important lesson that statisticians have been aware of for some time. An individual piece of information by itself may not compromise someone's privacy, but will do so when it is combined with other pieces of information. Knowing that someone lives in a small town still preserves anonymity, but when that small town name appears in a database of all pediatric heart transplant cases, you have a problem.

Questions

1. List some of the other things that people might search on that would potentially reveal their identities.

2. Could this data set be cleaned up to the point where it could be truly thought to be anonymized?

3. Why would a researcher be interested in what people search for on the Internet? What sort of information would be useful for someone in Marketing?

Submitted by Steve Simon

Mean vs. Median

Who's Counting: It's Mean to Ignore the Median
ABCNews.com, 6 August 2006
John Allen Paulos

This latest installment of "Who's Counting" focuses on the distinction between the mean and median. Paulos begins with the familiar example of housing prices, and goes on to discuss the implications for interpreting newly released data on the performance of the US economy for 2004. Republicans point out that the economy grew at a rate of 4.2%, and complain that they are not getting enough credit for the good news. Democrats counter that real median income is falling and poverty is rising. How can both be true? Just as a few expensive houses in a neighborhood can pull the mean substantially above the median, gains by a wealthy few at the top of the income ladder can pull up the mean, even if most people are not benefiting.

To show that this is happening, Paulos cites work on income distribution by economists Thomas Picketty and Emmanuel Satz. According to their calculations, the the richest one percent, whose incomes exceed $315,000, gained on average nearly 17% over the year in question. However, the good news did not extend very far down the income distribution. Looking at the top five percent of all incomes, the average gain is described as "minimal." This means that the gains were concentrated near the very top. In fact, even among the top one percent, Picketty and Satz found that half of income gains went to the top tenth of the group.

Paulos points out that the pattern of the income distribution can be described mathematically in terms of so-called "power laws," which apply to a variety of observed phenomenon, including Internet surfing and investing. A general description of power laws from Wikipedia can be found here.

Submitted by Bill Peterson

A Reader's Guide to Polls

Precisely False vs. Approximately Right: A Reader's Guide To Polls
The New York Times, August 27, 2006, The Public Editor
Jack Rosenthal

Jack Rosenthal, a former New York Times senior editor filling in as the guest "Public Reader" is concerned that the media often reports the outcomes of a poll without explaining how the poll should be interpreted and without alerting the readers when there are serious problems with the way the poll is carried out. He provides the following example:

LAST March, the American Medical Association reported an alarming rate of binge drinking and unprotected sex among college women during spring break. The report was based on a survey of "a random sample" of 644 women and supplied a scientific-sounding "margin of error of +/– 4.00 percent." Television, columnists and comedians embraced the racy report. The New York Times did not publish the story, but did include some of the data in a chart.

The sample, it turned out, was not random. It included only women who volunteered to answer questions — and only a quarter of them had actually ever taken a spring break trip. They hardly constituted a reliable cross section, and there is no way to calculate a margin of sampling error for such a "sample."

Rosenthal refers readers to the Mystery Pollster, a polling blog for more information about this survey. On this website we read:

Cliff Zukin, the current president of the American Association for Public Opinion Research (AAPOR), saw the survey results printed in the Times, and wondered about how the survey had been conducted. He contacted the AMA and was referred to the methodology section of their online release. He saw the following description (which has since been scrubbed):
The American Medical Association commissioned the survey. Fako & Associates, Inc., of Lemont, Illinois, a national public opinion research firm, conducted the survey online. A nationwide random sample of 644 women age 17 - 35 who currently attend college, graduated from college or attended, but did not graduate from college within the United States were surveyed. The survey has a margin of error of +/- 4.00 percent at the 95 percent level of confidence [emphasis added].

Zukin then contacted Janet Willams at the AMA asking for more details on how the study was carried out. She responded:

The poll was conducted in the industry standard for internet polls -- this was not academic research -- it was a public opinion poll that is standard for policy development and used by politicians and nonprofits.

Zukin replied:

I'm very troubled by this methodology. As an op-in non-probability sample, it lacks scientific validity in that your respondents are not generalizable to the population you purport to make inferences about. As such the report of the findings may be seriously misleading. I do not accept the distinction you make between academic research and a "public opinion" survey.

In her reply Williams said:

As far as the methodology, it is the standard in the industry and does generalize for the population. Apparently I need to reiterate that this is not an academic study and will [not ?] be published in any peer reviewed journal; this is a standard media advocacy tool that is regularly used by the American Lung Association, American Heart Association, American Cancer Society and others.

We recommend reading the full discussion Part 1 and Part 2 of the The AMA Spring Break Survey.
Rosenthal gives another example, writing:

Another example surfaced last week in The Wall Street Journal. It examined a “landmark survey,” conducted for liquor retailers, claiming to show that “millions of kids” buy alcohol online. A random sample? The pollster paid the teenage respondents and included only Internet users.

The survey is criticized in Carl Bialik's "Numbers Guy" column in the Wall Street Journal Online August 18, 2006.
Rosenthal remarks:

Such misrepresentations help explain why The Times recently issued a seven-page paper on polling standards for editors and reporters. "Keeping poorly done survey research out of the paper is just as important as getting good survey research into the paper," the document said.

He then says "readers, too, need to know something about polls--at least enough to sniff out good polls from bad" and so he provides a brief reader's guide. This includes understanding margin of error and being aware of problems in the way the questions are asked such as: use of double negatives, the order of the questions, the effect of strength of feeling about and an issue etc.
The MysteryPollster remarks that ABC has made their standards public in their report News' Polling Methodology and Standards and suggests that the Times should also make their paper also available to the public.

Discussion questions

(1) The first item in the Reader's guide is to beware of too much precision. The following example is given:
<\blockquote>A recent Zogby Interactive poll, for instance, showed that the candidates for the Senate in Missouri were separated by 3.8 percentage points. Yet the stated margin of sampling error meant the difference between the candidates could be seven points. The survey would have to interview unimaginably many thousands for that zero point eight to be useful.</blockquote.
Why should we beware of too much precision?
(2) The second item deals with sampling error. We read
The Times and other media accompany poll reports with a box explaining how the random sample was selected and stating the sampling error. Error is actually a misnomer. What this figure actually describes is a range of approximation.
and

For a typical election sample of 1,000, the error rate is plus or minus three percentage points for each candidate, meaning that a 50-50 race could actually differ by 53 to 47.

Coment on these two statements.
(3) We also read:

There’s also a formula for calculating the error in comparing one survey with another. For instance, last May, a Times/CBS News survey found that 31 percent of the public approved of President Bush’s performance; in the survey published last Wednesday, the number was 36 percent. Is that a real change? Yes. After adjustment for comparative error, the approval rating has gained by at least one point.

What was the sample size?