Chance News 20: Difference between revisions

From ChanceWiki
Jump to navigation Jump to search
Line 74: Line 74:
as well as the names of several people, all of whose last names were Arnold. It didn't take long for the New York Times to track down a 62 year old widow named Thelma Arnold.
as well as the names of several people, all of whose last names were Arnold. It didn't take long for the New York Times to track down a 62 year old widow named Thelma Arnold.


<blockquote>Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. “I had no idea somebody was looking over my shoulder.</blockquote>
<blockquote>Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. "I had no idea somebody was looking over my shoulder."</blockquote>


This is an important lesson that statisticians have been aware of for some time. An individual piece of information by itself may not compromise someone's privacy, but will do so when it is combined with other pieces of information. Knowing that someone lives in a small town still preserves anonymity, but when that small town name appears in a database of all pediatric heart transplant cases, you have a problem.
This is an important lesson that statisticians have been aware of for some time. An individual piece of information by itself may not compromise someone's privacy, but will do so when it is combined with other pieces of information. Knowing that someone lives in a small town still preserves anonymity, but when that small town name appears in a database of all pediatric heart transplant cases, you have a problem.
Line 85: Line 85:


3. Why would a researcher be interested in what people search for on the Internet? What sort of information would be useful for someone in Marketing?
3. Why would a researcher be interested in what people search for on the Internet? What sort of information would be useful for someone in Marketing?
Submitted by Steve Simon


==item 2==
==item 2==

Revision as of 19:34, 16 August 2006

Quotations

Like dreams, statistics are a form of wish fulfillment. - Jean Baudrillard

According to an article in the WSJ by Dr. Jerome Groopman of the Harvard Medical School criticizing alternative medicine: on the wall of the office of Dr. Stephen Straus who directs NCCAM, (formerly the Office of Alternative Medicine which is within the National Institutes of Health) there exists the following framed quotation, "The plural of anecdote is not evidence." This useful and insightful aphorism appears in various versions as can be seen by this website here.

Forsooth


A clumsy attempt at anonymization

A Face is Exposed for AOL Searcher No. 4417749. Michael Barbaro and Tom Zeller, Jr. The New York Times (August 9, 2006).

Statisticians frequently deal with confidentiality issues when deciding what type of data and what amount of detail should be withheld to protect sensitive information about individual patients or institutions. It's not an easy task and there are some subtle traps. And sometimes there are not so subtle traps.

At the request of some researchers, America Online (AOL) released data on 20 million web searches performed 650 thousand AOL users over a three month span. They released the data, not just to those researchers, but to the general public. AOL quickly realized that this was a bad idea and removed the database, but it had already been copied to many locations. It is unlikely that they will ever be able to persuade the web owners at all the other locations to take the files offline.

The data was anonymized by replacing the user name with a random number. This is important, because some of the search terms are for rather sensitive items. Examples of things that people searched on are

- "can you adopt after a suicide attempt" or

- "how to tell your family you're a victim of incest."

But replacing a name by a number did not come even close to anonymizing all of the records. The problem is that people will do web searches about things that reveal hints about themselves. Actual searches listed in the data base included things like geographic locations:

- "gynecology oncologists in new york city,"

- "orange county california jails inmate information,"

- "employment needed- louisville ky," or

- "salem probate court decisions,"

or places where the searchers shopped or banked or got health care,

- "gerards restaurant in dc,"

- "st. margaret's hospital washington d.c.,"

- "l&n federal credit union," or

- "mustang sally gentlemans club,"

or products that the searchers owned,

- "cheap rims for a ford focus," or

- "how to change brake pads on scion xb,"

or their hobbies,

- "knitting stitches," or

- "texas hold'em poker on line seminars."

It gets even more revealing when people do web searches on their relatives or even themselves.

These individual searches are, according to one report, like individual pieces in a mosaic. Put enough of them together and you can get a really clear picture of who the searcher is. Can you actually identify people from their web searches? The answer is yes.

Accrdoing to the New York Times report, one user, with the id number 4417749 searched for

- "landscapers in Lilburn, Ga," and

- "homes sold in shadow lake subdivision gwinnett county georgia,"

as well as the names of several people, all of whose last names were Arnold. It didn't take long for the New York Times to track down a 62 year old widow named Thelma Arnold.

Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. "I had no idea somebody was looking over my shoulder."

This is an important lesson that statisticians have been aware of for some time. An individual piece of information by itself may not compromise someone's privacy, but will do so when it is combined with other pieces of information. Knowing that someone lives in a small town still preserves anonymity, but when that small town name appears in a database of all pediatric heart transplant cases, you have a problem.

Questions

1. List some of the other things that people might search on that would potentially reveal their identities.

2. Could this data set be cleaned up to the point where it could be truly thought to be anonymized?

3. Why would a researcher be interested in what people search for on the Internet? What sort of information would be useful for someone in Marketing?

Submitted by Steve Simon

item 2