Difference between revisions of "Chance News 20"

From ChanceWiki
Jump to navigation Jump to search
Line 16: Line 16:
  
 
----
 
----
==Item 1==
+
==A clumsy attempt at anonymization ==
 +
 
 +
[www.nytimes.com/2006/08/09/technology/09aol.html A Face is Exposed for AOL Searcher No. 4417749.] Michael Barbaro and Tom Zeller, Jr. The New York Times (August 9, 2006). Note: Available only for a few days more for free.
 +
 
 +
Statisticians frequently deal with confidentiality issues when deciding what type of data and what amount of detail should be withheld to protect sensitive information about individual patients or institutions. It's not an easy task and there are some subtle traps. And sometimes there are not so subtle traps.
 +
 
 +
At the request of some researchers, America Online (AOL) released data on 20 million web searches performed 650 thousand AOL users over a three month span. They released the data, not just to those researchers, but to the general public. AOL quickly realized that this was a bad idea and removed the database, but it had already been copied to many locations. It is unlikely that they will ever be able to persuade the web owners at all the other locations to take the files offline.
 +
 
 +
The data was anonymized by replacing the user name with a random number. This is important, because some of the search terms are for rather sensitive items. Examples of things that people searched on are
 +
 
 +
- "can you adopt after a suicide attempt" or
 +
- "how to tell your family you're a victim of incest."
 +
 
 +
But replacing a name by a number did not come even close to anonymizing all of the records. The problem is that people will do web searches about things that reveal hints about themselves. Actual searches listed in the data base included things like geographic locations:
 +
 
 +
- "gynecology oncologists in new york city,"
 +
- "orange county california jails inmate information,"
 +
- "employment needed- louisville ky," or
 +
- "salem probate court decisions,"
 +
 
 +
or places where the searchers shopped or banked or got health care,
 +
 
 +
- "gerards restaurant in dc,"
 +
- "st. margaret's hospital washington d.c.,"
 +
- "l&n federal credit union," or
 +
- "mustang sally gentlemans club,"
 +
 
 +
or products that the searchers owned,
 +
 
 +
- "cheap rims for a ford focus," or
 +
- "how to change brake pads on scion xb,"
 +
 
 +
or their hobbies,
 +
 
 +
- "knitting stitches," or
 +
- "texas hold'em poker on line seminars."
 +
 
 +
It gets even more revealing when people do web searches on their relatives or even themselves.
 +
 
 +
These individual searches are, according to one report, like individual pieces in a mosaic. Put enough of them together and you can get a really clear picture of who the searcher is. Can you actually identify people from their web searches? The answer is yes.
 +
 
 +
Accrdoing to the New York Times report, one user, with the id number 4417749 searched for
 +
 
 +
- "landscapers in Lilburn, Ga," and
 +
- "homes sold in shadow lake subdivision gwinnett county georgia,"
 +
 
 +
as well as the names of several people, all of whose last names were Arnold. It didn't take long for the New York Times to track down a 62 year old widow named Thelma Arnold.
 +
 
 +
<blockquote>Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. “I had no idea somebody was looking over my shoulder.”</blockquote>
 +
 
 +
This is an important lesson that statisticians have been aware of for some time. An individual piece of information by itself may not compromise someone's privacy, but will do so when it is combined with other pieces of information. Knowing that someone lives in a small town still preserves anonymity, but when that small town name appears in a database of all pediatric heart transplant cases, you have a problem.
  
 
==item 2==
 
==item 2==

Revision as of 19:09, 16 August 2006

Quotations

Like dreams, statistics are a form of wish fulfillment. - Jean Baudrillard

According to an article in the WSJ by Dr. Jerome Groopman of the Harvard Medical School criticizing alternative medicine: on the wall of the office of Dr. Stephen Straus who directs NCCAM, (formerly the Office of Alternative Medicine which is within the National Institutes of Health) there exists the following framed quotation, "The plural of anecdote is not evidence." This useful and insightful aphorism appears in various versions as can be seen by this website here.

Forsooth


A clumsy attempt at anonymization

[www.nytimes.com/2006/08/09/technology/09aol.html A Face is Exposed for AOL Searcher No. 4417749.] Michael Barbaro and Tom Zeller, Jr. The New York Times (August 9, 2006). Note: Available only for a few days more for free.

Statisticians frequently deal with confidentiality issues when deciding what type of data and what amount of detail should be withheld to protect sensitive information about individual patients or institutions. It's not an easy task and there are some subtle traps. And sometimes there are not so subtle traps.

At the request of some researchers, America Online (AOL) released data on 20 million web searches performed 650 thousand AOL users over a three month span. They released the data, not just to those researchers, but to the general public. AOL quickly realized that this was a bad idea and removed the database, but it had already been copied to many locations. It is unlikely that they will ever be able to persuade the web owners at all the other locations to take the files offline.

The data was anonymized by replacing the user name with a random number. This is important, because some of the search terms are for rather sensitive items. Examples of things that people searched on are

- "can you adopt after a suicide attempt" or - "how to tell your family you're a victim of incest."

But replacing a name by a number did not come even close to anonymizing all of the records. The problem is that people will do web searches about things that reveal hints about themselves. Actual searches listed in the data base included things like geographic locations:

- "gynecology oncologists in new york city," - "orange county california jails inmate information," - "employment needed- louisville ky," or - "salem probate court decisions,"

or places where the searchers shopped or banked or got health care,

- "gerards restaurant in dc," - "st. margaret's hospital washington d.c.," - "l&n federal credit union," or - "mustang sally gentlemans club,"

or products that the searchers owned,

- "cheap rims for a ford focus," or - "how to change brake pads on scion xb,"

or their hobbies,

- "knitting stitches," or - "texas hold'em poker on line seminars."

It gets even more revealing when people do web searches on their relatives or even themselves.

These individual searches are, according to one report, like individual pieces in a mosaic. Put enough of them together and you can get a really clear picture of who the searcher is. Can you actually identify people from their web searches? The answer is yes.

Accrdoing to the New York Times report, one user, with the id number 4417749 searched for

- "landscapers in Lilburn, Ga," and - "homes sold in shadow lake subdivision gwinnett county georgia,"

as well as the names of several people, all of whose last names were Arnold. It didn't take long for the New York Times to track down a 62 year old widow named Thelma Arnold.

Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. “I had no idea somebody was looking over my shoulder.”

This is an important lesson that statisticians have been aware of for some time. An individual piece of information by itself may not compromise someone's privacy, but will do so when it is combined with other pieces of information. Knowing that someone lives in a small town still preserves anonymity, but when that small town name appears in a database of all pediatric heart transplant cases, you have a problem.

item 2