# Difference between revisions of "Chance News 56"

m (→Judging wine judges) |
m (→Judging wine judges) |
||

Line 516: | Line 516: | ||

by Robert T. Hodgson, <i>Journal of Wine Economics</i>, Fall 2008<br> | by Robert T. Hodgson, <i>Journal of Wine Economics</i>, Fall 2008<br> | ||

− | <b>Abstract</b>: An analysis of over 4000 wines entered in 13 U.S. wine competitions shows little concordance among the venues in awarding Gold medals. Of the 2,440 wines entered in more than three competitions, 47 percent received Gold medals, but 84 percent of these same wines also received no award in another competition. Thus, many wines that are viewed as extraordinarily good at some competitions are viewed as below average at others. An analysis of the number of Gold medals received in multiple competitions indicates that the probability of winning a Gold medal at one competition is stochastically independent of the probability of receiving a Gold at another competition, indicating that winning a Gold medal is greatly influenced by chance alone.<br> | + | "<b>Abstract</b>: An analysis of over 4000 wines entered in 13 U.S. wine competitions shows little concordance among the venues in awarding Gold medals. Of the 2,440 wines entered in more than three competitions, 47 percent received Gold medals, but 84 percent of these same wines also received no award in another competition. Thus, many wines that are viewed as extraordinarily good at some competitions are viewed as below average at others. An analysis of the number of Gold medals received in multiple competitions indicates that the probability of winning a Gold medal at one competition is stochastically independent of the probability of receiving a Gold at another competition, indicating that winning a Gold medal is greatly influenced by chance alone."<br> |

[http://www.wine-economics.org/journal/content/Volume3/number2/Full%20Texts/01_wine%20economics_Robert%20T.%20Hodgson%20%28105-113%29.pdf “An Examination of Judge Reliability at a major U.S. Wine Competition”]<br> | [http://www.wine-economics.org/journal/content/Volume3/number2/Full%20Texts/01_wine%20economics_Robert%20T.%20Hodgson%20%28105-113%29.pdf “An Examination of Judge Reliability at a major U.S. Wine Competition”]<br> | ||

by Robert T. Hodgson, <i>Journal of Wine Economics</i>, Spring 2009<br> | by Robert T. Hodgson, <i>Journal of Wine Economics</i>, Spring 2009<br> | ||

− | <b>Abstract</b>: Wine judge performance at a major wine competition has been analyzed from 2005 to 2008 using replicate samples. Each panel of four expert judges received a flight of 30 wines imbedded with triplicate samples poured from the same bottle. Between 65 and 70 judges were tested each year. About 10 percent of the judges were able to replicate their score within a single medal group. Another 10 percent, on occasion, scored the same wine Bronze to Gold. Judges tend to be more consistent in what they don’t like than what they do. An analysis of variance covering every panel over the study period indicates only about half of the panels presented awards based solely on wine quality.<br> | + | "<b>Abstract</b>: Wine judge performance at a major wine competition has been analyzed from 2005 to 2008 using replicate samples. Each panel of four expert judges received a flight of 30 wines imbedded with triplicate samples poured from the same bottle. Between 65 and 70 judges were tested each year. About 10 percent of the judges were able to replicate their score within a single medal group. Another 10 percent, on occasion, scored the same wine Bronze to Gold. Judges tend to be more consistent in what they don’t like than what they do. An analysis of variance covering every panel over the study period indicates only about half of the panels presented awards based solely on wine quality."<br> |

Submitted by Margaret Cibes | Submitted by Margaret Cibes |

## Revision as of 09:24, 30 October 2009

## Contents

- 1 Quotations
- 2 Forsooths
- 3 Minimizing the number of coins jingling in your pocket
- 4 Failure to disclose
- 5 More on AIDS Vaccine
- 6 Carrying a gun increases risk of getting shot and killed
- 7 Identifying financial market cycles - or not
- 8 Learning by the petabyte
- 9 The unluckiest fan
- 10 More on who's happier
- 11 Disc Fragments
- 12 Hyping counts
- 13 Sneaking into a clinical trial
- 14 Judging wine judges

## Quotations

I can calculate the motion of heavenly

bodies but not the madness of people

After losing a fortune in the

South Sea Company bubble of 1720

Trying is the first step towards failure.

At the website Language Log [1], Mark Liberman posted the following quotation from Miranda Robertson, in “Ockham’s broom,” a new series said to have been introduced on October 16, 2009, in the *Journal of Biology*:

[I]t is probably safe to assume that most readers are familiar with Ockham’s razor – roughly, the principle whereby gratuitous suppositions are shaved from the interpretation of facts …. Ockham's broom is a somewhat more recent conceit, attributable to Sydney Brenner, and embodies the principle whereby inconvenient facts are swept under the carpet in the interests of a clear interpretation of a messy reality. (Or, some – possibly including Sydney Brenner – might say, in order to generate a publishable paper.)

Jeff Witmer posted this quotation on the ISOSTAT listserv. It's Malcolm Gladwell's response [2] to an interviewer, reported in *TIME*, October 20, 2009.

If I was studying today, I would go get a master's in statistics, and maybe do a bunch of accounting courses and then write from that perspective. I think that's the way to survive. The role of the generalist is diminishing. Journalism has to get smarter.

This illustrates why the controversy over statistical significance is exaggerated. Whether you consider the first or second analysis, the observed effect of the Thai candidates was either just above or below the level of statistical significance. Statisticians will tell you it is possible to observe an effect and have reason to think it’s real even if it’s not statistically significant. And if you think it’s real, you ought to examine it carefully.

New York Times

Seth Berkley

op-ed contributor

Submited by Paul Alper

## Forsooths

This forsooth is from the October 2009 RSS Forsooth.

Of course in those days we worked on the assumption that everything was normally distributed and we have seen in the last few months that there is no such thing as a normal distribution.Scientific Computing World

February/March 2009

You can see the context of this comment here.

University of North Dakota researchers found that pilots who ate the fattiest foods such as butter or gravy had the quickest response times in mental tests and made fewer mistakes when flying in tricky cloud conditions.

*New Yorker*(October 12, 2009) review [3] of Matthew Stewart's

*The Management Myth: Why the Experts Keep Getting It Wrong*, Stewart tells a story about how "his boss taught his twenty-something[-old] trainees ... how to conduct a 'two-handed regression'":

"When a scatter plot failed to show the signifiant correlation between two variables that we all knew was there, he would place a pair of meaty hands over the offending clouds of data points and thereby reveal the straight line hiding from conventional mathematics." Management consulting isn't a science, Stewart says; it's a party trick.

## Minimizing the number of coins jingling in your pocket

Do We Need a 37-Cent Coin? Steven d. Levitt, October 6, 2009, Freakonomics Blog, The New York Times.

The current system of coins in the United States is inefficient. Patrick DeJarnette studied this problem and his work was highlighted in the Freakonomics blog. Dr. DeJarnette makes two assumptions.

1. Some combination of coins must reach every integer value in [0,99].

2. Probability of a transaction resulting in value v is uniform from [0,99].

Under this system, the average number of coins that you would receive in change during a random transaction would be 4.7. The system that would work better is rather bizzarre.

The most efficient systems? The penny, 3-cent piece, 11-cent piece, 37-cent piece, and (1,3,11,38) are tied at 4.10 coins per transaction.

Such a set of coins would be evocative of the monetary system in the Harry Potter books.

The article goes on to discuss systems where the coins are more conveniently priced and which single change in coins would lead to the greatest savings.

Submitted by Steve Simon

### Questions

1. Minimizing the number of coins received in change is not the only criteria for a set of coin denominations. What other criteria make sense.

2. Is it logical to assume a uniform distribution in this problem?

3. What coin could be added to the current mix of coins to minimize the number of coins given in change.

## Failure to disclose

“Data Call Into Question HIV Study Results”

by Gautam Naik and Mark Schoofs, *The Wall Street Journal*, October 10, 2009

Researchers from the U.S. Army and Thailand failed to disclose that some results of a potential HIV vaccine trial were not statistically significant, although they had this information when they announced the discovery.

"We thought very hard about how to provide the clearest, most honest message," [one researcher] said. "We stand by the fact that this is a vaccine with a modest protective effect." He called the trial results "complex."

The first analysis, a “modified intent to treat” analysis, included “virtually everyone who enrolled in the study, regardless of whether they ended up getting the full course of the vaccine. …. By this measure, the vaccine tested in Thailand reduced by 31% the chance of infection with HIV ….”

New infections occurred in 51 of the 8,197 people who got the vaccine, compared with 74 of the 8,198 volunteers who got placebo shots. Statistical calculations showed there was a 3.9% probability that chance accounted for the difference. In drug and vaccine trials, anything above a 5% probability of a chance result is deemed statistically insignificant.

The second analysis, a “per protocol” analysis, included only the “study participants who got the full regimen of vaccine shots at the right time.” Apparently, for this group, in which 86 people were infected, there is a “16% chance the study results were a fluke.” It reduced by 26% the chance of infection with HIV.

The article’s authors comment:

It isn't clear why the vaccine was seemingly ineffective among participants who followed the guidelines to the letter.

Submitted by Margaret Cibes

## More on AIDS Vaccine

“Hardly ever believe what you read” is a maxim that will stand you in good stead. Googling “aids vaccine Thailand” will get 248,000 hits, most of which are misleading. In essence, the URLs say that for the first time an effective vaccine against AIDS has been manufactured. But that was last month. Reality has now set in.

The following chart found in the Wall Street Journal of October 9, 2009 paints a different picture. “New infections occurred in 51 of the 8,197 people who got the vaccine, compared with 74 of the 8,198 volunteers who got placebo shots.” Note that the “125” infections represent “51 + 74.”

The announcement on September 24, 2009 indicated that the p-value is 3.9%. A Minitab run shows that, in fact, the p-value is higher (i.e., worse) as indicated by the Fisher exact test. However, the .048 is still under the mystical .05:

**Test and CI for Two Proportions**

Sample |
X |
N |
Sample p |

1 |
51 |
8197 |
0.006222 |

2 |
74 |
8198 |
0.009027 |

Difference = p (1) - p (2)

Estimate for difference: -0.00280480

95% CI for difference: (-0.00546736, -0.000142249)

Test for difference = 0 (vs not = 0): Z = -2.06 P-Value = 0.039

Fisher's exact test: P-Value = 0.048

“Efficacy” of 31.2% seems to be determined from

(74 - 51)/ 74 = .310

In the final column of the chart--“Strictly adheres to trial design”--appears the unreleased

“per protocol” version. According to

Science Magazine:

- The second analysis is called “per protocol” and adheres strictly to how the trial was designed by only including the study participants who got the full regimen of vaccine shots at the right time. Because it excludes study participants who didn't get the full vaccine regimen, it usually provides corroboration to the looser “intent to treat” findings.

The article doesn’t say what the breakdown of the 86 infections is. Nevertheless, it indicates that the p-value of 16% puts a damper on enthusiasm for the vaccine.

- The press conference was not a scholarly, rigorously honest presentation,” said one leading HIV/AIDS investigator, who like others asked that his name not be used. “It doesn’t meet the standards that have been set for other trials, and it doesn’t fully present the borderline results. It’s wrong.”

Discussion

1. “Strictly adheres to trial design” has an efficacy of 26.2% and 86 infections. Show that this leads approximately to 36 and 50 infections, respectively.

2. The articles fail to tell us the number of participants in the “per protocol” situation. However, use the 36 and 50 cited above and show via a statistics package such as Minitab that the Fisher exact test comes up with about 16% for the p-value regardless of whether the sample sizes are the original ones or 4000 each, 5000 each, etc.

3. The “researchers with the U.S. Army who helped run the study, strongly objected to the assertion that they gave the data a positive spin… The debate over the way the results were presented will have no immediate practical impact because even under the most optimistic assessment, the vaccine offered too little protection to be a serious candidate for widespread use.” If this is so, why was there so much positive publicity in September?

Submitted by Paul Alper

## Carrying a gun increases risk of getting shot and killed

The NewScientist

October 06 2009

Ewen Callaway

In this article we read

People who carry guns are far likelier to get shot – and killed – than those who are unarmed, a study of shooting victims in Philadelphia, Pennsylvania, has found. It would be impractical – not to say unethical – to randomly assign volunteers to carry a gun or not and see what happens. So Charles Branas's team at the University of Pennsylvania analyzed 677 shootings over two-and-a-half years to discover whether victims were carrying at the time, and compared them to other Philly residents of similar age, sex and ethnicity. The team also accounted for other potentially confounding differences, such as the socioeconomic status of their neighborhood.

Their article will appear in the American Journal of Public Health. The current version of this article can be found here and the most resent abstract can be found here in this abstract we read:

Objectives. We investigated the possible relationship between being shot in an assault and possession of a gun at the time.

Methods. We enrolled 677 case participants that had been shot in an assault and 684 population-based control participants within Philadelphia, PA, from 2003 to 2006. We adjusted odds ratios for confounding variables.

Conclusions. On average, guns did not protect those who possessed them from being shot in an assault. Although successful defensive gun uses occur each year, the probability of success may be low for civilian gun users in urban areas. Such users should reconsider their possession of guns or, at least, understand that regular possession necessitates careful safety countermeasures.

Results. After adjustment, individuals in possession of a gun were 4.46 (P<.05) times more likely to be shot in an assault than those not in possession. Among gun assaults where the victim had at least some chance to resist, this adjusted odds ratio increased to 5.45 (P<.05).

Discussion

Why do you think the New Science and other's discussing this study titled there article "Carrying a gun increases risk of getting shot and killed" rather than the title of of the article "Investigating the Link Between Gun Possession and Gun Assault"?

Of course this is the kind of article that lends iself to interesting comments. For example:

I am definitely going to have to find the complete article. I want to see how they determined which victims of being shot were included in the study and how they determined which civilians would be included in the study. With out that information, this study doesn't really mean anything.

Follow this advice and see if you think the study really means anything.

Sounds to me like a completely ignorant study and weighted to get the result they want. If you check a place like Philidelphia, of course this is the result you would get, because the people carrying guns are more likely to be involved in crimes or living in crime ridden areas. Check Dallas, or Oklahoma City. You wouldn't get that result at all. And that's because dang near everybody has guns, and we have far fewer shootings.

Does this suggest that the study is completely ignorant?

This article was suggested by Gordon Fox

## Identifying financial market cycles - or not

“The Secret Cycle”, by Nick Paumgarten, *The New Yorker*, October 12, 2009

This article focuses on the work of Martin Armstrong, a technical financial analyst, who found that, "on average, there had been a panic every 8.6 years" over the period 1683-1907:

He discerned a recurrence of major turning points in the economy and in world affairs that followed a distinct and unwavering 8.6-year rhythm.

Then he found that the October 1987 crash “took place on the minor halfway point up the first leg of the 8.6-year cycle, at 2.15 years,” noting that "8.6 years was exactly … 3,141 [days], the number pi times a thousand.”

Eventually:

The model … failed, among other things, to foresee its developer’s demise. In September, 1999, Armstrong was charged with defrauding Japanese investors of nearly a billion dollars. …. The upshot, though, is that he has now spent more than nine years in jail – a pi cycle and then some.

The article includes discussions of Fibonacci-based market behavior models and the "reasoning" behind them.

Submitted by Margaret Cibes

## Learning by the petabyte

Training to Climb an Everest of Digital Data. Ashlee Vance, The New York Times, October 11, 2009.

Some Statistics textbooks have been criticized for having small "toy" problems that do not reflect the complexity of data analysis out in the real world. What sort of data sets are out in the real world?

Facebook, for example, uses more than 1 petabyte of storage space to manage its users’ 40 billion photos. It was not long ago that the notion of one company having anything close to 40 billion photos would have seemed tough to fathom. Google, meanwhile, churns through 20 times that amount of information every single day just running data analysis jobs. In short order, DNA sequencing systems too will generate many petabytes of information a year.

Even at the best universities, students are not asked to handle data sets this large. And this is a problem.

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow. "If they imprint on these small systems, that becomes their frame of reference and what they’re always thinking about," said Jim Spohrer, a director at I.B.M.'s Almaden Research Center.

Two companies with lots of experience tackling petabyte sized data sets want to change this.

Two years ago, I.B.M. and Google set out to change the mindset at universities by giving students broad access to some of the largest computers on the planet. The companies then outfitted the computers with software that Internet companies use to tackle their toughest data analysis jobs. And, rather than building a big computer at each university, the companies created a system that let students and researchers tap into giant computers over the Internet. This year, the National Science Foundation, a federal government agency, issued a vote of confidence for the project by splitting $5 million among 14 universities that want to teach their students how to grapple with big data questions.

Submitted by Steve Simon

### Questions

1. What is the size of the largest data set that you have ever analyzed. Did the size of the data set force you to use a different computing system, different software, or a different statistical method?

2. Could a random sample of a few megabytes from a petabyte of data be sufficiently useful to learn on? Note that a megabyte is six orders of magnitude smaller than a petabyte. Is it possible to have a representative sample with a data set sampled this sparsely?

3. Moore's Law says (more or less) that computing capacity doubles every two years (some sources say 18 months). If Moore's Law applies, calculate how long will it take before we see petabyte sized hard drives on laptop computers?

## The unluckiest fan

Nats follower may be unluckiest fan

All Things Considered, NPR, 16 October 2009.

The Washington Nationals baseball team posted a dismal won-lost record of 59-103 for the 2009 season. From the link above, you can listen to an interview with season-ticket holder Stephen Krupin, who watched the team lose all 19 games he attended this year. The host speculates that this must be a record for bad luck. In fact, Mr. Krupin reports that his cousin, a PhD economist, calculated the chance that this would happen as 1 in 131,204.

In comments posted on the NPR site, several listeners attempt to reproduce this calculation, but find that the event appears to be more likely than reported. It turns out that their analyses are based on the full season record--which seems natural since that record is featured so prominently in the story. However, it comes out in the interview that Mr. Krupin attended only home games. From the Major League Baseball standings we see that the Nationals were 33–48 at home and 26–55 on the road. The chance that 19 randomly selected home games are all losses is , which equals 1 in 131203.8, in agreement with Mr. Krupin's report.

We had another curious experience trying to get the data to match the calculation. An initial try found a sortable schedule on the Washington Nationals web site. Selecting the home games produces 85 entries: 34 wins, 48 losses and 3 postponements. Baseball fans will recognize that 34 plus 48 gives one too many home games, but how do we account for the extra win? It turns out that the May 5 game against the Astros was suspended by rain in the 11th inning, with the score tied 10-10. The game was completed on July 9, with the Nationals ultimately winning 11-10. This result appears twice in the schedule, once on each date.

Submitted by Bill Peterson, based on a suggestion from Jeanne Albert.

## More on who's happier

“The Happiness Gap is back is back is back is back”

by Mark Liberman, Language Log (online blog), September 20, 2009

This is an update of Liberman's 2007 Language Log blog “The ‘Happiness Gap’ and the Rhetoric of Statistics”, which was posted in reaction to David Leonhardt’s 2007 *New York Times* article “He’s Happier, She’s Less So”.

He writes now in reaction to recently updated data and to the renewed spate of 2009 articles on this topic:

(a) NYT’s Ross Douthat in “Liberated and Unhappy”

(b) Huffington Post’s “The Sad, Shocking Truth About How Women Are Feeling”, “What’s Happening To Women’s Happiness?”, *etc.*

(c) NYT’s Marueen Dowd in “Blue Is the New Black”.

Here are the original and updated percents:

1972-74

Men 31.9 very happy, 53.0 pretty happy, 15.1 not too happy

Women 37.0 very happy, 49.4 pretty happy, 13.6 not too happy

2004-08

Men 29.8 very happy, 56.1 pretty happy, 14.0 not too happy

Women 31.2 very happy, 54.9 pretty happy, 13.9 not too happy

Liberman refers readers to his more detailed 2007 discussion of the statistical issues – sample size, self-reporting, statistical vs. practical significance – including references to lots of other articles related to these survey results, in “The ‘Gender Happiness Gap’: Statistical, practical and rhetorical Significance”.

Submitted by Margaret Cibes

## Disc Fragments

The dream of every clinical trial is to come up with something which is inexpensive, definitive and likely to result in media publicity. “Improved outcome after lumbar microdiscectomy in patients shown their excised disc fragments: a prospective, double blind, randomized, controlled trial” by M.J. Tait, et al here fulfills the desire.

According to local Twin Cities website with the heading, Seeing, it appears, is believing when it comes to back surgery: “British surgeons report that patients who underwent a surgical procedure (lumbar microdiscectomy) for back pain caused by a spinal disc tear (“slipped disc”) had better outcomes when they received fragments of their removed disc after the operation. That’s right. Simply taking home a souvenir of the operation in a pot of saline solution improved the patients’ recovery. They reported less leg and back pain, less leg weakness and less “pins and needles” sensations (paresthesia). They also took fewer pain medications after the surgery.”

The surgeons “said they decided to do the study for two main reasons: They knew that a patient’s anxiety and depression going into surgery for a spinal disc tear has a big impact on the recovery process. They had also noticed, anecdotally, that many of their patients who responded best to the surgery — and who seemed to experience the least anxiety and depression afterwards — were those who had been given their disc fragments.”

The abstract of the journal article notes low p-values to make their case “that presenting the removed disc material to patients after LMD improves patient outcome.”:

Lumbar microdiscectomy (LMD) is a commonly performed neurosurgical procedure. We set up a prospective, double blind, randomised, controlled trial to test the hypothesis that presenting the removed disc material to patients after LMD improves patient outcome. METHODS: Adult patients undergoing LMD for radiculopathy caused by a prolapsed intervertebral disc were randomised into one of two groups, termed experimental and control. Patients in the experimental group were given their removed disc fragments whereas patients in the control group were not. Patients were unaware of the trial hypothesis and investigators were blinded to patient group allocation. Outcome was assessed between 3 and 6 months after LMD. Primary outcome measures were the degree of improvement in sciatica and back pain reported by the patients. Secondary outcome measures were the degree of improvement in leg weakness, paraesthesia, numbness, walking distance and use of analgesia reported by the patients. RESULTS: Data from 38 patients in the experimental group and 36 patients in the control groSummaryup were analysed. The two groups were matched for age, sex and preoperative symptoms. More patients in the experimental compared with the control group reported improvements in leg pain (91.5 vs 80.4%; p<0.05), back pain (86.1 vs 75.0%; p<0.05), limb weakness (90.5 vs 56.3%; p<0.02), paraesthesia (88 vs 61.9%; p<0.05) and reduced analgesic use (92.1 vs 69.4%; p<0.02) than preoperatively. CONCLUSION: Presentation of excised disc fragments is a cheap and effective way to improve outcome after LMD.

The entire paper is only three pages in length and so its calculations can be checked. Below are the calculation results for the three secondary outcomes for which the paper claims statistical significance:

1. Improved Leg Weakness--the paper states that the p-value is less that .02. Minitab shows that the p-Fisher’s exact test is .024

**T and CI for Two Proportions** [leg weakness]

Sample |
X |
N |
Sample p |

1 |
9 |
16 |
0.562500 |

2 |
19 |
21 |
0.904762 |

Difference = p (1) - p (2)

Estimate for difference: -0.342262

95% CI for difference: (-0.615844, -0.0686795)

Test for difference = 0 (vs not = 0): Z = -2.45 P-Value = 0.014

Fisher's exact test: P-Value = 0.024

2. Parathaesia--The paper states that the p-value is less that .05. Minitab shows that the p-value is from Fisher’s exact test is .08.

Test and CI for Two Proportions [parathaesia]

Sample |
X |
N |
Sample p |

1 |
22 |
25 |
0.880000 |

2 |
13 |
21 |
0.619048 |

Difference = p (1) - p (2)

Estimate for difference: 0.260952

95% CI for difference: (0.0173021, 0.504603)

Test for difference = 0 (vs not = 0): Z = 2.10 P-Value = 0.036

Fisher's exact test: P-Value = 0.080

3. Reduced Analgesic Use--The paper states that the p-value is less that .02. Minitab shows that the p-value from Fisher’s exact test is .017.

**Test and CI for Two Proportions**

Sample |
X |
N |
Sample p |

1 |
35 |
38 |
o.921053 |

2 |
25 |
36 |
0.694444 |

Difference = p (1) - p (2)

Estimate for difference: 0.226608

95% CI for difference: (0.0534229, 0.399793)

Test for difference = 0 (vs not = 0): Z = 2.49 P-Value = 0.013

Fisher's exact test: P-Value = 0.017

The primary outcomes, leg pain and (low) back pain for the treatment vs. the control were not calculated in a similar manner to the way the secondary outcomes were. Instead of using a two-sample test of proportions, the results for “pain” were calculated by having five categories: “Much better,” Little better,” “Same,” “Little worse,” and “Much worse.” That is, an ordinal scale was employed. Because the accompanying graphs, Figure 1A and 1B in the paper, are not precise enough to determine the number in each category, a nonparametric calculation is hard to carry out.

Nevertheless, ignoring the breakdown into five categories, here are Minitab results for leg pain and back pain, respectively; note that the p-values are much different from the claimed <.05:

Test and CI for Two Proportions [leg pain]

Sample |
X |
N |
Sample p |

1 |
35 |
38 |
0.921053 |

2 |
29 |
36 |
0.805556 |

Difference = p (1) - p (2)

Estimate for difference: 0.115497

95% CI for difference: (-0.0396318, 0.270626)

Test for difference = 0 (vs not = 0): Z = 1.46 P-Value = 0.144

Fisher's exact test: P-Value = 0.185

Test and CI for Two Proportions [back pain] Sample X N Sample p 1 33 38 0.868421 2 27 36 0.750000

Sample |
X |
N |
Sample p |

1 |
33 |
38 |
0.868421 |

2 |
27 |
36 |
0.750000 |

Difference = p (1) - p (2)

Estimate for difference: 0.118421

95% CI for difference: (-0.0592271, 0.296069)

Test for difference = 0 (vs not = 0): Z = 1.31 P-Value = 0.191

Fisher's exact test: P-Value = 0.242

Discussion

1. Why might an individual report a better outcome because he was handed his disc fragment? Why might he feel worse?

2. Assuming that the p-values reported in the article are correct, what criticism might still remain?

3. A disc fragment is one form of excised body part. What other excised body part might have a similar positive result? What other excise body part might have a distinctly negative result?

4. This study took place in London, England. Why might patient reaction be different in, let us say, Asia or Africa?

Submitted by Paul Alper

## Hyping counts

“Prostitution and trafficking - the anatomy of a moral panic”

by Nick Davies, *The Guardian*, October 20, 2009

This article describes the saga of counting sex traffickers in the UK and the accompanying media amplification of the apparently distorted counts.

Behind the confusion among estimates of the size of the problem is the lack of agreement upon a definition of a sex trafficker. According to international law (2000 Palermo protocol), sex trafficking involves the “use of force, fraud or coercion to transport an unwilling victim into sexual exploitation.” A looser definition refers to the “movement of all sex workers, including willing professionals who are simply travelling in search of a better income.”

In any case, based on a police report, academics at the University of North London estimated 71 as the number of women trafficked in the UK during 1998, and they suggested in 2000 that the true count might be between 142 and 1,420. They acknowledged that any count was "problematic."

In 2003 a second team of researchers proposed an upper bound of 3,812 women involved in the sex trade, with caveats:

The researchers ringed this figure with warnings. The data, they said, was "very poor" and quantifying the subject was "extremely difficult". Their final estimate was "very approximate", "subject to a very large margin of error" and "should be treated with great caution" and the figure of 3,812 "should be regarded as an upper bound.”

Before the report had been published the figure was rounded up to 4,000.

In 2007 a politician used the figure “25,000 sex slaves,” citing Home Office estimates, despite the fact that there was no Home Office source for that figure.

"I used to work for the Daily Mirror, so I trust the report," … [one reader] said.

That same politician then cited an 18,000 figure several times. However, one person stated, "None of us knew where that came from."

An anti-prostitution’s spokesperson stated:

"I realise that the 25,000 figure, which is one that has been bandied about in the media, is one that doesn't really have much of an evidence base and may be slightly subject to media hype. There is an awful lot of confusion in the media and other places between trafficking (unwilling victims) and smuggling (willing passengers). People do get confused and they are two very different things."

The article’s author’s conclusion was:

For the police, the misinformation has succeeded in diverting resources away from other victims.

Submitted by Margaret Cibes

## Sneaking into a clinical trial

Bending the Rules of Clinical Trials Pauline W. Chen, M.D., The New York Times, October 29, 2009.

When a patient comes to you with a terminal illness, and the only treatments available are experimental, you're supposed to encourage them to enroll in a clinical trial. But what about the patient that is so ill that they can't meet the eligibility requirements? This is the quandry that the author, Pauline Chen, faced with one of her patients, Louise.

No standard therapy would work, so I wondered if a clinical trial, a study of a new drug or treatment, might hold some promise. I also knew that every research trial maintained strict criteria for enrollment and that Louise was hardly the ideal candidate and might not qualify. While there were trials that accepted patients with diminished liver function, her organ was on the precipice of all-out failure. If she took part in a trial and her liver failed, she could muddy the data, perhaps even alter the trial’s outcome. The investigators would assume that liver failure was a side effect of the experimental drug, a complication potentially so significant it could prevent future use of the drug. But the reality was that her liver failure would have probably had nothing or very little to do with the drug; it would have been the result of not having had enough normal liver to begin with.

It's not hard to trick the system.

Such violations could include altering medical records in order to get a patient into an H.I.V. treatment trial, downplaying a substance abuse history in order to help a patient enroll in a trial on depression, or artificially improving an otherwise poor kidney function test by having a patient drink a gallon of water the night before the study’s blood draw.

So what would you do? Most doctors would bend the rules and try to sneak the patient into the trial. In a survey published in the bioethics journal IRB: Ethics & Human Research (abstract only is available for free), characterize the vast extent to which this is true.

90 percent believed that ignoring certain entry criteria was acceptable if a patient could, in their estimation, benefit from the trial. In addition, over 60 percent of those surveyed also believed that researchers should deviate from study rules if doing so might improve a patient’s care.

The New York Times blog on health issues continued the discussion and invited reader comments. Most of the readers were upset at the willingness of doctors to violate the rules of clinical trials.

Submitted by Steve Simon

### Questions

1. Who is harmed when an ineligible patient enrolls in a clinical trial?

2. Who's the villain , the doctors for violating the rules of clinical trials or the scientists who write too restrictive entry criteria?

3. Would the harm of enrolling an ineligible patient be nullified if he/she was just as likely to end up in the treatment group as the control group?

## Judging wine judges

These two articles present detailed statistical analyses of how consistent wine judges are.

“An Analysis of the Concordance Among 13 U.S. Wine Competitions”

by Robert T. Hodgson, *Journal of Wine Economics*, Fall 2008

"**Abstract**: An analysis of over 4000 wines entered in 13 U.S. wine competitions shows little concordance among the venues in awarding Gold medals. Of the 2,440 wines entered in more than three competitions, 47 percent received Gold medals, but 84 percent of these same wines also received no award in another competition. Thus, many wines that are viewed as extraordinarily good at some competitions are viewed as below average at others. An analysis of the number of Gold medals received in multiple competitions indicates that the probability of winning a Gold medal at one competition is stochastically independent of the probability of receiving a Gold at another competition, indicating that winning a Gold medal is greatly influenced by chance alone."

“An Examination of Judge Reliability at a major U.S. Wine Competition”

by Robert T. Hodgson, *Journal of Wine Economics*, Spring 2009

"**Abstract**: Wine judge performance at a major wine competition has been analyzed from 2005 to 2008 using replicate samples. Each panel of four expert judges received a flight of 30 wines imbedded with triplicate samples poured from the same bottle. Between 65 and 70 judges were tested each year. About 10 percent of the judges were able to replicate their score within a single medal group. Another 10 percent, on occasion, scored the same wine Bronze to Gold. Judges tend to be more consistent in what they don’t like than what they do. An analysis of variance covering every panel over the study period indicates only about half of the panels presented awards based solely on wine quality."

Submitted by Margaret Cibes