# Difference between revisions of "Chance News 38"

## Quotations

If the devil exists, he no doubt has a high IQ and an Ivy League degree. It's clear that having an educational pedigree is no prophylactic against greed and bad behavior..

Tom Donaldson professor of ethics and law at Wharton
Star Tribune July 3, 2008

Paul Alper found inWikipedia the following quotations of well known statisticians relating to Types of error.

In 1948, Frederick Mosteller argued that a "third kind of error" was required to describe circumstances he had observed, namely:

Type I error: rejecting the null hypothesis when it is true.

Type II error: accepting the null hypothesis when it is false.

Type III error: correctly rejecting the null hypothesis for the wrong reason.

In 1957, Allyn W. Kimball, a statistician with the Oak Ridge National Laboratory, proposed a different kind of error to stand beside "the first and second types of error in the theory of testing hypotheses". Kimball defined this new "error of the third kind" as being "the error committed by giving the right answer to the wrong problem"

Mathematician Richard Hamming expressed his view that

It is better to solve the right problem the wrong way than to solve the wrong problem the right way.

The famous Harvard economist Howard Raiffa describes an occasion when he, too, "fell into the trap of working on the wrong problem"

In 1974, Ian Mitroff and Tom Featheringham extended Kimball's category, arguing that:

One of the most important determinants of a problem's solution is how that problem has been represented or formulated in the first place.

.

They defined type III errors as either "the error. of having solved the wrong problem. when one should have solved the right problem" or "the error. [of] choosing the wrong problem representation. when one should have. chosen the right problem representation"

In 1969, the Harvard economist Howard Raiffa jokingly suggested "a candidate for the error of the fourth kind: solving the right problem too late" .

In 1970, Marascuilo and Levin proposed a "fourth kind of error" -- a "Type IV error" -- which they defined in a Mosteller-like manner as being the mistake of "the incorrect interpretation of a correctly rejected hypothesis"; which, they suggested, was the equivalent of "a physician's correct diagnosis of an ailment followed by the prescription of a wrong medicine"

See the Wikipedia article for references.

## Forsooth

Deborah Alper suggested the following Forsooth:

Roughly one-third of all eligible Americans, 64 million people, are not registered to vote. This percentage is even higher for African-Americans (30 percent) and Hispanics (40 percent).

The Nation
July 21/28, 2008, Page 32

Paul Alper suggested the following two Forsooths:

We want to persuade you of one claim: that William Sealy Gosset (1876-1937)--aka "Student" of Students t-test--was right and that his difficult friend, Ronald A. Fisher, though a genius, was wrong.

From the preface of Cult of Statistical Significance:

Deirdre Nansen McCloskey and Steve Ziliak

Feb 19, 2008

"There's this cluster of interrelated findings", said Richard A. Lippa, a professor of psychology at California State University at Fullerton, who has found evidence that in gay men, the hair on the back of the head is more likely to curl counterclockwise than in straight men. "These are all biological markers that something must have gone on early in development".

From an article by Rob Stein in the Washington Post,
February 5, 2008

## Irreligion

Irreligion: A Mathematician Explains Why the Arguments for God Just Don't Add Up . By John Allen Paulos. 158 pp. Hill & Wang. \$20.

John suggested that Chance News readers might enjoy some of the arguments that he used in his book that rely on probability concepts . We give a sample below and you can see more of his probability arguments in a talk he gave at the recent Conference "Beyond Belief Enlightenment 2.0" sponsored by the science network.

A common creationist argument goes roughly like the following. A very long sequence of individually improbable mutations must occur in order for a species or a biological process to evolve. If we assume these are independent events, then the probability of all of them occurring and occurring in the right order is the product of their respective probabilities, which is always a tiny number. Thus, for example, the probability of getting a 3, 2, 6, 2, and 5 when rolling a single die five times is 1/6 x 1/6 x 1/6 x 1/6 x 1/6 or 1/7,776 - one chance in 7,776. The much longer sequences of fortuitous events necessary for a new species or a new process to evolve leads to the minuscule probabilities that creationists argue prove that evolution is so wildly improbable as to be essentially impossible.

This line of argument, however, is deeply flawed. Leaving aside the issue of independent events, I note that there are always a fantastically huge number of evolutionary paths that might be taken by an organism (or a process), but there is only one that actually will be taken. So if, after the fact, we observe the particular evolutionary path actually taken and then calculate the a priori probability of its being taken, we will get the minuscule probability that creationists mistakenly attach to the process as a whole.

A related creationist argument is supplied Michael Behe, a key supporter of intelligent design. Behe likens what he terms the "irreducible complexity" of phenomena such as the clotting of blood to the irreducible complexity of a mousetrap. If just one of the trap's pieces is missing -- whether it be the spring, the metal platform, or the board -- the trap is useless. The implicit suggestion is that all the parts of a mousetrap would have had to come into being at once, an impossibility unless there were an intelligent designer. Design proponents argue that what's true for the mousetrap is all the more true for vastly more complex biological phenomena. If any of the 20 or so proteins involved in blood clotting is absent, for example, clotting doesn't occur, and so, the creationist argument goes, these proteins must have all been brought into being at once by a designer.

But the theory of evolution does explain the evolution of complex biological organisms and phenomena, and the Paley argument from design has been decisively refuted. Natural selection acting on the genetic variation created by random mutation and genetic drift results in those organisms with more adaptive traits differentially surviving and reproducing. (Interestingly, that we and all life have evolved from simpler forms by natural selection disturbs fundamentalists who are completely unphased by the Biblical claim that we come from dirt.) Further rehashing of defenses of Darwin or refutations of Paley is not my goal, however. Those who reject evolution are usually immune to such arguments anyway. Rather, my intention here is to develop some loose analogies between these biological issues and related economic ones and, secondarily, to show that these analogies point to a surprising crossing of political lines.

Paul Alper suggested that readers might enjoy the following:

Paulos often writes about unlikely events and how quickly the public tends to assume something supernatural is taking place. On page 52 of Irreligion he muses on numerological coincidences involving 9/11. He starts with 9/11 being "the telephone code for emergencies." The digits 9 + 1 + 1 sum to 11 and September 11 is the 254th day of the year so that 2 + 5 + 4 sum to 11. Further, there are another 111 days to the end of the year. The first plane to crash into the towers was flight number 11. The Pentagon, Afghanistan and New York City each have 11 letters. Moreover, any three-digit number when multiplied by 91 and 11 results in a six-digit number where digits four, five and six repeat digits one, two and three, respectively; in particular, starting with 911 results in 911,911. A few pages later he notes that on September 11, 2002 "the New York State lottery numbers were 911." The day before that,"the closing value of the September S&P 500 futures contracts" was 911. And to cinch it all, Johnny Unitas, the number one quarterback ever, died on September 11 and wore 19 on his jersey.

## An improbable event and a coincidence

I have an example of an improbable event and a coincidence; it shows the difference between them. At Forrest's graduation last night, all of the seniors marched, in alphabetical order, to the stage to receive their diplomas. The women were wearing gray gowns and the men were wearing black gowns. I was careful to note any siblings (as far as I could tell, there were none). GREAT! So now we have a random sequence of coin tosses of length about 310, and the coin is pretty close to fair. The longest sequence of consecutive men I observed was 9; this is somewhat longer than the expected length of the longest run of heads, which is about 7, and somewhat longer than the expected length of the longest run of either heads or tails, which is about 8. So I observed a fairly unusual event. The coincidence is that Forrest was in the longest run of men.

An email from Charles Grinstead to Laurie Snell about his son's graduation.

## The Drunkard's Walk:How Randomness Rules Our Lives

Leonard Mlodinow
Pantheon Books, New York, 2008

There are not many writers who can successfully write about mathematics for the general public but Leonard Mlodinow is one of them. He is a physicist who has written a number of successful books on physics and mathematics for the general public. He has also been an editor for Star Trek.

The Drunkard's Walk is his most recent book. In this book he shows that we all have a hard time understanding probability and yet it plays an important role in our daily lives. To show that we are not wired to understand probability, he has only to show us the birthday problem, the Monte Hall problem, the two sisters problems, the Linda problem, the two-envelope problem, etc.

Of course to understand how probability affects our lives we have to understand some basic probability. Mlodinow makes this more interesting by explaining probability along with the history of its development. He starts with Cardano introducing the sample space and solving dice problems. At the same time he discusses Cardano's colorful life. He then discusses Pascal and Fermat's solution to the problem of points. He continues with Bernoulli, deMere, and Bayes and explains their contributions including the law of large numbers, the central limit theory and conditional probability. Of course none of this is new but what makes this book so interesting is that while Mlodinow discusses probability concepts and applications he also explains how they can effect our lives. For example the argument that there is no such thing as a hot-hand in basketball might also be made about your stock adviser.

To hear this in action listen to Mlodinow himself here

You can also a review of this book and other similar books here by the well known probabilist David J. Aldus in the Berkeley Statistics Department. David teaches an Undergraduate Seminar From Undergraduate Probability Theory to the Real World. You will also find on his website a talk The top ten things that math probability says about the real world.

submitted by Laurie Snell

## Researchers Fail to Reveal Full Drug Pay

New York Times June 8, 2008
Gardner Harris and Benedict Carey

The authors say:

A world-renowned Harvard child psychiatrist whose work has helped fuel an explosion in the use of powerful antipsychotic medicines in children earned at least \$1.6 million in consulting fees from drug makers from 2000 to 2007 but for years did not report much of this income to university officials, according to information given Congressional investigators.

By failing to report income, the psychiatrist, Dr. Joseph Biederman, and a colleague in the psychiatry department at Harvard Medical School, Dr. Timothy E. Wilens, may have violated federal and university research rules designed to police potential conflicts of interest, according to Senator Charles E. Grassley, Republican of Iowa. Some of their research is financed by government grants.

Like Dr. Biederman, Dr. Wilens belatedly reported earning at least \$1.6 million from 2000 to 2007, and another Harvard colleague, Dr. Thomas Spencer, reported earning at least \$1 million after being pressed by Mr. Grassley’s investigators. But even these amended disclosures may understate the researchers’ outside income because some entries contradict payment information from drug makers, Mr. Grassley found.

In one example, Dr. Biederman reported no income from Johnson & Johnson for 2001 in a disclosure report filed with the university. When asked to check again, he said he received \$3,500. But Johnson & Johnson told Mr. Grassley that it paid him \$58,169 in 2001, Mr. Grassley found.

The Harvard group’s consulting arrangements with drug makers were already controversial because of the researchers’ advocacy of unapproved uses of psychiatric medicines in children.

In addition to money that they get from the drug company, researchers often get addition support from the National Institute of Health that has some responsibility to monitor conflicts of interest. Since neither the Universities nor the NIH seem to be doing their duty Senator Grassley is asking Congress and the NIH to do something about this. You can read his proposal here.

## Does the internet help or confuse medical decisions?

I have been diagnosed as having Mild Cognitive Impairment (MCI). This is a transition stage between the cognitive changes of normal aging and the more serious problems caused by Alzheimer's disease (AD). It has been recommended that I take two medications Aricept (donepezil) and Excelon (rivastigmine) which do not improve memory but is thought to delay the occurrence of Alzheimer’s disease. Since the possible side effects are unpleasant I decided to look on the web for studies on the effectiveness of these drugs.

I found that the recommendation for donepezil is often based on the article:

Vitamin E and Donepezil for the Treatment of Mild Cognitive Impairment.
New England Journal of Medicine, June 9, 2005.
Ronald C. Petersen and others.

This is a well designed experiment and they describe there study as:

A total of 769 subjects were enrolled, and possible or probable Alzheimer's disease developed in 212. The overall rate of progression from mild cognitive impairment to Alzheimer's disease was 16 percent per year. As compared with the placebo group, there were no significant differences in the probability of progression to Alzheimer's disease in the vitamin E group (hazard ratio, 1.02; 95 percent confidence interval, 0.74 to 1.41; P=0.91) or the donepezil group (hazard ratio, 0.80; 95 percent confidence interval, 0.57 to 1.13; P=0.42) during the three years of treatment. Prespecified analyses of the treatment effects at 6-month intervals showed that, as compared with the placebo group, the donepezil group had a reduced likelihood of progression to Alzheimer's disease during the first 12 months of the study (P=0.04), a finding supported by the secondary outcome measures. Among carriers of one or more apolipoprotein E 4 alleles, the benefit of donepezil was evident throughout the three-year follow-up. There were no significant differences in the rate of progression to Alzheimer's disease between the vitamin E and placebo groups at any point, either among all patients or among apolipoprotein E 4 carriers.

Their conclusion was:

Vitamin E had no benefit in patients with mild cognitive impairment. Although

donepezil therapy was associated with a lower rate of progression to Alzheimer's disease during the first 12 months of treatment, the rate of progression to Alzheimer's disease after three years was not lower among patients treated with

donepezil than among those given placebo.

The 3 year period was the primary outcome, but the one year period was an exploratory outcome so we have to worry about this.

Bob Norman subjested another thing we might worry about. He suggested the following picture:

http://www.dartmouth.edu/~chance/forwiki/memory.jpg

The top line is the rate of progression of Alzheimer's for the placebo group and the bottom line for the treated group.This shows the rate of progression of Alzheimer's for the treated group decreasing in the first year and then increasing until the third year. But after that the treated group increases faster than the placebo group!

The most recent study for rivastigmine seems to be the following:

Effect of rivastigmine on delay to diagnosis of Alzheimer's disease from mild cognitive impairment:
Lancet Neurology, June 2007
Howard Feldman and others.

The authors describe their study as follows:

Of 1018 study patients enrolled, 508 were randomly assigned to rivastigmine and 510 to placebo; 17·3% of patients on rivastigmine and 21·4% on placebo progressed to AD (hazard ratio 0·85 [95% CI 0·64–1·12]; p=0·225). There was no significant difference between the rivastigmine and placebo groups on the standardized Z score for the cognitive test battery measured as mean change from baseline to endpoint (−0·10 [95% CI −0·63 to 0·44], p=0·726). Serious adverse events were reported by 141 (27·9%) rivastigmine-treated patients and 155 (30·5%) patients on placebo; adverse events of all types were reported by 483 (95·6%) rivastigmine-treated patients and 472 (92·7%) placebo-treated patients. The predominant adverse events were cholinergic: the frequencies of nausea, vomiting, diarrhoea, and dizziness were two to four times higher in the rivastigmine group than in the placebo group.

And their interpretation was:

There was no significant benefit of rivastigmine on the progression rate to AD (Alzhimers) or on cognitive function over 4 years. The overall rate of progression from MCI to AD in this randomized clinical trial was much lower than predicted. Rivastigmine treatment was not associated with any significant safety concerns.

Ronald C Peterson (lead author of the New England Journal of medicine study) wrote a critique of this study

MCI treatment trials: failure or not? The Lancet Neurology - Volume 6, Issue 6 (June 2 Ronald C Petersen

In this critique he writes"

The 3-year duration of anticipated therapeutic effect applied to this study was, in retrospect, overly ambitious: the treatment did not work for this duration in the donepezil and vitamin E trial, and no cholinesterase inhibitor has been shown to work for 3 years, even in AD. Therefore, this duration of effect would not be expected at the MCI stage. This, coupled with the subtherapeutic doses used, contributed to the treatment failure.

In spite of all these challenges, there was a glimmer of efficacy—a trend towards a positive rivastigmine effect. Some of the MRI measures suggested a therapeutic response during the 1 year to 2 year window, which is similar to the mild efficacy effect in the donepezil and vitamin E trial?

Note: Ronold C. Peterson is the Director of the Mayo Alzheimer's Disease Research Center.

Well were does that leave me? Answer: Confused.

---Discussion---

(1) What are exploratory outcomes and why do we have to worry about them?

(2) What do doctors know that the Internet doesn't?

(3) How do you think we should use the internet in making a medical decision?

Submitted by Laurie Snell

## A Controversy About Doping

Detecting drug cheats is a key issue in the lead-up to the Beijing Olympics this summer. Not a surprise given the battering of top-flight sports by successive doping scandals. In June, cyclist Floyd Landis officially lost his 2006 Tour de France title, almost two years after his urine gave a positive test for a performance-enhancing drug known as EPO. WADA (World Anti Doping Agency) has accredited 33 labs around the world to perform this and other anti-doping tests. Recently, some Danish researchers rained on the parade, claiming that their experiment proved that the EPO test had poor “detection power”. (For a good summary, see “The Validity of EPO Testing for Athletes”, ScienceDaily.com, June 28, 2008.)

In their paper, researchers at the Copenhagen Muscle Research Center made two specific claims: (a) that the detection power of the EPO test is poor; and (b) that agreement between results from two WADA-accredited labs is very poor. They concluded that due to a high false negative rate (low sensitivity), this test would fail to catch many drug cheats. They called into question WADA’s ability to detect EPO abuse at the Beijing Olympics.

In the experiment, they recruited eight (8) healthy college students, all non-athletes, to follow a program of EPO injection and exercise over a seven-week period, divided into three phases (boosting/higher dose, maintenance/lower dose, post treatment/off cycle). Eight, 16 and 24 urine samples were collected in these respective phases, in addition to eight samples taken before the EPO program to serve as base-line. Each half-sample was submitted to one of two WADA-accredited labs (known as “Lab A” and “Lab B”) for EPO testing.

An excerpt from the table of results from the original paper is given below:

 Phase Pre Boosting Maintenance1 Maintenance2 Post1 Post2 Post3 Samples tested 8 8 8 8 7* 6* 7* Lab A + 0 8 4 2 2 0 0 Lab B + 0 0 0 0 0 0 0

Note: The reduction in samples in the post-treatment period was not explained

Source: Lundby, et. al., “Testing for recombinant human erythropoietin in urine: problems associated with current anti doping testing”, PresS. J Appl. Physiol, June 26, 2008, online.

--- Discussion ---

(1) Describe a pair of metrics that statisticians use to measure the accuracy (“detection power”) of diagnostic tests. This experiment addressed only one of these metrics. Which one? Why couldn’t this experimental design capture the other metric? Why is it important to know the other metric before passing judgment on test accuracy?

(2) Inside the paper, the authors revealed that each lab classified each sample into one of three categories: positive, suspicious and negative. Suspicious cases are subject to further confirmatory testing (although we do not know if this was done). We were told that Lab A indicated 2 maintenance samples and 3 post-treatment samples “suspicious”; Lab B indicated 7 boosting samples and 5 maintenance samples “suspicious”; Lab B called 1 boosting sample negative. Do the additional data affect your opinion of the study’s conclusions?

(3) Comment on the sample size, and the selection mechanism. How comfortable are you to make statistical inference using this data?

(4) In the Discussion, the authors used language like “In the ‘maintenance’ period, laboratory A found six positive results … in a total of 16 samples.” Explain why special testing methodology must be used if you were to treat this data as an n=16 sample. What key assumption is violated when ordinary testing is invoked?

--- Further Discussion ---

It has been widely reported that the EPO test is highly “unreliable” because this study proved that two labs, both WADA certified, could produce divergent analytical results. In response, Olivier Rabin, WADA’s scientific director, doubted that this study “reflected the true state of EPO testing”. (“Study Shows Problems in Olympic-Style Tests”, New York Times, June 26, 2008) Dr. Rabin raised a core question in statistical inference: given that the true state is unknown, one would like to see if the experimental data provides sufficient statistical evidence to prove/disprove one’s hypothesis.

(5) State null and alternative hypotheses for this problem. Identify the population and the sample in the Danish experiment. What was their sample size?

(6) Under what conditions are you willing to generalize the result of this test of Lab A vs. Lab B? (Or design a different experiment to test your stated hypotheses.)

Submitted by Kaiser Fung