# Chance News 66

## Quotations

"It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so."
--Mark Twain

Quoted by Richard H. Thaler in The overconfidence problem in forecasting, New York Times, 21 August 2010.

Submitted by Paul Alper

## Forsooth

The following Forsooth is from the August 23, 2010 RRS News.

An editorial was published in the Journal of the National Cancer Institute (Volume 101, no 23, 2 December 2009). It announced some online resources for journalists, including a statistics glossary, which gave the following definitions:

• P value.  Probability that an observed effect size is due to chance alone.

if p ≥ 0.05, we say 'due to chance', 'not statistically significant'

if p < 05, we say 'not due to chance', 'statistically significant'

• Confidence interval (95% CI)

Because the observed value is only an estimate of the truth, we know it has a 'margin of error'.

The range of plausible values around the observed value that will contain the truth 95% of the time.

The journal subsequently (vol. 102, no. 11, 2 June 2010) published a letter commenting on the editorial and the statistics glossary. The authors of the original editorial replied as follows:

Dr Lash correctly points out that the descriptions of p values and 95% confidence intervals do not meet the formal frequentist statistical definitions.

[…] We were not convinced that working journalists would find these definitions user-friendly, so we sacrificed precision for utility.

Submitted by Laurie Snell

## Risk reduction

Burger and a statin to go? Or hold that, please?
by Kate Kelland and Genevra Pittman, Reuters, 13 August 2010

Dr. Darrel Francis is the leader of "a study published in the American Journal of Cardiology, [in which] scientists from the National Heart and Lung Institute at Imperial College London calculated that the reduction in heart disease risk offered by a statin could offset the increase in risk from eating a cheeseburger and a milkshake."

Further, "When people engage in risky behaviors like driving or smoking, they're encouraged to take measures that minimize their risk, like wearing a seatbelt or choosing cigarettes with filters. Taking a statin is a rational way of lowering some of the risks of eating a fatty meal."

Discussion

1. Obviously, the above comments and analogies are subject for debate. Defend and criticize the comparisons made.

2. Risk analysis is very much in the domain of statistics. How would you estimate the risk of driving, smoking, eating a cheeseburger or taking a statin? Read the article in the American Journal of Cardiology to see how Francis and his co-authors estimated risk.

3. Pascal’s Wager is famous in religion, philosophy and statistics; his wager is the ultimate in risk analysis.

Historically, Pascal's Wager was groundbreaking as it had charted new territory in probability theory, was one of the first attempts to make use of the concept of infinity, [and] marked the first formal use of decision theory.

The wager is renowned for discussing the risk of not believing in God, presumably the Christian concept of God.

[A] person should wager as though God exists, because living life accordingly has everything to gain, and nothing to lose.

Read the Wikipedia article and discuss the risk tables put forth.

Submitted by Paul Alper

## Teaching with infographics

Teaching with infographics: Places to start
by Katherine Schulten, New York Times, The Learning Network blog, 23 August 2010

Each day of this week, the blog will present commentary on some aspect of infographics. Starting with discussion of what infographics are, the series proceeds through applications ranging from sciences and social sciences to the fine arts. Each category will feature examples that have appeared in the Times. The link above for the first day contains an index to the whole series. There is a wealth of material to browse here.

Submitted by Bill Peterson

## Subverting the Data Safety Monitoring Board

Don't Mess with the DSMB Jeffrey M. Drazen and Alastair J.J. Wood. N Engl J Med 2010; 363:477-478, July 29, 2010.

The Data Safety Monitoring Board (DSMB) is supposed to be an independent group charged with interim review of data from a clinical trial to decide whether to stop a study early because of evidence that continuation of the trial would be unethical. Trials are commonly halted early because of sufficient evidence that a new drug is clearly superior/inferior to the comparison drug, or because of serious concerns about safety.

The independent function of a DSMB is vital.

Since the DSMB (data and safety monitoring board) is charged with ensuring that clinical equipoise is maintained as trial data are accrued, it is considered very bad, even self-destructive, behavior for people who are involved with the study to interact with DSMB members on trial-related issues. Traditionally, there has been a wall between investigators, sponsors, and the DSMB. This wall prevents preliminary findings from leaking out in ways that would prejudice the trial. For example, if it was known that the DSMB was examining a marginal increase in cardiovascular risk in a trial, then trial investigators might bias future recruitment by excluding patients at risk for such events.

In the real world, though, problems with the DSMB occur. In one case, a drug company bypassed the DSMB and conducted an in-house examination of data in a trial and quickly published the data to counter a recently published meta-analysis that suggested safety issues associated with that company's drug. Why is this a problem?

The DSMB should have been informed of our May 2007 article and checked the trial data to be sure that patients receiving rosiglitazone in the RECORD trial were not having adverse events at an unacceptable rate. If clinical equipoise was still in play, the trial should have been allowed to continue undisturbed (i.e., without publication of the RECORD interim analysis), without public comment from the DSMB, without communication with investigators, and without disturbing the integrity of the trial. On the other hand, if in the opinion of the DSMB equipoise no longer existed, then the trial should have been terminated — that is the way it is supposed to work. The DSMB protects the participants in a trial.

This editorial described a second trial where interim data that should have been blinded from everyone except the DSMB became publicly known. A detailed description of this trial can be found in an earlier NEJM article.

Another concern about revelation of data in a DSMB involves the potential for insider trading. There is a nice description of the problems that disclosure can have in this Seattle Times article from 2005.

### Questions

1. Some DSMBs analyze data that is blinded by coding the two arms of the study with generic letters like A and B. With generic letters, though, it still may be possible to guess which group is which. How?

2. There are also examples where the DSMB is presented only with aggregate data across both arms of the study. What types of safety issues could be analyzed with only aggregate data? What types of safety issues would be impossible to conduct with only aggregate data?

3. If the rules for stopping a study are specified in detail prior to data collection, would a DSMB still be needed?

Submitted by Steve Simon

## Think the answers clear? Look again

The New York Times Science
by Katie Hafner
August 30, 2010

Presidential elections can be fatal.

(1) Win an Academy Award and you’re likely to live longer than had you been a runner-up.

(2) Interview for medical school on a rainy day, and your chances of being selected could fall..

Win an Academy Award and you’re likely to live longer than had you been a runner-up.

Interview for medical school on a rainy day, and your chances of being selected could fall.

Such are some of the surprising findings of Dr. Donald A. Redelmeier, a physician-researcher and perhaps the leading debunker of preconceived notions in the medical world.

Readers of chance news will recall that it was the claim that Osker winners live longer that was debunked.

See these

Submited by Laurie Snell

## Is the United States a religious outlier?

Religious outlier by Charles Blow, The New York Times, September 4, 2010.

The following image was published on the New York Times website.

The author, Charles Blow, is the visual OpEd columnist for the New York Times. His comments about the graph are rather brief.

With all of the consternation about religion in this country, it’s sometimes easy to lose sight of just how anomalous our religiosity is in the world. A Gallup report issued on Tuesday underscored just how out of line we are. Gallup surveyed people in more than 100 countries in 2009 and found that religiosity was highly correlated to poverty. Richer countries in general are less religious. But that doesn’t hold true for the United States.

### Questions

1. Does the United States look like an outlier to you? Are there any other outliers on this graph?

2. Why would there be a relationship between GDP and percentage of people who call themselves religious? Does a higher GDP cause lower religiosity? Does a lower religiosity cause a higher GDP? What sort of data could you collect that might help answer this question?

3. Do you like how Mr. Blow presented this data? What would you change, if anything, in this graph?

Submitted by Steve Simon

Notice how many dimensions are included in addition to the axes (percentage who say religion is important and G.D.P. per capita). The Gallup poll from which the graph came can be found here. Gallup says:

Results are based on telephone and face-to-face interviews conducted in 2009 with approximately 1,000 adults in each country. For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of sampling error ranges from ±5.3 percentage points in Lithuania to ±2.6 percentage points in India. In addition to sampling error, question wording and practical difficulties in conducting surveys can introduce error or bias into the findings of public opinion polls.

What might be "practical difficulties in conducting surveys"? "wording difficulties"? What effect might these have on the findings?

Submitted by Paul Alper

## Perfect Handshake formula

“Scientists Create Formula for Perfect Handshake”
Newspress, July 15, 2010

A University of Manchester psychologist has developed a formula for the Perfect Handshake. The formula was devised as part of a project for UK Chevrolet, who wanted a handshake training guide to be used by its sales force in promoting a new warranty plan.

An edited version of the formula is given by:

PH^2 = (e^2 + ve^2)(d^2) + (cg + dr)^2 + pi(4s^2)(4p^2)]^2 + (vi + t + te)^2 + [(4c^2)(4du^2]^2

where the following variables are measured on a scale of 1 to 5 for low to high traits, respectively:

(a) optimum score 5: e = eye contact; ve = verbal greeting; d = Duchenne smile; cg = completeness of grip
(b) optimum score 4: dr = dryness of hand
(c) optimum score 3: s = strength; p = position of hand; vi = vigor; t = temperature of hands; te = texture of hands; c = control; du = duration.

The article gives more details about the rating standards, as well as a phone number and email address with which to obtain a copy of the handshake training guide.

Discussion
1. Interpret the phrase "optimum score" for each variable.
2. Find the minimum, optimum, and maximum Perfect Handshake scores.
3. Comment on the range of possible scores.

Submitted by Margaret Cibes

“Wonderlic Test”
Wikipedia, retrieved September 12, 2010

The Wonderlic[1] Personnel Test is a 12-minute, 50-question multiple-choice test of English and math, which is used to help employers evaluate the general aptitude of job candidates in many occupations.

A candidate’s score is the total number of correct answers, with a score of 20 indicating average intelligence. For NFL pre-draft candidates, average scores range from 16 (halfback) to 26 (offensive tackle).

Pat NcInally, of Harvard, holds the record for a perfect score of 50. However, Dan Marino and Vince Young both scored 16 on the test. (See “So, how do you score?” for sample questions from a Wonderlic 2007 test.)

Business professor McDonald Mirabile is said to have compiled Wonderlic scores for 241 NFL quarterbacks in 2010, and found a mean score of 25.22 and a standard deviation of 7.46. Assuming that the standard deviation of any subgroup will not differ significantly from that of the population as a whole, the Wikipedia article suggests an equation relating a Wonderlic score to a standard IQ test score:

IQ = 100 + [(W − 20) / 7.46] * 15.

Note: Professor Mirabile’s 2005 study[2] of 84 drafted and signed quarterbacks from 1989 to 2004 showed “no statistically significant relationship between intelligence and collegiate passing performance.”

Discussion
1. Explain the role of each number, and numerical expression, in the equation relating a Wonderlic score to a standard IQ test score.
2. An SAT Reasoning Test (called "Scholastic Aptitude Test" pre-2005) score is scaled to a mean of 500 and a standard deviation of 100. Suggest an equation relating a Wonderlic score to an SAT Reasoning score.
3. While any pair of these scores can be related to each other via an equation, do you believe that they should be, i.e., that such relationships would be meaningful? What else would you need to know in order to decide?
4. Can you find any evidence, online or elsewhere, that a Wonderlic test is a valid and/or reliable measure of "aptitude" for jobs?

Submitted by Margaret Cibes

## Too much data on risks of BPA

In Feast of Data on BPA Plastic, No Final Answer. Denise Grady, The new York Times, September 6, 2010.

The more research there is in an area, the greater the chances of reaching a consensus. That's seems intuitive enough, but for the case of BPA, more data does not seem to help resolve this contentious area.

The research has been going on for more than 10 years. Studies number in the hundreds. Millions of dollars have been spent. But government health officials still cannot decide whether the chemical bisphenol-A, or BPA, a component of some plastics, is safe.

There are plenty of examples where one research study contradicts another, but usually scientists agree that one of the studies trumps the results of the other one because of superior research design. This does not appear to be the case with BPA.

The mountains of data produced so far show conflicting results as to whether BPA is dangerous, in part because different laboratories have studied the chemical in different ways. Animal strains, doses, methods of exposure and the results being measured — as crude as body weight or as delicate as gene expression in the brain — have all varied, making it difficult or impossible to reconcile the findings. In science, no experiment is taken seriously unless other researchers can reproduce it, and difficulties in matching BPA studies have led to fireworks.

Scientists are arguing over which set of studies to believe.

Most of the evidence against BPA comes from studies that find harmful effects in rats and mice at low doses comparable to the levels to which people are exposed. Sometimes the results seem downright weird, indicating that low doses could be worse than higher ones. There is sharp disagreement among scientists about how to interpret some research. The disputes arise in part because scientists from different disciplines — endocrinologists versus toxicologists, academic researchers versus those at regulatory agencies — do research in different ways that can make findings hard to reconcile.

There are some patterns to the findings.

She and other scientists said studies by university labs tended to find low-dose effects, and studies by government regulatory agencies and industry tended not to find them. The split occurs in part because the studies are done differently. Universities, Dr. Birnbaum said, “have moved rapidly ahead with advances in science,” while regulators have used “older methods.” Some researchers consider the regulatory studies more reliable because they generally use much larger numbers of animals and adhere to formal guidelines called “good laboratory practices,” but Dr. Birnbaum described those practices as “good record-keeping” and said, “That doesn’t mean the right questions were being asked.” The low-dose studies are newer and have raised safety issues that need to be resolved, she said. Last year, a scientific group called the Endocrine Society issued a 34-page report expressing serious concerns about endocrine-disrupting compounds, including BPA, dioxins, PCBs, DDT, the plasticizers known as phthalates and DES.

### Questions

1. Are there other areas of science where research studies fail to reach the same conclusion.

2. Is the inability to use randomized trials in this area a possible explanation for why one set of studies is not considered to be definitive? What are some other possible explanations?

3. When a study fails to show a dose-response relationship, greater effect at a higher dose than at a lower dose, that is considered a serious problem. Is there a possible explanation for the lack of dose-response in these studies?

Submitted by Steve Simon

## All the news that the data tell us is fit to print

Some Newspapers, Tracking Readers Online, Shift Coverage, Jeremy W. Peters, The New York Times, September 5, 2010.

Newspapers are not like other businesses.

In most businesses, not knowing how well a particular product is performing would be almost unthinkable. But newspapers have always been a peculiar business, one that has stubbornly, proudly clung to a sense that focusing too much on the bottom line can lead nowhere good.

But that may be changing.

Now, because of technology that can pinpoint what people online are viewing and commenting on, how much time they spend with an article and even how much money an article makes in advertising revenue, newspapers can make more scientific decisions about allocating their ever scarcer resources.

This is quite different from the readers polls about which comic strips to keep or delete. The article goes on to describe how major newspapers like the Wall Street Journal and the Washington Post use web traffic data to make decisions about how and what to cover. There are dissenting voices, and the New York Times, where this article originated, says that they do not make decisions about coverage based on statistics.

Submitted by Steve Simon