Chance News 66
August 13, 2010 to September 30, 2010
- 1 Quotations
- 2 Forsooth
- 3 Risk reduction
- 4 Teaching with infographics
- 5 Subverting the Data Safety Monitoring Board
- 6 Debunking medical claims
- 7 Is the United States a religious outlier?
- 8 Perfect Handshake formula
- 9 Wondering about NFL IQs?
- 10 Too much data on risks of BPA
- 11 All the news that the data tell us is fit to print
- 12 Data matters
- 13 Getting caught in the wrong arm of a randomized trial
- 14 Craps record analyzed
"It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so."
Quoted by Richard H. Thaler in The overconfidence problem in forecasting, New York Times, 21 August 2010.
Submitted by Paul Alper
"Nine out of 10 ‘SHOCKING’ poll results aren’t shocking if you’re paying attention."
Writing in his FiveThiryEight blog on September 22 (he was making a point about random variation and potential technical difficulties in polling, issues that are often overlooked in an unscientific reading of polling results).
Submitted by Bill Peterson
The following Forsooth is from the August 23, 2010 RRS News.
An editorial was published in the Journal of the National Cancer Institute (Volume 101, no 23, 2 December 2009). It announced some online resources for journalists, including a statistics glossary, which gave the following definitions:
- P value. Probability that an observed effect size is
due to chance alone.
if p ≥ 0.05, we say 'due to chance', 'not statistically significant'
if p < 05, we say 'not due to chance', 'statistically significant'
- Confidence interval (95% CI)
Because the observed value is only an estimate of the truth, we know it has a 'margin of error'.
The range of plausible values around the observed value that will contain the truth 95% of the time.
The journal subsequently (vol. 102, no. 11, 2 June 2010) published a letter commenting on the editorial and the statistics glossary. The authors of the original editorial replied as follows:
Dr Lash correctly points out that the descriptions of p values and 95% confidence intervals do not meet the formal frequentist statistical definitions.
[…] We were not convinced that working journalists would find these definitions user-friendly, so we sacrificed precision for utility.
Submitted by Laurie Snell
"Two years ago the sausage roll was the number one snack, but is now in second place with 53% of sales."
Submitted by Jeremy Miles
Burger and a statin to go? Or hold that, please?
by Kate Kelland and Genevra Pittman, Reuters, 13 August 2010
Dr. Darrel Francis is the leader of "a study published in the American Journal of Cardiology, [in which] scientists from the National Heart and Lung Institute at Imperial College London calculated that the reduction in heart disease risk offered by a statin could offset the increase in risk from eating a cheeseburger and a milkshake."
Further, "When people engage in risky behaviors like driving or smoking, they're encouraged to take measures that minimize their risk, like wearing a seatbelt or choosing cigarettes with filters. Taking a statin is a rational way of lowering some of the risks of eating a fatty meal."
1. Obviously, the above comments and analogies are subject for debate. Defend and criticize the comparisons made.
2. Risk analysis is very much in the domain of statistics. How would you estimate the risk of driving, smoking, eating a cheeseburger or taking a statin? Read the article in the American Journal of Cardiology to see how Francis and his co-authors estimated risk.
3. Pascal’s Wager is famous in religion, philosophy and statistics; his wager is the ultimate in risk analysis.
Historically, Pascal's Wager was groundbreaking as it had charted new territory in probability theory, was one of the first attempts to make use of the concept of infinity, [and] marked the first formal use of decision theory.
The wager is renowned for discussing the risk of not believing in God, presumably the Christian concept of God.
[A] person should wager as though God exists, because living life accordingly has everything to gain, and nothing to lose.
Read the Wikipedia article and discuss the risk tables put forth.
Submitted by Paul Alper
Teaching with infographics
Teaching with infographics: Places to start
by Katherine Schulten, New York Times, The Learning Network blog, 23 August 2010
Each day of this week, the blog will present commentary on some aspect of infographics. Starting with discussion of what infographics are, the series proceeds through applications ranging from sciences and social sciences to the fine arts. Each category will feature examples that have appeared in the Times. The link above for the first day contains an index to the whole series. There is a wealth of material to browse here.
Another nice resource is the Google Public Data Explorer, which uses Gapminder's "Trendalyzer" presentation tool to facilitate data exploration activities. (Gapminder was discussed in an earlier post; see CN 63). Gapminder Labs provides some more advanced tools for creating web-based presentations, an includes sample lesson plans for teachers.
Submitted by Bill Peterson
Subverting the Data Safety Monitoring Board
Don't Mess with the DSMB Jeffrey M. Drazen and Alastair J.J. Wood. N Engl J Med 2010; 363:477-478, July 29, 2010.
The Data Safety Monitoring Board (DSMB) is supposed to be an independent group charged with interim review of data from a clinical trial to decide whether to stop a study early because of evidence that continuation of the trial would be unethical. Trials are commonly halted early because of sufficient evidence that a new drug is clearly superior/inferior to the comparison drug, or because of serious concerns about safety.
The independent function of a DSMB is vital.
Since the DSMB (data and safety monitoring board) is charged with ensuring that clinical equipoise is maintained as trial data are accrued, it is considered very bad, even self-destructive, behavior for people who are involved with the study to interact with DSMB members on trial-related issues. Traditionally, there has been a wall between investigators, sponsors, and the DSMB. This wall prevents preliminary findings from leaking out in ways that would prejudice the trial. For example, if it was known that the DSMB was examining a marginal increase in cardiovascular risk in a trial, then trial investigators might bias future recruitment by excluding patients at risk for such events.
In the real world, though, problems with the DSMB occur. In one case, a drug company bypassed the DSMB and conducted an in-house examination of data in a trial and quickly published the data to counter a recently published meta-analysis that suggested safety issues associated with that company's drug. Why is this a problem?
The DSMB should have been informed of our May 2007 article and checked the trial data to be sure that patients receiving rosiglitazone in the RECORD trial were not having adverse events at an unacceptable rate. If clinical equipoise was still in play, the trial should have been allowed to continue undisturbed (i.e., without publication of the RECORD interim analysis), without public comment from the DSMB, without communication with investigators, and without disturbing the integrity of the trial. On the other hand, if in the opinion of the DSMB equipoise no longer existed, then the trial should have been terminated — that is the way it is supposed to work. The DSMB protects the participants in a trial.
This editorial described a second trial where interim data that should have been blinded from everyone except the DSMB became publicly known. A detailed description of this trial can be found in an earlier NEJM article.
Another concern about revelation of data in a DSMB involves the potential for insider trading. There is a nice description of the problems that disclosure can have in this Seattle Times article from 2005.
1. Some DSMBs analyze data that is blinded by coding the two arms of the study with generic letters like A and B. With generic letters, though, it still may be possible to guess which group is which. How?
2. There are also examples where the DSMB is presented only with aggregate data across both arms of the study. What types of safety issues could be analyzed with only aggregate data? What types of safety issues would be impossible to conduct with only aggregate data?
3. If the rules for stopping a study are specified in detail prior to data collection, would a DSMB still be needed?
Submitted by Steve Simon
Debunking medical claims
Think the answer's clear? Look again
by Katie Hafner, New York Times, 30 August 2010
Here we read:
Presidential elections can be fatal.
Win an Academy Award and you’re likely to live longer than had you been a runner-up.
Interview for medical school on a rainy day, and your chances of being selected could fall.
Such are some of the surprising findings of Dr. Donald A. Redelmeier, a physician-researcher and perhaps the leading debunker of preconceived notions in the medical world.
Readers of Chance News will recall that it was the claim that Oscar winners live longer that was debunked. See these links:
- Oscar winners do not live longer
- McGill researchers debunk Oscar-winner 'longevity bonus'
- Do Oscar winners live longer than less successful peers? A reanalysis of the evidence
Submitted by Laurie Snell
Is the United States a religious outlier?
Religious outlier by Charles Blow, The New York Times, September 4, 2010.
The following image was published on the New York Times website.
The author, Charles Blow, is the visual OpEd columnist for the New York Times. His comments about the graph are rather brief.
With all of the consternation about religion in this country, it’s sometimes easy to lose sight of just how anomalous our religiosity is in the world. A Gallup report issued on Tuesday underscored just how out of line we are. Gallup surveyed people in more than 100 countries in 2009 and found that religiosity was highly correlated to poverty. Richer countries in general are less religious. But that doesn’t hold true for the United States.
1. Does the United States look like an outlier to you? Are there any other outliers on this graph?
2. Why would there be a relationship between GDP and percentage of people who call themselves religious? Does a higher GDP cause lower religiosity? Does a lower religiosity cause a higher GDP? What sort of data could you collect that might help answer this question?
3. Do you like how Mr. Blow presented this data? What would you change, if anything, in this graph?
Submitted by Steve Simon
Notice how many dimensions are included in addition to the axes (percentage who say religion is important and G.D.P. per capita). The Gallup poll from which the graph came can be found here. Gallup says:
Results are based on telephone and face-to-face interviews conducted in 2009 with approximately 1,000 adults in each country. For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of sampling error ranges from ±5.3 percentage points in Lithuania to ±2.6 percentage points in India. In addition to sampling error, question wording and practical difficulties in conducting surveys can introduce error or bias into the findings of public opinion polls.
What might be "practical difficulties in conducting surveys"? "wording difficulties"? What effect might these have on the findings?
Submitted by Paul Alper
Perfect Handshake formula
“Scientists Create Formula for Perfect Handshake”
Newspress, July 15, 2010
A University of Manchester psychologist has developed a formula for the Perfect Handshake. The formula was devised as part of a project for UK Chevrolet, who wanted a handshake training guide to be used by its sales force in promoting a new warranty plan.
An edited version of the formula is given by:
PH^2 = (e^2 + ve^2)(d^2) + (cg + dr)^2 + pi(4s^2)(4p^2)]^2 + (vi + t + te)^2 + [(4c^2)(4du^2]^2
where the following variables are measured on a scale of 1 to 5 for low to high traits (optimum scores in parens), respectively:
e = eye contact (5); ve = verbal greeting (5); d = Duchenne smile (5); cg = completeness of grip (5); dr = dryness of hand (4); s = strength (3); p = position of hand (3); vi = vigor (3); t = temperature of hands (3); te = texture of hands (3); c = control (3); du = duration (3).
The article gives more details about the rating standards, as well as a phone number and email address with which to obtain a copy of the handshake training guide.
1. Interpret the phrase "optimum score" for each variable.
2. Find the minimum, optimum, and maximum Perfect Handshake scores.
3. Comment on the range of possible scores.
Submitted by Margaret Cibes
Wondering about NFL IQs?
Wikipedia, retrieved September 12, 2010
The Wonderlic Personnel Test is a 12-minute, 50-question multiple-choice test of English and math, which is used to help employers evaluate the general aptitude of job candidates in many occupations.
A candidate’s score is the total number of correct answers, with a score of 20 indicating average intelligence. For NFL pre-draft candidates, average scores range from 16 (halfback) to 26 (offensive tackle).
Pat NcInally, of Harvard, holds the record for a perfect score of 50. However, Dan Marino and Vince Young both scored 16 on the test. (See “So, how do you score?” for sample questions from a Wonderlic 2007 test.)
Business professor McDonald Mirabile is said to have compiled Wonderlic scores for 241 NFL quarterbacks in 2010, and found a mean score of 25.22 and a standard deviation of 7.46. Assuming that the standard deviation of any subgroup will not differ significantly from that of the population as a whole, the Wikipedia article suggests an equation relating a Wonderlic score to a standard IQ test score:
IQ = 100 + [(W − 20) / 7.46] * 15.
Note: Professor Mirabile’s 2005 study of 84 drafted and signed quarterbacks from 1989 to 2004 showed “no statistically significant relationship between intelligence and collegiate passing performance.”
1. Explain the role of each number, and numerical expression, in the equation relating a Wonderlic score to a standard IQ test score.
2. An SAT Reasoning Test (called "Scholastic Aptitude Test" pre-2005) score is scaled to a mean of 500 and a standard deviation of 100. Suggest an analogous equation relating a Wonderlic score to an SAT Reasoning score.
3. While any pair of these scores can be related to each other via an equation, do you believe that they should be, i.e., that such relationships would be meaningful? What else would you need to know in order to decide?
4. Suppose that the Wonderlic test is not related to passing performance at all, at least for quarterbacks. What other aspect(s) of good quarterbacking, if any, might it measure?
Submitted by Margaret Cibes
Too much data on risks of BPA
In Feast of Data on BPA Plastic, No Final Answer. Denise Grady, The New York Times, September 6, 2010.
The more research there is in an area, the greater the chances of reaching a consensus. That's seems intuitive enough, but for the case of BPA, more data does not seem to help resolve this contentious area.
The research has been going on for more than 10 years. Studies number in the hundreds. Millions of dollars have been spent. But government health officials still cannot decide whether the chemical bisphenol-A, or BPA, a component of some plastics, is safe.
There are plenty of examples where one research study contradicts another, but usually scientists agree that one of the studies trumps the results of the other one because of superior research design. This does not appear to be the case with BPA.
The mountains of data produced so far show conflicting results as to whether BPA is dangerous, in part because different laboratories have studied the chemical in different ways. Animal strains, doses, methods of exposure and the results being measured — as crude as body weight or as delicate as gene expression in the brain — have all varied, making it difficult or impossible to reconcile the findings. In science, no experiment is taken seriously unless other researchers can reproduce it, and difficulties in matching BPA studies have led to fireworks.
Scientists are arguing over which set of studies to believe.
Most of the evidence against BPA comes from studies that find harmful effects in rats and mice at low doses comparable to the levels to which people are exposed. Sometimes the results seem downright weird, indicating that low doses could be worse than higher ones. There is sharp disagreement among scientists about how to interpret some research. The disputes arise in part because scientists from different disciplines — endocrinologists versus toxicologists, academic researchers versus those at regulatory agencies — do research in different ways that can make findings hard to reconcile.
There are some patterns to the findings.
She and other scientists said studies by university labs tended to find low-dose effects, and studies by government regulatory agencies and industry tended not to find them. The split occurs in part because the studies are done differently. Universities, Dr. Birnbaum said, “have moved rapidly ahead with advances in science,” while regulators have used “older methods.” Some researchers consider the regulatory studies more reliable because they generally use much larger numbers of animals and adhere to formal guidelines called “good laboratory practices,” but Dr. Birnbaum described those practices as “good record-keeping” and said, “That doesn’t mean the right questions were being asked.” The low-dose studies are newer and have raised safety issues that need to be resolved, she said. Last year, a scientific group called the Endocrine Society issued a 34-page report expressing serious concerns about endocrine-disrupting compounds, including BPA, dioxins, PCBs, DDT, the plasticizers known as phthalates and DES.
1. Are there other areas of science where research studies fail to reach the same conclusion.
2. Is the inability to use randomized trials in this area a possible explanation for why one set of studies is not considered to be definitive? What are some other possible explanations?
3. When a study fails to show a dose-response relationship, greater effect at a higher dose than at a lower dose, that is considered a serious problem. Is there a possible explanation for the lack of dose-response in these studies?
Submitted by Steve Simon
All the news that the data tell us is fit to print
Some Newspapers, Tracking Readers Online, Shift Coverage, Jeremy W. Peters, The New York Times, September 5, 2010.
Newspapers are not like other businesses.
In most businesses, not knowing how well a particular product is performing would be almost unthinkable. But newspapers have always been a peculiar business, one that has stubbornly, proudly clung to a sense that focusing too much on the bottom line can lead nowhere good.
But that may be changing.
Now, because of technology that can pinpoint what people online are viewing and commenting on, how much time they spend with an article and even how much money an article makes in advertising revenue, newspapers can make more scientific decisions about allocating their ever scarcer resources.
This is quite different from the readers polls about which comic strips to keep or delete. The article goes on to describe how major newspapers like the Wall Street Journal and the Washington Post use web traffic data to make decisions about how and what to cover. There are dissenting voices, and the New York Times, where this article originated, says that they do not make decisions about coverage based on statistics.
Submitted by Steve Simon
Evidence-based medicine “or Evidence-based practice (EBP) aims to apply the best available evidence gained from the scientific method to clinical decision making.” EBP can be viewed as a technique for discerning what is true versus what is plausibly valid. Often, careful research protocols indicate such items as vitamin E, beta carotene, fiber, massive screening for cancer, etc., fail to live up to expectations even though we intuitively feel they must be beneficial.
A recent article Phys Ed: Does Stretching Before Running Prevent Injuries? (New York Times, 1 September 2010) cites another case where EBP counters intuition and plausibility. We all “know” that stretching before running is beneficial. Yet, when the 1400 runners varying in age from “13 to past 60” were randomly assigned to (static) stretching and to not stretching:
About 16 percent of the group that didn’t stretch were hobbled badly enough to miss training for at least three days (the researchers’ definition of a running injury), while about 16 percent of the group that did stretch were laid up for the same amount of time.
However, in order to overthrow our intuition in favor of what EBP is telling us, we have to believe the data is honestly presented. Such belief can be naïve especially when money plays a vital role. The case of Timothy Kuklo has appeared several times in Chance News (see entries in CN 50, CN 49, and CN 48). Other conflicts of interest and failures to state such conflicts has been an ongoing embarrassment for medical journals. See Medical Industry Ties Often Undisclosed in Journals (New York Times, 13 September 2010) for recent egregious instances.
Unfortunately, fraudulent data also arises in areas outside of medicine and the academic world. Especially if data is taken over time, the urge to fake it in subsequent periods is overwhelming as indicated by Sir Walter Scott’s famous quotation, Oh what a tangled web we weave when first we practice to deceive. The depressing story of NYPD’s continuing corruption and increased data faking may be found in The Village Voice:
As a result, the tapes show, the rank-and-file NYPD street cop experiences enormous pressure in a strange catch-22: He or she is expected to maintain high "activity"—including stop-and-frisks—but, paradoxically, to record fewer actual crimes.
Even more frightening is to listen via the streaming audio to The Right to Remain Silent (episode number 414 of the NPR program, This American Life):
For 17 months, New York police officer Adrian Schoolcraft recorded himself and his fellow officers on the job, including their supervisors ordering them to do all sorts of things that police aren't supposed to do. For example, downgrading real crimes into lesser ones, so they wouldn't show up in the crime statistics and make their precinct look bad.
1. The full study on stretching may be found here. Take note of the many caveats such as runners who did stretching ordinarily but were randomly assigned to forego stretching were more likely to be injured.
2. There is enormous controversy regarding annual mammograms for women less than 50 years old. Assuming that mammography is not efficacious for women under 50, why would any organization keep insisting it should be done?
3. Adrian Schoolcraft is suing NYPD and Jamaica Hospital for $50 million. It is alleged that
[O]n October 31, 2009, several high ranking NYPD officials illegally entered PO Schoolcraft’s home, forcibly removed him in handcuffs, seized his personal effects, including evidence he had gathered documenting NYPD corruption, and had him admitted to Jamaica Hospital Center against his will, under the false pretense that he was “emotionally disturbed.”
As noted above, Schoolcraft had “hard” evidence--his recordings--of data manipulations. Nonetheless, in this age of electronic manipulation, should his recordings be trusted? Be sure to listen to the end of the radio program to hear the police being angry to discover his secret recording device which falls out of his pocket during the rough interrogation--only to miss another recording device in the room!
Submitted by Paul Alper
Getting caught in the wrong arm of a randomized trial
New Drugs Stir Debate on Rules of Clinical Trials, Amy Harmon, The New York Times, September 18, 2010.
This is the story of two cousins, both with a deadly disease and both enrolled in a clinical trial. One gets into the treatment arm and does really well, and the other gets in the control arm and does very poorly.
“Dude, you have to get on these superpills,” Thomas McLaughlin, then 24, whose melanoma was diagnosed first, urged his cousin, Brandon Ryan. Mr. McLaughlin’s tumors had stopped growing after two months of taking the pills. But when Mr. Ryan, 22, was admitted to the trial in May, he was assigned by a computer lottery to what is known as the control arm. Instead of the pills, he was to get infusions of the chemotherapy drug that has been the notoriously ineffective recourse in treating melanoma for 30 years.
Once randomized, you are stuck, it seems.
Even if it became clear that the chemotherapy could not hold back the tumors advancing into his lungs, liver and, most painfully, his spine, he would not be allowed to switch, lest it muddy the trial’s results.
It may seem harsh, but there is a price to be paid if drugs like this are not studied in rigorous and carefully controlled trials.
Defenders of controlled trials say they are crucial in determining whether a drug really does extend life more than competing treatments. Without the hard proof the trials can provide, doctors are left to prescribe unsubstantiated hope — and an overstretched health care system is left to pay for it. In melanoma, in particular, no drug that looked promising in early trials had ever turned out to prolong lives.
But critics of the trials argue that the new science behind the drugs has eclipsed the old rules — and ethics — of testing them. They say that in some cases, drugs under development, PLX4032 among them, may be so much more effective than their predecessors that putting half the potential beneficiaries into a control group, and delaying access to the drug to thousands of other patients, causes needless suffering.
It's dangerous to assume that the new drug is superior based on these two cousins.
Of course, no single pair of patients can fairly represent the outcomes of a trial whose results are not yet known. Rather, the story of Thomas McLaughlin and Brandon Ryan is one of entwined paths that suddenly diverged, with a roll of the dice.
Is there an option for patients like Brandon Ryan?
Dr. Chapman of Sloan-Kettering came up with a new tack: an unconventional bid to speed the drug’s approval, rooted in the observation that patients weeks or days from death could get out of bed and off oxygen when given PLX4032, sometimes for months. The doctors working with the drug referred to this as the Lazarus effect; it was unheard of with dacarbazine. A trial that cataloged PLX4032’s effect on the well-being of the sickest patients, Dr. Chapman argued, would probably yield fast, tangible results. For him, it represented a chance to give patients symptomatic relief, even if the drug turned out not to prolong life.
The drug company did not want to pursue this option.
But company officials feared that might lead to approval for only a narrow group of the sickest patients. The surest way to get the F.D.A’s endorsement for a broader market was a controlled trial. And with its competitors rushing to get similar drugs to market, the findings of such a trial might give Roche an advantage in marketing its version as the only one proven to prolong survival.
The article presents many additional comments, both supportive and critical of the clinical trial. It is an excellent and even-handed presentation of a very controversial area. There are also four interesting letters that appeared a week later.
1. Is it fair to make a promising new drug available only through a clinical trial until the clinical trial proves (or disproves) efficacy for this drug?
2. In the article, Brandon Ryan's mother is quoted as saying "What gives them the right to play God?It doesn’t make sense to say, 'We want you for a statistic' instead of giving them a chance at life." Do you agree with her?
Submitted by Steve Simon
Craps record analyzed
A world record in Atlantic City and the length of the shooter’s hand at craps,
by S.N. Ethier and Fred M. Hoppe, Mathematical Intelligencer, published online 13 August 2010
In an earlier post (see CN 49: A_new_record_in_craps), we discussed the story of a New Jersey woman who established a new world record in craps, rolling the dice 154 consecutive times before "sevening out." To find the probability of this event, we numerically computed transition probabilities for a Markov chain model of the play. Ethier and Hoppe credit Peter Griffin with first proposing this model for craps.
The Intelligencer article (pre-publication version here) presents an elegant algebraic solution of model, based on eigenvalue analysis and even some Galois theory. With the help of Mathematica, the authors obtain a closed-form expression for the distribution of the duration of play as a linear (though notably not convex) combination of four geometric distributions. Readers interested in seeing an application of some attractive mathematics are encouraged to have a look at the full paper.
Submitted by Bill Peterson