Chance News 12
- 1 Screening
- 2 Is the human brain a Bayesian-reasoning machine?
- 3 Superfluous Medical Studies
- 4 Are We Descended from Cannibals?
- 5 Data Mining 101: Finding Subversives with Amazon Wishlists
- 6 Can dogs sniff out cancer?
From the doctors' perspective, early detection has other appealing features: ordering a test is quick and easy, and it has an established billing process--unlike health promotion counseling.
--H. Gilbert Welch
For a related story, see this page.
One thing almost all people know is that it is prudent to be screened for diseases because that will add to their longevity. However, according to H. Gilbert Welch, a medical doctor at Dartmouth College, it isn't necessarily so.
His book, Should I Be Tested For Cancer? Maybe Not And Here's Why [University of California Press, 2004], focuses on screening which is a particular form of testing and he deals exclusively with cancer as opposed to other afflictions. Screening "means the systematic examination of asymptomatic people to detect and treat disease." His contention is that screening for cancer is inefficient in that very few people who actually have the particular cancer are both discovered and then cured. Moreover, the false positives result in many problems of which the general public is not aware. On the other hand, false negatives of cancer screening are barely mentioned in his book "because we do not biopsy people with negative screening tests." That is, we can't distinguish between a false negative and a rapidly-growing cancer that emerges in between screenings.
In a nutshell, randomized clinical screening trials for those cancers discussed in the book--lung cancer, cervical cancer, breast cancer, prostate cancer and colon cancer-- have statistically shown that screening has provided very little benefit in terms of mortality. Welch argues that with the new, exquisite devices such as CAT scans, MRIs, etc., now available, it is possible to detect cancer earlier so that it seems that the 5-year survival rates have improved; victims are living longer not because the treatments are better but only because the diagnoses were made earlier. Further, these devices are detecting what he calls "pseudodiseases," cancers which will never develop into a cancer that will cause a problem. It follows that this detection of cancers which would never have been discovered years ago when there was a lack of technology, further inflates the 5-year survival rate, a figure of merit which he would like to see abolished because it is so misleading.
He argues that the side effects of a false positive are not to be taken lightly. Chapters 2 and 3 are entitled "You may have a cancer 'scare' and face an endless cycle of testing" and "You may receive unnecessary treatment," respectively. Certainly, in bygone days being told that you had cancer was frightening in the extreme. Perhaps not so much in these enlightened times, but a stay in a hospital, especially for an unnecessary procedure, can definitely lead to unpleasant side effects such as infection or worse.
Welch points out that there are vested interests in the screening industry: doctors, hospitals, clinics, insurance companies and lay organizations which depend for their existence, financial and otherwise, on keeping Americans fully screened and uninformed about the problems connected with screening. For example, although it has been statistically shown via randomized clinical screening trials that mammography, an unpleasant procedure at best, is not useful for women under 50, the "mammography lobby," made up of manufacturers, radiologists, ideologues and feminists who considered the studies to be a male plot, went ballistic and wanted to substitute emotion for science: The National Cancer Institute reconsidered and by 17 to 1 decided "in favor of recommending mammography to all women in their 40s."
The same sort of situation applies to prostate cancer. The accepted, conventional wisdom in the United States is that screening must be worthwhile because it is self-evident even though a careful look at the data points in the opposite direction. Watchful waiting, a much used medical treatment in Europe for prostate cancer is frequently ridiculed in this country by both laymen and urologists.
Welch fully realizes his thesis--screening for most cancers is, by and large, ineffective and/or harmful--will not go over well because it "flies in the face of medical dogma." His "book is not about what to do if you know you have cancer; it is about informing the decision of whether to look for cancer when you are well." This distinction has been lost on the people I have spoken to. The conventional wisdom that cancer screening must be desirable is a notion that, as far as I can tell from my experience when discussing it with others, is unchallengeable. To be even more cynical, any doctor who doesn't order a screening test for a patient who eventually gets cancer is likely to be sued successfully, so ingrained is the conventional wisdom among the general public and judges alike.
Submitted by Paul Alper
Is the human brain a Bayesian-reasoning machine?
Bayes rules, Jan 5th 2006, The Economist.
The lead article in this weeks Science & Technology section of The Economist claims that Bayesian statistics may help to explain how the mind works and even argues that the human mind is a Bayesian one.
The Economist article begins with a summary of Bayes' ideas:
[Bayes ideas] about the prediction of future events from one or two examples were popular for a while, and have never been fundamentally challenged. But they were eventually overwhelmed by those of the frequentist school, which developed the methods based on sampling from a large population that now dominate the field and are used to predict things as diverse as the outcomes of elections and preferences for chocolate bars.
But, Bayes has recently started a comeback, among computer scientists designing software with human-like intelligence, such as internet search engines and automated 'help wizards'. In many situations, the true answer cannot be determined based on the limited data available, yet common sense suggests at least a reasonable guess. For example,
- how much longer will a 60-year old man live?
- can you identify a three-dimensional object from a two-dimensional diagram?
- what is the total gross from a movie that has made $40m at the box-office, so far?
That has prompted some psychologists to ask if the human brain itself might be a Bayesian-reasoning machine. Accounts of human perception and memory suggest that these systems effectively approximate optimal statistical inference, correctly combining new data with an accurate probabilistic model of the environment. The Economist article suggests that
The Bayesian capacity to draw strong inferences from sparse data could be crucial to the way the mind perceives the world, plans actions, comprehends and learns language, reasons from correlation to causation, and even understands the goals and beliefs of other minds.
It goes on to summarises how Bayesian reasoning works
The key to successful Bayesian reasoning is not in having an extensive, unbiased sample, which is the eternal worry of frequentists, but rather in having an appropriate “prior”, as it is known to the cognoscenti. This prior is an assumption about the way the world works-in essence, a hypothesis about reality-that can be expressed as a mathematical probability distribution of the frequency with which events of a particular magnitude happen.
It claims that frequentism is thus a more robust approach but it is not well suited to making decisions on the basis of limited information - which is something that people have to do all the time - and this is where Bayesian statistics excels.
The article discusses four prior distributions: Gaussian, Poisson, Erlang and power-law and an experiement that the scientists, Thomas Griffiths at Brown and Joshua Tenenbaum at MIT, conducted by giving individual nuggets of information to each of the participants in their study and asking them to draw a general conclusion.
The experiment found that people could make accurate predictions about the duration or extent of everyday phenomena, given limited data, such as: (The authors used publicly available data to identify the true prior distributions shown below in brackets.)
- estimate what its total box-office “gross” takings of a movie, even though they were not told for how long it had been on release so far (power-law)
- the number of lines in a poem, given how far into the poem a single line is (power-law)
- the time it takes to bake a cake, given how long it has already been in the oven (a complex and irregular distribution, according to the authors)
- the total length of the term that would be served by an American congressman, given how long he has already been in the House of Representatives (Erlang)
- an individual's lifespan given his current age (approx Gaussian)
- the run-time of a film (approx Gaussian)
- the amount of time spent on hold in a telephone queuing system (traditionally a Poisson but the experiment's results suggests a power-law distribution which matches other recent research)
- reigns of Pharaohs (approx Erlang)
Accounts of human perception and memory suggest that these systems effectively approximate optimal statistical inference, correctly combining new data with an accurate probabilistic model of the environment. People’s prediction functions took on very different shapes in domains characterized by Gaussian, power-law, or Erlang priors, just as expected under the ideal Bayesian analysis.
There were exceptions, such as an inability of the human brain to estimate the length of the reign of an Egyptian Pharaoh in the fourth millennium BC. People consistently overestimated this. The analysis showed that the prior they were applying was an Erlang distribution, which was the correct type. They just got the parameters wrong, presumably through lack of knowledge of political and medical conditions in fourth-millennium BC Egypt.
The authors claim that
everyday cognitive judgments follow the same optimal statistical principles as perception and memory [which are often explained as optimal statistical inferences, informed by accurate prior probabilities], and reveal a close correspondence between people’s implicit probabilistic models and the statistics of the world.
How the priors are themselves constructed in the mind has yet to be investigated in detail. Obviously they are learned by experience, but the exact process is not properly understood. The Economist article finishes with a cautionary note for both Bayesians and frequentists
Things dont always go smoothly with a Bayesian approach. Sometimes the process goes further and further off-track and the authors speculate that that might explain the emergence of superstitious behaviour, with an accidental correlation or two being misinterpreted by the brain as causal. A frequentist way of doing things would reduce the risk of that happening. But by the time the frequentist had enough data to draw a conclusion, he might already be dead.
- Bayes rules, Jan 5th 2006, The Economist. - the full article is worth reading.
- Optimal predictions in everyday cognition, Thomas L. Griffths, Department of Cognitive and Linguistic Sciences, Brown University & Joshua B. Tenenbaum, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology.
- The paper shows the emprical distributions for the each of the variables being estimated along with more details about the experiment.
Submitted by John Gavin.
Superfluous Medical Studies
Superfluous Medical Studies
When a patient volunteers for a randomized clinical trial, he or she strikes an implicit bargain with the researcher. The patient may benefit, but even if he does not, others will. That is because the study will produce new knowledge. But if the question is already settled, then the patient's sacrifice and altruism are for naught.
Steven N. Goodman, Johns Hopkins University biostatistician
Clinical trials have been the bread and butter for many a statistician. A frequent tagline to such studies is, "More research needs to be done" which implies further employment for statisticians. If the results are overall underwhelming, perhaps the procedure/medication works better on women, or Hispanics, or the elderly or some other subgroup and so the studies proliferate. David Brown's article in the Washington Post of January 2, 2006 looks at several instances where, on the contrary, the evidence is so convincing that no more studies need or should be done. As he puts it, "What part of 'yes' don't doctors understand." Specifically, he cites the use of aprotinin in heart surgery, SIDS (sudden infant death syndrome) prevention and the use of streptokinase to treat heart attacks.
According to Brown, there have been 64 studies of aprotinin since 1987 but by the 12th in 1992 it was clear that aprotinin reduced bleeding. "On average, each new paper listed only one-fifth of the previous studies in its references." Although "Being given a placebo long after aprotinin's value had been proved probably did not cost lives, the same cannot be said of medicine's failure to pay attention to studies of infant sleep position."
A child health expert alleges that "if researchers had pooled the results of the oldest studies [40 studies back to 1965] and analyzed them, they might have gotten a big hint by 1970 that putting babies to sleep on their stomachs raised the risk of SIDS" sevenfold. By the 1990s,"at least 50,000 excess [SIDS] deaths were attributable to harmful health advice." With regard to streptokinase, it lowered death rates by 25%; "that conclusion and the percentage, did not budge while 34,542 more patients were enrolled in 25 more trials of streptokinase over the next 15 years" from 1973 to 1988.
In order to rectify this excessive zeal on the part of researchers, "The Lancet, a British journal, announced last summer that it will require that authors submitting papers show they performed a meta-analysis of previous research or consulted an existing one." Goodman claims that "In 10 years we are going to look back on this time, and we won't believe this wasn't done as a matter of course."
Submitted by Paul Alper
Are We Descended from Cannibals?
Are We Descended from Cannibals? Micheal Balter, ScienceNOW Daily News,
6 January 2006.
A study published 2 years ago in Science (25 April 2003, p. 640), led by John Collinge of University College London (UCL), claimed that modern humans harbor a gene that allowed our ancestors to engage in cannibalism. The gene, called PRNP, codes for prions, thought to be responsible for several neurodegenerative diseases, including Creutzfeldt-Jacob Disease (CJD) and kuru. Individuals with certain variations in this gene are more resistant to those diseases.
The claim was based on a sample of 1,000 people from populations around the world and suggested that variations on this gene had survived for 500,000 years. The researchers hypothesized that the gene survived due to widespread cannibalistic practices that had made early humans susceptible to prion diseases.
But a recent second paper in Genome Research, by Jaume Bertranpetit and his coworkers at the Pompeu Fabra University in Barcelona, contradicts this result. It rejects the model of selection and claimed that the Science paper was statistically skewed because its study ignored low frequency variations of the gene, an error known as ascertainment bias. This second paper used a sample of 174 people from around the world.
Lead author of the Science paper, Simon Mead of UCL, stands by his original claim and argues that his paper's conclusions were based on several different lines of evidence that trump criticisms of ascertainment bias.
- Is a sample size of 1,000 people sufficient to extrapotate to the world population over the last 500,000 years?
- Is the much smaller sample size of 174 in the second paper justifiable?
Incidently, the wikipedia link above to prions warns 'This article has been identified as possibly containing errors', referring to a study in the journal Nature comparing Wikipedia to Britanica. This comparison was the subject of a previous Chance news item.
Submitted by John Gavin.
Data Mining 101: Finding Subversives with Amazon Wishlists
Data Mining 101: Finding Subversives with Amazon Wishlists, Tom Owad. applefritter.com, January 4, 2006.
This article explains a novel source for data mining, the information contained in the popular Amazon wishlists, and discusses the political implications of its use. It is not written from a statistical point of view but it offers an interesting case study in data-mining and exploratory data analysis (EDA).
The author uses readily-available, open-source software to access over 260,000 wishlists from U.S. citizens. He says
All the tools used in this project are standard and free. The services, likewise, are all free. The technical skills required to implement this project are well within the abilities of anybody who has done any programming.
Owad suggests that based on this information, it is possible to compile a list of people who expressed an interest in certain books. The author offers a sample of the list he compiled and invites everyone to make up their own list and explore the data. As an example he asks
What books are most dangerous? Send it to the FBI. I'm sure they'll appreciate your help in fighting terrorism.
Owad offers some examples of 'subversive' authors, such as Michael Moore (the fringe left) or Rush Limbaugh (the fringe right).
As part of his EDA, he impressively converted City and state information on each person to latitude and longitude coordinates, using the free on-line Ontok Geocoder service and then mapped those locations using Google's Maps API. For example, you could see the locations of all people who expressed an interest in a certain book and live in a certain city or even a certain street. Two interactive examples are offered which plot all of the locations on a satellite image of the United States that can be zoomed in to house level:
There are many comments on this article posted on the same webpage.
Submitted by John Gavin.
Can dogs sniff out cancer?
McCulloch M, Jezierski T, Broffman M, Hubbard A, Kirk Turner, Janecki T. (2006) Diagnostic Accuracy of Canine Scent Detection in Early- and Late-Stage Lung and Breast Cancers. Integrative Cancer Therapies 2006: 5(1); 1-10. Early release in PDF format
Dogs have an unusually sensitive sense of smell and might be able to diagnose cancer by sniffing breath sample from human patients. This is rather intriguing, since dogs have already been trained to locate explosives, cadavers, drugs, and so forth.
The researchers collected breath samples from 55 patients with lung cancer, 31 patients with breast cancer, and 83 volunteers with no prior cancer history.
Eligible patients were men and women older than 18 years with a very recent biopsy-confirmed conventional diagnosis of lung or breast cancer. We specifically requested that recruitment centers refer patients as soon as possible following definitive diagnosis so that breath sampling would not interfere with or delay planned conventional treatment. As we suspected that chemotherapy treatment would change the exhaled chemicals in cancer patients, we sought patients who had not yet undergone chemotherapy treatment. As we also suspected that patients with more advanced disease, and thus larger tumors, might be exhaling higher concentrations of the chemicals associated with cancer cells and would therefore be more easily identified by the dogs, we sought patients with any stage disease.
The collection of breath samples was quite simple.
For breath sampling, we obtained a cylindrical polypropylene organic vapor sampling tube (Defencetek, Pretoria, South Africa). Each tube is open at either end, is 6 inches long, has an outer diameter of 1 inch, has an inner diameter of 0.75 inches, and has removable end caps. A removable 2-inch-long insert of silicone oil-coated polypropylene “wool” captures volatile organic compounds in exhaled breath as breath passes through the tube. To collect breath samples, we asked donors to exhale 3 to 5 times through the tube. We then fitted the tubes with their end caps and sealed them in ordinary grocery store Ziplock-style bags at room temperature between the time of breath sampling and presentation to the dogs.
Each patient and control contributed multiple breath samples to the study, ranging from 4 to 18 samples per person.
The dogs had to be trained to recognize cancer samples, and in the training sessions, the trainer had to be unblinded to the location of the cancer sample, so they could reward the dogs when they identified the cancer samples correctly. The dogs were trained to indicate a positive result by sitting down by the canister that had the cancer breath sample.
During phase 1 of training, the location of the cancer breath sample was known by both experimenter and trainer. One station contained a cancer breath sample, and the remaining 4 stations contained blank sample tubes that had not been used in any breath sampling. To encourage the dogs to seek out the exhaled chemicals associated with cancer, we placed a piece of dog food in the station with the cancer breath sample and covered the container with a piece of paper so the food would not be visible.
The second phase of training still used four blank canisters and food rewards in the cancer breath sample canister.
During phase 2 of training, only the experimenter was aware of the location of the cancer breath sample and apart from encouraging the dog with encouraging phrases such as “go to work,” gave no “sit” or other verbal commands to the dog. Clicker signal by the experimenter and subsequent food reward and praise by the trainer were given only after the dog correctly indicated on the cancer breath sample. When the dog indicated incorrectly on a control, the experimenter would not signal with the clicker and the handler would remain silent, not give the dog any praise reward, and mildly rebuke the dog by saying “no.” Samples used in phases 1 and 2 (contaminated with food scent) were not used again.
The third phase of training was similar to the second, except there were no food rewards in the canister with the cancer breath sample. After the dogs had performed sufficiently well during the training session, they were evaluated in a single blind phase.
During the single-blinded canine scent-testing experiment, using samples previously used in phase 3 of training, the level of challenge to the dogs was increased by placing a cancer breath sample in 1 station and control subject breath samples in the remaining 4 stations. Thus, dogs now had to distinguish cancer patient breath samples from those of healthy controls. Furthermore, the handler was blinded to the location and status of patient and control breath samples. Although the experimenter did not know the location and status of patient and control breath samples during the single-blinded experiments, the possibility of the experimenter giving the dogs cues was minimized by positioning the experimenter in an adjacent room, behind an opaque curtain that almost completely covered the doorway between the training and observation rooms.
This was followed by a double blind phase, the phase used to evaluate sensitivity and specificity.
We designed our double-blinded experiment so that each dog would have the opportunity to sniff breath samples from each subject and each control. During the entire double-blinded testing phase, all breath samples sniffed by dogs, for both cases and controls, were from completely different subjects not previously encountered by the dogs during training or single-blinded testing. Furthermore, all of these breath samples used during double-blinded testing, for both cases and controls, contribute to the overall results reported in Table 3. For each trial, we used a random number table to determine the location of the sample being tested in the lineup.
All other methods were identical to the single-blinded testing phase, except that we now (1) placed the target breath sample of interest, whether from patient or control, within the lineup along with 4 other controls and (2) blinded both the experimenters and dog handlers to the status of that target sample in the lineup. Whereas in the single-blinded experiments only the dog handler was blinded to knowledge of the target sample, in the double-blinded experiments, both handler and experimenter were blinded to ensure that neither experimenters nor handlers could be giving any clues to the dogs. Since the experimenters now no longer knew the status of the target breath sample, they did not activate the clicker device after a sitting indication by the dog, and therefore the handler did not reward the dog with any food. After being given the opportunity to sniff and indicate on samples, the dog was simply led out of the room. Only after leaving the training room was the dog acknowledged with the phrase “good work!” During double-blinded testing, each tube was used a median of 20 times (x = 32.35, SD = 24.46; range, 4-99).
Blinding is very important in a trial like this because of the "Clever Hans" effect, which is the ability of animals to pick up subtle and even subconscious nonverbal cues from the people around them.
In the trials involving lung cancer patients, 708 of the 712 control canisters were properly identified, and 564 of the 574 cancer canisters were identified. In the trials involving breast cancer patients, 260 of the 275 control canisters were properly identified and 110 of the 116 cancer canisters were identified.
It is unclear how these results were tabulated. One possible method would be the following: If the dog did not sit down at any canister, and the fifth canister was a control breath sample, that trial was labeled a true negative. If the dog sat down at one of the four control canisters or hesitated, that trial was labeled a false positive or false negative depending on the contents of the fifth canister. Another interpretations would be that if the dog sat down at any control canister, that was considered a false positive for that canister and failure to sit down at any control canister was considered a true positive.
In other words, is it possible for a dog to make only one mistake in a trial or up to five? The wording of the paper seems to favor the latter interpretation
The dogs’ response to each of the 5 samples sniffed was included in our analysis; dogs were allowed the opportunity to visit each sample station and thus could have potentially indicated every one of the samples in a trial, although in our experiments, this never occurred. Dog handlers did not try to prevent dogs from visiting any individual station. Therefore, since each individual sample station was considered as a unit of analysis, the use of 4 control subject breath samples along with a cancer patient sample in each experimental trial would not change sensitivity or specificity.
On the other hands the number of control samples during the double blind phase was 987 compared to 690 cancer samples, and it is hard to reconcile these numbers with the fact that at least four control samples were tested in each trial. The ratio of controls to cancers should be at least five to one and probably closer to ten to one.
Because of the number of tests performed, individual patients were used multiple times in the study and even individual breathing tubes were re-used many times.
During double-blinded testing, each tube was used a median of 20 times (x = 32.35, SD = 24.46; range, 4-99).
To account for this, the researchers used "general estimating equations (GEE) random effects linear regression, with standard errors adjusted for clustering on donor." The researchers re-analyzed the data including only the first dog-donor combination in each trial of the double blind phase, and found comparable results.
The GEE estimates were also adjusted for current smoking status since there was more smoking among the lung cancer volunteers than the control volunteers.
This research used a case-control to estimate sensitivity and specificity, which is acceptable for a "proof of concept" study, but the authors do discuss the problem of spectrum bias in this research.
However, our specificity may be overestimated because we used only healthy controls (rather than a broad spectrum of subjects that included, for example, those with bronchitis or emphysema as controls for lung cancer or those with fibrocystic breast disease or mastitis as controls for breast cancer). These questions could be better understood by further study in a prospective cohort design that included both cases and controls representing the full spectrum of disease severity seen in the general population.
There are additional limitations to this research which the authors discuss at the end of the article.
1. This is a fascinating study, but the results do need to be replicated in a more rigorous design. What changes would you make to the research design?
2. Does the fact that a cancer sample never appears more than once among the five canisters cause any bias in the estimates of sensitivity or specificity?
3. The GEE model accounts for a cluster effect by donor. Are there other cluster effects that could/should be modeled in this analysis?
Submitted by Steve Simon.