# Chance News 22

## Quotation

I think you're begging the question, said Haydock, and I can see looming ahead one of those terrible exercises in probability where six men have white hats and six men have black hats and you have to work it out by mathematics how likely it is that the hats will get mixed up and in what proportion. If you start thinking about things like that, you would go round the bend. Let me assure you of that!

Agatha Christie
The Mirror Crack's

From the Probability Web Quotations

## Forsooth

The following Forsooths are from the November 2006 RRS NEWS.

At St John's Wood station alone, the number of CCTV cameras has jumped from 20 to 57, an increase of 300 per cent.

Metro

3 May 2006

Now 78% of female veterinary medicine students are women, almost a complete turn-around from the previous situation.

The Herald (Glasgow)

4 May 2006

Drought to ravage half the world within 100 years

Half the world's surface will be gripped by drought by the end of the century, the Met Office said yesterday.

Times online

6 October 2006

## Estimating the diversity of dinosaurs

Proceedings of the National Academy of Sciences
Published online before print September 5, 2006
Steve C. Wang, and Peter Dodson

Fossil hunters told: Dig deeper
Tom Avril

Steve Wang is a statistician at Swarthmore College and Peter Dodson is a paleontologist at the University of Pennsylvania. Their study was widely reported in the media. You can find references to the media coverage and comments by Steve here.

In their paper the authors provided the following description of their results. Here are a few definitions that might be helpful: genera: a collective term used to incorporate like-species into one group, nonavian: not derived from birds, fossiliferous: containing a fossil, rock outcrop: the part of a rock formation that appears above the surface of the surrounding land

Despite current interest in estimating the diversity of fossil and extant groups, little effort has been devoted to estimating the diversity of dinosaurs. Here we estimate the diversity of nonavian dinosaurs at 1,850 genera, including those that remain to be discovered. With 527 genera currently described, at least 71% of dinosaur genera thus remain unknown. Although known diversity declined in the last stage of the Cretaceous, estimated diversity was steady, suggesting that dinosaurs as a whole were not in decline in the 10 million years before their ultimate extinction. We also show that known diversity is biased by the availability of fossiliferous rock outcrop. Finally, by using a logistic model, we predict that 75% of discoverable genera will be known within 60-100 years and 90% within 100-140 years. Because of nonrandom factors affecting the process of fossil discovery (which preclude the possibility of computing realistic confidence bounds), our estimate of diversity is likely to be a lower bound.

In this problem we have a sample of dinosaurs that lived on the earth. These dinosaurs are classified into groups called genera. We can count the number of each generus in our sample. From this we want to estimate the total number of dinosaurs that have roamed the earth. Many different methods for doing this have been developed and the authors of this study use one of the newer methods. We have discussed in prevent Chance News other examples of this problem and it might help to discuss these briefly.

One of the first statistical studies of species was carried out by R.A. Fisher and illustrated in terms of determining the number of species of Malayan butterflies. His method is described in the paper 'The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population', R.A. Fisher; A.Steven Corbet; C.B. Williams, The Journal of Animal Ecology, Vol. 12. No. 1, pp.442-58. (Available from Jstor).

Corbet provided the following data from his sampling of the Malyan butterflies:

 n observed expected number 1 118 156.44 2 74 74.52 3 44 47.33 4 24 33.82 5 29 25.77 6 22 20.46 7 20 16.71 8 19 13.93 9 20 11.79 10 15 10.11 11 12 8.76 12 14 7.65 13 6 6.73 14 12 5.95 15 6 5.29 16 9 4.73 17 9 4.24 18 6 3.81 19 10 3.44 20 10 3.11 21 11 2.83 22 5 2.57 23 3 2.34 24 3 2.14

In this table n is the number of times a species occurs in the sample. The second column gives the number of species that occur n times in the sample. So we see that 118 species occurred once in the sample, 74 twice and 44 three times. The their column gives the expected number that occur n times suing Fisher's model which we will explain next. Thus the expected number for n = 1,2,3 are 156.44, 74.52 and 47.33.

Fisher model assumes that the number of times a species occurs in a sample has a poisson distribution:

$e^{-m}\frac{m^n}{n!}$

For a given species m is the expected number of this species that will occur in a sample. Since this will be expected to vary among the species Fisher treats this as a random variable. He chooses a distribution for m that leads him to estimate the expected number of species which appear n times in a random sample is given by

$\frac{\alpha}{n} x^n$

Here $\alpha$ and x are parameters. If S is the number of species observed and N the sample size \alpha and x can be determined as the values that satisfied the following two equations:

$S = -\alpha \log (1-x), \quad N = \alpha x/(1-x)$

From our data we find that S = 501 and N = 3306. Using these values we find that x = .95268 and $\alpha = 164.21.$ These do not agree with the values obtained by the authors but we believe them to be correct.

Fisher was interested in finding a distribution that could approximate the distribution of the number of number of times a species in a sample occurred and the distribution that he proposed has been widely used in species studies. Another interesting question would be: can you estimate the total number of Malayan butterflies from a sample. This is what Wang and Dodson did in their study. One of the first to tackle this problem were I:.J. Good and G. H. Tollmin in their paper "The number of New Species, and the Increase in Population Coverage, when a Sample is Increased", Biometrika, Vol. 43, (June, 1956), pp. 45-63.

To be continued