Chance News 37: Difference between revisions

From ChanceWiki
Jump to navigation Jump to search
m (new article ==Word frequency==)
Line 69: Line 69:


To be continued
To be continued
==Word frequency==
[http://www.wordcount.org/main.php WordCount] is an interactive website that
offers a neat visualisation of the 86,800 most frequently used English words,
ranked in order of frequency.
For example, relative frequency is proportional to font size,
to emphasise each word's level of usage.
Some sample word rankings (taken from the top of [http://chance.dartmouth.edu/chancewiki/index.php/Chance_News_36 Chance News 36)] are:
statistics (3,010), teaching (1,134), number (171), numbers (894), jackpot (25,498).
The website author, Jonathan Harris, explains:
<blockquote>
[http://www.wordcount.org/about.html WordCount] was designed with a minimalist aesthetic, to let the information speak for itself.
...
The goal is for the user to feel embedded in the language,
sifting through words like an archaeologist through sand, awaiting the unexpected find.
Observing closely ranked words tells us a great deal about our culture.
For instance, 'God' is one word from 'began',
two words from 'start' and six words from 'war'.
Another sequence is "america ensure oil opportunity".
...
As ever, the more one explores, the more is revealed.
</blockquote>
===Questions===
* WordCount includes all words that occur at least twice in the [http://www.natcorp.ox.ac.uk/ <em>British National Corpus (BNC)</em>,] a 100 million word collection; yet, WordCount only contains 86,800 words.
** What can you infer from this about the distribution of word frequencies?
** WordCount is a full enumeration of the BCN and the BCN is such a large sample, so does that mean that the estimates of word rankings are accurate? For example, six words chosen from [http://chance.dartmouth.edu/chancewiki/index.php/Chance_News_36 Chance News 36] included the word 'forsooth', which seems to be [http://sara.natcorp.ox.ac.uk/cgi-bin/saraWeb?qy=Forsooth in the BNC] five times but is not in [http://www.wordcount.org/main.php WordCount.] (The other five words and their rankings are listed above.)
** What other information would you like to have to investigate variations in word frequencies and where could you start looking?
* The layout presents the data (word counts) as a density distribution, with a lookup for individual work rankings. It doesn’t display word frequency or percentiles, would they be more helpful or informative than rankings?
* What is your estimate of the rank of words like 'chance' and 'probability' or 'statistician' and 'mathematician' in the BNC?
** Have you any confidence in your prediction? You might want to consult the recent Plus Magazine article [http://plus.maths.org/issue46/risk/index.html Understanding uncertainty: The Premier League,] by Mike Pearson and David Spiegelhalter, before producing a confidence interval for rankings.
* The author plans to apply the technique to any text, such as a website or the whole internet. What standard statistical techniques can you think of to highlight differences in word count distributions across different sources? Do any emphasise a visual, interactive approach to data analysis, as is done with WordCount? If not, why not?
* Do you agree with the statement <em>observing closely ranked words tells us a great deal about our culture</em>?
** Find 'chance' in the rankings. How likely is it that there is a 'logical' link between 'chance' and its nearest neighbours? If you now expand the neighbourhood around 'chance', how quickly do you think your chances of finding a 'logical' link become? Would it matter if 'chance' could be substituted by equivalent conceptual words, like 'odds' or 'probability'?
*** For example, The Washington Post's Sunday humor/wordplay contest, ran a challenge to [http://www.washingtonpost.com/wp-dyn/content/article/2005/08/11/AR2005081100915.html write a four-line poem] incorporating any four or more successive WordCount words in order (but not necessarily adjacently). Can you use those results as an example of a logical link and how likely it is to occur?
*** See [http://chance.dartmouth.edu/chancewiki/index.php/Chance_News_32#A_coincidence.3F A coincidenc?,] from [http://chance.dartmouth.edu/chancewiki/index.php/Chance_News_32 Chance News 32] for a related article.
* While [http://www.wordcount.org/main.php WordCount] tracks the way we use language; [http://www.wordcount.org/querycount.php QueryCount] is a related website that tracks the way WordCount is used, by rearranging its word rankings based on the number of times each word has been queried by WordCount. So QueryCount contains statistics of search usage with WordCount.
** What differences do you expect to find between these two distributions, if any? For example, how much more likely is it that your first name has a higher ranking in QueryCount than in WordCount, on the premise that people are more likely to look up their own name (contributing to its QueryCount ranking) than that word is likely to occur in the BNC (contributing to its WordCount ranking), which changes more slowly over time. Can you infer anything about how representative the BNC is of English language usage on the internet?
** How likely are you to find a word in [http://www.wordcount.org/main.php WordCount] that is not in [http://www.wordcount.org/querycount.php QueryCount?]
=== Further reading===
* Here are two web pages ([http://www.number27.org/projects/wordcount/conspiracy.html Conspiracy Game] and [http://www.wordcount.org/namegame/index.html 1970s Movie Characters]) devoted to finding neighbouring words that are 'connected' (it is claimed).
* Aside: The [http://www.natcorp.ox.ac.uk/ <em>BCN</em>] is a potentially useful [http://sara.natcorp.ox.ac.uk/cgi-bin/saraWeb?qy=statistician source of quotes] about your favourite topic.
Submitted by John Gavin.

Revision as of 14:26, 18 May 2008

Quotations

How dare we speak of the laws of chance? Is not chance the antithesis of all law?

Boethius (ca. 480-525)

If we can increase IQ by three to four points in the whole population, we can have fewer children at the low end and more Einsteins at the high end.

Dr, Michael Kramer, a professor of pediatrics at McGill University, the lead author of a study in the Archives of General Psychiatry involving about 17,000 children in Belarus.

Human milk also contains cholesterol, while formula doesn't. We learned to fear cholesterol and yet cholesterol is very important for brain tissue, it's very important for nerve tissue. That's why human milk is a better nutrient to support brain growth

Dr. Ruth Lawrence, a member of the American Academy of Pediatrics executive committee section on breast-feeding.

Submited by Paul Alper

Breastfeeding

Except possibly for the manufacturers of formula milk, most people believe that breast milk is superior to infant formula with regard to the physical health of the child. According to "Breastfeeding and Child Cognitive Development," by Michael Kramer, et al, Archives of General Psychiatry, Vol. 65, (No. 5), May 2008, 578-584, breastfeeding is also superior for the mental development of the child.

The numbers are impressive: 17 authors, 17,046 infants enrolled, of whom 13,889 were followed up at age 6.5 years, at which time,

according to HealthDay Reporter of May 5, 2008,

Those children who were exclusively breast-fed scored, on average, 7.5 points higher in verbal intelligence, 2.9 points higher in nonverbal intelligence, and 5.9 points higher in overall intelligence.

Nevertheless, with a closer look at the journal article, some of the numbers fade. Two of the three measures of intelligence just mentioned turn out not to be statistically significant. All in the control group also breastfed their infants with the difference being only the number of months of breastfeeding in the control group was less than in the (encouraged to breastfeed) treatment group. "[B]linding of the pediatricians [who administered the IQ test] to the experimental vs control group assignment was infeasible."

Discussion

1. Not so long ago, breast milk was not considered superior to infant formula. Make a case for the superiority of infant formula.

2. This study was carried out entirely in Belarus where according to the article, "> 95% of mothers in Belarus" choose to initiate breastfeeding. If the percentage in the U.S. is vastly different, how does this affect the generality of the conclusions?

3. The treatment group was "encouraged" to continue breastfeeding; the control group was neither encouraged nor discouraged. At the end of 12 months, those still breastfeeding were 19.7% and 11.4%, respectively. Ask a friendly librarian to find the comparison within the U.S. after 12 months.

4. IQ and intelligence are often conflated and elided. Use a search engine or that friendly librarian to find out how many different kinds of IQ tests there are. In addition, determine the strengths and weaknesses of the WASI test, the one used in the breastfeeding study.

5. The children in the study had their IQ measured via WASI at the age of 6.5 years. Did you ever have an IQ test? How old were you? Where you or your parents informed of your score? Were you ever retested? If so, did you go up or down? Do you feel that as far as intelligence is concerned, an individual is completely determined by the age of 6.5?

6. Although intelligence testing was originally proposed as a means of helping those who need help, IQ testing is often used as a form of rank ordering because of its precision and presumed accuracy. Richard Feynman, generally conceded to be the most prominent physicist of the second half of the 20th century, had an IQ, one point lower than his sister. Do a literature search to determine his IQ. Likewise, do a literature search to determine what his sister did with that one point advantage.

7. The lead author of the study backs off from the simplistic claim that "Long and exclusive breast-feeding makes kid smarter." Obtain the articles mentioned to see what else he says might be the causal reason for IQ improvement.

Submitted by Paul Alper

Longer limbs mean less risk of dementia

Longer limbs 'mean less risk of dementia'

Ian Sample, science correspondent

Guardian, Tuesday May 6 2008

Sample writes:

Leggy women and gangly men are less likely to develop Alzheimer's, according to a study that suggests a healthy upbringing protects against the degenerative disease. Researchers took limb measurements of 2,798 men and women with an average age of 72 and monitored them for five years. At the end of the study 480 had developed Alzheimer's or other types of dementia.

The study showed that women with longer legs had a much lower risk of dementia, with every extra inch of leg reducing their risk by 16%. Women with the shortest arms were 50% more likely to develop the disease than those with the longest arms. The study, which appears in the journal Neurology, revealed that only arm length was linked to men's risk of Alzheimer's, with every extra inch lowering their risk by 6%. Scientists who ran the study at Johns Hopkins University in Baltimore believe the link may be explained by poor nutrition in early life.

A second report in the same journal studied the effect of the painkiller ibuprofen on Alzheimer's disease. Doctors at Boston University Medical School found that people who used ibuprofen for at least five years had a 40% lower risk of dementia. The risk was lower among those who took the drug over longer periods. Because the effect is tentative, the scientists said ibuprofen should not be administered specifically to prevent dementia.

To be continued

Cold hit DNA matches

Debate on analyzing 'cold hit' DNA matches swirls in case before California Supreme Court. A long-time scientific controversy centers on how to calculate the probability that such a match would be the result of coincidence.
Los Angeles Times, May 9, 2008
Jason Felch and Maura Dolan

To be continued


Word frequency

WordCount is an interactive website that offers a neat visualisation of the 86,800 most frequently used English words, ranked in order of frequency. For example, relative frequency is proportional to font size, to emphasise each word's level of usage. Some sample word rankings (taken from the top of Chance News 36) are: statistics (3,010), teaching (1,134), number (171), numbers (894), jackpot (25,498).

The website author, Jonathan Harris, explains:

WordCount was designed with a minimalist aesthetic, to let the information speak for itself. ... The goal is for the user to feel embedded in the language, sifting through words like an archaeologist through sand, awaiting the unexpected find. Observing closely ranked words tells us a great deal about our culture. For instance, 'God' is one word from 'began', two words from 'start' and six words from 'war'. Another sequence is "america ensure oil opportunity". ... As ever, the more one explores, the more is revealed.

Questions

  • WordCount includes all words that occur at least twice in the British National Corpus (BNC), a 100 million word collection; yet, WordCount only contains 86,800 words.
    • What can you infer from this about the distribution of word frequencies?
    • WordCount is a full enumeration of the BCN and the BCN is such a large sample, so does that mean that the estimates of word rankings are accurate? For example, six words chosen from Chance News 36 included the word 'forsooth', which seems to be in the BNC five times but is not in WordCount. (The other five words and their rankings are listed above.)
    • What other information would you like to have to investigate variations in word frequencies and where could you start looking?
  • The layout presents the data (word counts) as a density distribution, with a lookup for individual work rankings. It doesn’t display word frequency or percentiles, would they be more helpful or informative than rankings?
  • What is your estimate of the rank of words like 'chance' and 'probability' or 'statistician' and 'mathematician' in the BNC?
    • Have you any confidence in your prediction? You might want to consult the recent Plus Magazine article Understanding uncertainty: The Premier League, by Mike Pearson and David Spiegelhalter, before producing a confidence interval for rankings.
  • The author plans to apply the technique to any text, such as a website or the whole internet. What standard statistical techniques can you think of to highlight differences in word count distributions across different sources? Do any emphasise a visual, interactive approach to data analysis, as is done with WordCount? If not, why not?
  • Do you agree with the statement observing closely ranked words tells us a great deal about our culture?
    • Find 'chance' in the rankings. How likely is it that there is a 'logical' link between 'chance' and its nearest neighbours? If you now expand the neighbourhood around 'chance', how quickly do you think your chances of finding a 'logical' link become? Would it matter if 'chance' could be substituted by equivalent conceptual words, like 'odds' or 'probability'?
      • For example, The Washington Post's Sunday humor/wordplay contest, ran a challenge to write a four-line poem incorporating any four or more successive WordCount words in order (but not necessarily adjacently). Can you use those results as an example of a logical link and how likely it is to occur?
      • See A coincidenc?, from Chance News 32 for a related article.
  • While WordCount tracks the way we use language; QueryCount is a related website that tracks the way WordCount is used, by rearranging its word rankings based on the number of times each word has been queried by WordCount. So QueryCount contains statistics of search usage with WordCount.
    • What differences do you expect to find between these two distributions, if any? For example, how much more likely is it that your first name has a higher ranking in QueryCount than in WordCount, on the premise that people are more likely to look up their own name (contributing to its QueryCount ranking) than that word is likely to occur in the BNC (contributing to its WordCount ranking), which changes more slowly over time. Can you infer anything about how representative the BNC is of English language usage on the internet?
    • How likely are you to find a word in WordCount that is not in QueryCount?

Further reading

Submitted by John Gavin.