Difference between revisions of "Chance News 27"

From ChanceWiki
Jump to navigation Jump to search
(new article - correlation and tonal languages.)
Line 188: Line 188:
 
when it's pronounced with a single high-pitched tone, "ma" means "mother."  
 
when it's pronounced with a single high-pitched tone, "ma" means "mother."  
 
But when it has a low-pitched lilt in the middle, it means "horse", making it a word you don't want to mispronounce.  
 
But when it has a low-pitched lilt in the middle, it means "horse", making it a word you don't want to mispronounce.  
 +
In contrast, English is a non-tonal language.
  
In contrast, English is a non-tonal language.
 
 
This split is unevenly distributed: tonal languages are the norm in sub-Saharan Africa and are common in Southeast Asia and among Native American languages especially in parts of Central and South America. Non-tonal languages are the norm in Europe and Central, South and West Asia, and among the aboriginal languages of Australia.  
 
This split is unevenly distributed: tonal languages are the norm in sub-Saharan Africa and are common in Southeast Asia and among Native American languages especially in parts of Central and South America. Non-tonal languages are the norm in Europe and Central, South and West Asia, and among the aboriginal languages of Australia.  
 +
If your ancestors were all European, with mostly nontonal languages in Europe, you have a better than even chance of carrying the two new genes.
  
 
The two genes in question have emerged in the human population only very recently, 6,000 to 37,000 years ago.
 
The two genes in question have emerged in the human population only very recently, 6,000 to 37,000 years ago.
Line 196: Line 197:
 
Furthermore, these new genes seem to be spreading quickly in the human species,
 
Furthermore, these new genes seem to be spreading quickly in the human species,
 
suggesting that they favoured by natural selection.
 
suggesting that they favoured by natural selection.
However, they are unevenly distributed in the world’s populations: rare in sub-Saharan Africa, most common in Europe, North Africa and Western Asia.
 
If your ancestors were all European, with mostly nontonal languages, you have a better than even chance of carrying the two new genes.
 
  
 
The authors hypothesis concerns the relationship between  
 
The authors hypothesis concerns the relationship between  
Line 205: Line 204:
 
They gathered genetic data, in the form of allele frequencies, and linguistic data, in the form of values of typological features from a variety of sources.
 
They gathered genetic data, in the form of allele frequencies, and linguistic data, in the form of values of typological features from a variety of sources.
 
Next they calculated the correlation between tone and each of the alleles separately.
 
Next they calculated the correlation between tone and each of the alleles separately.
These correlations are between the tone status of the language of each population and the corresponding allele frequency in that population, based on the 49 populations in their sample.
 
  
 
[[Image:GenLingCorrs.jpg|frame|right|The distribution of the correlations between all pairs of genetic markers and linguistic features in the database.
 
[[Image:GenLingCorrs.jpg|frame|right|The distribution of the correlations between all pairs of genetic markers and linguistic features in the database.
Line 212: Line 210:
 
But it may be that genes generally correlate with the typological features of language because of patterns of past migration and contact between peoples speaking different languages.   
 
But it may be that genes generally correlate with the typological features of language because of patterns of past migration and contact between peoples speaking different languages.   
 
Because of this possibility, Dediu and Ladd wanted to make sure that our correlation is 'signficant' not only in this standard statistical sense, but also that it is unusual when compared to other correlations between genes and typological features.  
 
Because of this possibility, Dediu and Ladd wanted to make sure that our correlation is 'signficant' not only in this standard statistical sense, but also that it is unusual when compared to other correlations between genes and typological features.  
So for the same 49 populations they gathered frequency data for 983 alleles and values for 26 typological features, such as the number of consonants or the use of inflections, and 983 genetic variants,  
+
So for the same 49 populations they gathered frequency data for 983 alleles and values for 26 typological features, such as the number of consonants or the use of inflections and 983 genetic variants,  
 
in order to see in which manner genes and typological features tend to behave.  
 
in order to see in which manner genes and typological features tend to behave.  
They show that the correlations between all pairs of genes and typological features follows a normal distribution around 0.  
+
They show that the correlations between all pairs of genes and typological features follows a normal distribution around zero.  
 +
 
 
This means that, as expected, there is no general correlation between genes and linguistic features, so language features are unlikely to be affected by genes.
 
This means that, as expected, there is no general correlation between genes and linguistic features, so language features are unlikely to be affected by genes.
 
But the relation between tone and the two genes under study was confirmed to be especially strong in all the analyses.  
 
But the relation between tone and the two genes under study was confirmed to be especially strong in all the analyses.  
 
<blockquote>
 
<blockquote>
It’s because there generally isn’t a correlation between population genetics and language typology that the correlation we’ve found may be interesting.
+
It’s because there generally isn’t a correlation between population genetics and language typology that the correlation we’ve found may be interesting
 
</blockquote>
 
</blockquote>
 +
one of the authors claims.
  
 
Pearson's correlation was calculated when considered each of the two genes separately. The authors used logistic regression when considering the impact of both genes together, to estimate how the frequency of the two genes in a population predicts its language tonality.
 
Pearson's correlation was calculated when considered each of the two genes separately. The authors used logistic regression when considering the impact of both genes together, to estimate how the frequency of the two genes in a population predicts its language tonality.
Line 226: Line 226:
 
while in the top-left quadrant there is a balanced mixture.  
 
while in the top-left quadrant there is a balanced mixture.  
 
The bottom-right quadrant contains no populations in our sample and the reason is not known.]]
 
The bottom-right quadrant contains no populations in our sample and the reason is not known.]]
 
Because they compute a large number of statistical measures, the significances (p-values) of the results had to be adjusted. (When computing, for example, 1000 correlations, just by chance 10 will turn out to be significant for an alpha level of 0.01.) So the paper adjusts all p-values using [http://en.wikipedia.org/wiki/Holm_Bonferroni_method Holm's multiple comparisons correction.]
 
 
 
Ladd said
 
Ladd said
 
<blockquote>
 
<blockquote>
Line 234: Line 231:
 
</blockquote>
 
</blockquote>
  
 +
Because they compute a large number of statistical measures, the significances (p-values) of the results had to be adjusted. (When computing, for example, 1000 correlations, just by chance 10 will turn out to be significant for an alpha level of 0.01.) So the paper adjusts all p-values using [http://en.wikipedia.org/wiki/Holm_Bonferroni_method Holm's multiple comparisons correction.]
 
Remarakable correlations could happen by chance. In this case, other factors that might cause an unusual correlation to occur include prehistoric migration, that are currently unknown. Ladd said
 
Remarakable correlations could happen by chance. In this case, other factors that might cause an unusual correlation to occur include prehistoric migration, that are currently unknown. Ladd said
 
<blockquote>
 
<blockquote>
Line 243: Line 241:
 
* Speculate how you think the authors take into account the fact that neighbouring populations tend to share both the same genes and languages? In addition to Person's correlation, the paper also uses [Manel correlation http://en.wikipedia.org/wiki/Mantel_test], are you familiar with this type of correlation? What other kinds of correlation can you think of? The authors transform their data into distances to control for geography and history. They based geographic distance between populations on land distances rather than 'as the crow flies', why do you think this measure might be more appropriate?
 
* Speculate how you think the authors take into account the fact that neighbouring populations tend to share both the same genes and languages? In addition to Person's correlation, the paper also uses [Manel correlation http://en.wikipedia.org/wiki/Mantel_test], are you familiar with this type of correlation? What other kinds of correlation can you think of? The authors transform their data into distances to control for geography and history. They based geographic distance between populations on land distances rather than 'as the crow flies', why do you think this measure might be more appropriate?
 
* Languages from the Americas were excluded. Why do you think the authors did this?
 
* Languages from the Americas were excluded. Why do you think the authors did this?
 +
* Can you think of any other tests the authors might have considered?
 
===Further reading===
 
===Further reading===
 
* The [http://www.pnas.org/cgi/content/abstract/0610848104v1 academic paper] by Dan Dediu and D. Robert Ladd, University of Edinburgh, is available from Proc. Natl. Acad. Sci. USA (subscription required).
 
* The [http://www.pnas.org/cgi/content/abstract/0610848104v1 academic paper] by Dan Dediu and D. Robert Ladd, University of Edinburgh, is available from Proc. Natl. Acad. Sci. USA (subscription required).

Revision as of 14:42, 16 June 2007

Quotation

I could prove God statistically. Take the human body alone - the chances that all the functions of an individual would just happen is a statistical monstrosity.

George Gallup

This is from an ineresting article about the life of George Gallup: "The Human Yardstick", Williston Rich, Saturday Evening Post January 21,1939, p. 71. The article ends with:

His greatest delution is that he can forecast the stock market. his greatest fear, that a competitor will enter his field and be dishonest with the figures. His greatest devotion, to his family, his home and his church. He says "I could prove God statistically--.

Submitted by Laurie Snell

Forsooth

The following Forsooths were in the May 2007 RSS News:

The Times leader of 28 February tells us that taking regular dose of certain vitamins 'can actually increase the risk of mortality by five per cent'. Since the "risk of mortality' is already 100 percent this is very worrying.

AWF Edwards
Cambridge

Da springen drei Rosen, Halb rot und halb weiß

19. Die schöne Müllerin
Franz Schubert,

Submitted and sung by Laurie Snell


I'm not sure if this is a "Forsooth" but at the minimum, it is an example of a poorly written survey question. The Kansas City Star newspaper has a poll on its website asking people their opinions about the Harry Potter series of books. The first question asks "Is Snape good or bad?" and offers the choices "Yes" or "No". As of June 7, there were 33 (29%) votes for "No".

Submitted by Steve Simon.

How to own a random number

God created the integers; all else is the work of man - Leopold Kronecker.

AACS is the copy protection technology used on HD-DVD and Blu-ray discs. The consortium that owns this technology are apparently trying to stop websites and newspapers publishing a specific 128-bit integer that, with suitable software, enables the decryption of video content on most existing HD-DVD and Blu-ray discs. As part of this effort, they have claimed ownership of the encryption key, which means that you cannot use, without written permission, that particular 30-digit integer (in base 10) and several million other unknown keys that they apparently are claiming ownership of. Not only that, but the numbers in question were chosen randomly so there is no simple way of knowing if your random choice conflicts with theirs, even if their choices were public knowledge.

Further reading

  • How to own a random number, BoingBoing blog, 7 May 2007.
  • You Can Own an Integer Too — Get Yours Here, Ed Felten, May 7, 2007 — this professor of Computer Science and Public Affairs at Princeton University suggests a way that you too can own your own random integer. Hundreds have now claimed their own random number. He even suggests a use for your number:

Did we mention that a shiny new integer would make a perfect Mother’s Day gift?

Submitted by John Gavin.

Two probability problems

IBM has a monthly "Ponder this challenge" which is often a probability problem. The April 2007 Challenge was the following random walk problem.

This month's puzzle concerns a frog who is hopping on the integers from minus infinity to plus infinity. Each hop is chosen at random (with equal probability) to be either +2 or -1. So the frog will make steady but irregular progress in the positive direction. The frog will hit some integers more than once and miss others entirely. What fraction of the integers will the frog miss entirely? You may consider the answer to be the limit as N goes to infinity of the fraction of integers between -N and N a frog starting at -N and randomly hopping as described misses on average. An answer correct to six decimal places is good enough.

You might also want to the Puzzle for February 2007 which provides another example of the occurence of the Golden Mean:

Consider the following two person game. Each player receives a random number uniformly distributed between 0 and 1. Each player can choose to discard his number and receive a new random number between 0 and 1. This choice is made without knowing the other players number or whether the other player chose to replace his number. After each player has had an opportunity to replace his number the numbers are compared and the player with the higher number wins. What strategy should a player follow to ensure he will win at least 50% of the time?

Why have Americans stopped growing?

Bad Health Care, Deficient Welfare Keep Americans Short, Spiegel Internaional, 22 May 2007.

This on-line article, one in a series, claims that US citizens were the tallest in the world up to World War II but since then, US heights have stagnated while Europeans have been getting taller. Furthermore, the average American is now between two and six centimeters shorter than his European counterpart. The article cites a new study which conjectures that this phenomenon might be explained by differences between health care and the social network systems in various countries. It also suggests a similar result for life expectancy.

The underlying academic paper studies long-term trends in the heights of the US population by combining the results of cross-sectional surveys. The authors analysis is based on the complete set of NHES and NHANES data collected between 1959 and 2004. They used regression analysis to estimate the trend in U.S. heights stratified by gender and ethnicity, holding income and educational attainment constant.

The article asks why the historical correlation between height and wealth is breaking down

The correlation between wealth and height has long been understood, the most recent example coming as Eastern Europeans shot up following the collapse of communism. But why, in the richest country in the world, should growth rates (in height) be stagnating?

This academic study argues that politics may offer an answer. The paper claims that the US average is pulled down by those who struggle to get by, in a country with drastic differences between rich and poor. It claims

whereas in the US, some 15 percent of the population has no health insurance and those on welfare can barely get by, almost all citizens of northern and western European countries enjoy universal health care and a generous social net. The result is that even those children dependent on welfare in Europe have a sufficient living standard.

The paper mentions that taller people have higher incomes, on average, but is unsure about the direction of causation. Is spite of this, the US population height has been stable since 1950 even though prosperity contined to rise, something that will require further work to explain, according to the authors:

Quite a bit more needs to be done to determine the relationship between social standards and height. In short, the richest are neither the tallest nor the healthiest. Why that is so must be explained.

Questions

  • Today, Americans are between two and six centimeters shorter than the Dutch but in the mid-19th centuary the reverse was the case. How reliable do you think data from 150 years is, in this case? What questions might you ask to ensure that any comparisons between cross sectional studies, over such a long period of time, are fair? How large do you think the samples sizes might have to be to justifty such claims?
  • The authors claim an association between height and health but also mention an association between health and life expectancy. Height is easy to measure and fixed (between ages 20-50) but life expectancy appears not to be. Is it plausible to speculate on what average height might tell us about average life expectancy? Might it be easier for health educators and policy makers to focus on height as a simplier and more transparent measure of health than life expectancy?
  • Similar to the previous question, there is a negative correlation between population height and population illiteracy and population height and income inequality. Would height serve as a good proxy for the overall well-being of a population? If so, why aren't trends in population height used by policy makers more often?
  • Other factors that might influence the relation between height and health, such as nutritional intake, incidence of disease and availability of medical services, were not felt to be material in developed countries. Why might this be? Might these factors have been more relevant in the past? How difficult do you think it would be to reliably extract such measures from historical data?
  • The paper briefly mentions a negative association between population height and population density, allowing for social status. Speculate on how you might investigate such a relationship.
  • If height seems like an attractive proxy for health, how might you handle the fact that a population's average height grows with age between 0 and about 20 and shrinks with age above about age 50? Without such adjustments, is it necessary to wait 20 years for each generation's final height to stabilise before drawing conclusions about their future health and life expectancy?
  • For future studies, speculate on how genetic information might gradually replace height as a proxy for future health. What are genetic's relative merits compared to height?

Further reading

Submitted by John Gavin.

A short film exploring relative sizes

On the topics of visualisation and scaling, this short film visualises the size of 'things' and the effect of then adding a zero to the scale. With every passing 10 seconds, the view is 10 times wider: starting with one meter - a person lying on the ground - to 10^24 - the universe. It then reverses the view down to a subatomic particle, 10^-14. A journey of 40 powers of ten.

  • The video was made by Charles and Ray Eames at IBM.

Further reading

  • If you have trouble with the link above, try seraching for Powers of Ten.
  • I found the original link at specialten.tv - wait for the page to load, then click on the link labelled Top Ten, in the top left of the page, to get a drop down list, then click on Powers of ten, which is fifth on the list.

Submitted by John Gavin.

Statistics dampen feelings of compassion

Genocide: When compassion fails, Paul Slovic, New Scientist, 07 April 2007.

If I look at the mass, I will never act. If I look at the one, I will. Mother Teresa

This New Scientist article asserts that people do not value lives consistently, when donating to charitable causes. In particular, the authors claim that statistics can dissipate any emotion we might feel towards a victim, by comparing a 'statistical' victim to an identifiable victim.

Examples of statistical victims are:

  • Food shortages in Malawi are affecting more than 3 million children.
  • In Zambia, severe rainfall deficits have resulted in a 42 percent drop in maize production from 2000. As a result, an estimated 3 million Zambians face hunger.
  • Four million Angolans -- one third of the population -- have been forced to flee their homes.
  • More than 11 million people in Ethiopia need immediate food assistance.

In contrast, an example of an identifiable victim is:

  • Any money that you donate will go to Rokia, a 7-year-old girl from Mali, Africa. Rokia is desperately poor, and faces a threat of severe hunger or even starvation. Her life will be changed for the better as a result of your financial gift.

Studies have shown how people are more willing to aid specific individuals than those who are unidentified or simply listed as statistics, the 'identifiable victim' effect.

In a series of field experiments, the authors claim that teaching people to recognize the discrepancy in giving toward identifiable and statistical victims had perverse effects: individuals gave less to identifiable victims but did not increase giving to statistical victims, resulting in an overall reduction in caring and giving. Thus, it appears that, when thinking analytically, people discount sympathy towards identifiable victims but fail to generate sympathy toward statistical victims.

In a realted paper by Small and Loewenstein, people felt less compassion and donated less aid towards a pair of victims than to either individual alone.

Questions

  • Do you agree with the article's assertion that people are more generous toward specific identified victims than toward unidentifiable 'statistical' victims? If so, why do you think this might be? For example, as individuals, do we feel powerless to fight systemic evils but able to help individual people?
  • What techniques might be used when presenting statistics so that can an emotional response, sufficient to motivate action, be created and maintained?
  • Joseph Stalin was quoted as saying "A single death is a tragedy; a million deaths is a statistic." Is Stalin commenting on the same issue as the New Scientist article?

Further reading

Submitted by John Gavin.

Correlation and tonal languages

Words in code, The Economist, 31st May 2007.
Speaking in tones? Blame it on your genes, Mark Henderson, Science Editor , The Times Online (UK)
A Genetic Basis for Language Tones?, Nikhil Swaminathan, Scientific American, 29 May 2007.
Is Your Tongue in Your Genes?, Michael Balter, ScienceNOW Daily News, 29 May 2007.
Genes May Influence Language Learning, Study Suggests, Mason Inman, National Geographic News, 29 May 2007.

A widely-reported academic paper discusses a statistical study of the relationship between the geographical distribution of two human genes and the geographical distribution of tonal languages. From this, the authors, Dan Dediu and D. Robert Ladd, at the University of Edinburgh, assert a genetic basis for people's ability to learn a tonal language. Ladd said

I looked at maps of the distributions of the old and new versions of the genes. And I said, that looks like the distribution of tonal languages.

In effect, the language people speak is at least partly determined by genes, rather than just experience alone.

About half the world's languages are tonal languages, in which the pitch or tone of words and syllables makes a difference to word meaning. Examples include: Chinese, Thai, Yoruba, and Zulu. In Mandarin Chinese, for example, the syllable "ma" can take on several unique meanings: when it's pronounced with a single high-pitched tone, "ma" means "mother." But when it has a low-pitched lilt in the middle, it means "horse", making it a word you don't want to mispronounce. In contrast, English is a non-tonal language.

This split is unevenly distributed: tonal languages are the norm in sub-Saharan Africa and are common in Southeast Asia and among Native American languages especially in parts of Central and South America. Non-tonal languages are the norm in Europe and Central, South and West Asia, and among the aboriginal languages of Australia. If your ancestors were all European, with mostly nontonal languages in Europe, you have a better than even chance of carrying the two new genes.

The two genes in question have emerged in the human population only very recently, 6,000 to 37,000 years ago. This implies that the very first human languages were probably tonal, sounding more like Zulu or Chinese than French or English. Furthermore, these new genes seem to be spreading quickly in the human species, suggesting that they favoured by natural selection.

The authors hypothesis concerns the relationship between a typological linguistic feature (namely, tone) and the "derived" alleles (variants) of two human genes. They tested the hypothesis using linguistic and genetic data from 49 populations around the world. They gathered genetic data, in the form of allele frequencies, and linguistic data, in the form of values of typological features from a variety of sources. Next they calculated the correlation between tone and each of the alleles separately.

The distribution of the correlations between all pairs of genetic markers and linguistic features in the database. The horizontal axis represents the strength of the correlation (Pearson's r, between -1 and +1, 0 means no correlation). It can be seen that most correlations are around zero, but that the correlation between tone and the two genes (ASPM and Microcephalin) are very improbable (stronger than 98.6% of all the correlations).

But it may be that genes generally correlate with the typological features of language because of patterns of past migration and contact between peoples speaking different languages. Because of this possibility, Dediu and Ladd wanted to make sure that our correlation is 'signficant' not only in this standard statistical sense, but also that it is unusual when compared to other correlations between genes and typological features. So for the same 49 populations they gathered frequency data for 983 alleles and values for 26 typological features, such as the number of consonants or the use of inflections and 983 genetic variants, in order to see in which manner genes and typological features tend to behave. They show that the correlations between all pairs of genes and typological features follows a normal distribution around zero.

This means that, as expected, there is no general correlation between genes and linguistic features, so language features are unlikely to be affected by genes. But the relation between tone and the two genes under study was confirmed to be especially strong in all the analyses.

It’s because there generally isn’t a correlation between population genetics and language typology that the correlation we’ve found may be interesting

one of the authors claims.

Pearson's correlation was calculated when considered each of the two genes separately. The authors used logistic regression when considering the impact of both genes together, to estimate how the frequency of the two genes in a population predicts its language tonality.

Tone languages are represented by empty squares and non-tone languages by black squares. It can be seen that in the bottom-left quadrant there are only tone languages, in the to-right quadrant only non-tone languages, while in the top-left quadrant there is a balanced mixture. The bottom-right quadrant contains no populations in our sample and the reason is not known.

Ladd said

What we have found suggests that these genes might have a very small effect on individuals, and a larger effect on the populations in which they live. As the language is passed on culturally, it would then be more likely to develop along one path than the other.

Because they compute a large number of statistical measures, the significances (p-values) of the results had to be adjusted. (When computing, for example, 1000 correlations, just by chance 10 will turn out to be significant for an alpha level of 0.01.) So the paper adjusts all p-values using Holm's multiple comparisons correction. Remarakable correlations could happen by chance. In this case, other factors that might cause an unusual correlation to occur include prehistoric migration, that are currently unknown. Ladd said

The research had so far found only an association that appears to be more than chance, and that more work was needed to confirm a causal effect.

Questions

  • Speculate how you think the authors take into account the fact that neighbouring populations tend to share both the same genes and languages? In addition to Person's correlation, the paper also uses [Manel correlation http://en.wikipedia.org/wiki/Mantel_test], are you familiar with this type of correlation? What other kinds of correlation can you think of? The authors transform their data into distances to control for geography and history. They based geographic distance between populations on land distances rather than 'as the crow flies', why do you think this measure might be more appropriate?
  • Languages from the Americas were excluded. Why do you think the authors did this?
  • Can you think of any other tests the authors might have considered?

Further reading

Submitted by John Gavin