mardi 2 janvier 2018

Un joli résultat de génétique Mendelienne, grâce à (a+b)^2=a^2+2ab+b^2

Ayant entendu parler de la loi de Hardy montrant que la part des allèles dominants ou récessifs d’un gène reste stable dans le temps, et donc les phénotypes associés,  sachant que Hardy était un mathématicien renommé, je me suis intéressé aux maths utilisées pour obtenir ce résultat. La surprise est venue en constatant qu’il n’y avait pas de modélisation par des processus aléatoires complexes, mais que le résultat repose sur l’identité remarquable connue des collégiens : (a+b)^2=a^2+2ab+b^2.  
Bien que le problème s'avère facile au final, il y avait avant la publication de Hardy une controverse entre experts en statistique et génétique, et c'est par hasard et pour rendre service à son partenaire du jeu de croquet qu’Hardy se pencha sur le sujet.   
Un gène possédant deux allèles, chacun provenant d’un parent,  ayant 2 formes possibles notées A,a
Par exemple, A désignerait la couleur marron pour les yeux, a la couleur bleue.
Les fréquences des allèles dans la population sont notées (on distingue Aa de aA pour suivre le mécanisme de meïose à l’issue de laquelle chaque parent a fourni un seul allèle) :

AA - p, Aa - q, aA - q, aa -  r

À partir de 2 individus, sous l’hypothèse dite panmictique on compte les fréquences des croisements dans une table de multiplication comme l’écrivait Hardy.
Parent1 / Parent2
AA p
Aa q
aA q
aa r
AA p
AA
Aa
AA
Aa
Aa q
AA
Aa
AA
Aa
aA q
aA
aa
aA
aa
aa  r
aA
Aa
aA
aa

Pour alléger la notation,  x2 désigne une quantité x élevée au carré, soit x^2, à ne pas confondre avec 2x indiquant la quantité doublée. F est la fréquence pour les génotypes des descendants de la première génération.

F(AA) = P = p2+pq+qp+q2 = p2+2pq+q2=(p+q)2
F(Aa) = Q= pq+q2+pr+qr = (p+q)(q+r)
F(aA) = Q= qp+q2+rp+rq
F(aa) = R= q2+rq+qr+r2 =  (q+r)2=q2+2qr+r2

Hypothèse : si q2=pr  (on verra ce qu’il en est par la suite)
En remplaçant q2 par pr dans les expressions de P et R on a :
P=p2+2pq+pr
R=pr+2rq+r2
et donc P/R = p(p+2q+r)/r(p+2q+r) = p/r. Pour montrer le résultat, sachant que p+2q+r=P+2Q+R=1, il faut une relation supplémentaire, (Q/R)2=(q/r)2=(p/r). En effet, l’hypothèse est q2=pr qu’on divise par r2 d’une part, et on a constaté que Q2=PR par lecture directe des expressions de P, Q, R, qu’on divise par R2 pour obtenir (Q/R)2=(P/R)=p/r soit (q/r)2, d’autre part.

En résumé sous l’hypothèse q2=pr on a P/R=p/r et Q/R=q/r et sachant que la somme P+2Q+R=p+2q+r est égale à 1, la seule solution possible est P=p, Q=q, R=r.

La nature est ingénieuse, car s’il n’y a aucune raison pour que q2=pr, les calculs deviennent néanmoins valables à partir de la génération suivante, car on a constate que toujours Q2=PR, par construction. Le mathématicien anglais Hardy le formulait ainsi.
«The interesting question is — in what circumstances will this distribution be the same as that in the generation before? It is easy to see that the condition for this is q^2 = pr. And since q_1^2 = p_1r_1, whatever the values of p, q, and r may be, the distribution will in any case continue unchanged after the second generation ».
Les proportions d’allèles se stabilisent immédiatement, par un mécanisme d’identité remarquable.  
Hardy, spécialiste en théorie des nombres, et mentor du génial mathématicien Indien Ramanujan, n’était pas particulièrement impressionné par sa découverte, dont il juge les mathématiques très simples. Mais pour nous qui avons des souvenirs scolaires, voilà peut-être enfin une justification pour l’adjectif « remarquable ».
La conséquence de la « loi », qu’il vaudrait sans doute mieux qualifier de théorème, est que dans une population fermée, nombreuse et non soumise à la sélection naturelle, il n’y aurait pas comme on le croirait spontanément un phénotype unique vers lequel toute la population convergerait au fil du temps. Les yeux bleus ne disparaissent pas, malgré leur caractère récessif bien connu, réjouissons-nous, mais par ailleurs il existe des maladies qui persistent par le même mécanisme de rencontre entre 2 allèles récessifs. La nature semble préférer une sorte de statu quo, qui s’oppose au changement, mais sans stabilité que serait la vie ?   
PS : Je découvre que l’ «évolution allèlique » est enseignée à présent dès la classe de première, mais j’ignore si la preuve très accessible du résultat est aussi présentée.   




mardi 15 août 2017

Yet another contribution to the P-value discussion : Probabilities are maths, not logic

Hi,
for some reason I hit the problem of Null Hypothesis Statistical Testing
also known as the significance testing or P-value, in social and medical sciences.

What is it ?  There is an old discussion that can be traced back to the origin of Statistical testing
in the 1930's, and still alive as more and more scientific publications are based on statistical evidence. According to many qualified observers, the misuse of NHST allows lots of bad science, where the evidence supposed to be from the facts, are actually derived from ill reasoning.

Claim : In the vernacular, the word probability lacks the rigorous definition from a mathematical theory. Back to the definition we understand that the vague concept of a probability is not the same as probabilities functions on measurable sets. Once said, there is no need for another approach, classical probability theory stands.

Here is a link to an article for history, ironically titled "The Earth is round : p<.05".
http://ist-socrates.berkeley.edu/~maccoun/PP279_Cohen1.pdf

and here, a collection of quotes from famous statisticians on the subject :
http://www.indiana.edu/~stigtsts/quotsagn.html

Probabilities can be tricky. The usual Null Hypothesis Testing misuse can be illustrated on a simple example (with the 5% limit value commonly used in research experiments)

- Assuming H0 the patient is Normal, so
- H1 for the patient is Sick,
Having at hand a positive test with a probability of false positive test of only 4%,
--> H0 is rejected and the patient is considered as Sick.

It only seems to be a logical conclusion, for probability in vernacular is a synonym for the degree of truth :  Less than 5 % is translated into an objective evaluation of H0's truth, its adequation to reality, and in that case its degree of truth is too low, and as we know, if not true then false. But we can be wrong when converting probabilities into logical values, as we will see now. 

Probabilities are mappings from a source onto [0,1]. Sources can be different subsets in the population (the Universe), a different source implies a different probability mapping.

Thinking a while, we understand that 96 % is the probability of being tested positive only for the people who are actually Sick, and not a probability of being Sick for the people having a Positive test result. They are simply not the same mathematical functions, we can see this by writing that p(Sick and Negative)+p(Sick and Positive) = 1 and just to dot the i's p(Sick and Normal) = 0.
As we know, this is the reason for using subscripts for a conditional probability, to make it clear that it is not the same probability.

The so-called  "confusion" matrix or table makes it clearer.

Actual            \   Tested
Negative  result
Positive result
Normal
900
100
Sick
4
96

The probability of  being Sick having a Positive test can be easily calculated = 96/(100+96) <50%.
In this mockup, having a positive test result implies almost equal chances for being sick or not.
(Risk is still much higher than for those with negative test : p=4/904)

Unlike the truth, the probability is relative to its measured set.

I think this is just how it was defined in the first but still useful rigorous formalism by Kolmogorov, in the 1930's.

Here is an inspiring article on the same topic for its newest appearance : Tests vs confidence intervals. Econometric Sense: Confidence Intervals: Fad or Fashion: Confidence intervals seem to be the fad among some in pop stats/data science/analytics. Whenever there is mention of p-hacking, or the ills..

vendredi 19 août 2011

Simultaneous equations model for passenger demand

Our purpose is to estimate how much more(less) air travel demand does an increase(decrease) in GDPs produce?
We use DOT data for the USA France travel market to show how this question can be answered by econometrics.

Demand, Traffic, Supply, Market shares. 
 
People in airlines make a distinction between:

- Demand (for travelling from A to B).
- Traffic or actual number of passengers carried between two points, by a given service.

Demand is like fisch in the sea. Airlines capture them using aircrafts as nets.
The bigger the net(aircraft), the higher the number of passengers. Chart below displays
the log of passengers' share for an airline, against log of percentage of seats
(Source: US Department Of  Transportation, Quarterly, 2003Q1 - 2010Q1).
Observations line up, displaying a stable relation over years:












For airlines, empty seats are useless costs, and turned down passengers are loss of income.
So it is better to know the market's potential to build a schedule, choosing the right size of
aircraft and number of flights.

Elasticity to GDP:

For the USA-France market, we expect  a positive correlation between the change in GDPs for France
and the USA, and the variation in demand for air travel between the two countries.
Actually, there is more than correlation but a causal relation exists, because GDP is an indicator of the available income.

We can write down a first equation (dummy variables, D1-D3 are here for seasonal effects,  for three quarters only since a constant Kd is already in the model):
D(t) = Kd . GDP(t)^a  . D1(t) . D2(t) . D3(t)

(We use only one parameter for the GDP, the average of France and USA, since we do not know the number of tickets sold in each country)
 
Multiplicative form is the standard form because (using the first order approximation) it simplifies into: 
[% change in Demand]  =  a x [% change in GDP]

This equation above is the reason for naming the "a" coeffcient an "elasticity".
Estimation of  "a", the GDP exponent can be done by mean of ordinary least squares (OLS or  linear regression), but two things prevents us from doing this.


Problem 1: No demand statistics for a country pair (A,B)


Airlines report carried  passengers to the DOT, but a passenger departing from CDG and connecting at Heathrow will appear as a passenger originating from the UK, not from France.
We might consider using traffic data, as an approximation, but then there is another problem.




Problem 2: Planners use GDP forecasts to design schedules.

Thus, a change in GDP causes a change in traffic via 2 different channels: Demand D, and Supply.  cannot use only traffic data to estimate the GDP demand factor.

Solution with multiple equations and conclusion.


Instead of using Demand , we should use traffic and supply to fit a simultaneous equations model. This enables us to make an unbiased  estimation of the demand exponent.

Nevertheless, we use another (reasonable) hypothesis : The average proportion of seats used by connecting passengers, i.e. people originating from other countries than France and the USA does not change dramatically in time. Otherwise, a specific airline would have to substract its own connecting passenger from the trafic, a much simpler operation anyway, compared to collecting data from all other carriers for the geographical origin of their passengers on a particular route.

Seats and traffic data for the USA-France market comes from DOT, and the OECD provides GDPs for France and USA.

Using this obviously too simple model,  elasticity to GDP exponent is.1.43

Here are results for single equation alternatives:
Using traffic data from a specific airline with a single equation :     1.93
Using aggregated traffic from all carriers with a single equation:     1.11

At least on this particular example, fitting a model using all the available data compensates for the absence of more relevant data.


Application: Predicting traffic for a single airline 

This model can also be used to predict traffic.Chart below illustrates the goodness of fit for a single airline. Forecasts would not necessarily be as good as this. A simple counterargument is that you'll need an accurate forecast for GDPs to build your forecast.