In the introductory chapter, we proposed a classification of data based on the distinction between binary and continuous variables. When both variables to be correlated are binary, the phi coefficient of correlation, described by Udny Yule in his 1912 article in the Journal of Royal Statistical Society, is the correct statistics to use. The numerical value of the phi coefficient of correlation is identical to that obtained by the Pearson product-moment coefficient of correlation. The phi coefficient of correlation is a concept, not, in the present computer era, a computational device. The derivation of the phi coefficient from the Pearson product-moment coefficient of correlation will be used here to gain additional insight into the properties, assumptions, and limitations of the product-moment coefficient of correlation.
The phi coefficient can be derived from the formula for the product-moment coefficient of correlation for obtained scores
by considering several equivalencies that are true only for binary data. The contingencies surrounding use of the phi coefficient can be perhaps best explicated for the example of a data matrix containing only unique response patterns for binary variables X and Y,
Since all variables in the above table are binary, their variances can be expressed using the pq notation and the formula for the coefficient of correlation can be rewritten as
The r symbol for the product-moment correlation is subscripted with the Greek character phi to stress that to use the above formula, the variables correlated must be binary.
A graphical rendering of the binary data matrix for the current example is shown as

Since in the case of two binary variables, the test scores can occupy only four points on the scatter-plot, it is possible to envelope the Cartesian coordinates by a four-fold table, shown as

In the four-fold table, each cell corresponds to a specific response pattern. The Allen's cell contains the (0,1)-response pattern, the Beth's cell the (1,1)-response pattern, the Cathy's cell the (0,1)-response pattern, and the Debra's cell the (1,0)-response pattern. Let us borrow initials of our subject's names and call these response patterns A, B, C, and D, respectively.
To translate the phi coefficient from the 'pq' notation to the 'ABCD' notation, consider data matrix in the above diagram. Using the 'ABCD' notation, indicating the frequencies of a four-fold data table, the phi coefficient of correlation can be rewritten from its 'pq' notation
into the 'ABCD' notation as
|
|
|
Expressions in both the numerator and the denominator of the above compound fraction can be put under the common denominator as
|
|
|
Simplifying the above expression results in
Since N equals the sum of all matrix frequencies (i.e., N = A+B+C+D), the phi coefficient can be written as
and, further simplified, this equation can be expressed as
The above formula discloses the phi coefficient of correlation in terms of a simple frequency count of [0,1] [1,1] [0,0] and [1,0] binary response patterns. In terms of proportions instead of frequencies, and adapting a convention to write frequencies in the upper case and the proportions in the lower case letters, the above formula translates to
Since the above proportions are also the means of their corresponding response patterns, the phi coefficient of correlation can be also written as
A classic heredity vs. environment experiment involved kittens reared together with a rat or with other kittens. When the kittens reached adulthood, they were placed with a rodent and observed whether they killed the rodent. Results of this type of research are usually presented as a four-fold table
|
|
|
The above table is an illustrative rendering of this experiment (which actually involved 30 kittens) conducted by Z.Y. Kuo, and described in his classical article The genesis of the cat's response to the rat, published in the Journal of Comparative Psychology, 1930, 11, 1-30.
To analyze this type of data, all we have to do is to correctly orient the for-fold table with respect to Cartesian coordinates
and unravel it into the form suitable to compute a coefficient of correlation. Lets define the rearing circumstances as the predictor variable X and enter 0 for kittens reared with a rodent and 1 for kittens reared with other kittens. Next, define the criterion variable, Y, and enter 1 when the cat killed the rodent and 0 when it did not.
The phi coefficient of correlation is computed from the four-fold table of frequencies as [2(3) - 0(1)] / (3*2*2*3)1/2 which equals .71 or from the above table as (.33 - (.50)(.33)) / (.5)(.47) that is also .71. The coefficient of determination equals .50. Cats reared together with rodents often overcome the inherited tendency to kill the rodent. Experiments like Kuo's are relevant within the context of the nature-nurture controversy.
The assumptions of normality and homogeneity can be violated when the categories are extremely uneven, as in the case of proportions close to .90, .95 or .10, .05. In these cases, the phi coefficient can be markedly attenuated. The assumption of linearity cannot be violated within the context of the phi coefficient of correlation.
In the case of binary variables, the prototypical formula of statistical significance
can be further simplified as
The Greek symbol on the left side of the above formula is called chi, the index is called the chi square. To understand the nature of the chi square distribution, let us first describe the family of the gamma distributions.
In the chapter on the phi coefficient of correlation, we described a classic heredity vs. environment experiment involved kittens reared together with a rat or with other kittens. When the kittens reached adulthood, they were placed with a rodent and observed whether they killed the rodent. The 1s and 0s of the parent vector X indexes the rearing circumstances. The criterion variable Y indexes whether the cat killed the rat or if it did not.
The regression analysis for the example is shown in the table above. The coefficient of determination was computed as .11 / .22, which equals .5. The coefficient of alienation equals .5. The z square ratio was computed as (.5 / .5) 6 that is 6. The z score equals 2.45, its associated probability equals .00724. Using the chi square test of statistical significance, the chi-square equals .5 (6) that is 3.0; its associated probability is smaller than .001.
The coefficient of determination can be obtained from the value of the chi-square
as
For the example, the strength of the relationship can be computed as 3 / 6, equal to .50.
The correlation between continuous variables is captured by the
product-moment correlation coefficient. The correlation between continuous and
binary variables can be computed by the point biserial correlation. The
correlation between binary variables is conceptualized by phi correlation
coefficient. All the coefficients of Pearson's family of product-moment
coefficients are algebraically equivalent and give identical numerical results.
There is no need for a computer program dedicated to compute the point biserial
coefficient of correlation. Any computer program calculating the Pearson's
product-moment coefficient of correlation will also correctly calculate the
point biserial coefficient of correlation. Together with the phi coefficient of
correlation, the point biserial belongs to the family of Pearson's coefficients
of correlation.
The
importance of the point biserial as a labor saving solution for obtaining a
correlation coefficient for those special cases where one variable is binary
and the other is continuous, has diminished with the advent of computerization.
However, the point biserial correlation is of considerable theoretical
importance in that it provides a theoretical basis for the translation between
proportions of variance accounted for by the coefficients of determination and
alienation and the variances of their constituent variables. Understanding the
principles behind the computation of the point biserial coefficient is also essential
for understanding the theory behind the tests of statistical significance.
The
different renderings of the correlation coefficient are indispensable for
understanding the relationship between correlational measures of relationships
and tests of statistical significance to be discussed later. The phi
coefficient of correlation is associated with the chi square test of
significance. The point biserial coefficient is associated with the t-test.
The family of Pearson's coefficients
of correlation can be summarized as
with all coefficients in the above
table being different algebraic renderings of the product moment coefficient,
returning identical numerical values .