In the introductory chapter, we proposed a classification of data based on the distinction between binary and continuous variables. When both variables to be correlated are binary, the phi coefficient of correlation, described by Udny Yule in his 1912 article in the Journal of Royal Statistical Society, is the correct statistics to use. The numerical value of the phi coefficient of correlation is identical to that obtained by the Pearson product-moment coefficient of correlation. The phi coefficient of correlation is a concept, not, in the present computer era, a computational device. The derivation of the phi coefficient from the Pearson product-moment coefficient of correlation will be used here to gain additional insight into the properties, assumptions, and limitations of the product-moment coefficient of correlation.
The phi coefficient can be derived from the formula for the product-moment coefficient of correlation for obtained scores
by considering several equivalencies that are true only for binary data. The contingencies surrounding use of the phi coefficient can be perhaps best explicated for the example of a data matrix containing only unique response patterns for binary variables X and Y,
Since all variables in the above table are binary, their variances can be expressed using the pq notation and the formula for the coefficient of correlation can be rewritten as
The r symbol for the product-moment correlation is subscripted with the Greek character phi to stress that to use the above formula, the variables correlated must be binary.
A graphical rendering of the binary data matrix for the current example is shown as

Since in the case of two binary variables, the test scores can occupy only four points on the scatter-plot, it is possible to envelope the Cartesian coordinates by a four-fold table, shown as

In the four-fold table, each cell corresponds to a specific response pattern. The Allen's cell contains the (0,1)-response pattern, the Beth's cell the (1,1)-response pattern, the Cathy's cell the (0,1)-response pattern, and the Debra's cell the (1,0)-response pattern. Let us borrow initials of our subject's names and call these response patterns A, B, C, and D, respectively.
To translate the phi coefficient from the 'pq' notation to the 'ABCD' notation, consider data matrix in the above diagram. Using the 'ABCD' notation, indicating the frequencies of a four-fold data table, the phi coefficient of correlation can be rewritten from its 'pq' notation
into the 'ABCD' notation as

Expressions in both the numerator and the denominator of the above compound fraction can be put under the common denominator as

Simplifying the above expression results in
![]()
Since N equals the sum of all matrix frequencies (i.e., N = A+B+C+D), the phi coefficient can be written as
![]()
and, further simplified, this equation can be expressed as
![]()
The above formula discloses the phi coefficient of correlation in terms of a simple frequency count of [0,1] [1,1] [0,0] and [1,0] binary response patterns. In terms of proportions instead of frequencies, and adapting a convention to write frequencies in the upper case and the proportions in the lower case letters, the above formula translates to
Since the above proportions are also the means of their corresponding response patterns, the phi coefficient of correlation can be also written as
A classic heredity vs. environment experiment involved kittens reared together with a rat or with other kittens. When the kittens reached adulthood, they were placed with a rodent and observed whether they killed the rodent. Results of this type of research are usually presented as a four-fold table
|
|
Reared Together |
Reared Apart |
|
Killed the Rodent |
0 |
2 |
|
Did Not Kill the Rodent |
3 |
1 |
The above table is an illustrative rendering of this experiment (which actually involved 30 kittens) conducted by Z.Y. Kuo, and described in his classical article The genesis of the cat's response to the rat, published in the Journal of Comparative Psychology, 1930, 11, 1-30.
To analyze this type of data, all we have to do is to correctly orient the for-fold table with respect to Cartesian coordinates
and unravel it into the form suitable to compute a coefficient of correlation. Lets define the rearing circumstances as the predictor variable X and enter 0 for kittens reared with a rodent and 1 for kittens reared with other kittens. Next, define the criterion variable, Y, and enter 1 when the cat killed the rodent and 0 when it did not.
The phi coefficient of correlation is computed from the four-fold table of frequencies as [2(3) - 0(1)] / (3*2*2*3)1/2 which equals .71 or from the above table as (.33 - (.50)(.33)) / (.5)(.47) that is also .71. The coefficient of determination equals .50. Cats reared together with rodents often overcome the inherited tendency to kill the rodent. Experiments like Kuo's are relevant within the context of the nature-nurture controversy.
The assumptions of normality and homogeneity can be violated when the categories are extremely uneven, as in the case of proportions close to .90, .95 or .10, .05. In these cases, the phi coefficient can be markedly attenuated. The assumption of linearity cannot be violated within the context of the phi coefficient of correlation.
In the case of binary variables, the prototypical formula of statistical significance
can be further simplified as
The Greek symbol on the left side of the above formula is called chi, the index is called the chi square. To understand the nature of the chi square distribution, let us first describe the family of the gamma distributions.
In the chapter on the phi coefficient of correlation, we described a classic heredity vs. environment experiment involved kittens reared together with a rat or with other kittens. When the kittens reached adulthood, they were placed with a rodent and observed whether they killed the rodent. The 1s and 0s of the parent vector X indexes the rearing circumstances. The criterion variable Y indexes whether the cat killed the rat or if it did not.
The regression analysis for the example is shown in the table above. The coefficient of determination was computed as .11 / .22, which equals .5. The coefficient of alienation equals .5. The z square ratio was computed as (.5 / .5) 6 that is 6. The z score equals 2.45, its associated probability equals .00724. Using the chi square test of statistical significance, the chi-square equals .5 (6) that is 3.0; its associated probability is smaller than .001.
The coefficient of determination can be obtained from the value of the chi-square
as
For the example, the strength of the relationship can be computed as 3 / 6, equal to .50.
The correlation
between continuous variables is captured by the product-moment correlation
coefficient. The correlation between continuous and binary variables can be
computed by the point biserial correlation. The correlation between binary
variables is conceptualized by phi correlation coefficient. All the
coefficients of Pearson's family of product-moment coefficients are
algebraically equivalent and give identical numerical results. There is no need
for a computer program dedicated to compute the point biserial coefficient of
correlation. Any computer program calculating the Pearson's product-moment
coefficient of correlation will also correctly calculate the point biserial
coefficient of correlation. Together with the phi coefficient of correlation,
the point biserial belongs to the family of Pearson's coefficients of
correlation.
The importance of the point biserial as
a labor saving solution for obtaining a correlation coefficient for those
special cases where one variable is binary and the other is continuous, has diminished
with the advent of computerization. However, the point biserial correlation is
of considerable theoretical importance in that it provides a theoretical basis
for the translation between proportions of variance accounted for by the
coefficients of determination and alienation and the variances of their
constituent variables. Understanding the principles behind the computation of
the point biserial coefficient is also essential for understanding the theory
behind the tests of statistical significance.
The different renderings of the
correlation coefficient are indispensable for understanding the relationship
between correlational measures of relationships and tests of statistical
significance to be discussed later. The phi coefficient of correlation is
associated with the chi square test of significance. The point biserial
coefficient is associated with the t-test.
The family of Pearson's coefficients of correlation can be
summarized as
with all coefficients in the above table being different algebraic
renderings of the product moment coefficient, returning identical numerical
values .