Crosstabulation

In the introductory chapter, we proposed a classification of data based on the distinction between binary and continuous variables. When both variables to be correlated are binary, the phi coefficient of correlation, described by Udny Yule in his 1912 article in the Journal of Royal Statistical Society, is the correct statistics to use. The numerical value of the phi coefficient of correlation is identical to that obtained by the Pearson product-moment coefficient of correlation. The phi coefficient of correlation is a concept, not, in the present computer era, a computational device. The derivation of the phi coefficient from the Pearson product-moment coefficient of correlation will be used here to gain additional insight into the properties, assumptions, and limitations of the product-moment coefficient of correlation.

Yule's Conceptualization of the Phi Coefficient

The phi coefficient can be derived from the formula for the product-moment coefficient of correlation for obtained scores

 

 

 

by considering several equivalencies that are true only for binary data. The contingencies surrounding use of the phi coefficient can be perhaps best explicated for the example of a data matrix containing only unique response patterns for binary variables X and Y,

 

 

 

Since all variables in the above table are binary, their variances can be expressed using the pq notation and the formula for the coefficient of correlation can be rewritten as

 

 

The r symbol for the product-moment correlation is subscripted with the Greek character phi to stress that to use the above formula, the variables correlated must be binary.

The Four-Fold Tables

A graphical rendering of the binary data matrix for the current example is shown as

Since in the case of two binary variables, the test scores can occupy only four points on the scatter-plot, it is possible to envelope the Cartesian coordinates by a four-fold table, shown as

       In the four-fold table, each cell corresponds to a specific response pattern. The Allen's cell contains the (0,1)-response pattern, the Beth's cell the (1,1)-response pattern, the Cathy's cell the (0,1)-response pattern, and the Debra's cell the (1,0)-response pattern. Let us borrow initials of our subject's names and call these response patterns A, B, C, and D, respectively.

 

 

 

To translate the phi coefficient from the 'pq' notation to the 'ABCD' notation, consider data matrix in the above diagram. Using the 'ABCD' notation, indicating the frequencies of a four-fold data table, the phi coefficient of correlation can be rewritten from its 'pq' notation

 

 

 

into the 'ABCD' notation as

 

 

Expressions in both the numerator and the denominator of the above compound fraction can be put under the common denominator as

 

 

Simplifying the above expression results in

 

 

Since N equals the sum of all matrix frequencies (i.e., N = A+B+C+D), the phi coefficient can be written as

 

 

and, further simplified, this equation can be expressed as

 

 

The above formula discloses the phi coefficient of correlation in terms of a simple frequency count of [0,1] [1,1] [0,0] and [1,0] binary response patterns. In terms of proportions instead of frequencies, and adapting a convention to write frequencies in the upper case and the proportions in the lower case letters, the above formula translates to

 

 

 

Since the above proportions are also the means of their corresponding response patterns, the phi coefficient of correlation can be also written as

 

 

About Kittens Killing Rodents: Will They Give Quarter?

A classic heredity vs. environment experiment involved kittens reared together with a rat or with other kittens. When the kittens reached adulthood, they were placed with a rodent and observed whether they killed the rodent. Results of this type of research are usually presented as a four-fold table

 

 

Reared Together

Reared Apart

Killed the Rodent

0

2

Did Not Kill the Rodent

3

1

 

The above table is an illustrative rendering of this experiment (which actually involved 30 kittens) conducted by Z.Y. Kuo, and described in his classical article The genesis of the cat's response to the rat, published in the Journal of Comparative Psychology, 1930, 11, 1-30.

To analyze this type of data, all we have to do is to correctly orient the for-fold table with respect to Cartesian coordinates

 

 

and unravel it into the form suitable to compute a coefficient of correlation. Lets define the rearing circumstances as the predictor variable X and enter 0 for kittens reared with a rodent and 1 for kittens reared with other kittens. Next, define the criterion variable, Y, and enter 1 when the cat killed the rodent and 0 when it did not.

 

 

 

The phi coefficient of correlation is computed from the four-fold table of frequencies as [2(3) - 0(1)] / (3*2*2*3)1/2 which equals .71 or from the above table as (.33 - (.50)(.33)) / (.5)(.47) that is also .71. The coefficient of determination equals .50. Cats reared together with rodents often overcome the inherited tendency to kill the rodent. Experiments like Kuo's are relevant within the context of the nature-nurture controversy.

Limitations of the Phi Coefficient of Correlation

The assumptions of normality and homogeneity can be violated when the categories are extremely uneven, as in the case of proportions close to .90, .95 or .10, .05. In these cases, the phi coefficient can be markedly attenuated. The assumption of linearity cannot be violated within the context of the phi coefficient of correlation.

The Chi Square Test of Statistical Significance

In the case of binary variables, the prototypical formula of statistical significance

 

 

 

can be further simplified as

 

 

 

The Greek symbol on the left side of the above formula is called chi, the index is called the chi square. To understand the nature of the chi square distribution, let us first describe the family of the gamma distributions.

Kittens Killing Rodents Revisited

In the chapter on the phi coefficient of correlation, we described a classic heredity vs. environment experiment involved kittens reared together with a rat or with other kittens. When the kittens reached adulthood, they were placed with a rodent and observed whether they killed the rodent. The 1s and 0s of the parent vector X indexes the rearing circumstances. The criterion variable Y indexes whether the cat killed the rat or if it did not.

 

 

 

The regression analysis for the example is shown in the table above. The coefficient of determination was computed as .11 / .22, which equals .5. The coefficient of alienation equals .5. The z square ratio was computed as (.5 / .5) 6 that is 6. The z score equals 2.45, its associated probability equals .00724. Using the chi square test of statistical significance, the chi-square equals .5 (6) that is 3.0; its associated probability is smaller than .001.

Strength of the Relationship

The coefficient of determination can be obtained from the value of the chi-square

 

 

 as

 

 

For the example, the strength of the relationship can be computed as 3 / 6, equal to .50.

The Family of Pearson's Coefficients of Correlation

The correlation between continuous variables is captured by the product-moment correlation coefficient. The correlation between continuous and binary variables can be computed by the point biserial correlation. The correlation between binary variables is conceptualized by phi correlation coefficient. All the coefficients of Pearson's family of product-moment coefficients are algebraically equivalent and give identical numerical results. There is no need for a computer program dedicated to compute the point biserial coefficient of correlation. Any computer program calculating the Pearson's product-moment coefficient of correlation will also correctly calculate the point biserial coefficient of correlation. Together with the phi coefficient of correlation, the point biserial belongs to the family of Pearson's coefficients of correlation.

The importance of the point biserial as a labor saving solution for obtaining a correlation coefficient for those special cases where one variable is binary and the other is continuous, has diminished with the advent of computerization. However, the point biserial correlation is of considerable theoretical importance in that it provides a theoretical basis for the translation between proportions of variance accounted for by the coefficients of determination and alienation and the variances of their constituent variables. Understanding the principles behind the computation of the point biserial coefficient is also essential for understanding the theory behind the tests of statistical significance.

The different renderings of the correlation coefficient are indispensable for understanding the relationship between correlational measures of relationships and tests of statistical significance to be discussed later. The phi coefficient of correlation is associated with the chi square test of significance. The point biserial coefficient is associated with the t-test.

Summary

The family of Pearson's coefficients of correlation can be summarized as

 

 

 

with all coefficients in the above table being different algebraic renderings of the product moment coefficient, returning identical numerical values .