The Phi Coefficient of Correlation

 

Correlate Two Binary Variables

In the introductory chapter, we proposed a classification of data based on the distinction between binary and continuous variables. When both variables to be correlated are binary, the phi coefficient of correlation, described by Udny Yule in his 1912 article in the Journal of Royal Statistical Society, is the correct statistics to use.

The numerical value of the phi coefficient of correlation is identical to that obtained by the Pearson product-moment coefficient of correlation. The phi coefficient of correlation is a concept, not, in the present computer era, a computational device. The derivation of the phi coefficient from the Pearson product-moment coefficient of correlation will be used here to gain additional insight into the properties, assumptions, and limitations of the product-moment coefficient of correlation.

 

Yule's Conceptualization of the Phi Coefficient

The phi coefficient can be derived from the formula for the product-moment coefficient of correlation for obtained scores

 

 

by considering several equivalencies that are true only for binary data. The contingencies surrounding use of the phi coefficient can be perhaps best explicated for the example of a data matrix containing only unique response patterns for binary variables X and Y,

 

 

Since all variables in the above table are binary, their variances can be expressed using the pq notation and the formula for the coefficient of correlation can be rewritten as

 

 

The r symbol for the product-moment correlation is subscripted with the Greek character phi to stress that to use the above formula, the variables correlated must be binary.

 

The Four-Fold Tables

A graphical rendering of the binary data matrix for the current example is shown as  

 

Since in the case of two binary variables, the test scores can occupy only four points on the scatter-plot, it is possible to envelope the Cartesian coordinates by a four-fold table, shown as

 

 

In the four-fold table, each cell corresponds to a specific response pattern. The Allen's cell contains the (0,1)-response pattern, the Beth's cell the (1,1)-response pattern, the Cathy's cell the (0,0)-response pattern, and the Debra's cell the (1,0)-response pattern. Let us borrow initials of our subject's names and call these response patterns A, B, C, and D, respectively.

 

 

To translate the phi coefficient from the 'pq' notation to the 'ABCD' notation, consider data matrix in the above diagram. Using the 'ABCD' notation, indicating the frequencies of a four-fold data table, the phi coefficient of correlation can be rewritten from its 'pq' notation

 

 

into the 'ABCD' notation as

 

 

Expressions in both the numerator and the denominator of the above compound fraction can be put under the common denominator as

 

 

Simplifying the above expression results in

 

 

Since n equals the sum of all matrix frequencies (i.e., n = A+B+C+D), the phi coefficient can be written as

 

 

and, further simplified, this equation can be expressed as

 

 

The above formula discloses the phi coefficient of correlation in terms of a simple frequency count of [0,1] [1,1] [0,0] and [1,0] binary response patterns.

 

 Row Total
             A                              B                  A+B           
            C                     D       C+D          
Column TotalA+C     B+D  

 

In terms of proportions instead of frequencies, and adapting a convention to write frequencies in the upper case and the proportions in the lower case letters, the above formula translates to

 

 

Since the above proportions are also the means of their corresponding response patterns, the phi coefficient of correlation can be also written as

 

 

As witnessed by the presented algebraic derivation, the phi coefficient belongs to the family of Pearson's product moment coefficients of correlations.

 

About Kittens Killing Rodents: Will They Give Quarter?

A classic heredity vs. environment experiment involved kittens reared together with a rat or with other kittens. When the kittens reached adulthood, they were placed with a rodent and observed whether they killed the rodent. This experiment (which actually involved 30 kittens) was conducted by Z.Y. Kuo, and described in his classical article The genesis of the cat's response to the rat, published in the Journal of Comparative Psychology, 1930, 11, 1-30.

 

Coding

Let's define the rearing circumstances as the variable X and enter 0 for kittens reared with a rodent and 1 for kittens reared with other kittens. Next, define the killing behavior as the variable, Y, and enter 1 when the cat killed the rodent and 0 when it did not.

 

 

Combinations

There are two rearing conditions (kitten reared with a rodent and kitten reared with other kittens) and the variable killing behavior has two levels (cat killed the rodent and cat did not kill the rodent). Thus, there are four combinations (2*2 = 4). The four combinations with respect to Cartesian coordinates can be positioned as shown below.

 

YKill the Rodent (1)(0, 1) (1, 1)
Did not Kill the Rodent (0)(0, 0)(1, 0)
  Rear Together (0)Rear Apart  (1)
  X

 

ABCD Notation

Record the response frequencies corresponding to response patterns (0,1), (1,1), (0,0), and (1,0).

 

Pattern

Frequency

01

11

00

10

0  

2

3

1

   

 

They are usually presented as a four-fold table

 

 

Reared Together (0)

Reared Apart (1)

Killed the Rodent          (1)

0

2

Did Not Kill the Rodent (())

3

1

 

Note that no cat reared with the rodent killed the rodent.  For the cats reared with other cats, two of them killed the rodent and one did not. 

To analyze this type of data, all we have to do is to correctly orient the for-fold table with respect to Cartesian coordinates.


 

and unravel it into the form suitable to compute a coefficient of correlation. 


Rear Together (0)Rear Apart  (1)Row Total
Kill the Rodent (1) 0 22
Did not Kill the Rodent (0) 3  14
Column Total 3 36

 

The phi coefficient of correlation is computed from the four-fold table of frequencies as [2(3) - 0(1)] / (3*3*2*4)1/2 which equals .71.

 

PQ Notation 

Lets define the rearing circumstances as the predictor variable X and enter 0 for kittens reared with a rodent and 1 for kittens reared with other kittens. Next, define the killing behavior as the criterion variable, Y, and enter 1 when the cat killed the rodent and 0 when it did not.

 

 

The phi coefficient of correlation is computed from the above table as

 

 

(.33 - (.50)(.33)) / (.5)(.47)  that is also .71. Note that pq1/2 =

Experiments like Kuo's are relevant within the context of the nature-nurture controversy.

 

Limitations of the Phi Coefficient of Correlation

The assumptions of normality and homogeneity can be violated when the categories are extremely uneven, as in the case of proportions close to .90, .95 or .10, .05. In these cases, the phi coefficient can be markedly attenuated. The assumption of linearity cannot be violated within the context of the phi coefficient of correlation.

 

The Chi Square Test of Statistical Significance

The phi coefficient of correlation is associated with the chi square test of significance. Is there any relation between the type of the rearing circumstances and their killing behavior?

Hypotheses

The null hypothesis states that there is no relation between the rear conditions and the killing behavior. The alternative hypothesis states that there is a relation between the rear conditions and the killing behavior. 

The Chi-Square Test

The phi square and chi square coefficients are related. The chi square can be expressed using the square of the phi coefficient as

The Greek symbol on the left side of the above formula is called chi and the index is called the chi square. To understand the nature of the chi square distribution, let us first review the family of the gamma distributions.

 

Kittens Killing Rodents Revisited

In the chapter on the phi coefficient of correlation we described a classic heredity vs. environment experiment involved kittens reared together with a rat or with other kittens. When the kittens reached adulthood, they were placed with a rodent and observed whether they killed the rodent. 

Coding

The 1s and 0s of the parent vector X indexes the rearing circumstances. The criterion variable Y indexes whether the cat killed the rat or it did not.

 

 

The regression analysis for the example is shown in the table below.

 

   

Compute the Chi-Square Value

The coefficient of determination is computed as .11 / .22 which equals .5. Using the chi square test of statistical significance,

 

 

the chi-square equals .5 (6) that is 3.0.

Note that the t-square ratio is characteristic of the Fisherian conceptualization of statistical inference with the degrees of freedom used throughout all computations leading to the t-square ratio. The chi-square ratio is characteristics of the Pearsonian conceptualization of statistical inference where the degrees of freedom are introduced only during the last phase of the computation of probability associated with the chi-square. 

Degrees of Freedom

The degrees of freedom for a contingency table can be computed as (number of rows - 1)(number of columns - 1). Thus, for a two-by-two table, (2-1)(2-1) = 1.

Visualize the Chi-Square Distribution

Locate the position of the obtained chi square value of 3 in the chi -square distribution with one degree of freedom.

 

 

The associated probability is .08. The observed (exact) probability is larger than .05. We can not reject the null hypothesis. In conclusion, there is no significant differences between the two rearing circumstances with respect to their proportion of killing a rodent.

 

Strength of the Relationship

The square of the phi coefficient of correlation can be written in terms of the chi square as

 

as

 

 

The phi and chi square coefficients indicate jointly the strength and the significance of a relationship. For the example, the strength of the relationship can be computed as 3 / 6, equal to .50. The phi correlation equals .71.

Report the Result

A two-way contingency table analysis was conducted to evaluate whether there was any relation between the type of the rearing circumstances and the killing behavior. The two variables were the rearing circumstances with two levels (kittens reared together with a rat or with other kittens) and the killing behavior with two levels (the cat killed the rat or it did not). The rearing circumstances and the killing behavior (for this arbitrary example) were not significantly related, Pearson (1, n = 6) = 3, p > .05, = .71.    

 

The Family of Pearson's Coefficients of Correlation

The correlation between continuous variables is captured by the product-moment correlation coefficient. The correlation between continuous and binary variables can be computed by the point biserial correlation. The correlation between binary variables is conceptualized by phi correlation coefficient.

All the coefficients of Pearson's family of product-moment coefficients are algebraically equivalent and give identical numerical results. There is no need for a computer program dedicated to compute the point biserial coefficient of correlation. Any computer program calculating the Pearson's product-moment coefficient of correlation will also correctly calculate the point biserial coefficient of correlation. Together with the phi coefficient of correlation, the point biserial belongs to the family of Pearson's coefficients of correlation.

The importance of the point biserial as a labor saving solution for obtaining a correlation coefficient for those special cases where one variable is binary and the other is continuous, has diminished with the advent of computerization. However, the point biserial correlation is of considerable theoretical importance in that it provides a theoretical basis for the translation between proportions of variance accounted for by the coefficients of determination and alienation and the variances of their constituent variables.

Understanding the principles behind the computation of the point biserial coefficient is also essential for understanding the theory behind the tests of statistical significance. Also, the different renderings of the correlation coefficient are indispensable for understanding the relationship between correlational measures of relationships and tests of statistical significance. The phi coefficient of correlation is associated with the chi square test of significance. The point biserial coefficient is associated with the t-test.

 

Summary

The family of Pearson's coefficients of correlation can be summarized as

 

 

with all coefficients in the above table being different algebraic renderings of the product moment coefficient, returning identical numerical values .