Categorization

The guaranteed fulfillment of the linearity assumption for of two groups suggested previously has considerable theoretical importance within the framework of the general linear model. Algebraically, this theoretical advantage can be realized by restricting the variability of the predictor variable by defining it as a binary variable, i.e., a variable that defines two categories, signified by one and zero. The coefficient of correlation between a (binary) categorical variable X and a continuous variable Y is called the point biserial coefficient.

Conceptual Definition of the Point Biserial

The dichotomous nature of binary variables allows for the classification of both the X and Y variables into two categories, with separate ns, means, and variances. Consider the problem of computing a correlation coefficient between a binary variable X = [0 0 1 1 1], and a continuous variable Y = [1 2 3 4 5]. The framework for the development of a concise computational algorithm for this special case is outlined as

 

 

 

The concept of the point biserial is based on a rendering of the coefficient of correlation as a slope of a regression line. For each category of a binary predictor variable, the predicted score can be calculated by conditional bivariate regression as a mean of the scores in either 0 or 1 category. The slope of a regression line of the point biserial coefficient of correlation can be plotted as

The slope of the regression line B, is the ratio of the opposite and adjacent legs of the triangle. This slope can be calculated by

 

 

 

where Ms signify the mean of the Y scores corresponding to the 1 or 0 categories, respectively. The above equation would equal the point biserial coefficient of correlation if the slope would be expressed in standard scores. However, the equation is rendered in obtained scores. To derive the formula for the point biserial coefficient of correlation we must transform the above formula from the obtained score form to that of standard score form.

Derivation of the Point Biserial

To accomplish the necessary modifications of the equation for the slope of the regression line defined in the above section, let us start with the equation of a regression line in standard score form

 

 

 

This equation can be expressed in deviation scores form as

 

 

and simplified to

 

 

The above equation can be compared with the analytical equation of a line in deviation scores

 

 

 

Equating the slopes of the analytical and statistical equations of a line

 

 

 

 

and multiplying both sides of this equation by the standard deviation of the predictor variable X, the following equation results

 

 

 

The coefficient of correlation, as isolated from the above equation, can be written as

 

 

 

At this point we can substitute the slope of the regression line B, as defined in the preceding section, for the slope of the regression line b in the above equation, since the slopes of regression lines in obtained and deviation scores are identical. This results in equation

 

 

The Point Biserial in the PQ Notation

We can replace the standard deviation of the predictor variable X written in sigma notation, with the variance written in the 'pq' notation. The formula for the point biserial coefficient of correlation, as derived from the Pearson's product-moment coefficient of correlation, is

 

 

 

The point biserial coefficient of correlation can be also written in the form of a coefficient of determination

 

 

 

Computation of the Point Biserial

Let us reconsider the example introduced at the beginning of the chapter

 

 

 

The point biserial coefficient of determination can be computed as (3/5)(2/5)(4-1.5)2/2 which equals .75. This result can be verified by standardizing variance of predicted scores as 1.5/2 that, indeed, equals .75.

Summary

The preferred conceptualization of the point biserial coefficient of correlation is in its determination form, as

 

 

The values of the point biserial are numerically equivalent those that could have been obtained by the product moment coefficient of correlation computed from the same data. However, the point biserial provides a link between the correlation methods and statistical methods estimating the probability that differences between two or more means are large enough to be statistically significant.