The guaranteed fulfillment of the linearity assumption for two groups suggested previously has considerable theoretical importance within the framework of the general linear model. Algebraically, this theoretical advantage can be realized by restricting the variability of the predictor variable by defining it as a binary variable, i.e., a variable that defines two categories, signified by one and zero. The coefficient of correlation between a (binary) categorical variable X and a continuous variable Y is called the point biserial coefficient.
The dichotomous nature of binary variables allows for the classification of both the X and Y variables into two categories, with separate ns, means, and variances. Consider the problem of computing a correlation coefficient between a binary variable X = [0 0 1 1 1], and a continuous variable Y = [1 2 3 4 5]. The framework for the development of a concise computational algorithm for this special case is outlined as
The concept of the point biserial is based on a rendering of the coefficient of correlation as a slope of a regression line. For each category of a binary predictor variable, the predicted score can be calculated by the regression on categories method as a mean of the scores in either 0 or 1 category. The slope of a regression line of the point biserial coefficient of correlation can be plotted as
The slope of the regression line B, is the ratio of the opposite and adjacent legs of the triangle. This slope can be calculated by
where Ms signify the mean of the Y scores corresponding to the 1 or 0 categories, respectively. The above equation would equal the point biserial coefficient of correlation if the slope would be expressed in standard scores. However, the equation is rendered in obtained scores. To derive the formula for the point biserial coefficient of correlation we must transform the above formula from the obtained score form to that of standard score form.
To accomplish the necessary modifications of the equation for the slope of the regression line defined in the above section, let us start with the equation of a regression line in standard score form
This equation can be expressed in deviation scores form as
and simplified to
The above equation can be compared with the analytical equation of a line in deviation scores
Equating the slopes of the analytical and statistical equations of a line
and multiplying both sides of this equation by the standard deviation of the predictor variable X, the following equation results
The coefficient of correlation, as isolated from the above equation, can be written as
At this point we can substitute the slope of the regression line B, as defined in the preceding section, for the slope of the regression line b in the above equation, since the slopes of regression lines in obtained and deviation scores are identical. This results in equation
We can replace the standard deviation of the predictor variable X written in sigma notation, with the variance written in the 'pq' notation. The formula for the point biserial coefficient of correlation, as derived from the Pearson's product-moment coefficient of correlation, is
The point biserial coefficient of correlation can be also written in the form of a coefficient of determination:
Let us reconsider the example introduced at the beginning of the chapter
The point biserial coefficient of determination can be computed as (3/5)(2/5)(4-1.5)2/2 which equals .75. This result can be verified by standardizing variance of predicted scores as 1.5/2 that, indeed, equals .75.
The preferred conceptualization of the point biserial coefficient of correlation is in its determination form, as
The values of the point biserial are numerically equivalent those that could have been obtained by the product moment coefficient of correlation computed from the same data.
The point
biserial correlation is conceptually important, as it helps
to understand the main principles of the tests of
statistical significance, especially how the coefficient of
correlation can be used to measure a difference between
two means.
![]()