Regression on Multiple Categories

 

To use the correlation coefficient correctly, a relationship between two variables must be approximately linear. When the assumption of linearity is violated, Pearson's product-moment coefficient of correlation will underestimate the strength of the relationship. What then should be done in cases where the variables to be correlated are not linearly related? Let us see whether we could derive a coefficient of correlation independent of the assumption of linearity.

 

Absence of a Relationship

As a prolegomena to our discussion, let us consider one of the two instances when the assumption of linearity is irrelevant and cannot be violated. The first of these two exceptional cases is when the predictor and criterion variables are not correlated. In this case, we may choose to ignore the predictor variable, as suggested by the question marks in the following table

 

 

The values of the predictor variable X correlating zero with the criterion variable Y are marked with the question marks to stress their irrelevance. The predicted scores are calculated using the equation

 

as

 

which simplifies to

 

 

As an example, consider the following data where the variables X and Y are not correlated.

 

 

Mean and Standard Error of Prediction

Several points may be stressed for this special case. The above equation asserts that in the absence of any relevant information, the best prediction of the outcome is the mean.

The standard error of this prediction will be equal to the standard deviation of the criterion variable Y. As illustrated, the deviation scores of a single variable can be conceptualized as error scores, capturing all the variance of the specification equation

 

 

For the example, 3.2 = 0 + 3.2.

This variance is the minimum possible. You may try to substitute any other number for the arithmetic mean and compute the deviation scores by its subtraction. Squaring and averaging these hypothetical deviation scores will result in a variance that is larger than the variance obtained by using the arithmetic mean. This observation lends credence to the conclusion that the arithmetic mean is the optimal locus of the distribution of scores in the least squares sense.

 

Eta Square

If values of variable X would indicate categories, we would have data on variable Y to compute the means from. Then, we could use these means as predicted scores for each category. 

Linear Relationship

Consider data comprising three categories

 

 

together with the scatterplot.

 

 

Note that the relationship between the two variables is linear.

Bivariate Regression Analysis

Regression analysis necessitates computations of means, variances, and correlation of variables X and Y. These values have to be substituted to the equation for calculation of the predicted scores

 

 

Errors of the prediction are computed by the equation

 

The coefficient of correlation between variables X and Y equals .71. The coefficient of determination is equal to .50. The standard deviations of variables X and Y are .82 and 1.15, respectively. The predicted scores can be computed as .71 (1.15 / .82) (X - 2) + 3 which simplifies to X+1. The error scores were calculated by subtracting the predicted variable from the criterion variable. Results of the regression analysis can be shown below

  

 

The scores from the above table are plotted as



 

Regression on Categories Solution

Now, compute the predicted scores by using means of values of Y corresponding to each category signified by X. For X=1, (1+2+3)/3=2. For X=2, (2+3+4)/3=3. For X=3, (3+4+5)/3=4. Next, compute the error scores by subtracting predicted scores from the obtained scores Y. To complete regression analysis, just compute the means and variances of predicted and error scores. The computation of correlation between X and Y is not necessary. Results of the solution can be shown below

 

 

The scores from the above table are plotted as

 

 

Since this type of regression analysis does not involve computation of the correlation coefficient, it should not be dependent on its assumptions.

Note that in the case of linear relationships, results from both regression methods are the same.

Nonlinear Relationship

Let us observe what happens if the data violate the assumption of linearity. Consider data comprising three categories

 

together with the scatterplot.

 

 

Note that the relationship between the two variables is not linear.

 

Bivariate Regression Analysis

The predicted scores were computed by the following equation.

 

 

Since the correlation between X and Y is zero,

 

The predicted scores equaled the mean of the variable Y.

 

 

The scores from the above table were plotted as

 

 

Note that correlation did not reflect this relationship since this relationship is not linear

 

Regression on Categories

Compute the predicted scores by substituting means of each category. 

 

 

The scores from the above table were plotted as

 

Eta Square

Compute the variances of the criterion variable, the predicted variable, and the error variable. 

 

 

Forming the ratio of the variance of the predicted variable (.22) and variance of the criterion variable (.89), the value of this ratio equals .25. Since we disregarded the assumption of linearity, we cannot call this ratio the coefficient of determination. Thus we have to give to this ratio a new name. This new index is called the eta square ratio and is defined as

 

Eta square, also called the correlation ratio, is free of the assumption of linearity.

In the case of linear relationships, correlation ratio equals the coefficient of determination. For nonlinear relationships, correlation ratio is not equal to the coefficient of determination. Since the eta square ratio is independent of the linearity assumption, and is also the best solution in the sense of the criterion of the least squares, it will be always greater or equal to the coefficient of determination. 

Advantages and Disadvantages

In sum, if the relationship is linear, the eta square will equal the coefficient of determination. If the relationship is not linear, eat square will be greater than the coefficient of determination, as, for non-linear relationships, the coefficient of correlation is attenuated and underestimates the magnitude of the non-linear relationships.

Despite of the advantage of being free of the linearity assumption, the nonlinear regression analysis cannot be used routinely in lieu of linear regression analysis, since its use is heavily predicated by sufficient frequencies of scores in Y in every category. In a limiting case when every unique score in X corresponds to a unique score in Y the eta square will, unrealistically, be always equal to one.

 

Irrelevance of the Linearity Assumption in the Case of Two Categories

Aside from the absence of a relationship, discussed in the opening paragraphs of this chapter, and making the assumption of linearity irrelevant, there is another circumstance in which linearity ceases to be assumed and becomes a certainty. Consider the following example.

 

This example is based on the previous example of the non-linear relationship. By combining data into only two categories, the relationship turns into a linear one, as shown below

 

For the above data, predicted and error scores computed by either method are identical. The coefficient of determination is equal to eta square.

The guaranteed fulfillment of the linearity assumption for the case of two groups, based on Euclid's postulate that a line is defined by two points, allows the use of the coefficient of determination and eta square interchangeably. This equivalence is the theoretical basis of several designs within the general linear model to be discussed in chapters to follow.

 

Summary

Relationships between correlation and the correlation ratio depend on whether the relationship is linear or non-linear.

 

Linear

Curvilinear

 

 

 

 

When the relationship is linear or approximately linear, the coefficient of determination and the correlation ratio are equivalent measures. As the relationship departs from linearity, the coefficient of determination underestimates the degree of the relationship. Since the determination and alienation are reciprocal measures, the coefficient of alienation is greater than its correlation ratio counterpart when the relationship is markedly nonlinear.