Nonlinear Regression Analysis

To use the correlation coefficient correctly, a relationship between two variables must be approximately linear. When the assumption of linearity is violated, Pearson's product-moment coefficient of correlation will underestimate the strength of the relationship. What then should be done in cases where the variables to be correlated are not linearly related? Let us see whether we could derive a coefficient of correlation independent of the assumption of linearity.

Absence of a Relationship

As a prolegomena to our discussion, let us consider one of the two instances when the assumption of linearity is irrelevant and cannot be violated. The first of these two exceptional cases is when the predictor and criterion variables are not correlated. In this case, we may choose to ignore the predictor variable, as suggested by the question marks in the following table

 

 

 

       The values of the predictor variable X correlating zero with the criterion variable Y are marked with the question marks to stress their irrelevance. The predicted scores are calculated using the equation

 

as

 

 

 

which simplifies to

 

 

As an example, consider the following data where the variables X and Y are not correlated.

 

 

 

 

Several points may be stressed for this special case. The above equation asserts that in the absence of any relevant information, the best prediction of the outcome is the mean. The standard error of this prediction will be equal to the standard deviation of the criterion variable. As illustrated, the deviation scores of a single variable can be conceptualized as error scores, capturing all the variance of the specification equation

 

 

 

For the example, 3.2 = 0 + 3.2. This variance is the minimum possible. You may try to substitute any other number for the arithmetic mean and compute the deviation scores by its subtraction. Squaring and averaging these hypothetical deviation scores will result in a variance that is larger than the variance obtained by using the arithmetic mean. This observation lends credence to the conclusion that the arithmetic mean is the optimal locus of the distribution of scores in the least squares sense.

The Eta Square Coefficient

If values of variable X would indicate categories, we could compute the means, contained by the variable  directly. We could also disregard the over-all relationship between X and Y. Consider data comprising three categories

 

 

 

The above data set was plotted in the diagram that follows.

 

Regression analysis necessitates computations of means, variances, and correlation of variables X and Y. These values have to be substituted to the equation for calculation of the predicted scores

 

 

 

Errors of the prediction are computed by the equation

 

 

 

       The coefficient of correlation between variables X and Y equals .71. The coefficient of determination is equal to .50. The standard deviations of variables X and Y are .82 and 1.15, respectively. The predicted scores can be computed as .71 (1.15 / .82) (X - 2) + 3 which simplifies to X+1. The error scores were calculated by subtracting the predicted variable from the criterion variable.

Now, compute the predicted scores by using means of values of Y corresponding to each category signified by X. Compute the error scores by subtracting predicted scores from the obtained scores Y.  To complete regression analysis, just compute the means and variances of predicted and error scores. The computation of correlation between X and Y is not necessary.

       Since this type of regression analysis does not involve computation of the correlation coefficient, it should not be dependent on its assumptions. Let us observe what happens if the data violate the assumption of linearity. In the table below, the predicted scores were computed by equation

 

 

 

and since the correlation between X and Y is zero, equaled the mean of the variable Y.

 

 

 

The scores from the above table were plotted as

Compute the predicted scores by substituting means of each category. Forming the ratio of the variance of the predicted variable (.22) and variance of the criterion variable (.89), the value of this ratio equals .25. Since we disregarded the assumption of linearity, we cannot call this ratio the coefficient of determination. Thus, we have to give to this ratio a new name. This new index is called the eta square ratio and is defined as

 

 

Eta square, also called the correlation ratio, is free of the assumption of linearity. In the case of linear relationships, correlation ratio equals the coefficient of determination. For nonlinear relationships, correlation ratio is not equal to the coefficient of determination. Since the eta square ratio is independent of the linearity assumption, and is the best solution in the sense of the criterion of the least squares, it will be always greater or equal to the coefficient of determination.

 

 

 

 

 

If the relationship is linear, the eta square will equal the coefficient of determination. If the relationship is not linear, eat square will be greater then the coefficient of determination, as, for non-linear relationships, the coefficient of correlation is attenuated and underestimates the magnitude of the non-linear relationships.

       Despite of the advantage of being free of the linearity assumption, the nonlinear regression analysis cannot be used routinely in lieu of linear regression analysis, since its use is heavily predicated by sufficient frequencies of scores in Y in every category. In a limiting case when every unique score in X corresponds to a unique score in Y the eta square will be, unrealistically, always equal to one.

Irrelevance of the Assumption of Linearity

Aside from the absence of a relationship, discussed in the opening paragraphs of this chapter, and making the assumption of linearity irrelevant, linearity ceases to be assumed and becomes a certainty in another circumstance. Consider the following example.

 

 

 

This example is based on the previous example of the non-linear relationship. By combining data into only two categories, the relationship turns into a linear one, as shown below

 

For the above data, the coefficient of correlation between variables X and Y can be computed as (4.33 - 2.33 (1.67))/(.94 (.94)), and equals .50. The coefficient of correlation, together with the means and variances of variables X and Y is used for construction of a prediction equation Y' = .50 (.94 / .94) (X - 2.2.33) + 1.67, which can be simplified to Y' = .5X +.5. Predicted and error scores computed by either method are identical, as is the coefficient of determination (.25) and eta square (.22 / .89 = .25).

       The guaranteed fulfillment of the linearity assumption for the case of two groups, based on Euclid's postulate that a line is defined by two points, allows the use of the coefficient of determination and eta square interchangeably. This equivalence is the theoretical basis of several designs within the general linear model to be discussed in chapters to follow.

Summary

Relationships between correlation and the correlation ratio depend on whether the relationship is linear or non-linear.

 

Linear

Curvilinear

 

 

 

 

 

 

 

 

When the relationship is linear or approximately linear, the coefficient of determination and the correlation ratio are equivalent measures. As the relationship departs from linearity, the coefficient of determination underestimates the degree of the relationship. Since the determination and alienation are reciprocal measures, the coefficient of alienation is greater than its correlation ratio counterpart when the relationship is markedly nonlinear.