Cruise Scientific        Visual Statistics Studio       Table of Contents

Correlation: Assumptions and Limitations

The correct use of the coefficient of correlation depends heavily on the assumptions made with respect to the nature of data to be correlated and on understanding the principles of forming this index of association. Correlation is a central measure within the general linear model of statistics. It can be employed for measurement of relationships in countless applied settings. However, in situations where its assumptions are violated, correlation becomes inadequate to explain a given relationship. These assumptions mandate that the distributions of both variables related by the coefficient of correlation should be normal and that the scatter-plots should be linear and homoscedastic. Referring to diagrams of data typical of various magnitudes of the coefficient correlation, 


Positive correlation

 


Zero Correlation

 


Negative Correlation

one may notice that the assumption of linearity pertains to the main axis of the ellipse enclosing the data points. Its main axis should be approximately linear. The assumption of homoscedascity pertains to the secondary axis of this ellipse. The width of the ellipse should be approximately equal to the length of the secondary axis. To the extent that any of these assumptions are violated, the coefficient of correlation does not correctly reflect the relationship. 

The Assumption of Normality: The Case of the Pentadactyl Limbs

The assumption of normality requires that the distribution of both variables approximates the normal distribution and is not skewed in either the positive or the negative direction. Consider an applied setting wherein biologist specializing in comparative morphology counts the number of digits in the anterior X and posterior Y limbs of a group of vertebrates. The observations are tabulated as

Suppose that the biologist is interested in the theory that both the front and hind limbs of vertebrates developed from the pentadactyl limb (Gr.pentadaktylos; pente, five; daktylos, finger or toe) and should therefore have the same number of fingers and toes. Even though the visual inspection of the above data indicates that the relationship between the number of fingers and toes for the tabulated vertebrates is perfect, the correlation coefficient does not confirm this observation. Using the formula for correlation computed at the level of the obtained scores, the coefficient for the data is computed as (25 - 5(5))/(0(0)) = 0/0 = ? This perhaps-surprising outcome is the consequence of the extreme violation of the assumption of normality. Since all values in distributions X and Y are the same, the assumption that they are distributed normally is not defensible. There is a one-to-one relationship between the number of digits in the anterior and posterior extremities of the group of vertebrates measured.

Due to violation of the assumption of normality, however, the Pearson's product-moment coefficient of correlation does not reflect this relationship. Some other relational index should be used.

The Assumption of Linearity: About the Anxiety of Fighter Pilots

An aviation psychologist is interested in the relationship between the number of practice landings (X), on the deck of the aircraft carrier and anxiety (Y), experienced by the pilots as a result of such exercises. Imaginary observations for this experiment are presented in the table below. 

Using the formula for computation of correlation for obtained scores, [5,400 - 30(180)] / 14.14 (74.83) = (5,400 - 5,400) / 1,058 = 0 / 1,058 = .00.

The aviation psychologist entertained a theory that, initially, pilot anxiety should be moderate. As they realize the danger of landing a jet on the rocking runway of an aircraft carrier, their anxiety level should skyrocket, only to be subdued by prolonged practice. The data from the experiment matched the theory rather nicely. However, the coefficient of correlation turned out to be zero, indicating an absence of a relationship. Correlation did not reflect this relationship since this relationship is not linear, as can be observed in the figure below.

 

Increased practice does not reduce anxiety in a linear fashion; initially the anxiety increases, later it decreases. Although the observations fit the theory, the Pearson's product-moment coefficient of correlation is not the correct index to capture a nonlinear relationship. 

The Assumption of Homoscedascity: About Nickels and Dimes

Jobs of toll collectors on the Chicago turnpikes were short-lived. A group of industrial psychologists developed a test battery to select applicants who were likely to stay on the job. An ability test was one of the predictor variables. Scores on this ability test, A, and the length of stay on the job, L, are shown in the table below.

Computing the coefficient of correlation for the above data as equal to .13, the corresponding coefficient of determination equals .02 and accounts for only 2 % of variance. The industrial psychologists' hypothesis was that toll collectors with scored lower on an ability test had difficulties giving correct change, partly due to the fact that nickels, larger than dimes, convey an implication of greater value. 

 

Plotting the obtained relationship, an interesting pattern emerged. The ability to give correct change was a good predictor of tenure as a toll collector only for persons scoring low on this scale. After reaching a threshold, however, this variable no longer mattered. The overall relationship, as depicted in the above diagram is nonhomoscedastic. For a relationship to be homoscedastic, it should have the same (homo) scatter (scedasticity) throughout. In the above figure, the scatter in the 70 to 90 range approximates a line, in the 100 to 120 range it approximates a circle; the relationship is nonhomoscedastic. 

Summary

The assumptions, underlying the coefficient of correlation are those of linearity, normality, and homoscedascity. These assumptions, or their subset, are shared by most methods of the general linear model of statistics.