|
Cruise Scientific Visual Statistics Studio Table of Contents |
The correct use of the coefficient of
correlation depends heavily on the assumptions made with
respect to the nature of data to be correlated and on
understanding the principles of forming this index of
association. Correlation is a central measure within the
general linear model of statistics. It can be employed for
measurement of relationships in countless applied settings.
However, in situations where its assumptions are violated,
correlation becomes inadequate to explain a given
relationship. These assumptions mandate that the
distributions of both variables related by the coefficient
of correlation should be normal and that the scatter-plots
should be linear and homoscedastic. Referring to diagrams of
data typical of various magnitudes of the coefficient
correlation,
|
|
|
|
one may notice that the assumption of
linearity pertains to the main axis of the ellipse enclosing
the data points. Its main axis should be approximately
linear. The assumption of homoscedascity pertains to the
secondary axis of this ellipse. The width of the ellipse
should be approximately equal to the length of the secondary
axis. To the extent that any of these assumptions are
violated, the coefficient of correlation does not correctly
reflect the relationship.
The assumption of normality requires that the distribution of both variables approximates the normal distribution and is not skewed in either the positive or the negative direction. Consider an applied setting wherein biologist specializing in comparative morphology counts the number of digits in the anterior X and posterior Y limbs of a group of vertebrates. The observations are tabulated as
Suppose that the biologist is interested in the theory that both the front and hind limbs of vertebrates developed from the pentadactyl limb (Gr.pentadaktylos; pente, five; daktylos, finger or toe) and should therefore have the same number of fingers and toes. Even though the visual inspection of the above data indicates that the relationship between the number of fingers and toes for the tabulated vertebrates is perfect, the correlation coefficient does not confirm this observation. Using the formula for correlation computed at the level of the obtained scores, the coefficient for the data is computed as (25 - 5(5))/(0(0)) = 0/0 = ? This perhaps-surprising outcome is the consequence of the extreme violation of the assumption of normality. Since all values in distributions X and Y are the same, the assumption that they are distributed normally is not defensible. There is a one-to-one relationship between the number of digits in the anterior and posterior extremities of the group of vertebrates measured.

Due to violation of the assumption of normality, however, the Pearson's product-moment coefficient of correlation does not reflect this relationship. Some other relational index should be used.
An aviation psychologist is interested
in the relationship between the number of practice landings
(X), on the deck of the aircraft carrier and anxiety (Y),
experienced by the pilots as a result of such exercises.
Imaginary observations for this experiment are presented in
the table below.

Using the formula for computation of correlation for obtained scores, [5,400 - 30(180)] / 14.14 (74.83) = (5,400 - 5,400) / 1,058 = 0 / 1,058 = .00.
The aviation psychologist entertained a theory that, initially, pilot anxiety should be moderate. As they realize the danger of landing a jet on the rocking runway of an aircraft carrier, their anxiety level should skyrocket, only to be subdued by prolonged practice. The data from the experiment matched the theory rather nicely. However, the coefficient of correlation turned out to be zero, indicating an absence of a relationship. Correlation did not reflect this relationship since this relationship is not linear, as can be observed in the figure below.

Increased practice does not reduce anxiety in a linear fashion; initially the anxiety increases, later it decreases. Although the observations fit the theory, the Pearson's product-moment coefficient of correlation is not the correct index to capture a nonlinear relationship.
Jobs of toll collectors on the Chicago turnpikes were short-lived. A group of industrial psychologists developed a test battery to select applicants who were likely to stay on the job. An ability test was one of the predictor variables. Scores on this ability test, A, and the length of stay on the job, L, are shown in the table below.
Computing the coefficient of correlation for the
above data as equal to .13, the corresponding coefficient of
determination equals .02 and accounts for only 2 % of
variance. The industrial psychologists' hypothesis was that
toll collectors with scored lower on an ability test had
difficulties giving correct change, partly due to the fact
that nickels, larger than dimes, convey an implication of
greater value.
Plotting the obtained relationship, an interesting pattern emerged. The ability to give correct change was a good predictor of tenure as a toll collector only for persons scoring low on this scale. After reaching a threshold, however, this variable no longer mattered. The overall relationship, as depicted in the above diagram is nonhomoscedastic. For a relationship to be homoscedastic, it should have the same (homo) scatter (scedasticity) throughout. In the above figure, the scatter in the 70 to 90 range approximates a line, in the 100 to 120 range it approximates a circle; the relationship is nonhomoscedastic.
The assumptions, underlying the coefficient of correlation are those of linearity, normality, and homoscedascity. These assumptions, or their subset, are shared by most methods of the general linear model of statistics.