The concept of
correlation is based on Galton and Pearson's notion that there is a category
beyond causation of which causation is only a limit. The world of Galton and
Pearson was the Newtonian world: orderly, categorizing, the world of Pope's
poems and exciting new scientific discoveries occurring in the physical
sciences. Galton and Pearson hoped to bring the quantitative rigor of the
physical sciences to the social sciences that were at that time dominated by
qualitative descriptions and philosophical speculations. The method of
correlation opened new vistas for quantitative social science. The notion of
causality as the sole explanatory principle of events was broadened to include
the notion of association between events. Expectation was born that these
quantified associations would be elaborated into nomological networks,
encompassing relationships between elements of complex systems. As the
association between events becomes stronger, the probability that those events
are also influenced by unknown factors lessens.
Hopes for a new, quantitative social
science were tempered by observations that rigorous demands necessary to assure
the correctness of the assumptions asserted the relevance of factors comprising
the correlational studies, and the plausibility of the experimental framework.
Failures to meet these conditions frequently inspired critics of correlational
methods to conjure examples of patently erroneous conclusions based upon
correlation between superficially related events.
The
methodological issues associated with correlational analyses are quite complex
and hard to understand without an intimate knowledge of the technique itself.
The formula for computing a Pearson's product-moment coefficient of correlation
preserves the form of the coefficient of covariance
and substitutes
deviation scores for standard scores
This
definitional formula of the coefficient of correlation, together with the
formulae for the mean and the variance, comprise the most important formulae of
the general linear model described thus far. It is a definitional formula in
the sense that it cannot be readily derived from some other, more basic
expression. Its form suggests the full name of this statistics: the
product-moment coefficient of correlation. All other renderings of the
coefficient of correlation can be algebraically derived from this basic form.
To gain insight into the concept of the coefficient of correlation and its
properties, let us consider an example of two rating scales, listed below.
I would like to
be a librarian
I like poetry
Subject responses
were recorded as
The question to
be answered is whether the answers to these two questions are related.
Computation of the product-moment coefficient of correlation is outlined as
As displayed in
the tabular presentation of the computational example, the obtained scores X
and Y are translated to deviation scores x and y by subtracting their
respective means (3; 3). The variances are then computed (2; 2) by squaring
their deviation scores and computing their means. Taking the square roots of
both variances, their standard deviations (1.41; 1.41) are obtained. Dividing
the deviation scores by their corresponding standard deviations results in
standard scores zx and zy. By forming the product of the standard scores, summing
them (2.50) and computing the mean of this product, the coefficient of
correlation (.50) is obtained.
The definitional
formula for the Pearson's product-moment coefficient of correlation can be translated
into a formula of the coefficient of correlation for the deviation scores by
substituting and
.
These substitution results in a formula for the coefficient of correlation
expressed in deviation scores
The necessary
steps for computing the coefficient of correlation, using deviation scores, are
The
computational procedures, contained in the shorthand form by the formula for
computing coefficient of correlation using deviation scores, and summarized in
the above table, can be verbally explained as follows. As a first step, the
means of the X and Y variables are computed, and the obtained scores are
transformed to deviation scores. The mean of the product of the deviation
scores is computed as 1.00. Division by the product of both standard deviations
gives the value of the coefficient of correlation as .50. This value is
identical to the value obtained from the standard score formula.
Changing the
deviation into obtained scores within the formula for the coefficient of
correlation in deviation scores, i.e., substituting X - Mx and Y - My for the deviation
scores x and y, as
the expression
can be
simplified as
Since the mean
is a constant, the above expression can be written as
and simplified
as
For the current
example, the computational operations outlined by the above formula are
Variances of
variables X and Y can be computed as
For the example
11 - 9 = 2, and for the variable Y as
For the example
the variance of Y also equals 2. Taking the square root, the standard deviation
of both variables is equal to 1.41. The coefficient of correlation is then
computed as (10 - (3)(3))/(1.41)(1.41) which equals .50.
Let us consider
jointly the deviation score formulae for the correlation and covariance
coefficients. Since the correlation in deviation scores equals
and the
covariance, (also expressed in deviation scores) equals
the correlation
coefficient can be expressed as the standardized covariance
From the above
expression, the coefficient of covariance may be expressed as
The above term
is often part of statistical formulae, notably of formulae for computing
variance of a sum and of a difference. Thus the formula for the variance of a
sum, introduced in the previous chapter,
can be written
as an expression, containing the coefficient of correlation,
The formula for
the variance of a difference
can be written
as
These
alternative expressions of the variance of sums and differences are frequently
encountered in the discussion of properties of methods for statistical analysis
of data.
The formulae for
the coefficient of correlation and formulae for translations between covariance
and correlation formulae are summarized as
|
Obtained Scores |
Deviation Scores |
Standard Scores |
|
|
|
|
|
|
|
|
These formulae
are the basic building blocks of the general linear model, capturing the
quantitative aspects of relationships between variables. The formulae capturing
the relationship between covariance and correlation are shown below.
|
|
Covariance |
Correlation |
|
Covariance |
|
|
|
Correlation |
|
|
Covariance and
correlation are fundamental tools of statistical analysis, the principal
building blocks of the general linear model. They are used in the course of
theory development as well as in applied computations. Between those two
indices, correlation is more frequently used. The additional properties of the
coefficient of correlation will be discussed in detail in the chapters to
follow.