Correlation Analysis

The concept of correlation is based on Galton and Pearson's notion that there is a category beyond causation of which causation is only a limit. The world of Galton and Pearson was the Newtonian world: orderly, categorizing, the world of Pope's poems and exciting new scientific discoveries occurring in the physical sciences. Galton and Pearson hoped to bring the quantitative rigor of the physical sciences to the social sciences that were at that time dominated by qualitative descriptions and philosophical speculations. The method of correlation opened new vistas for quantitative social science. The notion of causality as the sole explanatory principle of events was broadened to include the notion of association between events. Expectation was born that these quantified associations would be elaborated into nomological networks, encompassing relationships between elements of complex systems. As the association between events becomes stronger, the probability that those events are also influenced by unknown factors lessens.

        Hopes for a new, quantitative social science were tempered by observations that rigorous demands necessary to assure the correctness of the assumptions asserted the relevance of factors comprising the correlational studies, and the plausibility of the experimental framework. Failures to meet these conditions frequently inspired critics of correlational methods to conjure examples of patently erroneous conclusions based upon correlation between superficially related events.

Pearson Product-Moment Coefficient of Correlation

The methodological issues associated with correlational analyses are quite complex and hard to understand without an intimate knowledge of the technique itself. The formula for computing a Pearson's product-moment coefficient of correlation preserves the form of the coefficient of covariance

 

 

 

and substitutes deviation scores for standard scores

 

 

 

This definitional formula of the coefficient of correlation, together with the formulae for the mean and the variance, comprise the most important formulae of the general linear model described thus far. It is a definitional formula in the sense that it cannot be readily derived from some other, more basic expression. Its form suggests the full name of this statistics: the product-moment coefficient of correlation. All other renderings of the coefficient of correlation can be algebraically derived from this basic form. To gain insight into the concept of the coefficient of correlation and its properties, let us consider an example of two rating scales, listed below.

 

I would like to be a librarian

 

 

I like poetry

 

 

Subject responses were recorded as

 

 

 

The question to be answered is whether the answers to these two questions are related. Computation of the product-moment coefficient of correlation is outlined as

 

 

 

As displayed in the tabular presentation of the computational example, the obtained scores X and Y are translated to deviation scores x and y by subtracting their respective means (3; 3). The variances are then computed (2; 2) by squaring their deviation scores and computing their means. Taking the square roots of both variances, their standard deviations (1.41; 1.41) are obtained. Dividing the deviation scores by their corresponding standard deviations results in standard scores zx and zy. By forming the product of the standard scores, summing them (2.50) and computing the mean of this product, the coefficient of correlation (.50) is obtained.

Correlation in Deviation Scores

The definitional formula for the Pearson's product-moment coefficient of correlation can be translated into a formula of the coefficient of correlation for the deviation scores by substituting  and . These substitution results in a formula for the coefficient of correlation expressed in deviation scores

 

 

 

The necessary steps for computing the coefficient of correlation, using deviation scores, are

 

 

 

The computational procedures, contained in the shorthand form by the formula for computing coefficient of correlation using deviation scores, and summarized in the above table, can be verbally explained as follows. As a first step, the means of the X and Y variables are computed, and the obtained scores are transformed to deviation scores. The mean of the product of the deviation scores is computed as 1.00. Division by the product of both standard deviations gives the value of the coefficient of correlation as .50. This value is identical to the value obtained from the standard score formula.

Correlation in Obtained Scores

Changing the deviation into obtained scores within the formula for the coefficient of correlation in deviation scores, i.e., substituting X - Mx and Y - My for the deviation scores x and y, as

 

 

the expression

 

 

 

can be simplified as

 

 

 

Since the mean is a constant, the above expression can be written as

 

 

 

and simplified as

 

 

 

For the current example, the computational operations outlined by the above formula are

 

 

 

Variances of variables X and Y can be computed as

 

 

 

For the example 11 - 9 = 2, and for the variable Y as

 

 

 

For the example the variance of Y also equals 2. Taking the square root, the standard deviation of both variables is equal to 1.41. The coefficient of correlation is then computed as (10 - (3)(3))/(1.41)(1.41) which equals .50.

Covariance / Correlation Translations

Let us consider jointly the deviation score formulae for the correlation and covariance coefficients. Since the correlation in deviation scores equals

 

 

 

and the covariance, (also expressed in deviation scores) equals

 

 

 

the correlation coefficient can be expressed as the standardized covariance

 

 

 

From the above expression, the coefficient of covariance may be expressed as

 

 

 

The above term is often part of statistical formulae, notably of formulae for computing variance of a sum and of a difference. Thus the formula for the variance of a sum, introduced in the previous chapter,

 

 

 

can be written as an expression, containing the coefficient of correlation,

 

 

 

The formula for the variance of a difference

 

 

 

can be written as

 

 

 

These alternative expressions of the variance of sums and differences are frequently encountered in the discussion of properties of methods for statistical analysis of data.

Summary

The formulae for the coefficient of correlation and formulae for translations between covariance and correlation formulae are summarized as

 

Obtained Scores

Deviation Scores

Standard Scores

 

 

 

 

 

 

 

 

 

 

These formulae are the basic building blocks of the general linear model, capturing the quantitative aspects of relationships between variables. The formulae capturing the relationship between covariance and correlation are shown below.

 

 

Covariance

Correlation

Covariance

 

 

Correlation

 

 

 

 

 

Covariance and correlation are fundamental tools of statistical analysis, the principal building blocks of the general linear model. They are used in the course of theory development as well as in applied computations. Between those two indices, correlation is more frequently used. The additional properties of the coefficient of correlation will be discussed in detail in the chapters to follow.