Cruise Scientific        Visual Statistics Studio       Table of Contents

Correlation Analysis

Galton and Pearson

The concept of correlation is based on Galton and Pearson's notion that there is a category beyond causation of which causation is only a limit. The world of Galton and Pearson was the Newtonian world: orderly, categorizing, the world of Pope's poems and exciting new scientific discoveries occurring in the physical sciences. Galton and Pearson hoped to bring the quantitative rigor of the physical sciences to the social sciences that were at that time dominated by qualitative descriptions and philosophical speculations.

Association

The method of correlation opened new vistas for quantitative social science. The notion of causality as the sole explanatory principle of events was broadened to include the notion of association between events. Expectation was born that these quantified associations would be elaborated into homological networks, encompassing relationships between elements of complex systems. As the association between events becomes stronger, the probability that those events are also influenced by unknown factors lessens.

Hopes for a new, quantitative social science were tempered by observations that rigorous demands necessary to assure the correctness of the assumptions asserted the relevance of factors comprising the correlational studies, and the plausibility of the experimental framework. Failures to meet these conditions frequently inspired critics of correlational methods to conjure examples of patently erroneous conclusions based upon correlation between superficially related events.

Pearson Product-Moment Coefficient of Correlation
   Visual Statistics Studio - Correlation in Standard Scores

The methodological issues associated with correlational analyses are quite complex and hard to understand without an intimate knowledge of the technique itself. The formula for computing a Pearson's product-moment coefficient of correlation preserves the form of the coefficient of covariance,

 

and substitutes deviation scores for standard scores

This definitional formula of the coefficient of correlation, together with the formulae for the mean and the variance, comprise the most important formulae of the general linear model described thus far. It is a definitional formula in the sense that it cannot be readily derived from some other, more basic expression. Its form suggests the full name of this statistics: the product-moment coefficient of correlation. All other renderings of the coefficient of correlation can be algebraically derived from this basic form.

While the coefficient of covariance has no upper and lower limits, the coefficient of correlation can vary from positive one (indicating a perfect positive relationship), through zero (indicating the absence of a relationship), to negative one (indicating a perfect negative relationship). To gain insight into the concept of the coefficient of correlation and its properties, let us consider an example of two rating scales, listed below. 

I would like to be a librarian

I like poetry

The responses were recorded as

together with its scatterplot

   

The question to be answered is whether the answers to these two questions are related. Computation of the product-moment coefficient of correlation is outlined as

 

Compute the deviation scores and the standard deviation

As displayed in the tabular presentation of the computational example, the obtained scores X and Y are translated to deviation scores x and y by subtracting their respective means (3; 3). The variances are then computed by squaring their deviation scores and computing their means (2; 2). Taking the square roots of both variances, their standard deviations (1.41; 1.41) are obtained.

Compute the standard scores

Dividing the deviation scores by their corresponding standard deviations results in standard scores zx and zy.

Mean of the product of the standard scores

By forming the product of the standard scores, summing them (2.50) and computing the mean of this product, the coefficient of correlation (.50) is obtained.

The correlation coefficient is positive. Higher scores on the variable X are associated with higher scores on the variable Y.  Lower scores on the variable X are associated with lower scores on the variable Y.  

Correlation in Deviation Scores

The coefficient of correlation remains invariant with respect to change of the measurement unit. The definitional formula for the Pearson's product-moment coefficient of correlation can be translated into a formula of the coefficient of correlation for the deviation scores by substituting

and

These substitution results in a formula for the coefficient of correlation expressed in deviation scores 

The necessary steps for computing the coefficient of correlation, using deviation scores, are summarized as

  Computational Procedures

The computational procedures, contained in the shorthand form by the formula for computing coefficient of correlation using deviation scores, and summarized in the above table, can be verbally explained as follows.

As a first step, the means of the X and Y variables are computed, and the obtained scores are transformed to deviation scores. The mean of the product of the deviation scores is computed as 1.00. Division by the product of both standard deviations gives the value of the coefficient of correlation as .50. This value is identical to the value obtained from the standard score formula.

Correlation in Obtained Scores

Changing the deviation into obtained scores within the formula for the coefficient of correlation in deviation scores, i.e., substituting X - Mx and Y - My for the deviation scores x and y, as

the expression

is obtained. This expression can be simplified as

Under the common denominator, the formula for the coefficient of correlation in obtained scores can be written as

 

or, alternatively,

  For the current example, the computational operations outlined by the above formula are summarized as

 

Initially, it is necessary to compute variances of variables X and Y as  and
For the example, the variance of X equals 2 (55/5 - 32 = 2). The variance of Y also equals 2. Taking the square root, the standard deviation of both variables is equal to 1.41. The coefficient of correlation is then computed as (10 - (3)(3))/(1.41)(1.41) which equals .50.  

Covariance / Correlation Translations

Let us consider jointly the deviation score formulae for the correlation and covariance coefficients. Since the correlation in deviation scores equals

  and the covariance, (also expressed in deviation scores) equals

 

  The correlation coefficient can be expressed as 


From the above expression, the coefficient of covariance may be isolated and redefined as the product of the coefficient of correlation and the variances of its constituent scores;

  The above term is often part of statistical formulae, notably of formulae for computing variance of a sum and of a difference. The formula for the variance of a sum, introduced in the previous chapter,

  can be written as an expression, containing the coefficient of correlation,

 

  The formula for the variance of a difference

 

can be written as

  These alternative expressions of the variance of sums and differences are frequently encountered in the discussion of properties of methods for statistical analysis of data.

Summary

The formulae for the coefficient of correlation and formulae for translations between covariance and correlation formulae are summarized as

Obtained Scores

Deviation Scores

Standard Scores

 

 

 

 

 

 

These formulae are the basic building blocks of the general linear model, capturing the quantitative aspects of relationships between variables. The formulae capturing the relationship between covariance and correlation are shown below.

 

Covariance

Correlation

 

Covariance

 

 

 

Correlation

 

 

 

Covariance and correlation are fundamental tools of statistical analysis, the principal building blocks of the general linear model. They are used in the course of theory development as well as in applied computations. Between those two indices, correlation is more frequently used. The additional properties of the coefficient of correlation will be discussed in detail in the chapters to follow.