Cruise Scientific        Visual Statistics Studio       Table of Contents

Measures of Variability 

While the measures of central tendency convey information about the commonalties of measured properties, the measures of variability quantify the degree to which they differ. If not all values of data are the same, they differ and variability exists. The measures of central tendency should be complemented by measures of variability for the same reason objective descriptions of events should contain accounts of both centripetal and centrifugal forces, of consenting and opposing opinions, of shared and conflicting views. The variability of data is measured by the statistics called variance and symbolized by the squared Greek letter sigma written in lowercase, . Within mathematics, the coefficient of variance was formulated by von Andrae, Helmert, and Jordan between 1872 and 1876 in terms of distances between values of a variable. Karl Pearson introduced this concept into statistics in a series of articles published in Philosophical Transactions and Biometrika between 1896 and 1906, pertaining to subjects as diverse as panmixia, personal equation, and study of wasps. Pearson also coined the usage of the lowercase Greek letter sigma squared to signify the variance. The concept of variance is a basic component of statistical theory.

Example

The two variables, X and Y, have the same mean. However, their variances are different.

Deviation Scores and Their Properties

The algebraic expression of variance subsumes the concept of the deviation scores. To compute deviation scores, the mean is subtracted from each individual score in the original data set. The scores within the original data set are called the obtained or raw scores. Subtraction of the mean transforms the obtained scores into deviation scores. This linear transformation preserves all properties of the original set of raw scores, but it also sets the origin of the deviation scores scale to zero. Deviation scores sum to zero and thus their mean is always zero. The transformation to deviation scores changes obtained scores below the mean into negative numbers, scores above the mean into positive numbers, and anchors the mean of the distribution at the zero point.

An Example

By convention, obtained scores are written in upper case letters (e.g., X), and deviation scores in lower case letters (e.g., x). The linear transformation of a variable X into deviation scores x is shown as

The linear transformation of obtained scores X into deviation scores x can be written, using formal notation, as

As contrasted with formulae for the computation of statistical indices such as the arithmetic mean, a single number, the above formula is a transformation formula. That is, it changes values of a variable throughout its whole range.

Distances from the Mean

The concept of deviation scores from the arithmetic mean can be contrasted with deviation scores defined in terms of distances from the absolute zero, such as typified by the Kelvin’s scale of temperature. It has definitive merits in social sciences where it is typically difficult to find an absolute zero point of most measured properties. Is there such a thing as zero dominance, zero love, or zero hate? These are things to contemplate.

Definition of Variance

Transformation of the obtained scores to the deviation scores is one of the essential procedures of the data analysis. Initially, the computation of the mean has the highest priority. After the mean is known, it is removed from the data and the next most important index, the variance, is computed, as discussed in the following sections.

Variance is defined as the mean of the squared deviation scores

Most algebraic formulae may be viewed as blueprints of operations to be performed as to obtain the desired outcome. The following example summarizes the operations. 

Compute the Mean

To compute the mean, obtained scores must be summed and divided by n, as

Compute the Deviation scores

The mean must then be removed from the obtained scores by subtraction, as 

This operation results in a set of deviation scores. 

Compute the Mean of the Squared Deviation Scores

The deviation scores are then squared, summed, and averaged, as

Summary

For the example, obtained scores were summed and averaged. The obtained mean, 3, was subtracted from each of the obtained scores and the resulting deviation scores were squared and summed. This sum, 10, was divided by n, 5, to get the variance, 2. The square root of variance is called the standard deviation. For our example, the standard deviation equals 1.41.

Variance Computed Directly from the Obtained Scores

In the preceding section, we introduced the definitional formula for variance as

where

Substituting the right side of the above equation above into the formula for the computation of variance formula results in 

This formula can be expanded as a familiar algebraic formula 

  where in lieu of a and b, use X and M. Also, note that the summation signs are associated with the expanded terms as 

First, simplify the middle term. The middle term of the above equation contains two means.

Once, the mean is written as a sum of the obtained scores divided by n, the other time as M. Substituting M for the sum of the obtained scores divided by n term simplifies the above expression as

 

Second, simplify the last term. The formula for the arithmetic mean, written in complete notation, is

The last term of the equation we try to simplify, written in complete notation as

differs from the formula for the arithmetic mean in one important respect. While the formula for the arithmetic mean states that the variable X should be summed and averaged over its whole range, the above formula states that the square of the mean, a constant value, should be summed and averaged over the whole range of the variable X. This mean, in terms of the current example, that the square of the mean, a constant number, should be summed five times, as 9 + 9 + 9 + 9 + 9. This operation can be simplified as 5(9). Thus we can write the expression

as

 

and the equation

can be simplified as

Thus, the variance can be computed directly from the obtained scores as 

Examples

Examples illustrating the computation of variance directly from the obtained scores are shown below for the continuous variable X and the binary variable Y.

Variable X

The variance calculated directly from the obtained scores for the variable X as 11 - 32 equals 2.

The value is identical to that obtained by computing variance by using the deviation score formula.

Variable Y

The variance calculated directly from the obtained scores for the variable Y as .6 - .62 equals .24.

Variance of Binary Variables

Statistical analysis relies to some extent on squaring the elements of data matrices. If the data matrix contains some variables containing only binary elements, i.e., only one and zero numbers, squaring these numbers leaves these variables unchanged.

Thus, simpler formulae describe statistical indices and operations pertaining to binary data. For this reason, within statistics, numbers are often classified as continuous and binary. Continuous numbers are defined as numbers taking on any values. An example of a continuous variable is a variable [1 2 3 4 5]. Binary numbers are defined as numbers taking on only values of 'zero' and 'one.' An example of a binary variable is a variable [0 0 1 1 1]. Binary variables are often encountered in the case of instruments where the test items are scored as correct - incorrect, or true - false. Within this context, one typically stands for a true or correct response and zero for an false or incorrect response. Since the squaring the variable X leaves its values unchanged, the expression

in the variance formula

can be written as

and the formula can be simplified as

 

and

PQ Formula

To further simplify the above formula let us introduce a notational convention expressing the total number of cases, N, as a sum of cases with scores equal to zero and cases with scores equal to one, where

and note that, for the binary variables, both the sum of ones and the sum of their binary complements, zeroes, can be conceptualized as simple counts as shown below. 

The number of cases with scores equal to zero (n0) is 2. The number of cases with scores equal to one (n1) is 3.  The sum of cases with scores equal to zero and cases with scores equal to one (n0 + n1 = N ) is 5.

The mean of the binary scores, p, can be written as 

its complement, q, as

and the variance of the binary variables can be written as

The pq variance formula permits rapid calculations of variance of binary variables. For each variable, simply count the number of ones and zeroes, divide each total by N, and then multiply both fractions, as

The above method  for calculation of variance of binary variables can be contrasted with using the formula

with a larger number of arithmetic operations to perform, as

All the formulae for computation of variance discussed so far are algebraic transformations of the definitional variance formula and should provide identical numerical results, provided the data are of the appropriate type for the formula used. The above table contains an empirical example to illustrate this point. Variance computed by the formula for computation of variance from the obtained scores (3/5 -9/25 = 15/25 - 9/25 = 6/25) and by the formula for computation of variance from the deviation scores (6/25) is numerically equal to the variance computed by the 'pq' formula for computation of variance of binary variables ((3/5)(2/5) = 6/25).

True and Unbiased Variance

As you may observe by looking at the keyboard of a typical scientific hand calculator, there are two kinds of variance. The variance defined as

is called the population, or true variance and can be contrasted with the variance defined as

called the sample or unbiased variance estimate.

The computations of both true and unbiased variance coefficients are illustrated below. For this example, the variance is either 2.0 or 2.5, depending on whether the sum of squared deviation scores (10) was divided by n, for the example equal to 5 or by n-1 that is equal to 4. In the former case, the obtained variance is the true variance. In the latter case, the variance is 'unbiased'. 

Degrees of Freedom

The n-1 term in the denominator of the unbiased variance formula is referred to as degrees of freedom, signified as df or n, (Greek letter nu). To illustrate this notion, let us consider the numbers 1, 2, 3, 4, 5, assigned to Allen, Beth, Cathy, Debra, and Edgar at the beginning of our discussion. No one actually asked these five subjects whether they liked poetry. In fact, these subjects are purely fictitious and the assignment of the numbers 1, 2, 3, 4, and 5 to each subject was done because of computational convenience. The point here is that we were free to select these numbers at will, and, in this instance, the number of degrees of freedom equaled n, the number of cases.

Now, imagine that this book was written by using only deviation scores. The authors of this hypothetical book could assign numbers 1, 2, 3, 4 to Allen, Beth, Cathy, and Debra. So far, they were free to assign to these fictitious subjects any numbers they wished. However, in the case of Edgar, they would be no longer free to assign to him any number they wished.

They would have to assign to him the number -10, since the deviation scores must sum to zero. In Edgar's case, the authors are no longer free to assign any number they wish. After selecting the first four numbers as 1 2 3 and 4, the last number has to be -10 in order for the total sum to equal zero. Thus, the number of degrees of freedom associated with the deviation scores is n-1 and, for this example, equals 4.

Unbiased Variance Computed Directly from Obtained Scores

In the previous section, the true variance was computed using the formula for computation of variance from the obtained scores.

Substituting right side of equation for the computation of mean results in 

  The above formula can be changed into a formula expressing the unbiased variance by substituting the n-1 expression for one of the two ns in the denominator, as

 

  Consider a variable X [1 2 3 4 5]. Squares of its values can be computed as

 

  The unbiased variance is computed as 5 times 55, minus 15 squared, divided by 5 times 4, which equals 2.5.

Translations between True and Unbiased Variance

What is the relationship between the true and unbiased coefficients of variance? To answer this question, let us form the ratio of unbiased and true variances as 

  The above formula can be simplified as

 

  Thus, the translation from the true to the unbiased form can be accomplished as 

  and the translation of the unbiased variance into the true form as

 

For the current example, the unbiased variance (2.5) can be obtained from the true variance (2) as (5/4)2 = 2.5 and the true variance can be obtained from the unbiased variance as (4/5)2.5 = 2. The ratia of degrees of freedom to n and of n to degrees of freedom, translating variance from the unbiased form to the true form and vice versa, are frequently encountered in statistical analyses.

Summary

The variance formulae summarized here are of fundamental importance and will be repeatedly encountered in the course of our narrative. Formulae describing the transformation of variables from the obtained into deviation scores and back from the deviation scores to the obtained scores are 

Formulae for the true variance expressed in the obtained, deviation, and binary scores are summarized as

Formulae for the unbiased variance expressed in the obtained, deviation, and binary scores are summarized as

Inspection of the above formulae shows that the introduction of the degrees of freedom complicates the variance expressions. The notational description of the general linear model using unbiased variances at the level of obtained scores is not parsimonious. Parsimonious presentations of statistical theory use the true variance throughout, translating the true variance into an unbiased form

only when necessary.