Cruise Scientific       Visual Statistics Studio       Measurement and Scaling

Krus, D. J. (2006) The true and the unbiased variance. Journal of Visual Statistics at www.VisualStatistics.net (February 11, 2006)

The true and the unbiased variance
David J. Krus
Arizona State University

Inspection of the keyboard of a scientific calculator will often show a key engraved with σ2 and a key engraved with s2. Let us call the σ2 key in the true variance and the s2 the unbiased variance. Entering a few numbers and pressing the s2 key will return a number (the unbiased variance) which will be greater than the number (the true variance) you'll get when pressing the σ2 key. The more numbers you'll enter, the smaller this difference will be. When you'll enter more than 30 numbers, this difference will be negligible. However, to get identical values of these two kinds of variance, you would have to enter an infinitely large number of numbers.

Computationally, these two kinds of variance are obtained by dividing a certain quantity, described later, by either n, indicating the size of the population measured, or by the degrees of freedom, associated with the size of the sample from which the parameters of a population are estimated, sometimes, but not always, equal to n - 1. The concept of true variance is often poorly understood, as some think that a population has to be infinitely large or that the variance must be computed by using the n - 1 in the denominator. If a population of measurable events or phenomena exists, it is always finite, as the concept of infinity, as that of eternity, are only fictions. For instance, consider reporting on a population of an endangered species, with only four animals remaining. In this context Press, Teukolsky, Vetterling and Flannery (1992, p. 605) comment that "the n - 1 should be changed to n if you are ever in the situation of measuring the variance of a distribution whose mean is known a priori rather than being estimated from the data."

Computation of the true and unbiased variance

Mathematical formulae defining the true and the unbiased variance use the Greek letter Σ which means sum all values of a variable. The variable in this context is the lowercase Latin character x which denotes the deviation scores. The number of values of the variable X is signified as n. The values of the variable X are the obtained values, sometimes also called the obtained scores, i.e., values of the variable X obtained from quantification of properties of some entity or some attribute. The deviation values, also called the deviation scores (values that deviate from the mean) are obtained from the obtained scores X by subtracting the mean, M, from all values of the obtained scores X; i.e., x = X − M. The convention to signify the deviation scores by a lowercase letter is due to the notion that the obtained scores, signified by capital letters, are 'diminished' in size by subtraction of the arithmetic mean. The true variance of a variable X is defined as

 

and the unbiased variance of a variable X is defined as

 

                                                                       

where the expression n − 1 signifies the number of degrees of freedom, sometimes also signified by the Greek character ν (nu). For instance, a variable X [ 0, 1, 2, 3 ], (obtained scores) can be transformed into the vector of deviation scores x = [ −1.5, −.5, .5, 1.5 ] by subtracting the mean of the variable X (1.5).

 

The sum of the deviation scores must be zero. The squared deviation scores have to be summed (5) and divided by n (4) to obtain the true variance (1.25) or divided by n − 1 (3) to obtain the unbiased variance (1.67).

Changing true variance to unbiased variance and vice versa

The variance can be easily changed from the true variance to the unbiased variance, as

 

and from the unbiased variance to the true variance, as

 

For the example, the true variance (1.25) can be changed to the unbiased variance as (4/3)(1.25) = 1.67 and the unbiased variance (1.67) can be changed to the true variance as (3/4)(1.67) = 1.25.

Degrees of freedom

The n-1 term in the denominator of the unbiased variance formula is referred to as degrees of freedom, signified as df or by the Greek letter . The notion of the degrees of freedom is related to the concept of the random normal variable. To illustrate this concept, let us consider the numbers 0, 1, 2, 3 assigned to five subjects in our illustrative example. These subjects are fictitious, as are the numbers 0, 1, 2, and 3. Don't be misled by their ordinality, as in a recent lottery, the winning numbers were 3, 4,5,6,7, 17 and 34.

The point here is that we were free to select these numbers at will, and, in this instance, the number of degrees of freedom we had equaled n, the number of cases. This is also why the arithmetic mean is computed by dividing the sum of the obtained values by n and not by the n-1. Now, imagine that this example was written by using only the deviation scores. The authors of this hypothetical example could have assigned numbers 0, 1, 2, to the first three subjects and so far, they were free to assign to these fictitious subjects any numbers they wished. However, in the case of the fourth subject, they were no longer free to do so, as the deviation scores must sum to zero and thus only possible number to assign to the fourth subject had to be -3. Thus, the number of degrees of freedom associated with the deviation scores is n-1 and, for this example, equals 3.

Degrees of freedom: Monte Carlo simulation

Simulations using the random number generators are often called the Monte Carlo experiments. A common type of this type of an experiment is the generation of random variables with expected mean of 0 and expected variance of 1 and to observe the differences between the expected statistics and the obtained statistics. Results of one of these simulation experiments (with 100,000 generated random variables) are shown below for n equal to 100, 30, 10, 5, and 3. As you may observe, for ns greater than 30, the differences between true and unbiased variances are negligible. Even for the n as small as 10, the differences are next to negligible. However for the n of 5 and the n, the differences were substantial. Considering that very few real-life measurements are done with groups of subjects so small as 5 or 3, why even to bother to differentiate between the true and unbiased variance? The answer to this question is that within the context of descriptive statistics on samples with n greater than 30, the numerical differences between these two kinds of variances are negligible. However, within the framework of the statistical tests of significance, the concept of the degrees of freedom is central to this type of analysis and the use of the unbiased variance is necessary. There, the degrees of freedom, ν, often equal k - 1 or n - k where k refers to a number of groups in an experimental design.

 

References