|
Cruise Scientific Visual Statistics Studio Measurement and Scaling |
Krus, D. J. (2006) The true and the unbiased variance. Journal of Visual Statistics at www.VisualStatistics.net (February 11, 2006)
The true and the unbiased variance
Inspection of the keyboard of a scientific calculator will often show a key engraved with σ2 and a key engraved with s2. Let us call the σ2 key in the true variance and the s2 the unbiased variance. Entering a few numbers and pressing the s2 key will return a number (the unbiased variance) which will be greater than the number (the true variance) you'll get when pressing the σ2 key. The more numbers you'll enter, the smaller this difference will be. When you'll enter more than 30 numbers, this difference will be negligible. However, to get identical values of these two kinds of variance, you would have to enter an infinitely large number of numbers.
Computationally,
these two kinds of variance are obtained by dividing a certain quantity,
described later, by either n, indicating the size of the population measured,
or by the degrees of freedom, associated with the size of the sample from which
the parameters of a population are estimated, sometimes, but not always, equal
to n - 1. The concept of true variance is often poorly understood, as some
think that a population has to be infinitely large or that the variance must
be computed by using the n -
Mathematical formulae defining the true and the unbiased variance use the Greek letter Σ which means sum all values of a variable. The variable in this context is the lowercase Latin character x which denotes the deviation scores. The number of values of the variable X is signified as n. The values of the variable X are the obtained values, sometimes also called the obtained scores, i.e., values of the variable X obtained from quantification of properties of some entity or some attribute. The deviation values, also called the deviation scores (values that deviate from the mean) are obtained from the obtained scores X by subtracting the mean, M, from all values of the obtained scores X; i.e., x = X − M. The convention to signify the deviation scores by a lowercase letter is due to the notion that the obtained scores, signified by capital letters, are 'diminished' in size by subtraction of the arithmetic mean. The true variance of a variable X is defined as
and the unbiased variance of a variable X is defined as
where the expression n − 1 signifies the number of degrees of freedom, sometimes also signified by the Greek character ν (nu). For instance, a variable X [ 0, 1, 2, 3 ], (obtained scores) can be transformed into the vector of deviation scores x = [ −1.5, −.5, .5, 1.5 ] by subtracting the mean of the variable X (1.5).
The sum of the deviation scores must be zero. The squared deviation scores have to be summed (5) and divided by n (4) to obtain the true variance (1.25) or divided by n − 1 (3) to obtain the unbiased variance (1.67).
The variance can be easily changed from the true variance to the unbiased variance, as
and from the unbiased variance to the true variance, as
For the example, the true variance (1.25) can be changed to the unbiased variance as (4/3)(1.25) = 1.67 and the unbiased variance (1.67) can be changed to the true variance as (3/4)(1.67) = 1.25.
The
n-1 term in the denominator of the unbiased variance formula is referred to as
degrees of freedom, signified as df or by the Greek letter .
The notion of the degrees of freedom is related to the concept of the random
normal variable. To illustrate this concept, let us consider the numbers 0, 1,
2, 3 assigned to five subjects in our illustrative example. These subjects are
fictitious, as are the numbers 0, 1, 2, and 3. Don't be misled by their
ordinality, as in a recent lottery, the winning numbers were 3, 4,5,6,7, 17 and
34.
The point here is that we were free to select these numbers at will, and, in this instance, the number of degrees of freedom we had equaled n, the number of cases. This is also why the arithmetic mean is computed by dividing the sum of the obtained values by n and not by the n-1. Now, imagine that this example was written by using only the deviation scores. The authors of this hypothetical example could have assigned numbers 0, 1, 2, to the first three subjects and so far, they were free to assign to these fictitious subjects any numbers they wished. However, in the case of the fourth subject, they were no longer free to do so, as the deviation scores must sum to zero and thus only possible number to assign to the fourth subject had to be -3. Thus, the number of degrees of freedom associated with the deviation scores is n-1 and, for this example, equals 3.
Simulations
using the random number generators are often called the