|
|
|
While the measures of
central tendency convey information about the commonalties of measured
properties, the measures of variability quantify the degree to which they
differ. If values of a variable are not the same, they differ, and variability
exists.
The measures of central
tendency should be complemented by measures of variability for the same reason
objective descriptions of events should contain accounts of both centripetal
and centrifugal forces, of consenting and opposing opinions, of shared and
conflicting views. The variability of data is measured by the statistics called
variance and symbolized by the squared Greek letter sigma written in lowercase.
Within astronomy, the
coefficient of variance was formulated by von Andrae, Helmert, and Jordan
between 1872 and 1876 in terms of differences between values of a variable.
Karl Pearson introduced this concept into statistics in a series of articles
published in Philosophical Transactions
and Biometrika between 1896 and 1906.
Pearson also coined the usage of the lowercase Greek letter sigma squared to
signify the variance. Variance is the central concept of statistical theory.
Variance in terms of differences between
values of a variable
Consider all differences
between values of a variable X [1 2 3 4 5], as shown below
|
|
|
|
The above matrix of
differences is called the skew symmetric matrix, since to each positive element
corresponds a negative element of the same magnitude. If we are interested only
in distances between the values of the variable X and not in the direction of
these distances, the above matrix can be triangularized as
|
|
|
|
and the variance can be
defined as an average of these squared distances.
|
|
|
|
For the example, the sum of
values in the above matrix is 50 and the average of these values is 50/25. The
variance of the variable X [1 2 3 4 5] equals 2.
Variance in terms of differences between
values of a variable and the mean of the variable
Remember that the
Pythagoreans defined the mean as a quantity that exceeds the smaller value by
the same amount as the larger value exceeds the mean. Thus it seems plausible
to assume that we would get the same variance of a variable if we would average
all squared differences between its values and their mean. For the example, the
differences between values of the variable X [1 2 3 4 5] and its mean [3] are
|
|
|
|
Since the arithmetic mean is
a constant and not a variable, columns of the above matrix are the same. The
value of the variance thus should be the same if we consider only a single
column of the above matrix
|
|
|
|
and average its squared
scores
|
|
|
|
To obtain the same value of
the variance (10 / 5 = 2) which we obtained by averaging all differences
between the values of the variable X (50 / 25 = 2).
The above table can be
simplified if we define deviation scores (deviations from the arithmetic mean)
as
|
|
|
|
Note the subtle difference
in the above formula, which is a prototype of the transformation formulae. As
contrasted with formulae for the computation of statistical indices such as the
arithmetic mean, a single number, the above formula defines the changes in values
of a variable throughout its whole range. Using the above definition of the
deviation scores, we can rewrite the above table as
|
|
|
|
Thus variance can be defined
as
|
|
|
|
and its square root, called
the standard deviation, as
|
|
|
|
For the example, the
variance equals 2 and the standard deviation equals 1.41.
In statistics, values of
variables within the data set are called the obtained or raw scores.
Subtraction of the mean transforms the obtained scores into deviation scores.
This linear transformation preserves all properties of the original set of raw
scores save the mean. Deviation scores sum to zero and thus their mean is
always zero. The transformation to deviation scores changes obtained scores
below the mean into negative numbers, scores above the mean into positive
numbers, and anchors the mean of the distribution at the zero point.
The concept of deviation scores from the arithmetic mean can
be contrasted with deviation scores defined in terms of distances from the
absolute zero, such as typified by the Kelvin’s scale of temperature. It has
definitive merits in social sciences where it is typically difficult to find an
absolute zero point of most measured properties. Is there such a thing as zero
dominance, zero love, or zero hate? These are things to contemplate.
Transformation of the obtained scores to the deviation scores
is one of the essential procedures of the data analysis. Typically, the arithmetic
mean is computed. After the mean is known, it is subtracted (removed) from the
data and the next most important index, the variance, is computed.
In a preceding section, we
introduced the definitional formula for variance as
|
|
|
|
where
|
|
|
|
Substituting the right side
of the above equation into the formula for the computation of variance results
in
|
|
|
|
This formula can be expanded
as a familiar algebraic formula
|
|
|
|
where in lieu of a and b, use X and M. Also, note that the summation
signs are associated with the expanded terms as
|
|
|
|
The middle term on the
right-hand side of the above equation contains two means. Once, the mean is
written as a sum of the obtained scores divided by n, the other time as M.
Substituting M for the sum of the obtained scores divided by n term simplifies
the above expression as
|
|
|
|
The formula for the
arithmetic mean, written in complete notation, is
|
|
|
|
The last term of the
equation we try to simplify, written in complete notation as
|
|
|
|
differs from the formula for
the arithmetic mean in one important respect. While the formula for the
arithmetic mean states that the variable
X should be summed and averaged over its whole range, the above formula states
that the square of the mean, a constant
value, should be summed and averaged over the whole range of the variable X.
This means, in terms of the current example, that the square of the mean, a
constant number, should be summed five times, as 9 + 9 + 9 + 9 + 9. This
operation can be simplified as 5(9). Thus, we can write the above expression as
|
|
|
|
and the equation
|
|
|
|
can be simplified as
|
|
|
|
Thus, the variance can be
computed directly from the obtained scores as
|
|
|
|
which can be also written as
|
|
|
|
For the example of the
variable X [1 2 3 4 5], the computation of variance from the data obtained from
five subjects is outlined as
|
|
|
|
The variance calculated
directly from the obtained scores for the variable X as 11 9 which equals to 2, the value identical to
that obtained by computing variance by using the deviation score formula.
As you may observe by
looking at the keyboard of a typical scientific hand calculator, there are two
kinds of variance. The variance defined as
|
|
|
|
is called the population, or
true variance and can be contrasted with the variance defined as
|
|
|
|
called the sample or
unbiased variance estimate. The computations of both true and unbiased variance
coefficients are illustrated below.
|
|
|
|
For the example of
variable X [1 2 3 4 5], the variance was computed either as 2.0 or as 2.5,
depending on whether the sum of squared deviation scores (10) was divided by n,
for the example equal to 5 or by n-1 that is equal to 4. In the former case,
the obtained variance is the true variance (10/5 = 2.0). In the latter case,
the variance (10/4 = 2.5) is 'unbiased'.
The n-1 term in the
denominator of the unbiased variance formula is referred to as degrees of freedom, signified as df or ,
(Greek letter nu). The notion of the degrees of freedom is related to the
concept of the random normal variable. To illustrate the notion of the random
normal variable, let us consider the numbers 1, 2, 3, 4, 5, assigned to Allen,
Beth, Cathy, Debra, and Edgar at the beginning of our discussion. No one ever
actually asked these five subjects whether they liked poetry. In fact, these
subjects are purely fictitious and the assignment of the numbers 1, 2, 3, 4,
and 5 to each subject was done because of computational convenience. Don't be
misled by the ordinality of the numbers 1, 2, 3, 4, 5. In a recent lottery, the
winning numbers were 3, 4, 5, 6, 7, 17 and 34. The point here is that we were
free to select these numbers at will, and, in this instance, the number of
degrees of freedom we had equaled n, the number of cases. Now, imagine that
this book was written by using only deviation scores. The authors of this
hypothetical book could assign numbers 1, 2, 3, 4 to Allen, Beth, Cathy, and
Debra. So far, they were free to assign to these fictitious subjects any
numbers they wished. However, in the case of Edgar, they would be no longer
free to assign to him any number they wished. They would have to assign to him
the number -10, since the deviation scores must sum to zero. In Edgar's case,
the authors are no longer free to assign any number they wish. After selecting
the first four numbers as 1 2 3 and 4, the last number has to be -10 in order
for the total sum to equal zero. Thus, the number of degrees of freedom
associated with the deviation scores is n-1 and, for this example, equals 4.
In the previous sections, the
arithmetic mean was defined as
|
|
|
|
and the true variance as
|
|
|
|
Substituting right side of
equation for the computation of mean to the above formula results in
|
|
|
|
The above formula can be
changed into a formula expressing the unbiased variance by substituting the n-1
expression for one of the two ns in the denominator, as
|
|
|
|
The above formula is the
prototype of the “sum-of-squares mean squares” approach to statistics, as this
approach stresses that all what is needed are the sums and the sums of the
squared values of the variable. For the example of the variable X [1 2 3 4 5],
squares of its values can be computed as
|
|
|
|
For the example, the
unbiased variance can be computed by the above formula as 5 times 55 (275),
minus 15 squared (225), divided by 5 times 4 (20), i.e., (275 225)/20 which equals 2.5. However, this
initial simplification of the computational operations results in obfuscation
of statistical concepts and hinders the understanding of the meaning of the
statistical analysis of data..
What is the relationship
between the true and unbiased coefficients of variance? To answer this
question, let us form the ratio of unbiased and true variances as
|
|
|
|
The above formula can be
simplified as
|
|
|
|
Thus, the translation from
the true to the unbiased form can be accomplished as
|
|
|
|
and the translation of the
unbiased variance into its true form as
|
|
|
|
For the current example, the
unbiased variance (2.5) can be obtained from the true variance (2) as (5/4)2 =
2.5 and the true variance can be obtained from the unbiased variance as
(4/5)2.5 = 2. The ratia of degrees of freedom to n and of n to degrees of
freedom, translating variance from the unbiased form to the true form and vice
versa, are frequently encountered in statistical analyses.
Summary
The variance formulae summarized
here are of fundamental importance and will be repeatedly encountered in the
course of our narrative. Formulae describing the transformation of variables
from the obtained into deviation scores and back from the deviation scores to
the obtained scores are
|
|
Deviation Scores |
Obtained Scores |
|
Obtained
Scores |
|
|
|
Deviation
Scores |
|
|
Formulae for the true
variance expressed in the obtained and deviation scores are summarized as
|
|
Deviation Scores |
Obtained Scores |
|
True
Variance |
|
|
Formulae for the unbiased
variance expressed in the obtained and deviation scores are summarized as
|
|
Deviation Scores |
Obtained Scores |
|
Unbiased
Variance |
|
|
Inspection of the above
formulae shows that the introduction of the degrees of freedom complicates the
variance expressions. Parsimonious presentations of statistical theory use the
true variance throughout,
|
|
|
|
|
True
Variance |
|
|
|
Unbiased
Variance |
|
|
translating the true
variance into an unbiased form
only when necessary.