|
Visual Statistics
Studio Table of
Contents |
The description of a variable
usually begins with the specification of its single most representative value,
often called the measure of location, or central tendency. There are several
measures for this statistic; we will limit our discussion to the mean and the
median.
The
concept of the arithmetic mean is a very old one, formulated by a group of
pre-Socratic philosophers, the Pythagoreans. The Pythagoreans were interested,
among other things, in the numerical relationships governing the harmony in
music. They originally described the arithmetic mean in a treatise On Music. This first description of the
mean involved only two numbers. The mean was defined as a quantity that exceeds
the smaller value by the same amount as the larger value exceeds the mean.
Some historians maintain that the median
was introduced by Gauss in 1816. However, it was Fechner who, around 1878,
called attention of the scientific community to this concept. The reason for
symbolizing the median by the letter C is that Fechner called the median Centralwerth, the central value of an
ordered series. Fechner also described relationship between the mean and the
median in asymmetric distributions.
The arithmetic mean is a measure
of central tendency commonly referred to as an average. The mean is the sum of scores, divided by their number.
Consider a variable X, indexing the scores of five subjects on a scale
measuring their liking of poetry. The subjects responded to the question
I like poetry
|
|
|
|
Responses of the subjects,
answering the question 'I like poetry' by using a five step rating scale are
recorded as
|
|
|
|
The mean of variable X can be
computed by using the formula
|
|
|
|
where M denotes the mean of variable X,
and n is the number of subjects. The
Greek capital letter sigma indicates that values of variable X should be summed. For the example, the
mean equals 15 / 5 = 3.
The median, signified by the
capital letter C, is defined as that point below which fifty percent of the
cases fall. In other words, the median represents the midpoint of an ordered
series. When the scores are not equally distributed along the whole range of a
variable, the median is likely a more appropriate measure of the central
tendency than the mean. Consider the ordered distribution of scores [1 2 3 4
10]. To compute the median, count simultaneously from both sides of this series
toward the middle. If the number of scores, n, is odd, as in this example, then
the median is the value in the series where both counts meet. For our example,
the median is 3.
When a
distribution contains few extremely high or extremely low scores, the mean is
biased by these outermost values and the median is a better statistics, as
shown in the figure below.
|
|
|
|
An example might be a distribution
of salaries within a corporation where few top managers get very high salaries.
In this case, arithmetic mean is biased upwards and median better reflects the
typical salary within the organization.
If the
number of scores in the distribution is even, the median is the middle value
extrapolated from the scores adjacent to the theoretical midpoint of the
distribution. This extrapolation is frequently accomplished by averaging both
adjacent scores, but other procedures, as, e.g., the geometric mean or graphic
extrapolation of the observed trend may be used. Consider a data set [3 1 2 4].
To compute the median, first, order the distribution [1 2 3 4] and, next,
average the two adjacent middle values (2 and 3). The median of this
distribution equals 2.50.
Measures of central tendency are
fundamental statistical indices. While the median is used primarily within the
confines of descriptive statistics, the mean is universally used within the
general linear model and is an integral part of most statistical procedures.
When a distribution is symmetric, the mean and the median coincide. If the
distribution is not symmetric, as is often the case, the mean and median differ
with respect to distances between the center of the distribution and its
individual values.
In
asymmetric distributions, if the center of the distribution is defined by the
arithmetic mean, M, the squared distances
between the center of the distribution and its individual values as short as
possible. In the example below, the sum of squared distances from the mean, 4,
is 50. The sum of absolute distances from the center of the distribution is 12.
If the center of an asymmetric
distribution is defined by the median, C, the distances between the center of the distribution and its individual
values are as short as possible. In the example below, the sum of squared
distances from the median, 3 is 55 while the sum of absolute distances from the
center of the distribution is 11.
The universal acceptance of the
arithmetic mean is due to its fundamental property that it is a measure of
central tendency best in the least square
sense, a criterion used by most methods of the general linear model. The mean
minimizes the squared distances between the other values of the distribution
and itself. The median minimizes the distances between the other values of the
distribution and itself. If the distribution is symmetric, the mean and the
median coincide and both the distances and squared distances from the center of
the distribution are as small as possible. If the distribution is asymmetric,
as a descriptive statistics, the median is a statistics superior to the
arithmetic mean.