Cruise Scientific        Visual Statistics Studio       Table of Contents

Normal Distribution

In the preceding chapters, we often used as an example a variable X equal to [1 2 3 4 5]. This example was used primarily for its brevity and simplicity. In the course of the analysis of real data you are more likely to encounter variables with repeating values, as, e.g., [1 2 2 2 3 3 3 3 3 4 4 4 5]. Frequencies of repeated values can be plotted as a histogram,

  approximating a binomial distribution, discussed in detail in appendices. In turn, the binomial distribution is idealized by the normal distribution.

Ideal Shape

Most distributions of test scores show remarkable similarities. If plotted as a histogram, typically there are relatively few low and high scores; most scores are clustered around the mean with score frequencies decreasing toward both ends of the distribution. As the number of scores increases, the histogram begins to look like a bell or the hump of a camel. The idealized shape of this bell distribution was described by Gauss as the 'curve of errors', later called the 'normal distribution.'

Curve of Errors

Gauss served for many years as the director of Goettingen astronomical observatory. Attracting his attention was the report that the director of the Greenwich observatory had fired his assistant for reporting observations of star transitions differently from his own readings. Accurate readings of these transitions were critical for the determination of sidereal time, the time based upon the axial and orbital rotation of the earth with reference to the background of the stars. The exact determination of the sidereal time was, in Gauss' time, of crucial importance for maritime navigation. An error of a few seconds would translate to an error of several nautical miles when determining the longitude of a ship's position.

Gauss suspected that the different readings of sidereal transitions were caused by individual differences in the reaction time of the observers. In this sense, they are akin to distribution of test scores. 

Model of the Tautology of Results 

Gauss conceptualized the normal distribution as a curve of errors. In this capacity, the normal distribution serves as a prototype of a benchmark of tests for statistical significance. Within this context, the normal distribution is used as a model of the tautology of results. A statement is tautological if its truthfulness is based on a fact that it provides for all logical possibilities. To the degree the obtained results are not tautological, they acquire meaning. If an observed difference markedly departs from the model of all possible differences, it is interpreted as unique, significant, and meaningful.

Another major domain of applications of the normal distribution was introduced by Quetelet who used the normal distribution as a basis of his concept of the 'average man', 'l' homme moyen.' In this sense, the deviations of scores from the mean of the normal distribution are used for classification of individuals based on interpretations of test scores.

The 'curve of errors' was renamed 'normal curve' by Karl Pearson. Writing about the normal distribution Galton asserted that 'if the Greeks had known it, they would have deified it. It reigns with serenity and in complete self-effacement amids the wildest confusion. The more huge the mob and the greater the apparent anarchy, the more perfect is its sway. It is the supreme Law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to be latent all along.'  

 

Analytical formulation of the Normal Distribution

In 1809, Gauss published analytical formula of the normal distribution in his Theoria motus corporum coelestium. For the standard scores z, the formula reads 

  In the above formula both pi and e are constants. Pi equals about 3.14 and e equals about 2.718. The above formula can be also written as

 

and plotted as

 

Properties of the Normal Distribution

The standard normal distribution has its maximum height (ordinate) at z equal to zero and is symmetrical about that ordinate. The curve changes from convex to concave at z points on the abscissa equal to plus one and minus one. In its mathematical idealization, this curve stretches from the negative infinity to the positive infinity and covers a unit area. Integrating this area allows us to associate every z score with the area from minus infinity up to the specified z score, as shown in the table below. 

Z

Area

-

.000

-3.00

.001

-2.33

.01

-2.00

.02

-1.65

.05

-1.28

.10

-1.00

.16

-.84

.20

-.52

.30

-.25

.40

.00

.50

.25

.60

.52

.70

.84

.80

1.00

.84

1.28

.90

1.65

.95

2.00

.98

2.33

.99

3.00

.999

+

1.00

  For example, about 5% of the area under the normal curve is below a z score of -1.65 as shown below.


 

Another view of the normal distribution is in terms of its central areas. Almost 68% of the total area covered by the normal distribution are located between the z-scores plus and minus one.

 

 

Approximately 50% of the area under the standard normal distribution are between the z-scores plus and minus .67. Close to 95% of the total area of the normal distribution is between z-scores plus and minus two. Some of these central areas are shown in the table below.

 

Why the Normal Distribution

There are many explanations for the ubiquity of the normal distribution. Majority of physical and mental traits tends to be distributed as to approximate it. Stretching from minus to plus infinity and covering the unit area interpreted as probability of occurrence of the universe of traits or events it describes, the normal distribution is an ideal theoretical model of the world where most things are possible, but not all of them probable.

For example, flip a coin seven times. It is possible to obtain seven heads in a row. However, it is rare (improbable, not likely to happen). The degree of probability of their occurrences varies from infinitesimally small to almost certain, with the absolute certainty and absolute doubt disappearing in either direction into the plus or minus infinity. One is reminded in this respect of the passage from Tsao Hsueh Chin's classical Chinese novel Dream of the Red Chamber where the Shih Yin approaches the Great Void Illusion Land which gate bears the inscription 'when the unreal is taken for the real, then the real becomes unreal.' However, within our only too real world, the normal distribution symbolizes the age-old strategy known to living beings to continue themselves and their genus. The logic here is that anything which is possible may and will be tried. An important corollary not to be forgotten is that the degree of environmental urgency is matched by the degree to which the response is usual or unusual, tried or untried, moderate or extreme.


True and Unbiased Variance

During the first decade of the 20th century, an interesting observation has been made. While simulating the standard normal distribution, defined as having the mean of zero and standard deviation equal to one, the following happened.

Over the many trials, the means of random normal deviates indeed approximated the expected mean of zero. The standard deviations of the random normal deviates also approximated the expected standard deviation of one.

Biased vs. Unbiased Index

However, as the sample sizes became very small, the standard deviations of the random normal deviates were consistently less than one, even though their means correctly approximated the expected mean of zero. During these simulation experiments, the true (biased) variance was defined as

 

 

The true standard deviation was defined as the square root of the above equation.

A question naturally arises whether some other index than true variance could approximate the expected value better. Since the expected standard deviation was consistently underestimated only for the small values, a prime candidate for a new (unbiased) variance index was a variance defined as

 

 

since division by a smaller value makes the value of the fraction larger. A minute decrement of the n by the -1 seemed a logical candidate, since for the large n, division by n or by n - 1 makes for a small, negligible increase of the value of the fraction. However, for the small ns, the increase of the value of a fraction with n decremented by 1, can be large.

For example, define the sum of squared deviation scores in the numerator of the variance expression to be some arbitrary value, say 10. Division of 10 by 30 is .33. Division of 10 by 29 is .37. Decrementing n by 1 increased the fraction by .04. Now, divide 10 by 5. The result is 2.0. Divide 10 by 4, the result is 2.50. Decrementing n by 1 increased the fraction by .50.

When the definition of the variance was changed in such a way that the sum of the deviation scores was divided by n - 1, the standard deviations of the random normal deviates started to approximate the expected values of one even for the small sample sizes. The new index was called the unbiased variance (). Its square root was called the unbiased standard deviation (s).

Monte Carlo Simulation

Simulations using the random number generators are often called the Monte Carlo experiments, as Monte Carlo is to Europe as Las Vegas is to the United States. During our Monte Carlo experiment, let's generate sets of random variables with expected mean of 0 and expected variance of 1.

Results of this simulation experiment are shown below for n equal to 100, 30, 10, 5, and 3. As you may observe, for ns greater than 30, the differences between true and unbiased variances are negligible. Even for ns as small as 10, the differences are very small. However for ns of 5 and 3, the differences are substantial.

Considering that very few real-life experiments are done with groups of subjects so small, why even to bother to introduce a new index for the variance? You are absolutely right. The unbiased variance index is not a necessary alternative to the true index of variance. However, what is necessary within the context of the inferential statistics is the concept of the degrees of freedom.

 

 

The degrees of freedom, n, depending on circumstances, often equal n - 1, k - 1, or n - k. Remember how we defined one of the variance components within the context of the analysis of variance as the variance between the means? While in the course of statistical measurements sample sizes so small as to warrant the use of the unbiased variance virtually never occur, the experiments comparing two or three means are common. Consider an experiment involving a control and an experimental group with number of groups designated by a k. Computing the variance between the means by using k (2) or k -1 (1) as a divisor results in definitely not negligible difference in the variance estimates.

 

Summary

The normal distribution is described by the function

About 68% of the total area covered by the normal distribution are located between the z-scores of plus and minus one. Approximately 50% of the area under the standard normal distribution are between the z-scores of plus and minus .67. Close to 95% of the total area of the normal distribution is between z-scores of plus and minus two.