Appendix 2: Statistical Distributions
Binomial
Distribution
The binomial distribution is the basic
distribution within the general linear model of statistics.
The early work on the binomial distribution can be traced to
De Moivre's Approximatio
ad Summam Terminorum Binomii in Seriem Expansi,
published in 1733.
De Moivre earlier published a manual for gamblers where he
described arithmetic principles for betting strategies and
probabilities of outcomes of many games of chance.
Principle of Multiplication
Probability is defined as a ratio of
the number of expected outcomes to the number of possible
outcomes. To ascertain the number of possible outcomes, one
has typically count the number of permutations or
combinations of some set of events. A permutation is an arrangement of
events following a definite order. Permutations are
typically counted as
A combination is a selection of events
without regard to their order. Combinations of a set of n
events observed k times are counted as
The above expression also defines the binomial coefficient
Both combinations and permutations are
based on the principle of multiplication. The principle of
multiplication states that if an operation can be performed
in n1 ways, a second operation in n2
ways, and so on, and if outcomes of these operations follow
each other, then the number of outcomes is (n1)(n2),
and so on.
For example, in how many ways can be three
true-false items A, B, C answered? According to the
principle of multiplication, the answer is (2)(2)(2). Within
a three item true-false test we can, theoretically, observe
8 response patterns. Let's throw three coins 8 times and
record heads as 1 and tails as 0. Actual outcomes of this
experiment will vary, however, we can construct the table of
expected outcomes as
The above table was constructed in the
following manner. There are 2 possible outcomes of three
events (coins) which gives 23, 8, different
outcomes. Half of these will be 0, the other half 1. Enter
four zeroes and four ones to the first column. Split the
four zeroes in the first column into two zeroes and two ones
and record them in the second column. Do the same for the
ones. In the last, third column, record alternate zeroes and
ones. This algorithm will work for any number of binary
outcomes of the n events, resulting in 2n
different response patterns. Note that the frequencies of the heads
(1) and tails (0) for the above example, shown below
form a rectangular distribution and
that the correlations between the variables
are all zeroes. As the throws of coins
represent the random events, the variables in the above
table should not be and are not correlated.
Next, let us sum outcomes of each
trial, as
To
get the binomial distribution, as shown below, the outcomes
of each trial must be summed and recorded as frequencies.
Pascal's Triangle
We can also get the above frequencies
directly, by constructing the Pascal's triangle, as
To construct the Pascal's triangle,
start with 1 on top of the triangle, and append two 1s on
the next line. Continue to append 1s on the extremes of
following lines while constructing the in-between terms by
summing the adjacent terms above them. Notice that the row
sums of the Pascal's triangles are the powers of 2, i.e., [1
2 4 8
]. Dividing the rows of the Pascal's triangle by
their row sums,
one may get the probabilities
associated with the Pascal's triangle. For our repeated
throws of 3 coins, the expected probabilities are shown on
the bottom line of the above figure.
Binomial
Coefficients
The binomial model derives its name
from the fact that the binomial coefficients are terms of
expanded binomial equations. Consider a series of 0, 1, 2,
3,
expansions of the (a + b) binomial, solved as
and separate the expanded equations
into a component containing progressions of decreasing and
increasing exponents,
and the component containing the
binomial coefficients
and its associated probabilities
Binomial
Equation
The fundamental distribution of the
general linear model of data analysis is the binomial
distribution. The equation for the binomial distribution is
which can be also written as
In the above expressions, y is the
binomial variable containing probabilities of k successes in
n trials. The p is the probability of success, q is the probability of failure. For the case
when p = q, the above formula can be further simplified as
For example, the binomial coefficients
for the probable outcomes of repeated throws of three coins
are,
and
the values of the binomial variable can be computed as
follows:
y1
= 1 ( .50 .53
) = .125,
y2
= 3 ( .51 .52 ) = .375,
y3
= 3 ( .52 .51
) = .375
y4 = 1 ( .53 .50
) = .125.
Since for the example p = q, and .53
equals .125, using the simplified form of the binomial
equation, the binomial variable can be also computed as
y1
= 1 (.125 ) = .125,
y2
= 3 ( .125 ) = .375,
y3
= 3 ( .125 )
= .375
y4 = 1 ( .125 ) = .125.
These results are identical to results
obtained by computing probabilities directly from the
Pascal's triangle.
Gamma
Function
Note
that the length of the binomial variable is n+1, the
argument of the gamma function. When this argument is an
integer, the gamma function is just the factorial function
offset by one,
The gamma function is a key parameter
within the family of gamma distributions, including the
normal, t, F and chi square distributions.
Mean
and Variance of the Binomial Distribution
The mean of the binomial distribution
is
and
its variance is
To compute the mean and variance of the
binomial distribution, the distribution has to be changed
from a frequency count into a variable where each repeated
frequency is a separate value. For the example,
the
mean equals 3(.5) which is 1.5. The variance equals
3(.5)(.5) which is .75.
Binomial
Distribution within the Microsoft Excel Framework
Using De Moivre's formula in Microsoft
Excel, one has to define the location of the argument, the
length of the argument, the probability p, and whether the
distribution should be cumulative, or not. For the example
where the argument of the binomial function was generated in
the column a1:a4 as 0,1,2,3, the probability p equal to .5,
and standard, non-cumulative binomial function, the formula
was written as
binomdist(a1,3,0.5,false)
The function was generated in the
column b1:b4 as
and is plotted in the figure below
Another example of the binomial
function
was generated as follows. The argument
was generated in the column a1:a30 as 1,2,3,
30, the
probability p set to .5, and the formula for non-cumulative
binomial function was written
as
binomdist(a1,30,0.5,false)
Galton's
Quincunx
Galton's Quincunx is an apparatus with
a single top compartment that contains a handful of marbles
and a maze of ducts leading to several compartments at the
bottom of the instrument.
If this apparatus is set upright, the
marbles will fall through the ducts and mimic the pattern of
probabilities contained in the Pascal's triangle. When the
marbles reach the bottom compartments, the upper contour of
the stack of marbles will bear a strong resemblance to a
normal curve. Galton's original device used pins positioned
on a board and resembling an ornamental arrangement of five
bushes; hence the name quincunx.
The probabilities associated with
probabilities of falling marbles to enter certain path helps
to explain the binomial distribution. In the middle of the
top compartment that contains the marbles there is an
opening with a middle partition. A marble falling through
this partition has an equal probability of falling into one
of two lower compartments. Each of these compartments has an
opening with a partition in its middle. Below this, there
are four more compartments. The probability of a marble
falling into the leftmost compartment is half of .50,
associated with the compartment above, i.e., .25. The
probability of a marble falling into the two inner
compartments is the sum of .25 from the first above
compartment and .25 from the second, i.e., .50. The
probability of a marble falling into the rightmost
compartment is again .25. This branching pattern is repeated
over several rows. The pattern of probabilities over the
rows corresponds to the pattern of probabilities obtained by
dividing each coefficient of Pascal's Triangle by its row
sum. The same pattern of probabilities could have been
obtained by repeated expansions of the (.5+.5)n
expression, as
(.5 + .5)0 = 1
(.5 + .5)1 = .5 + .5
(.5 + .5)2 = .25 + .5 +.25
(.5 + .5)3 = .125 + .375 +
.375 + .125
etc. The exponent n can theoretically
grow without bounds. When it reaches infinity, the binomial
distribution changes to the normal distribution.
Normal
Distribution
The idealized binomial distribution is
called the normal distribution.
For the standard scores z, the formula reads
The above formula can be also written
as
and plotted as
Normal
Distribution within the Microsoft Excel Framework
In
Microsoft Excel, the normal distribution can be generated as
(1
/ Sqrt (2 * Pi() )) * Exp (-((A1) ^ 2) / 2)
This
is equivalent to using Excel's generic normdist
function. Generating the z scores in the column a1:a61 as
-3.0, -2.9,
, 3.0, plotting the z scores on the
horizontal coordinate and the y values on the vertical
coordinate, results in the standard normal distribution
shown above.
Comparison
of Binomial and Normal Distributions
For n greater than 30, there is not too
much difference between the binomial and normal
distributions, as shown in the figure below.
As the n is getting smaller, the
binomial distribution is noticeably moving to the left
as
shown in the above figure where the n of the binomial
distribution equals 10.
This shift is due to the fact that the abscissa for the
normal distribution is a continuous variable whereas the
abscissa for the binomial distribution is a discrete
variable, a point series. Consider that it is not possible
to have any values between, say, 2 and 3 heads of a coin.
Within this context, for small ns,
a shift of +.5 is sometimes applied to the values of the
binomial distribution. This is called a correction
for continuity.
Euler's Constant Of Growth And Decay
The
shape of the normal distribution is mainly due to the
transcendental number e, named epsilon e by Euler. Epsilon is often Latinized as e; its origin is rather
obscure, being introduced into mathematics shortly before
Napier used it as a basis of his system of natural
logarithms. However, it was Euler who popularized its use
and named it, as many suspect, after the initial of his own
name. Euler, a prolific mathematician credited with 886
published books and articles and averaging over 700 printed
pages per year, used e to show the connection between
exponential and trigonometric functions. The e to a positive
power is often use to describe the growth processes and to
the negative power, it describes the processes of decay. To
provide for a continuous growth of an investment, interest
is computed continuously and added to a principal. One
dollar at 100% interest compounded annually yields two
dollars. The interest compounded semiannually, at midyear,
is 50 cents. Added to the principal, the amount loaned is
$1.50. At the end of the year, the interest is $.75; the
total yield is $2.25. Compounded quarterly, the interest
augments the principal as $1.00 + $.25 + $.31 +$.39 + $.49 =
$2.44. This compounding process can be formalized by using
the expansion of a series of binomials: (1 + 1/1)2 = 2.00,
(1 + 1/2)3 = 2.25, (1 + 1/4)4 = 2.44, and so on. Increasing
the frequency of compounding to infinity defines the e as

Continuously compounded interest is the prototype of
continuous growth. The size of a quantity growing like
continuously compounded interest is given by the exponential
function

where
c is the initial size, p is the nominal growth rate given as
a proportion of the unit growth rate, and t is the number of
time periods. The definition of continuous decay differs
from the definition of the continuous growth only by the
sign of the exponent

An
illustration of the positive and negative growth may help to
clarify the above discussion.
About Bacterial Cultures And
Penicillin
Suppose
a bacterial culture of 1000 bacteria increases at a rate of
30% per day. Assume that penicillin decreases the size of
the culture at the same nominal rate and is added to the
culture at the end of the fourth day. The rise and fall of
the bacterial culture can be calculated and is summarized in
Table 21.1.

Table
21.1 Growth
and Decay of Bacterial Cultures
Both the positive and negative growth depiction of
the bacterial culture is presented in Figure 21.2. After an
initial rapid growth the trend is reversed after
introduction of penicillin. At the end of the ninth day the
culture is be back to its original size.

Figure
21.2 Accelerated and Decelerated
Growth
Function
The Gamma
Distributions
Functions
can be classified as algebraic and transcendental. An
algebraic function is a function that is a root of a
polynomial equation. A function that is not a root of a
polynomial equation is called transcendental. Most of the
functions that describe natural phenomena turn out to be
transcendental functions as are the trigonometric,
logarithmic, exponential, and hyperbolic functions. The
theory of higher transcendental functions was elaborated by Euler, (1707-1783) who also introduced the beta and gamma
transcendental functions. Most sampling distributions of
inferential statistics belong to the family of the gamma
density functions. Some textbooks on statistics ascribe the
t-distribution to Student and the F distribution to Snedecor.
These statisticians only called the attention to the
applicability of some of the higher transcendental function
to the theory of statistical inference. However, the gamma
density functions are due to Euler. These functions have a
general form
Examples
of gamma density functions are
approximating
the normal distribution which equation is
The
y1, y2, and y3 gamma
functions were plotted below, as

The
t- Distribution
The t-distribution belongs to the
Euler's family of the gamma distributions. The density
function for the t-distribution, associated with certain
number of degrees of freedoms signified by the Greek letter
,
is shown below.
In the above equation
(
)
= (
- 1) ! When this argument is an integer, the
gamma function is just a factorial offset by one, however,
the gamma function returns values of factorial for all
positive real numbers.
When the degrees of freedom grow large, the
t-distribution changes to normal distribution. In the above
equation, the constant a in Euler's Gamma Equation
can
be written as
This constant, for degrees of freedom
approaching infinity, approximates .3989. Thus
In the above equation the limit of the
expression in the square brackets is Euler's e and the limit
of the exponent in the oblique brackets equals -.5, thus
and the equation for the
t-distribution, with large number of degrees of freedom, can
be written within the framework of Euler's Gamma Equations
as
As the above equation signifies the
normal distribution, Gauss(1777-1855) could have claimed
predominance in describing astronomical applications of the
normal distribution, but hardly the primacy in description
of its analytical form.
The
t- Distribution within the Microsoft Excel Framework
Within
the Microsoft Excel computing environment, using the natural
logarithm of the gamma function Gammaln
circumvents this difficulty, since the number e raised to
the n
power, if n
is an integer, returns the same result as n
decremented by one, factorial. However, the Gammaln
function also works with arguments that are not integer
numbers. For five degrees of freedom, Microsoft Excel's
formula for t-distribution can be written as
= (1 / Sqrt ( 5 * Pi() )) * (2 / Exp (
Gammaln (2.5))) * (1 + (A1 ^ 2) / 5) ^ -3.
and plotted as in the figure below.
Theoretically, the normal distribution
and the t-distribution are identical only for the infinite
number of the degrees of freedom. Practically, you may see
for yourself that the differences between the normal and
t-distributions are not so large. Arguably, for sample size where n is
greater than 30, and undoubtedly, for sample sizes greater
than 60, the difference between the t-distribution and the
normal distribution are negligible.
The critical values for one-tailed
t and z tests at the .05 significance level, for
different degrees of freedom, are shown in the table below.
Convergence of Critical Values of t
and z
The critical values for one-tailed t tests at the .05 significance level, for
selected degrees of freedom, are reported in the table below.
|
|
t
|
t2
|
|
1
|
6.31
|
39.87
|
|
2
|
2.92
|
8.53
|
|
3
|
2.35
|
5.54
|
|
4
|
2.13
|
4.54
|
|
5
|
2.02
|
4.06
|
|
6
|
1.94
|
3.78
|
|
7
|
1.90
|
3.59
|
|
8
|
1.86
|
3.46
|
|
9
|
1.83
|
3.36
|
|
10
|
1.81
|
3.28
|
|
15
|
1.75
|
3.07
|
|
20
|
1.72
|
2.97
|
|
30
|
1.70
|
2.88
|
|
40
|
1.68
|
2.84
|
|
60
|
1.67
|
2.79
|
|
120
|
1.66
|
2.75
|
|
|
1.64
|
2.71
|
Values
of the t and t2 corresponding to the five percent
area of the t-distribution for selected degrees of freedom
(one-tailed test). The degrees of freedom equal to n - 2.
The t2 equals F for 1 degree of freedom. For
infinitely large degrees of freedom, t equals z.
Using the t distribution for estimation
of probability associated with the strength of a
relationship in lieu of the normal distribution increases
the threshold of the significance criterion and thus makes
results less likely to be significant when a small number of
subjects is used for analysis. For groups of subjects larger
than 60, the z-test and t-tests can be used interchangeably.
The
F Distribution
Among
the higher transcendental functions, a frequently used
function within the area of statistical inference is the
inverted beta distribution, also called, as coined by
Snedecor, the F distribution. As other probability
distributions, the F distribution belongs to the family of
gamma functions. The density function for the
F-distribution, associated with certain number of degrees of
freedoms signified by the Greek letter
,,
is
The
F Distribution within the Microsoft Excel Framework
In
the above equation
(
)
= (
- 1) ! For its both degrees of freedom equal to 10, the above
equation was written for Microsoft Excel as
=630 * a1^4 * (1
+ a1) ^ -10
The constant 630
within the above expression was computed as (9! / 4! 4!). This F(10,10)
distribution is shown in the figure below.

The ease with
which Microsoft Excel permits to visualize higher transcendental functions
removes much of the mythology and obfuscation from the statistical data
analysis.
Values
of F for selected degrees of freedom at the five percent
level of significance (one-tailed test) are shown in the
table below.
|
|
1
|
2
|
3
|
|
|
1
|
39.87
|
49.5
|
53.6
|
63.3
|
|
2
|
8.53
|
9.00
|
9.16
|
9.49
|
|
3
|
5.54
|
5.46
|
5.39
|
5.13
|
|
4
|
4.54
|
4.32
|
4.19
|
3.76
|
|
5
|
4.06
|
3.78
|
3.62
|
3.11
|
|
6
|
3.78
|
3.46
|
3.29
|
2.72
|
|
7
|
3.59
|
3.26
|
3.07
|
2.47
|
|
8
|
3.46
|
3.11
|
2.92
|
2.29
|
|
9
|
3.36
|
3.01
|
2.81
|
2.16
|
|
10
|
3.28
|
2.92
|
2.73
|
2.06
|
|
15
|
3.07
|
2.70
|
2.49
|
1.76
|
|
20
|
2.97
|
2.59
|
2.38
|
1.61
|
|
30
|
2.88
|
2.49
|
2.28
|
1.49
|
|
40
|
2.84
|
2.44
|
2.23
|
1.38
|
|
60
|
2.79
|
2.39
|
2.18
|
1.29
|
|
120
|
2.75
|
2.35
|
2.13
|
1.19
|
|
|
2.71
|
2.30
|
2.08
|
1.00
|
For
one degree of freedom, F equals t2.
Critical
Values in the F Distribution
The
Chi Square Distribution
The
equation for the chi square distribution is
The
above equation conforms to the general form of the Euler's
gamma function
The
constant a equals
the
constant b equals
the
constant c equals .5 and the constant d equals 2.
For
example, for 10 degrees of freedom
b
equals 4, c equals .5 and d equals 2.
As shown in the table (p = .05) below,
|
|
|
|
1
|
3.841
|
|
3
|
7.815
|
|
5
|
11.07
|
|
10
|
18.307
|
|
20
|
31.410
|
|
30
|
43.773
|
|
40
|
<