Appendix 2: Statistical Distributions


Binomial Distribution

The binomial distribution is the basic distribution within the general linear model of statistics. The early work on the binomial distribution can be traced to De Moivre's Approximatio ad Summam Terminorum Binomii in Seriem Expansi, published in 1733. De Moivre earlier published a manual for gamblers where he described arithmetic principles for betting strategies and probabilities of outcomes of many games of chance.

 

Principle of Multiplication

Probability is defined as a ratio of the number of expected outcomes to the number of possible outcomes. To ascertain the number of possible outcomes, one has typically count the number of permutations or combinations of some set of events. A permutation is an arrangement of events following a definite order. Permutations are typically counted as

 

A combination is a selection of events without regard to their order. Combinations of a set of n events observed k times are counted as

 

 

  The above expression also defines the binomial coefficient

 

 

Both combinations and permutations are based on the principle of multiplication. The principle of multiplication states that if an operation can be performed in n1 ways, a second operation in n2 ways, and so on, and if outcomes of these operations follow each other, then the number of outcomes is (n1)(n2), and so on.

For example, in how many ways can be three true-false items A, B, C answered? According to the principle of multiplication, the answer is (2)(2)(2). Within a three item true-false test we can, theoretically, observe 8 response patterns. Let's throw three coins 8 times and record heads as 1 and tails as 0. Actual outcomes of this experiment will vary, however, we can construct the table of expected outcomes as

 

The above table was constructed in the following manner. There are 2 possible outcomes of three events (coins) which gives 23, 8, different outcomes. Half of these will be 0, the other half 1. Enter four zeroes and four ones to the first column. Split the four zeroes in the first column into two zeroes and two ones and record them in the second column. Do the same for the ones. In the last, third column, record alternate zeroes and ones. This algorithm will work for any number of binary outcomes of the n events, resulting in 2n different response patterns. Note that the frequencies of the heads (1) and tails (0) for the above example, shown below

 

 

form a rectangular distribution and that the correlations between the variables

 

 

are all zeroes. As the throws of coins represent the random events, the variables in the above table should not be and are not correlated.

Next, let us sum outcomes of each trial, as

 

 

 To get the binomial distribution, as shown below, the outcomes of each trial must be summed and recorded as frequencies.

 

 

Pascal's Triangle

We can also get the above frequencies directly, by constructing the Pascal's triangle, as

 

 

To construct the Pascal's triangle, start with 1 on top of the triangle, and append two 1s on the next line. Continue to append 1s on the extremes of following lines while constructing the in-between terms by summing the adjacent terms above them. Notice that the row sums of the Pascal's triangles are the powers of 2, i.e., [1 2 4 8…]. Dividing the rows of the Pascal's triangle by their row sums,

 

 

one may get the probabilities associated with the Pascal's triangle. For our repeated throws of 3 coins, the expected probabilities are shown on the bottom line of the above figure.

 

Binomial Coefficients

The binomial model derives its name from the fact that the binomial coefficients are terms of expanded binomial equations. Consider a series of 0, 1, 2, 3, … expansions of the (a + b) binomial, solved as

 

 

and separate the expanded equations into a component containing progressions of decreasing and increasing exponents,

 

 

and the component containing the binomial coefficients

 

 

and its associated probabilities

 

 

Binomial Equation

The fundamental distribution of the general linear model of data analysis is the binomial distribution. The equation for the binomial distribution is

 

 

which can be also written as

 

 

In the above expressions, y is the binomial variable containing probabilities of k successes in n trials. The p is the probability of success, q is the probability of failure. For the case when p = q, the above formula can be further simplified as

 

 

For example, the binomial coefficients for the probable outcomes of repeated throws of three coins are,

 

 

 

 

 

 and the values of the binomial variable can be computed as follows:

 

y1 = 1 ( .50 .53  ) = .125,

 

y2 = 3 ( .51 .52 ) = .375,

 

y3 = 3 ( .52 .51  ) = .375

 

y4 = 1 ( .53 .50  ) = .125.

 

Since for the example p = q, and .53 equals .125, using the simplified form of the binomial equation, the binomial variable can be also computed as

 

y1 = 1 (.125 ) = .125,

 

y2 = 3 ( .125 ) = .375,

 

y3 = 3 ( .125  ) = .375

 

y4 = 1 ( .125  ) = .125.

 

These results are identical to results obtained by computing probabilities directly from the Pascal's triangle.

 

Gamma Function

 Note that the length of the binomial variable is n+1, the argument of the gamma function. When this argument is an integer, the gamma function is just the factorial function offset by one,

 

The gamma function is a key parameter within the family of gamma distributions, including the normal, t, F and chi square distributions.

 

Mean and Variance of the Binomial Distribution

The mean of the binomial distribution is

 

 and its variance is

 

To compute the mean and variance of the binomial distribution, the distribution has to be changed from a frequency count into a variable where each repeated frequency is a separate value. For the example,

 

 

 the mean equals 3(.5) which is 1.5. The variance equals 3(.5)(.5) which is .75.

 

Binomial Distribution within the Microsoft Excel Framework

Using De Moivre's formula in Microsoft Excel, one has to define the location of the argument, the length of the argument, the probability p, and whether the distribution should be cumulative, or not. For the example where the argument of the binomial function was generated in the column a1:a4 as 0,1,2,3, the probability p equal to .5, and standard, non-cumulative binomial function, the formula was written as

 

binomdist(a1,3,0.5,false)

 

The function was generated in the column b1:b4 as

 

 

and is plotted in the figure below

 

 

Another example of the binomial function

was generated as follows. The argument was generated in the column a1:a30 as 1,2,3,…30, the probability p set to .5, and the formula for non-cumulative binomial function was written

as

binomdist(a1,30,0.5,false)

 

Galton's Quincunx

Galton's Quincunx is an apparatus with a single top compartment that contains a handful of marbles and a maze of ducts leading to several compartments at the bottom of the instrument.

 

 

If this apparatus is set upright, the marbles will fall through the ducts and mimic the pattern of probabilities contained in the Pascal's triangle. When the marbles reach the bottom compartments, the upper contour of the stack of marbles will bear a strong resemblance to a normal curve. Galton's original device used pins positioned on a board and resembling an ornamental arrangement of five bushes; hence the name quincunx.

The probabilities associated with probabilities of falling marbles to enter certain path helps to explain the binomial distribution. In the middle of the top compartment that contains the marbles there is an opening with a middle partition. A marble falling through this partition has an equal probability of falling into one of two lower compartments. Each of these compartments has an opening with a partition in its middle. Below this, there are four more compartments. The probability of a marble falling into the leftmost compartment is half of .50, associated with the compartment above, i.e., .25. The probability of a marble falling into the two inner compartments is the sum of .25 from the first above compartment and .25 from the second, i.e., .50. The probability of a marble falling into the rightmost compartment is again .25. This branching pattern is repeated over several rows. The pattern of probabilities over the rows corresponds to the pattern of probabilities obtained by dividing each coefficient of Pascal's Triangle by its row sum. The same pattern of probabilities could have been obtained by repeated expansions of the (.5+.5)n expression, as

 

(.5 + .5)0 = 1

 

(.5 + .5)1 = .5 + .5

 

(.5 + .5)2 = .25 + .5 +.25

 

(.5 + .5)3 = .125 + .375 + .375 + .125

 

etc. The exponent n can theoretically grow without bounds. When it reaches infinity, the binomial distribution changes to the normal distribution.

 

Normal Distribution

The idealized binomial distribution is called the normal distribution. For the standard scores z, the formula reads

 

The above formula can be also written as

 

and plotted as

Normal Distribution within the Microsoft Excel Framework

In Microsoft Excel, the normal distribution can be generated as

 

 (1 / Sqrt (2 * Pi() )) * Exp (-((A1) ^ 2) / 2)

 

This is equivalent to using Excel's generic normdist function. Generating the z scores in the column a1:a61 as -3.0, -2.9, …, 3.0, plotting the z scores on the horizontal coordinate and the y values on the vertical coordinate, results in the standard normal distribution shown above.

Comparison of Binomial and Normal Distributions

For n greater than 30, there is not too much difference between the binomial and normal distributions, as shown in the figure below.

 

 

 

As the n is getting smaller, the binomial distribution is noticeably moving to the left

 

 

 as shown in the above figure where the n of the binomial distribution equals 10. This shift is due to the fact that the abscissa for the normal distribution is a continuous variable whereas the abscissa for the binomial distribution is a discrete variable, a point series. Consider that it is not possible to have any values between, say, 2 and 3 heads of a coin. Within this context, for small ns, a shift of +.5 is sometimes applied to the values of the binomial distribution. This is called a correction for continuity. 

 

Euler's Constant Of Growth And Decay

The shape of the normal distribution is mainly due to the transcendental number e, named epsilon e by Euler. Epsilon is often Latinized as e; its origin is rather obscure, being introduced into mathematics shortly before Napier used it as a basis of his system of natural logarithms. However, it was Euler who popularized its use and named it, as many suspect, after the initial of his own name. Euler, a prolific mathematician credited with 886 published books and articles and averaging over 700 printed pages per year, used e to show the connection between exponential and trigonometric functions. The e to a positive power is often use to describe the growth processes and to the negative power, it describes the processes of decay. To provide for a continuous growth of an investment, interest is computed continuously and added to a principal. One dollar at 100% interest compounded annually yields two dollars. The interest compounded semiannually, at midyear, is 50 cents. Added to the principal, the amount loaned is $1.50. At the end of the year, the interest is $.75; the total yield is $2.25. Compounded quarterly, the interest augments the principal as $1.00 + $.25 + $.31 +$.39 + $.49 = $2.44. This compounding process can be formalized by using the expansion of a series of binomials: (1 + 1/1)2 = 2.00, (1 + 1/2)3 = 2.25, (1 + 1/4)4 = 2.44, and so on. Increasing the frequency of compounding to infinity defines the e as

 

 

Continuously compounded interest is the prototype of continuous growth. The size of a quantity growing like continuously compounded interest is given by the exponential function

 

 

where c is the initial size, p is the nominal growth rate given as a proportion of the unit growth rate, and t is the number of time periods. The definition of continuous decay differs from the definition of the continuous growth only by the sign of the exponent

 

 

An illustration of the positive and negative growth may help to clarify the above discussion.

 

About Bacterial Cultures And Penicillin

Suppose a bacterial culture of 1000 bacteria increases at a rate of 30% per day. Assume that penicillin decreases the size of the culture at the same nominal rate and is added to the culture at the end of the fourth day. The rise and fall of the bacterial culture can be calculated and is summarized in Table 21.1.

 

 

Table 21.1   Growth and Decay of Bacterial Cultures

 

          Both the positive and negative growth depiction of the bacterial culture is presented in Figure 21.2. After an initial rapid growth the trend is reversed after introduction of penicillin. At the end of the ninth day the culture is be back to its original size.

 

 

Figure 21.2   Accelerated and Decelerated

Growth Function

 

The Gamma Distributions

Functions can be classified as algebraic and transcendental. An algebraic function is a function that is a root of a polynomial equation. A function that is not a root of a polynomial equation is called transcendental. Most of the functions that describe natural phenomena turn out to be transcendental functions as are the trigonometric, logarithmic, exponential, and hyperbolic functions. The theory of higher transcendental functions was elaborated by Euler, (1707-1783) who also introduced the beta and gamma transcendental functions. Most sampling distributions of inferential statistics belong to the family of the gamma density functions. Some textbooks on statistics ascribe the t-distribution to Student and the F distribution to Snedecor. These statisticians only called the attention to the applicability of some of the higher transcendental function to the theory of statistical inference. However, the gamma density functions are due to Euler. These functions have a general form

 

 

Examples of gamma density functions are

 

 

 

 

approximating the normal distribution which equation is

 

 

The y1, y2, and y3 gamma functions were plotted below, as

 

 

The t- Distribution

The t-distribution belongs to the Euler's family of the gamma distributions. The density function for the t-distribution, associated with certain number of degrees of freedoms signified by the Greek letter , is shown below.

 

 

In the above equation () = ( - 1) ! When this argument is an integer, the gamma function is just a factorial offset by one, however, the gamma function returns values of factorial for all positive real numbers.

When the degrees of freedom grow large, the t-distribution changes to normal distribution. In the above equation, the constant a in Euler's Gamma Equation

 

can be written as

 

This constant, for degrees of freedom approaching infinity, approximates .3989. Thus

 

 

In the above equation the limit of the expression in the square brackets is Euler's e and the limit of the exponent in the oblique brackets equals -.5, thus

 

 

and the equation for the t-distribution, with large number of degrees of freedom, can be written within the framework of Euler's Gamma Equations as

 

 

As the above equation signifies the normal distribution, Gauss(1777-1855) could have claimed predominance in describing astronomical applications of the normal distribution, but hardly the primacy in description of its analytical form.

 

The t- Distribution within the Microsoft Excel Framework

 Within the Microsoft Excel computing environment, using the natural logarithm of the gamma function Gammaln circumvents this difficulty, since the number e raised to the n power, if n is an integer, returns the same result as n decremented by one, factorial. However, the Gammaln function also works with arguments that are not integer numbers. For five degrees of freedom, Microsoft Excel's formula for t-distribution can be written as  

= (1 / Sqrt ( 5 * Pi() )) * (2 / Exp ( Gammaln (2.5))) * (1 + (A1 ^ 2) / 5) ^ -3.

 

and plotted as in the figure below.

 

Theoretically, the normal distribution and the t-distribution are identical only for the infinite number of the degrees of freedom. Practically, you may see for yourself that the differences between the normal and t-distributions are not so large. Arguably, for sample size where n is greater than 30, and undoubtedly, for sample sizes greater than 60, the difference between the t-distribution and the normal distribution are negligible. 

The critical values for one-tailed t and z tests at the .05 significance level, for different degrees of freedom, are shown in the table below.

Convergence of Critical Values of t and z 

The critical values for one-tailed t tests at the .05 significance level, for selected degrees of freedom, are reported in the table below.

 

t

t2 

1

6.31

39.87

2

2.92

8.53

3

2.35

5.54

4

2.13

4.54

5

2.02

4.06

6

1.94

3.78

7

1.90

3.59

8

1.86

3.46

9

1.83

3.36

10

1.81

3.28

15

1.75

3.07

20

1.72

2.97

30

1.70

2.88

40

1.68

2.84

60

1.67

2.79

120

1.66

2.75

1.64

2.71

 

Values of the t and t2 corresponding to the five percent area of the t-distribution for selected degrees of freedom (one-tailed test). The degrees of freedom equal to n - 2. The t2 equals F for 1 degree of freedom. For infinitely large degrees of freedom, t equals z.

Using the t distribution for estimation of probability associated with the strength of a relationship in lieu of the normal distribution increases the threshold of the significance criterion and thus makes results less likely to be significant when a small number of subjects is used for analysis. For groups of subjects larger than 60, the z-test and t-tests can be used interchangeably.

 

The F Distribution

Among the higher transcendental functions, a frequently used function within the area of statistical inference is the inverted beta distribution, also called, as coined by Snedecor, the F distribution. As other probability distributions, the F distribution belongs to the family of gamma functions. The density function for the F-distribution, associated with certain number of degrees of freedoms signified by the Greek letter ,, is

 

 

The F Distribution within the Microsoft Excel Framework


In the above equation ( ) = ( - 1) ! For its both degrees of freedom equal to 10, the above equation was written for Microsoft Excel as

 

=630 * a1^4 * (1 + a1) ^ -10

 

The constant 630 within the above expression was computed as (9! / 4! 4!). This F(10,10) distribution is shown in the figure below.

The ease with which Microsoft Excel permits to visualize higher transcendental functions removes much of the mythology and obfuscation from the statistical data analysis.

Values of F for selected degrees of freedom at the five percent level of significance (one-tailed test) are shown in the table below.

 

1

2

3

1

39.87

49.5

53.6

63.3

2

8.53

9.00

9.16

9.49

3

5.54

5.46

5.39

5.13

4

4.54

4.32

4.19

3.76

5

4.06

3.78

3.62

3.11

6

3.78

3.46

3.29

2.72

7

3.59

3.26

3.07

2.47

8

3.46

3.11

2.92

2.29

9

3.36

3.01

2.81

2.16

10

3.28

2.92

2.73

2.06

15

3.07

2.70

2.49

1.76

20

2.97

2.59

2.38

1.61

30

2.88

2.49

2.28

1.49

40

2.84

2.44

2.23

1.38

60

2.79

2.39

2.18

1.29

120

2.75

2.35

2.13

1.19

2.71

2.30

2.08

1.00

 

For one degree of freedom, F equals t2.

 

Critical Values in the F Distribution

 

 

 

 

The Chi Square Distribution

The equation for the chi square distribution is

 

 

The above equation conforms to the general form of the Euler's gamma function

 

The constant a equals

the constant b equals

the constant c equals .5 and the constant d equals 2.

For example, for 10 degrees of freedom

b equals 4, c equals .5 and d equals 2.

          As shown in the table (p = .05) below,

 

 

1

3.841

3

7.815

5

11.07

10

18.307

20

31.410

30

43.773

40

55.759

50

67.505

 

 

 

for one degree of freedom, chi-square equals z-square. For infinite number of degrees of freedom, chi-square, divided by the degrees of freedom, equals F with both of its degrees of freedom equal to infinity.

 

The Chi Square Distribution within the Microsoft Excel Framework

 

 

 The above figure was plotted by using Microsoft Excel, using the equation (a1^4*2.71828^ (-a1/2))/768.

 

Perspective on Gamma Distributions

Within the statistical computer programs, the probabilities associated with the z, t F, and Chi Square ratia may be calculated by a single subroutine. This subroutine normalizes the F distribution as

 

 

After the normalization, this subroutine uses the polynomial approximations to find areas under the normal distribution, corresponding to standard z scores, as

 

 

where c1 = .196854, c2 = .115194, c3 = .000344, and c4 = .019527. 

For the example of the F(10,10) distribution, the conversion equation was written for Microsoft Excel as


=(0.9778*A2^(1/3)-0.9778) / (0.9778*A2^(2/3)+0.9778)^(1/2))*10


and the distribution was standardized as

The slight irregularity in the left tail of the standardized distribution is, in the course or real life computer implementations, removed by Kelley's correction. 

 

Probability Associated with the z- Ratio

Since  F equals z-Square with (1, infinity) degrees of freedom, the probability associated with the z-Square ratio can be obtained as p = fSig (1, 1000, z-Square). The infinity is represented by a large number, usually equal to 1,000.

Probability Associated with the t-Ratio

Since  F equals t-Square with (1, df) degrees of freedom, the probability associated with the t-Square ratio can be obtained as p = fSig (1, df, t-Square).

Probability Associated with the F  Ratio

This probability is obtained by calling the fSig subroutine as p = fSig(df1,df2,F).

Probability Associated with the Chi Square  Ratio

Since  F equals Chi Square with (df, infinity) degrees of freedom, the probability associated with the chi square ratio can be obtained as p = fsig(df, 1000, Chi-Square / df).

 

Pearsonian and Fisherian Conceptualizations of Statistical Inference

The above section provides insight to the apparent inconsistency in the conceptualization if the t-square ratio

 

and the chi-square ratio where

 

The t-square ratio is characteristic of the Fisherian conceptualization of statistical inference with the degrees of freedom used throughout all computations leading to the t-square ratio. The chi-square ratio is characteristics of the Pearsonian conceptualization of statistical inference where the degrees of freedom are introduced only during the last phase of the computation of probability associated with the chi-square.

 

Logical Basis of Statistical Distributions

A seminal idea in statistical mechanics is that of Maxwell's demon. Named after the Scottish physicist James Clerk Maxwell, Maxwell's demon is a hypothetical homunculus that is considered to admit or block passage of individual molecules between adjacent compartments. If provided with information about the speed of individual molecules, Maxwell's demon would be able to violate the second law of thermodynamics.

The notion of Maxwell's demon can be adapted for use within the classical theory of statistics and its associated theories of probability distributions and scaling. Within this context, let us assume that a group of Maxwell's demons operate within an environment of gates and compartments, provided by Galton's Quincunx. Let us further assume that each demon occupies a single decision node in the Quincunx and acts in accordance with the principles of formal logic, as defined by functions of propositional calculus. The experimenter can select functions, determining the demons' behavior, for each experimental run of the Quincunx. In this paper we describe the results of three trial runs of the above-defined Quincunx using different logical functions for each trial.

 

The Maxwell's Demons Quincunx

Aside from its characteristic honeycomb lattice of decision points connected with bottom compartments characteristic of Galton's Quincunx, the Quincunx of Maxwell's demons also contains an incipient data matrix of all possible responses to a set of binary scored questions. This data matrix is called plenum and is defined as a truth table of formal logic. A plenum of possible responses to four binary variables p, q, r, and s is shown on the left side of the following diagrams. On the right side of these figures can be observed a matrix of outcomes containing response patterns congruent with the logical function sent to Maxwell's demons. This matrix of outcomes defines the behavior of a ball moving through the grid of Quincunx's decision points. The elements of the data matrix of the outcomes containing ones (corresponding to true values of the logical truth tables) signify a path leading toward the right side of the Quincunx. The elements containing zeroes (corresponding to false values of the logical truth tables) signify a path leading toward the left side. The trajectory of the ball, traveling through the Quincunx is controlled by a group of Maxwell's demons operating the gate mechanism of the Quincunx according to principles of Boolean algebra.

 

Simulating a Binomial Distribution

This simulation corresponds to Galton's original model. Let us submit to the demons a tautological function f = taut (p, q, r, s). The computerized version of Maxwell's demons operates in this case as follows.

 

 

          The plenum of responses to a set of binary scored questions is solved as if it would be a truth table of formal logic. The solution is tautological, shown as a column of true (1) values in the second column of the diagram. The response patterns corresponding to the true values of the tautology function are replicated in the matrix of outcomes, as shown in the third column of the diagram below. The values of the data matrix of outcomes are associated with a display of the hexagonal layers of gates and bottom bins that comprise the Quincunx. This matrix of outcomes controls the movement of the balls, traveling through layers of the hexagonal lattices. Zeroes move the ball toward the left, ones toward the right. The balls are stacked inside of the compartments located underneath the hexagonal lattices. This binomial distribution is the same as that corresponding to the frequency counts in the last column of the diagram. The outcome of the simulation is a binomial distribution, approximating the normal distribution. Both the binomial and normal distributions reflect the influence of causal determinants on the outcome of events. According to the binomial model, phenomenon that has one determinant has two possible outcomes, phenomenon with two determinants has four outcomes, etc. Within this context, the logical model may help to understand the ubiquity of the binomial and normal distributions as reflections of determinants of such diverse phenomena as biological characteristics, physical and mental traits, and societal events. For example, allowing Maxwell's demons to realize possible combinations of determinants within the phylogenetic and ontogenetic repertoire of organisms reflect the optimum strategy of species survival. The logic here is that anything that is possible may and will be tried. Analogous outcomes can also be observed for individual behavior and behavior of societies. Laws of society and its ethical precepts may curtail manifestations of some outcomes, however, the magnitude of the environmental urgency is typically matched by the degree to which the outcome is usual or unusual, expected or unexpected, moderate or extreme.

 

Simulating a Rectangular Distribution

If we submit to Maxwell's demons a logical function f = (p -> q) & (q -> r) & (r -> s), the outcome results in a rectangular distribution.

 

 

 

The solution to the conjunction of implication functions is shown in the second column of the diagram and the table of outcomes is shown toward the right side of the diagram. The implication (->) returns a false (0) value only in the case of the (1,0) response pattern. The conjunction (&) returns a true (1) value only in the case of the (1,1) response pattern. The arguments within the parentheses are solved first, the conjunctions of implications next.

On a trial run, the balls moving through the lattice of decision points will form a rectangular distribution that corresponds to an idealized version of a perfect Guttman scale, also called an implicational scale. From the standpoint of formal logic, the definition of the Guttman scales as conjunctions of implication functions indicates that implicational scales are renderings of Aristotelian syllogisms.

 

Simulating Statistical Significance

To simulate a test of statistical significance, the behavior of Maxwell's demons has to be determined by two logical functions, f1 = .not. p & taut(q, r, s), and f2= p & taut(q, r, s). The algorithm for this simulation is shown in the diagram below. The solution for the first function is presented in the second column and the solution for the second function in the third column. The outcomes are shown on the right side of the diagram.

 

 

 

 

          On a trial run, the balls moving through the lattice of the Quincunx decision points will form two binomial distributions, shifted to the extent the determining attribute of a particular outcome is present. From the standpoint of formal logic, a test of statistical significance suggests that a particular determinant has significant influence on the outcome of an event. Within the context of an experiment, the question of whether a treatment determines a yield, a factor an outcome, or the independent variable the dependent variable, reflect the same type of logical reasoning.