|
|
Based on Krus, D.J., &
Ceurvorst, R.W. (1979) Dominance, information, and hierahical scaling of
variance space. Applied Psychological Measurement, 3, 515-527.
Novel conceptualization of matrix subtraction can be used for computation of variance from all possible differences between data elements. Discussed is also the linkage of variance to information and visualization of hierarchical structures of data elements.
The concept of variance, coined in terms of all possible differences between values of a variable, was introduced by von Andrae (1872) and Helmert (1876) in a series of articles to Astronomische Nachtrichten. In the middle of the last century, using all possible differences between variables as foundation of statistical theory was contemplated by Kendall (1943, p. 47) who defined a coefficient, called here u2, as
(1)
For the discontinuous infinite case, the above equation can be written as
(2) and for the finite case as
(3)
where the summed term in the above equation is a vector of all possible
differences between elements of variable x. Pointing out that the value of the
u2 coefficient is “dependent
on the spread of the variate-values among themselves and not on the deviations
from some central value“(p. 47) Kendall shows that ,
concludes that the initial defining formula is “nothing but twice the variance” (p.47) and abandons the idea. One
can only wonder which direction statistics could have taken if
The idea that analysis of variance can be linked with mathematical theory of information appeared shortly after Shannon and Weaver (1949) founded the discipline (Miller, & Madow, 1954; Garner & McGill, 1956). However, the initial interest in this relationship waned as expressing information in terms of base two logarithms made this index incompatible with the mainstream methods of data analysis.
In a similar vein, initial interest in the matrix algebra rendering of the analysis of variance designs, following publication of Horst’s (1963, p.271) Matrix Algebra for Social Scientists, subsided with the realization that Horst’s expression of variance in matrix algebra terms – as
(4)
lacks theoretically interesting interpretation of the I-11’/n term.
My interest in these old issues was aroused following observation of subtle inconsistency in conceptualization of basic matrix algebra operations, namely that textbooks on matrix algebra, routinely describing major and minor vector products, do not suggest analogical operations for the major and minor sums and differences of summands, minuends, and subtrahends. These operations are easy to imagine and are not discussed because most of their potential applications can be as well accomplished by unit vectors multiplications. However, on close scrutiny, matrix algebra operations of addition and subtraction (of vectors, not elements of vectors, of matrices, not elements of matrices) can be used for concise expression of several key algorithms of statistical theory and theory of probability. This paper is a re-write (2006) of my paper published with Robert Ceurvorst in 1979 where these issues were discussed in a seminal form.
Consider a vector x of n test scores. A major difference matrix is defined as
(5)
Since the elements of are symmetric, but with opposing signs along
the zero-filled principal diagonal, the squaring of each element would render this skew asymmetric matrix
symmetric and thus 50% redundant. To eliminate this redundancy (i.e., in
effect, to utilize each pair-wise difference only once), all negative elements
in are set equal to zero. If the elements of x
are arranged in ascending or descending order, this will result in a triangular
matrix ,
i.e.,
(6)
If a matrix is defined, where ,
(7)
the maximum likelihood (true) variance of x can be written as
(8)
where 1 is a column vector of unities and 1' is its transpose. Using summation notation, Equation 8 is equivalent to
(9)
A formal proof that variance, as defined in Equation 9 equals the more common variance formula
(10)
was
provided by
DIFFERENCES BETWEEN DATA ELEMENTS AND THEIR MEAN
Consider, for instance, a vector x' = [1 2 3 4 5] with mean equal to 3 and true variance equal to 2. The variance is typically computed as shown below.
(11)
The matrix D can be computed for this instance as
(12)
The above matrix can be triangularized
(13)
and its corresponding matrix S computed as
(14)
The variance of x [1 2 3 4 5] can be computed for this instance by using Eq. 8 as 50/25 = 2.The matrix D contains information about all differences between the elements of x. It seems plausible to assume that this information can be also obtained from a matrix of all possible differences between the elements of x and its mean. Thus matrix M can be constructed as
(15)
Its corresponding matrix can be obtained by squaring its elements,
(16)
illustrating why the variance of the
vector x can be computed either as (for the example 50/25 = 2), or as ,
(for the example 10/5 = 2). These historical
antecedents of the conceptualization of variance help to understand its true
meaning.
VARIANCE AND INFORMATION
Initially, the above conceptualization of variance may appear obtuse, however, it offers a possibility to link variance to measures of information not by defining information by Shannon’s equation H = log2 m where m is the number of equiprobable alternatives, as done by Garner & McGill (1956), but by defining information in terms of the 1-0 changes. This preserves the basic definition of bits of the information theory in a way that is congruent with the practice information is conceptualized within the statistical theory. The key relationship between the above skew symmetric matrix and the theory of information can be found within Guttman’s (1946) theory of implicational scales, as elaborated by Krus (1977). Let us express variable x by using binary units of information theory as a matrix of implicative relationships iX, for the current example
(17)
The row sums of the binary matrix iX are the values of the variable x [1 2 3 4 5]. The binary matrix iX can be also used to define the variance of the variable x, since
(18)
and, for the example,
(19)
Matrix in Eq. 19 is identical to the matrix in Eq. 13, suggesting a relationship between information, defined in terms of the 1-0 bits of the information theory
(20)
and variance, as used within the statistics and data analysis
(21)
HIERARCHICAL
STRUCTURE OF DATA VECTORS
The directional differences (or dominance relations) among the row marginal referents of the vector x, for the example
also implies the hierarchical structure of this data vector, corresponding to the matrix (Eq. 13), if conceptualized as a matrix adjacent to an ordered graph (Fig.1).
Fig 1. Dendrogram constructed from the
skew symmetric matrix D, triangulated into its positive form ,
and conceptualized as an adjacency matrix to an ordered graph.
Andrae, von (1872). Über die Bestimmung des wahrscheinlichen Fehlers durch die gegebenen Differenzen vom gleich genauen Beobachtungen einer Unbekannten. Astronomische Nachrichten, vol. 84.
Garner, W. R. & McGill, W. J. (1956) The relation between information and variance analysis. Psychometrika, 21, 219-228.
Guttman, L. (1946) An approach for quantifying paired comparisons and rank order. Annals of the Mathematical Statistics, 17, 144-163.
Helmert, F.R. (1876). Die Berechnung des wahrscheinlichen Beobachtungsfehlers aus den ersten Potenzen der Differenzen gleichgenauer directer Beobachtungen. Astronomische Nachrichten, vol. 88.
Horst, P. (1963) Matrix algebra for social scientists.
Krus, D. J. (1977) Order analysis: An inferential model of dimensional analysis and scaling. Educational and Psychological Measurement, 37, 587-601.
Krus, D.J., & Bart, W.M. (1974) An ordering-theoretic method of multidimensional scaling of items. Educational and Psychological Measurement, 34, 525-535.
Krus, D.J., & Wilkinson, S.M. (1986) Matrix differencing as a concise expression of test variance. Educational and Psychological Measurement, 46, 179-183.
Miller, G.A., & Madow, W.G. (1954) On the maximum likelihood
estimate of the Shannon-Wiener measure of information.
|
|