Liang, K.H., Krus, D.J., & Webb, J.M. (1995) K-fold crossvalidation in canonical analysis. Multivariate
Behavioral Research, 30, 539-545.
K-fold CROSS VALIDATION IN Canonical Analysis
Kun-Hsia Liang, David J. Krus
and
A computer-assisted, k-fold cross validation algorithm is discussed within the framework of canonical correlation analysis of random data. Results of the analysis suggest that this algorithm effectively reduces the contamination of canonical variates and canonical correlations by the sample-specific variance components.
Hotelling's (1935, 1936) canonical analysis, like many other methods of the general linear model, is not exempt from the problem of sample-specific variance (Stevens, 1986, pp. 373-40; Thompson, 1991). In an attempt to minimize this variance, and due to the nature of this method, the need for a double or triple cross validation of canonical analysis has been suggested (Lee, McCabe & Graham, 1983; Thorndike & Weiss, 1983). The present study describes a multiple (k-fold) cross validation algorithm for canonical analysis and examines its effectiveness.
Description of THE ALGORITHM
After computing the cross-validated canonical correlations, they are normalized and fitted into a running composite. The procedure described above is repeated until the differences between the successive iterations are negligible. At this point, the running composite of the cross-validated correlations is de-normalized, resulting in a set of cross-validated canonical correlations.
The Weight Interchange Process
The core operation of this multiple cross-validation process is the mutual exchange of canonical weights between the subsamples. In canonical analysis, the canonical variates are derived from the weighted composites of predictor and criterion variables; that is, the variate scores are composed of linear combinations of standard scores, each multiplied by its standardized weight. To illustrate this process, let A and B signify the subsamples of data, C and D represent the weighted canonical composites corresponding to sets A and B, and X and Y signify the standard scores of sets A and B. Then canonical variate scores then can be written as
and
|
|
The matrix of
correlations, R, between these two canonical variates can be defined as
|
|
where the canonical correlations are located along the
diagonal elements of R. The cross-validated, canonical correlations
for the set A then will be
|
|
The cross-validated
canonical correlations for the set B then can be obtained by
interchanging canonical weights with the set A.
Iteration and Normalization
After the first step in the k-fold cross-validation of canonical analysis, the cross-validated canonical correlations are normalized by a hyperbolic arctangent transformation
|
|
to prevent distorted estimates of their magnitudes. In the course of subsequent iterations, the normalized cross-validated canonical correlations are fitted into the running composites, as
|
|
and the
standard deviations of this running composite are computed as
|
|
|
where I is the
number of iterations completed, is an initial mean, is a new cross-validated and normalized ,
is an updated mean of the cross-validated and
normalized canonical correlations; and (Krus and Ceurvorst, 1978, pp. 816-817). The
iterations continue until a prespecified number of iterations is reached, or until
the running composite converges on small, arbitrarily defined constant . At
the termination of the iteration process, the resulting multicrossvalidated
normalized is translated back as
where
is the multiple cross-validated canonical
correlation.
Analysis of random data
A set of normally distributed random variables (k = 6,
n = 20) was generated and analyzed by canonical analysis. One-half of the random
variables were included in the predictor set of variables, the other half
defined the criterion set. The initial and cross-validated solutions of this
random data matrix are presented in Table 1.
Table 1. Canonical analysis and multiply cross-validated canonical analysis of random data.
|
|
Initial Solution |
Crossvalidated Solution |
|||||||
|
Eigenvalue |
df |
l |
r |
|
p |
l |
r |
|
p |
|
1 |
9 |
.522 |
.722 |
.116 |
.166 |
.006 |
-.077 |
.105 |
.999 |
|
2 |
4 |
.093 |
.305 |
.825 |
.825 |
.000 |
.020 |
.012 |
.999 |
|
3 |
1 |
.000 |
.012 |
.961 |
.961 |
.000 |
.019 |
.006 |
.999 |
For the random data used in the analysis, eigenvalues of the multiply
cross-validated canonical analysis shrank significantly, approximating zero as
their true values.
References
Hotelling, H. (1936) Relations between two sets of variates. Biometrika, 28, 321-377.
Hotelling, H. (1935) The most predictable criterion. Journal of Educational Psychology, 26, 139-142.
Lee, R., McCabe, D.J., & Graham, W. K. (1983) Multivariate relationships between job characteristics and job satisfaction in the public sector: a triple cross validation study. Multivariate Behavioral Research, 18, 47-62.
Stevens, J. (1986) Applied multivariate statistics for the
social sciences.
Thompson, B. (1991) Invariance of multivariate results: a
Thorndike, R. M., & Weiss, D. J. (1983) An empirical investigation of step-down canonical correlation with cross validation. Multivariate Behavioral Research, 18, 183-196.