Liang, K.H., Krus, D.J., & Webb, J.M. (1995) K-fold crossvalidation in canonical analysis. Multivariate Behavioral Research, 30, 539-545.
 

K-fold CROSS VALIDATION IN Canonical Analysis


Kun-Hsia Liang, David J. Krus and James M Webb
Arizona State University
 

A computer-assisted, k-fold cross validation algorithm is discussed within the framework of canonical correlation analysis of random data. Results of the analysis suggest that this algorithm effectively reduces the contamination of canonical variates and canonical correlations by the sample-specific variance components.

Hotelling's (1935, 1936) canonical analysis, like many other methods of the general linear model, is not exempt from the problem of sample-specific variance (Stevens, 1986, pp. 373-40; Thompson, 1991). In an attempt to minimize this variance, and due to the nature of this method, the need for a double or triple cross validation of canonical analysis has been suggested (Lee, McCabe & Graham, 1983; Thorndike & Weiss, 1983). The present study describes a multiple (k-fold) cross validation algorithm for canonical analysis and examines its effectiveness.

Description of THE ALGORITHM

The suggested procedure for the k-fold cross validation of canonical analysis begins with the random division of data sets into two subsamples. Subsequently, their canonical weights are extracted and exchanged. The cross-validated canonical correlations are then computed as correlations between the obtained and predicted canonical variates where the predicted canonical variates were derived by using the obtained canonical weights with a different cross-validation sample.

After computing the cross-validated canonical correlations, they are normalized and fitted into a running composite. The procedure described above is repeated until the differences between the successive iterations are negligible. At this point, the running composite of the cross-validated correlations is de-normalized, resulting in a set of cross-validated canonical correlations.


The Weight Interchange Process
 

The core operation of this multiple cross-validation process is the mutual exchange of canonical weights between the subsamples. In canonical analysis, the canonical variates are derived from the weighted composites of predictor and criterion variables; that is, the variate scores are composed of linear combinations of standard scores, each multiplied by its standardized weight. To illustrate this process, let A and B signify the subsamples of data, C and D represent the weighted canonical composites corresponding to sets A and B, and X and Y signify the standard scores of sets A and B. Then canonical variate scores then can be written as

and

 


The matrix of correlations, R, between these two canonical variates can be defined as

 

 


where the canonical correlations are located along the diagonal elements of  R. The cross-validated, canonical correlations for the set A then will be
 

 


The cross-validated canonical correlations for the set B then can be obtained by interchanging canonical weights with the set A.
 

Iteration and Normalization

After the first step in the k-fold cross-validation of canonical analysis, the cross-validated canonical correlations are normalized by a hyperbolic arctangent transformation

 

to prevent distorted estimates of their magnitudes. In the course of subsequent iterations, the normalized cross-validated canonical correlations are fitted into the running composites, as

 

 
and the standard deviations of this running composite are computed as


 

 


where I is the number of iterations completed,  is an initial mean,  is a new cross-validated and normalized ,  is an updated mean of the cross-validated and normalized canonical correlations;  and  (Krus and Ceurvorst, 1978, pp. 816-817). The iterations continue until a prespecified number of iterations is reached, or until the running composite converges on small, arbitrarily defined constant . At the termination of the iteration process, the resulting multicrossvalidated normalized  is translated back as

 

where  is the multiple cross-validated canonical correlation.
 

Analysis of random data


A set of normally distributed random variables (k = 6, n = 20) was generated and analyzed by canonical analysis. One-half of the random variables were included in the predictor set of variables, the other half defined the criterion set. The initial and cross-validated solutions of this random data matrix are presented in Table 1.

Table 1. Canonical analysis and multiply cross-validated canonical analysis of random data.

 

Initial Solution

Crossvalidated Solution

Eigenvalue

df

l

r

p

l

r

p

1

9

.522

.722

.116

.166

.006

-.077

.105

.999

2

4

.093

.305

.825

.825

.000

.020

.012

.999

3

1

.000

.012

.961

.961

.000

.019

.006

.999


For the random data used in the analysis, eigenvalues of the multiply cross-validated canonical analysis shrank significantly, approximating zero as their true values.

References
 

Hotelling, H.  (1936) Relations between two sets of variates. Biometrika, 28, 321-377.

Hotelling, H.  (1935) The most predictable criterion. Journal of Educational Psychology, 26, 139-142.

Lee, R., McCabe, D.J., & Graham, W. K.  (1983) Multivariate relationships between job characteristics and job satisfaction in the public sector: a triple cross validation study. Multivariate Behavioral Research, 18, 47-62.

Stevens, J. (1986) Applied multivariate statistics for the social sciences.  Hillsdale, NJ. Lawrence Erlbaum.

Thompson, B. (1991) Invariance of multivariate results: a Monte Carlo study of canonical function and structure coefficients. Journal of Experimental Education, 59, 367-382.

Thorndike, R. M., & Weiss, D. J. (1983) An empirical investigation of step-down canonical correlation with cross validation. Multivariate Behavioral Research, 18, 183-196.