![]() |
Visual Statistics Studio ♦ Outline of Visual Statistics ♦ Library |
COMPUTER ASSISTED MULTICROSSVALIDATION IN
REGRESSION ANALYSIS
David J. Krus and
Algorithm for multicrossvalidation of the multiple regression analysis was described and compared with analytical “formulae” methods for estimation of shrinkage of the multiple coefficients of correlation within the framework of amorphous and pre-structured data sets.
The desirability of cross validation has been repeatedly stressed. As Tatsuoka (1969, p. 26) commented,
“a careful researcher would not put much stock in a regression equation until he has cross-validated it on a sample other than the one on which it was based.”
The literature on multiple regression analysis is replete with caveats pertaining to the tendency of the least-squares solutions to capitalize on chance. As observed by Weiss (1976, p. 333),
“when regression weights developed on one group are applied to data from another group, the multiple correlation usually ‘shrinks’, in some cases to non-significant values.”
The pervasive nature of these warnings was captured by Cooley and Lohnes (1971, p. 57) who reiterated that
“we will not repeat this dictum ad nauseam, but we believe that the researcher who wants his linear components taken seriously must either base them on very large and very representative samples, or demonstrate their validity on replication samples.”
Despite perennial exhortations, cross validated multiple correlations are not frequently used. The foremost reason for this continuing reluctance to report routinely cross validated R’s is perhaps that cross validation is tedious. Two groups of subjects must he utilized. Regression weights developed on the screening group have to be applied to the calibration sample. A product-moment correlation coefficient between the criterion variable and predicted criterion variable in the calibration sample has to be computed. This ordinary correlation coefficient is the cross validated multiple R.
A logical extension of this procedure is to compute the regression weights for both samples and to apply them to the predictor variables of the other sample, as suggested by Mosier (1951). Unfortunately, this procedure results in two cross validated multiple R’s, often of quite discrepant values.
There is no consensus concerning how to report the results of double cross validation. Should only one cross validated R be reported, or both? If only one is reported, which one? Kerlinger and Pedhazur (1913, p. 284) suggested that one should
“study the differences between the R’s as well as the differences in the two regression equations. If the results are close, one may combine the samples and calculate the regression equation to be used in prediction.”
The question naturally arises how close is close and what to do if the coefficients are discrepant. Again, the suggested procedure is equivocal.
The main objections to the empirical cross validation and double cross validation techniques are those stressing the superiority of the analytical formulae for estimation of the degree to which the empirical multiple R is spuriously inflated. Horst (1966, p. 379), for example, argued that
“analytical instead of empirical procedures should be used for the estimation of the shrinkage of the multiple R. As the analytical solutions permit all observations to be used in the estimation at once, this procedure obviates the necessity of splitting the sample of subjects into developmental (screening) and cross validation (calibration) samples. Because analytical methods leave sample size intact, one should expect ipso facto that the resulting estimate will be more precise.”
This position was recently reiterated by Cattin (1980, p. 407) who maintained that
“the advantages [of analytical formulae for estimation of the predictive power of a regression model] over cross validation are that they are less cumbersome to use and that they produce more precise estimates.”
However, the opinion that the analytical solution is preferable to the empirical is not universally shared. Lord and Novick (1968, pp. 334-335) have observed that
“satisfactory methods for predicting the degree of shrinkage in cross validation are not currently available.”
Also, after discussing formulae for correcting the overestimation of R, Kerlinger and Pedhazur (1973, p. 283) concluded that
“probably the best method for estimating the degree of shrinkage is to perform a cross validation.”
In the authors’ experience, the preference for either empirical cross validation or, on the other hand, the analytical estimation of the expected amount of “shrinkage” is determined more by theoretical orientation of the researcher than by factual evidence favoring one or the other approach. To use Cronbach’s (1957) terminology, experimentalists tend to prefer analytical “formulae” approach, whereas correlationists tend to favor either cross validation or double cross validation methods.
The purpose of the present paper was to discuss an extension of Mosier’s (1951) technique of double cross validation. The name multicrossvalidation is used to reflect the repeated applications of the double cross validation technique to randomly selected subsamples of data. A related purpose was to provide for an empirical validation of the relative merits of the empirical vs. analytical methods of shrinkage estimation. The rigorous comparison of both approaches is possible, as the technique of multicrossvalidation, unlike its double cross validation precursor, returns a single cross-validated multiple R. This single, empirically obtained R can be compared with the value of the analytically obtained R by keeping the sample size and the number of predictor variables constant. Also, the method of multicrossvalidation is discussed within the framework of its computer implementation, as it is practically meaningless without the availability of a high-speed, large-memory computing device.
Description of the Multicrossvalidation Algorithm
The multicrossvalidation algorithm randomly splits the sample of subjects into two subsamples. The regression weights are calculated, interchanged, and used for prediction of the criterion variable from the data of the other sample. The cross validated multiple R’s are normalized by converting them to their hyperbolic arctangents:
|
|
and the process is repeated. A running composite of the normalized and of its standard error are computed as follows: after each iteration 1, 2, …, i, …, n the mean M of the normalized series of and its corresponding standard error are computed, respectively as
|
|
and
|
|
where and (Krus and Ceurvorst, 1978, pp. 816-817). The iteration process is terminated after a pre-specified number of iterations is reached, or after the difference becomes smaller than an arbitrarily selected constant over a period of several iterations.
At the termination of the iteration process, the resulting multicrossvalidated normalized is translated back to its original form as
|
|
and printed out. This normalization - denormalization procedure follows a procedure suggested by Fisher (1970, p. 232).
Validation of the Multicrossvalidation Algorithm
Data matrices can be placed on a continuum from random to determined. Both extremes of this continuum were utilized in the present study. The random pole was represented by a 100 by 20 matrix of random numbers. These numbers were normalized by the
|
|
transformation where X is a uniformly distributed random number and k is a constant, usually set to 12.
To approximate the determined pole of the continuum, Thurstone’s (1947, pp. 140-143) prestructured data set was elected. This set of data consisted of physical measurements of 20 cubic objects comprising the well known “Thurstone’s box problem.” The criterion variable was the volume of these cubes, computed as the product of their respective length, width, and height dimensions.
The selection of both data sets was based on the assumption that the discussed algorithm should correctly return the value of the multicrossvalidated R as equal to zero in the case of the random data set and show no shrinkage in the case of determined data. The number of iterations was set at 50 for both experiments to improve comparability of results.
For the random data, the multiple regression coefficient was .462. The multicrossvalidated R was .062, a close approximation of the expected zero value. The standard error of the multicrossvalidated R was .118. When Thurstone’s “box problem” data were used in the prediction situation, the original multiple regression coefficient was .971. The multicrossvalidated R was .965 with a standard error of .015. As contrasted with shrinkage of .400 for multiple R on the random data set, the shrinkage of the determined data R was only .006.
The multicrossvalidation algorithm was also compared with the analytical methods, specifically with Wherry’s (1931) correction for shrinkage
|
|
and approximation of Olkin and Pratt’s (1958) unbiased estimate of the squared multiple correlation (cf. Lord and Novick, 1968, p. 286)
|
|
where p is the number of predictors and N is the size of the sample. For Thurstone’s box problem the analytical corrections and the multicrossvalidation procedure produced comparable results. Wherry’s was .965, Olkin and Pratt’s unbiased estimator of was .969, and the multicrossvalidated R was .965. However, for the random data set, Wherry’s correction returned a spuriously inflated value of .165, and Olkin and Pratt’s formula estimated at .169. Only the multicrossvalidated R (.062) closely approximated the expected value of zero. The above observations are summarized in Table 1.
Table 1. Results of the validation study on the prestructured and random data sets.
| Prestructured Data | Random Data | |||
| Expected | Obtained | Expected | Obtained | |
| Multiple Regression R | .999 | .971 | .000 | .462 |
| Wherry's Correction for Shrinkage of R | .999 | .965 | .000 | .165 |
| Olkin & Pratt's Unbiased Estimate of R | .999 | .969 | .000 | .169 |
| Multicrossvalidated R | .999 | .965 | .000 | .062 |
Discussion
The traditional modus operandi of psychometrics has involved the structural approach without stochastic estimations. To introduce statistical inference into the multiple regression model, a researcher may choose to emphasize a theoretical probabilistic estimate of the extent to which the obtained data may be expected to yield a multiple R representative of the population parameter. In such a situation analytical formulae, typically based on the sample size and number of variables used for prediction, constitute the method of choice. An example of this approach has appeared in the recent work of Browne (1975) and Roseboom (1978).
Although the analytical methods work well with large data sets, research in education and psychology frequently lacks the large sample sizes required for this approach. For smaller data sets, the empirical rather than the analytical approach to crossvalidation of multiple regression coefficients, as typified by recent work of Gollob (1967) and Drehmer and Morris (1981), appears to be more viable.
The multicrossvalidation algorithm offers the researcher the opportunity to describe the actual measurements more precisely, a description which is based on a more thorough analysis of the obtained data, rather than on sample size and number of variables. Even though the multicrossvalidation algorithm typically requires large amounts of computer time, this seeming disadvantage should not be objectionable when subsequent decisions, based on information provided by multiple regression, are to be used for selection, diagnosis, or prognosis of real life events. Considered within this pragmatic framework, the technique of multicrossvalidation should be a routine addition to extant programs for multiple regression analysis.
REFERENCES
Browne, M. W. Predictive validity of a linear regression equation. British Journal of Mathematical and Statistical Psychology. 1975, 28, 79-87.
Cattin, P. Estimation of the predictive power of a regression model. Journal of Applied Psychology, 1980, 63, 407-414.
Cooley, W. W. and Lohnes, P. R. Multivariate
data analysis.
Crombach, L. J. The two disciplines of scientific psychology. American Psychologist, 1957, 12, 671-684.
Drehmer, D. E. and Morris, G. W. Cross validation with small samples: An algorithm for computing GolIob’s estimator. Educational and Psychological Measurement, 1981, 41, 195-200.
Fisher, R. A. Statistical methods for research workers. (14th edition). New Yoik: Hafner, 1970.
Gollob, H. F. Cross validation using samples
of size one. Paper presented at the American Psychological Association meeting
in
Horst, P. An overview of the
essentials of multivariate analysis methods. In R. B. Cattell (Ed.) Handbook
of multivariate experimental psychology.
Kerlinger, F. M. and Pedhazur, E. J.
Multiple regression in behavioral research.
Krus, D. J. and Ceurvorst, R. W. Computer assisted construction of variable norms. Educational and Psychological Measurement, 1978, 38, 815-818.
Lord, F. M. and Novick, M. R. Statistical
theories of mental test scores.
Mosier, C. I. Problems and designs of crossvalidation. Educational and Psychological Measurement, 1951, 11, 5-11.
Olkin,
Roseboom, W. W. The estimation of crossvalidated multiple correlation: A clarification. Psychological Bulletin, 1978, 85, 1348-1351.
Tatsuoka, M. M. The use of multiple
regression equations.
Thurstone, L. L. Multiple factor analysis.
Weiss, D. J. Multivariate procedure. In M.
D. Dunnette (Ed.) Handbook of industrial and organizational psychology.
Wherry, R. J. A new formula for predicting the shrinkage of the coefficient of multiple correlation. Annals of Mathematical Statistics, 1931, 2, 440-451.
|
|
Visual Statistics Studio ♦ Outline of Visual Statistics ♦ Library |