Regression Analysis

By positioning a line in Cartesian space, knowledge of the value of a variable X allows for unequivocal determination of its corresponding value on variable Y. The relationship between X and Y is perfect, thus errorless linear transformations between these two variables are possible. This situation is typical in the physical sciences where knowing one variable and its relationship to another variable, we are able to ascertain the outcome of changes in one variable on the state of the other with a high degree of certainty.

Imperfectly Related Variables

In the social sciences, relationships between two variables or events are usually far from perfect. When manifestations of two imperfectly related events are plotted, the result is not a line, but a scatter-plot. If the relationship between these two variables is approximately linear, it takes on the form of an ellipse, as schematically depicted in the figures below.

 

          If the ellipse is relatively narrow, (i.e., the degree of the relationship between the two variables is high), it is relatively easy to position a line through these points so that it approximates the measured relationship. However, as the degree of the relationship between the two variables diminishes and the scatter-plot becomes more and more like a circle than an ellipse, it becomes more and more difficult to position a line unequivocally.

Ideal and Actual Loci

Consider the problem of optimally positioning a line through the swarm of points that are presented in the following table.

 

 

 

To visualize the problem of positioning a line of best fit through the scatter-plot of actual data points, these data were plotted in the subsequent figure. The line of best fit to data points is positioned only tentatively. Since the plot is in the standard scores, we know that the line must go through the origin. However, the slope of the line, signified by the Greek letter beta is, so far, unknown.

Since the relationship between variables X and Y is not perfect, it is necessary to distinguish between the actual locations of the values of the Y variable, and the idealized locations of these values on a line of best fit to be positioned through the scatter-plot. To distinguish between the actual and idealized locations, let us adopt the following notational convention. A prime mark will signify variables containing the idealized locations, and a hat mark will signify variables containing information about the separation of distance between the actual and idealized locations.

Criterion of Least Squares

The problem is to determine the slope of the prediction line. As the slope changes, so do the distances between data points and the vertical projections of their actual locations to          their idealized locations on the line of best fit. Using formal notation, it is possible to write these distances as

 

 

 

To satisfy the criterion of the least squares, frequently credited to Karl Friedrich Gauss but with publication priority held by Pierre Simon Laplace, these distances must be kept at a minimum. To avoid negative numbers, a qualification is added ‘to keep the squared distances between the idealized and the actual loci at a minimum.’ To form a statistical index, an additional qualification is added ‘to keep the mean of the squared distances between the idealized and the actual loci at a minimum.’ Using algebraic notation, the statistical criterion of the least squares is defined as

 

 

The line of best fit is also called the regression line, defined by the idealized loci. Distances between the idealized and actual loci define the error variance. The above equation defines the variance of the error scores. The criterion of the least squares can also be phrased as determining the location of the regression line so as to keep the variance of the error scores at a minimum.

          Within the general linear model it is assumed that the relationship between the X and Y variables is linear, so that

 

 

 

The above equation can be written for the deviation scores as,

 

 

 

and for the standard scores as,

 

 

 

Substituting the right side of the above equation for the last term on the right side in the numerator of the expression defining the criterion of the least squares results in

 

Expanding the binomial leads to

 

 

In this expression, the first and last terms stand for the variance of the standard scores. Remember that the variance of any variable in standard scores always equals 1. The formula for the coefficient of correlation can be recognized within the middle term, and the expression can be simplified to

 

 

          This expression can be conceptualized as a loss function y. To find the minimum of this function, one has to differentiate (this step may be skipped if you are not familiar with calculus) it with respect to b as

 

 

The differentiation was accomplished by disregarding the constant (1), multiplying the expressions containing b by their exponent and diminishing the exponent of b by one. By setting the right hand of the equation to zero (a theoretical minimum) and by remembering that any number to a zero power equals 1, the first differential of the loss function can be written as

 

 

Solving the above equation for the beta term results in the fundamental equivalence of the general linear model

 

 

          This equivalence is at the heart of the general linear model. It indicates that the coefficient of correlation can structure the space defined by the elements in the data matrices. The addition of more variables to this simple bivariate model will change the regression line into a regression plane, a regression space and, finally, into a subspace within hyperspace. The principles governing the geometric properties of correlation will not change. They represent a solid foundation upon which the methods of data analysis are built.

Coefficient of Correlation as Slope of Regression Line

In standard scores, the slope of the line of best fit, also called the regression line, can be plotted by connecting the origin of the system of coordinates and a point defined by the value of the correlation coefficient on the abscissa at the unit distance from the origin. Using the trigonometric functions, one can also plot the regression line within the standard coordinate system. Trigonometry is a branch of geometry that involves the measurement of the sides and angles of triangles. Through the use of trigonometric functions, you can determine the lengths of sides and the sizes of angles in a right triangle if you know the length of two of the sides or the length of one side and the size of one angle, other than the right angle.

          Natural trigonometric functions are the sine, cosine, and tangent. The sine is defined as the altitude divided by the hypotenuse. The cosine is defined as the base divided by the hypotenuse and the tangent is defined as the altitude divided by the base. The inverses of these functions, symbolized by the -1 power, are called the arc sine, arc cosine, and arc tangent, expressing the angle equal to a particular value of these functions. Thus the angular separation of the regression line from the abscissa can be written as

 

 

since the tangent, as well as the slope, is defined by a ratio of the altitude to the base of a right triangle, i.e., by the ratio of change in the predicted value, located on the ordinate, to a unit change in the value of the predictor variable, located on the abscissa.  Arc tangent values for selected coefficients of correlation are shown in the table that follows.

 

 

For the example, the coefficient of correlation, r, was computed as equal to .50. Thus, theta equals the arc tangent of .50. Using your calculator or consulting the above table, the angular separation of the regression line from the abscissa can be found as equal to 26 degrees, 55 minutes, measured counterclockwise from the abscissa. The angular separation of the line of best fit is shown in the figure below.

Predictor, Criterion, Predicted, and Error Variables

Once a relationship is quantified and the magnitude of this relationship is found not to be zero, the coefficient of correlation can be used for the prediction of change in one variable from change in the other. Initially, the coefficient of correlation is computed between a predictor variable X and a criterion variable Y. Subsequently, a prediction can be made from the predictor variable to the predicted variable, and the error of this prediction can be ascertained. The variance of predicted scores can be computed from the equation for computing predicted scores. This equation can be obtained from the equation of the line in standard scores

 

 

by substituting correlation coefficient for slope of the line, beta, as specified by the fundamental equivalence of the general linear model. The equation for computing predicted scores is one of the key equations of the general linear model. Using notation introduced in the previous section, it is written as

 

 

 

The variance of predicted scores than can be computed from the above equation by squaring, summing and averaging both sides as

 

 

 

Recalling that the variance of obtained scores in standard form is always one, the variance of the predicted scores can be written as

 

 

 

Thus, the variance of the predicted standard scores equals the coefficient of determination. This equivalence is one of the key properties of the general linear model.

          If the correlation coefficient is less than one, the prediction is not perfect. From the definition of error, used for the development of the criterion of the least squares, the equation for this error component can be written as

 

 

 

Substituting the equation for computing predicted scores for the last term in the above equation, and by squaring, summing, and dividing by n,

 

 

the right side of the above equation can be expanded, as

 

 

Substituting 1 for the standard form of the variances of X and Y, and r for the coefficient of correlation, the equation becomes

 

 

 

This equation, expresses the variance of error scores associated with the prediction, based on the least squares, can be further simplified as

 

 

 

The right hand of the above equation can be recognized as the coefficient of alienation. The coefficient of alienation thus can be equated to the variance of error scores of bivariate prediction.

Specification Equation of Regression Analysis

At this point the integration of previous findings is in order. We know that the variance of criterion scores, in standard form, is always equal to one:

 

 

 

The variance of predicted scores equals the coefficient of determination

 

 

 

and the variance of error scores equals the coefficient of alienation,

 

Thus, adding the coefficients of determination and alienation must equal to one

 

 

          From the previous discussion we have learned that the coefficient of determination equals the variance of predicted scores and the coefficient of alienation equals the variance of error scores. Substituting these variance components into the above equation we have

 

 

 

Since the variance of standard scores always equals one, it is possible to substitute one for the variance of standard scores on the left-hand side of the above equation as

 

 

 

The above equation is the specification equation for bivariate prediction and is one of the fundamental equations of the general linear model. It postulates that the unit variance of the criterion variable Y can be partitioned into determined and alienated components, into the predictable known and the unpredictable unknown. Multiplying the above equation by  leads to

 

 

 

which can be also written as

 

 

 

Alternatively,

 

 

The partitioning of variance by these specification equations is based on the criterion of least squares, minimizing the error term.

Summary

The specification equations discussed in this chapter are summarized in the following table. They are elegant in their simplicity, expressing partitioning of variance as simple sums of component variances.

 

 

 

According to the general rule for addition of variance components, the variance sums should contain the covariance terms. The covariance terms are missing in all specification equations of the general linear model. This indicates that the correlation between the components of the specification equations is zero. Thus, the components of the specification equations must be orthogonal. The importance of this point cannot be overstressed and shall be recognized in the course of future discussions of the general linear model in statistical analysis.