Regression: Approximation of Assumed Structures

 

Perfect Relationship

By positioning a line in Cartesian space, knowledge of the value of a variable X allows for unequivocal determination of its corresponding value on variable Y. The relationship between X and Y is perfect, thus errorless linear transformations between these two variables are possible.

 

 

This situation is typical in the physical sciences where knowing one variable and its relationship to another variable, we are able to ascertain the outcome of changes in one variable on the state of the other with a high degree of certainty.

 

Imperfectly Related Variables

Scatterplot

In the social sciences, relationships between two variables or events are usually far from perfect. When manifestations of two imperfectly related events are plotted, the result is not a line, but a scatterplot.

If the relationship between these two variables is approximately linear, it takes on the form of an ellipse, as schematically depicted in the figures below.

 

 

If the ellipse is relatively narrow, (i.e., the degree of the relationship between the two variables is high), it is relatively easy to position a line through these points so that it approximates the measured relationship. However, as the degree of the relationship between the two variables diminishes and the scatter-plot becomes more and more like a circle than an ellipse, it becomes more and more difficult to position a line unequivocally.

 

Ideal and Actual Loci

Consider the problem of optimally positioning a line through the points that are presented in the following table.

Data Set

First, convert the obtained scores to standard scores.
 

 

The associated scatterplot can be show below.



 

Note that the relationship between two variables is not perfect. When manifestations of two imperfectly related variables are plotted, the result is not a line, but a scatterplot. 

Unknown Slope of a Line of Best Fit

To visualize the problem of positioning a line of best fit through the scatter-plot of actual data points, these data are plotted in the subsequent figure.

 

 

The line of best fit to data points is positioned only tentatively. Since the plot is in standard scores, we know that the line must go through the origin. However, the slope of the line, signified by the Greek letter beta, is so far unknown. Two approximate lines of best fit are positioned as shown below.


 

Which line best fits the points on the scatterplot? Before we will be able to answer this question, let us first introduce the following topic.

Idealized Locations

Since the relationship between variables Zx and Zy is not perfect, it is necessary to distinguish between the actual locations of the values, shown as the red data points,



 

and the idealized locations, shown as the blue data points on a line of best fit, positioned through the scatterplot. 



Notations

Predictor Variable and Criterion Variable

The variable used to help make the prediction is called the predictor variable. Usually we label the predictor variable as X, x, or Zx. The variable we make predictions about is called the criterion variable. Usually we label the criterion variable as Y, y, or Zy.

Regression Line and Predicted Variable

A line of best fit called the regression line is used for predicting values on the criterion variable from values on the predictor variable. To distinguish between the actual and idealized locations, let us adopt the following notational convention. A top bar (-) or a prime (') will mark variables containing the idealized locations.

Error Variable

Next, a hat (^) will mark variables containing information about the separation of distance between the actual and idealized locations. The distances between the actual and idealized locations (Zy - Zy'= Zy^) can be visualized as

 

 

Obtained, deviation, and standard variables containing information about the location of idealized data points will thus be written as , , and (or Y', y', and Zy'). Variables containing information about the separation of the idealized and actual locations of these points will be written as , , and .

 

Criterion of Least Squares

This criterion of least squares is instrumental in determining the slope of the regression line. As the slope changes, so do the distances between data points and the vertical projections of their actual locations to their idealized locations on the line of best fit. Using formal notation, it is possible to write these distances as

 

 

Minimum Distances

To satisfy the criterion of the least squares, frequently credited to Karl Friedrich Gauss but with publication priority held by Pierre Simon Laplace, these distances must be kept at a minimum.

Squared Distances

To avoid negative numbers, a qualification is added ‘to keep the squared distances between the idealized and the actual loci at a minimum.’ 

Mean of Squared Distances

To form a statistical index, an additional qualification is added ‘to keep the mean of the squared distances between the idealized and the actual loci at a minimum.’ Using algebraic notation, the statistical criterion of the least squares is defined as

 

 

The term within the parentheses in the numerator of the above equation defines the error scores and the entire equation defines the variance of the error scores. Thus, the criterion of the least squares can be phrased as determining the location of the regression line so as to keep the variance of the error scores at a minimum.

Regression Equation and Predicted Values

The location of a regression line in the scatterplot is determined by the regression equation. A regression equation is a formula used for computing predicted values. In standard scores form, the predicted values can be directly computed as 



Proof (Optional)

The equation of a line, written in obtained scores is 

Regression Line

Within the general linear model it is assumed that the relationship between the X and Y variables is linear, so that

 

 

The above equation can be written for the deviation scores as,

 

 

and for the standard scores as,

 

 

Criterion of the Least Squares

Substituting the right side of the above equation for the last term on the right side in the numerator of the expression defining the criterion of the least squares results in

 

Expanding the binomial leads to

 

 

In this expression, the first and last terms stand for the variance of the standard scores. Remember that the variance of any variable in standard scores always equals 1. The formula for the coefficient of correlation can be recognized within the middle term, and the expression can be simplified to

 

 

This expression can be conceptualized as a loss function y. To find the minimum of this function, one has to differentiate (this step may be skipped if you are not familiar with calculus) it with respect to beta as

 

 

The differentiation was accomplished by disregarding the constant (1), multiplying the expressions containing beta by their exponent and diminishing the exponent of beta by one. By setting the right hand of the equation to zero (a theoretical minimum) and by remembering that any number to a zero power equals 1, the first differential of the loss function can be written as

 

 

Solving the above equation for the beta term results in the fundamental equivalence of the general linear model

 

 

This equivalence is at the heart of the general linear model. It indicates that the coefficient of correlation can structure the space defined by the elements in the data matrices. The addition of more variables to this simple bivariate model will change the regression line into a regression plane, a regression space and, finally, into a subspace within hyperspace. The principles governing the geometric properties of correlation will not change. They represent a solid foundation upon which the methods of data analysis are built.

 

Coefficient of Correlation

The Coordinates of Two Points

In standard scores, the slope of the regression line can be plotted by finding the coordinates of two points. The first point is anchored by the origin of the system of coordinates. The coordinates of the second point can be located by plotting the value of the correlation coefficient (.50) as a distance on the ordinate at the unit distance from the origin, measured on the abscissa.


 

Trigonometric Functions (Optional)

The regression line within the standard coordinate system can be also plotted by using the trigonometric functions. Trigonometry is a branch of geometry that involves the measurement of the sides and angles of triangles. Through the use of trigonometric functions, you can determine the lengths of sides and the sizes of angles in a right triangle if you know the length of two of the sides or the length of one side and the size of one angle, other than the right angle.

Natural trigonometric functions are the sine, cosine, and tangent. The sine is defined as the altitude divided by the hypotenuse. The cosine is defined as the base divided by the hypotenuse and the tangent is defined as the altitude divided by the base. The inverses of these functions, symbolized by the -1 power, are called the arc sine, arc cosine, and arc tangent, expressing the angle equal to a particular value of these functions. Thus the angular separation of the regression line from the abscissa can be written as

 

 

since the tangent, as well as the slope, is defined by a ratio of the altitude to the base of a right triangle, i.e., by the ratio of change in the predicted value, located on the ordinate, to a unit change in the value of the predictor variable, located on the abscissa. Arc tangent values for selected coefficients of correlation are shown in the table that follows.

 

 

For the example, the coefficient of correlation, r, was computed as equal to .50. Thus, theta equals the arc tangent of .50. Consulting the above table, the angular separation of the regression line from the abscissa can be found as equal to 26 degrees, 55 minutes, measured counterclockwise from the abscissa. 

 

 

The angular separation of the line of best fit is shown in the figure below.

 

 

Variance Components

Once a relationship is quantified and the magnitude of this relationship is found not to be zero, the coefficient of correlation can be used for the prediction of change in one variable from change in the other.

Data Set

The predictor variable is Zx and the criterion variable is Zy.

 

Total Variance of the Criterion Variable

Ignore the predictor variable Zx. In the absence of any relevant information, the best prediction of the outcome is the overall mean. The mean of the variable Zy equals 0.



The total variance of the variable Zy is equal to 1. The distances between the individual values of the variable Zy and the overall mean ( Mzy) can be graphed as

 

 

Correlation Coefficient

Next, the analysis will involve two variables, Zx and Zy. Initially, the coefficient of correlation is computed between a predictor variable Zx and a criterion variable Zy.

 


The correlation between Zx and Zy is .50. The correlation coefficient is not zero. Subsequently, a prediction can be made from the predictor variable and the error of this prediction can be described by the error variable.

Predicted Variable

The regression equation is a formula used for computing predicted values. In standard scores form, the predicted values can be directly computed as 

Zy' = rxyZx

For our example (where rxy=.50), the results can be shown below

 

 

 

together with a plot of the regression line.


Predictable Variance

The predictable variance is computed as the mean of the squared differences between the predicted values and the mean of the predicted variable. The predictable variance equals .25. The distances between the predicted values and the mean can be shown below

 

The predictable variance represents the variance in the criterion variable accounted for by a predictor variable. Recall that the total variance of the criterion variable Zy is 1. Thus, 25% (.25/1 =.25) of the variance in the criterion variable can be explained by the predictor variable Zx

Error Variable

The error variable can be computed as

Zy^ = Zy - Zy'

The results can be shown below.

 

 

Error Variance

The distances between the actual values (Zy) and the predicted values (Zy') can be visualized as  


 

The error variance represents the amount of error in our prediction and it is equal to .75. Recall that the total variance of the criterion variable Zy is 1. Thus, 75% (.75/1 =.75) of the variance in the criterion variable can not be explained by the predictor variable Zx

The Equation for the Predicted Variance Component

Another way to compute the variance of predicted scores is from the equation for computing predicted scores.

The Equation of A Line in Standard Scores

Recall that the equation of the line in standard scores is

 

 

Regression Equation in Standard Scores

The regression equation can be obtained from the above equation by substituting correlation coefficient for slope of the line, beta, as specified by the fundamental equivalence of the general linear model. Using notation introduced in the previous section, it is written as

 

 

The equation for computing predicted scores is one of the key equations of the general linear model.

Variance of the Predicted Variable

The variance of predicted scores can be computed from the above equation by squaring, summing and averaging both sides as

 

 

Recall that the variance of standard scores is always one.

 


The variance of the predicted scores can be written as

 

   

Thus, the variance of the predicted standard scores equals the coefficient of determination. This equivalence is one of the key properties of the general linear model.

The Equation for the Error Variance Component

If the correlation coefficient is less than one, the prediction is not perfect. From the definition of error, used for the development of the criterion of the least squares, the equation for this error component can be written as

 

 

Substituting the equation for computing predicted scores for the last term in the above equation, and by squaring, summing, and dividing by n,

 

 

the right side of the above equation can be expanded, as

 

 

Substituting 1 for the standard form of the variances of X and Y, and r for the coefficient of correlation, the equation becomes

 

 

This equation, expresses the variance of error scores associated with the prediction, based on the least squares, can be further simplified as

 

 

The right hand of the above equation can be recognized as the coefficient of alienation. The coefficient of alienation thus can be equated to the variance of error scores of bivariate prediction.

 

Specification Equation of Regression Analysis

Coefficients of Determination and Alienation

At this point the integration of previous findings is in order. We know that the variance of criterion scores, in standard form, is always equal to one.
 

 

Next, the coefficient of correlation is computed between a predictor variable Zx and a criterion variable Zy.

Third, predict Zy from Zx. In standard scores form, the predicted values can be directly computed as 




and the error variable can be computed as

 

Zy^ = Zy - Zy'

 

The variance of predicted scores equals the coefficient of determination

 

 

and the variance of error scores equals the coefficient of alienation,

 

Thus, adding the coefficients of determination and alienation must equal to one

 

 

True Variance Components

From the previous discussion we have learned that the coefficient of determination equals the variance of predicted scores and the coefficient of alienation equals the variance of error scores. Substituting these variance components into the above equation we have

 

 

Since the variance of standard scores always equals one, it is possible to substitute one for the variance of standard scores on the left-hand side of the above equation as

 

 

The above equation is the specification equation for bivariate prediction and is one of the fundamental equations of the general linear model. It postulates that the unit variance of the criterion variable Y can be partitioned into determined and alienated components, into the predictable known and the unpredictable unknown.

Thus, multiplying the equation

 

 

by  leads to

 

 

which can be also written as

 

 

or, since the variances of deviation and obtained scores are identical, the specification equation can be also written as

 

 

Alternatively,

 

The partitioning of variance by these specification equations is based on the criterion of least squares, minimizing the error term.

 

Summary

The specification equations discussed in this chapter are summarized in the following table. They are elegant in their simplicity, expressing partitioning of variance as simple sums of component variances.

 

 

Variances

Coefficients of Determination and Alienation

Standard Scores

 

 

Deviation Scores

 

Obtained Scores

 

 

 

According to the general rule for addition of variance components, the variance sums should contain the covariance terms. The covariance terms are missing in all specification equations of the general linear model. This indicates that the correlation between the components of the specification equations is zero. Thus, the components of the specification equations must be orthogonal. The importance of this point cannot be overstressed and shall be recognized in the course of future discussions of the general linear model in statistical analysis.