|
|
|
By
positioning a line in Cartesian space, knowledge of the value of a variable X
allows for unequivocal determination of its corresponding value on variable Y.
The relationship between X and Y is perfect, thus errorless linear
transformations between these two variables are possible. This situation is
typical in the physical sciences where knowing one variable and its
relationship to another variable, we are able to ascertain the outcome of
changes in one variable on the state of the other with a high degree of
certainty.
In
the social sciences, relationships between two variables or events are usually
far from perfect. When manifestations of two imperfectly related events are
plotted, the result is not a line, but a scatter-plot. If the relationship
between these two variables is approximately linear, it takes on the form of an
ellipse, as schematically depicted in the figures below.


If the ellipse is relatively narrow,
(i.e., the degree of the relationship between the two variables is high), it is
relatively easy to position a line through these points so that it approximates
the measured relationship. However, as the degree of the relationship between
the two variables diminishes and the scatter-plot becomes more and more like a
circle than an ellipse, it becomes more and more difficult to position a line
unequivocally.
Consider
the problem of optimally positioning a line through the swarm of points that
are presented in the following table.
To
visualize the problem of positioning a line of best fit through the
scatter-plot of actual data points, these data were plotted in the subsequent
figure. The line of best fit to data points is positioned only tentatively.
Since the plot is in the standard scores, we know that the line must go through
the origin. However, the slope of the line, signified by the Greek letter beta
is, so far, unknown.

Since
the relationship between variables X and Y is not perfect, it is necessary to
distinguish between the actual locations of the values of the Y variable, and
the idealized locations of these values on a line of best fit to be positioned
through the scatter-plot. To distinguish between the actual and idealized
locations, let us adopt the following notational convention. A prime mark will
signify variables containing the idealized locations, and a hat mark will signify
variables containing information about the separation of distance between the
actual and idealized locations.
The
problem is to determine the slope of the prediction line. As the slope changes,
so do the distances between data points and the vertical projections of their
actual locations to their
idealized locations on the line of best fit. Using formal notation, it is
possible to write these distances as
To
satisfy the criterion of the least squares, frequently credited to Karl
Friedrich Gauss but with publication priority held by Pierre Simon Laplace,
these distances must be kept at a minimum. To avoid negative numbers, a
qualification is added ‘to keep the squared
distances between the idealized and the actual loci at a minimum.’ To form a
statistical index, an additional qualification is added ‘to keep the mean of the squared distances between
the idealized and the actual loci at a minimum.’ Using algebraic notation, the
statistical criterion of the least squares is defined as
![]()
The
line of best fit is also called the regression line, defined by the idealized
loci. Distances between the idealized and actual loci define the error
variance. The above equation defines the variance of the error scores. The
criterion of the least squares can also be phrased as determining the location
of the regression line so as to keep the variance of the error scores at a
minimum.
Within the general linear model it is
assumed that the relationship between the X and Y variables is linear, so that
The
above equation can be written for the deviation scores as,
and
for the standard scores as,
Substituting
the right side of the above equation for the last term on the right side in the
numerator of the expression defining the criterion of the least squares results
in
![]()
Expanding
the binomial leads to
![]()
In
this expression, the first and last terms stand for the variance of the
standard scores. Remember that the variance of any variable in standard scores
always equals 1. The formula for the coefficient of correlation can be
recognized within the middle term, and the expression can be simplified to
![]()
This expression can be conceptualized
as a loss function y. To find the minimum of this function, one has to
differentiate (this step may be skipped if you are not familiar with calculus)
it with respect to b as
![]()
The
differentiation was accomplished by disregarding the constant (1), multiplying
the expressions containing b by their exponent and diminishing the exponent of
b by one. By setting the right hand of the equation to zero (a theoretical
minimum) and by remembering that any number to a zero power equals 1, the first
differential of the loss function can be written as
![]()
Solving
the above equation for the beta term results in the fundamental equivalence of
the general linear model
![]()
This equivalence is at the heart of
the general linear model. It indicates that the coefficient of correlation can
structure the space defined by the elements in the data matrices. The addition
of more variables to this simple bivariate model will change the regression
line into a regression plane, a regression space and, finally, into a subspace
within hyperspace. The principles governing the geometric properties of
correlation will not change. They represent a solid foundation upon which the
methods of data analysis are built.
In
standard scores, the slope of the line of best fit, also called the regression
line, can be plotted by connecting the origin of the system of coordinates and
a point defined by the value of the correlation coefficient on the abscissa at
the unit distance from the origin. Using the trigonometric functions, one can
also plot the regression line within the standard coordinate system.
Trigonometry is a branch of geometry that involves the measurement of the sides
and angles of triangles. Through the use of trigonometric functions, you can
determine the lengths of sides and the sizes of angles in a right triangle if
you know the length of two of the sides or the length of one side and the size
of one angle, other than the right angle.
Natural trigonometric functions are
the sine, cosine, and tangent. The sine is defined as the altitude divided by
the hypotenuse. The cosine is defined as the base divided by the hypotenuse and
the tangent is defined as the altitude divided by the base. The inverses of
these functions, symbolized by the -1 power, are called the arc sine, arc
cosine, and arc tangent, expressing the angle equal to a particular value of
these functions. Thus the angular separation of the regression line from the
abscissa can be written as
![]()
since
the tangent, as well as the slope, is defined by a ratio of the altitude to the
base of a right triangle, i.e., by the ratio of change in the predicted value,
located on the ordinate, to a unit change in the value of the predictor
variable, located on the abscissa. Arc
tangent values for selected coefficients of correlation are shown in the table
that follows.
For
the example, the coefficient of correlation, r, was computed as equal to .50.
Thus, theta equals the arc tangent of .50. Using your calculator or consulting
the above table, the angular separation of the regression line from the
abscissa can be found as equal to 26 degrees, 55 minutes, measured
counterclockwise from the abscissa. The angular separation of the line of best
fit is shown in the figure below.

Once
a relationship is quantified and the magnitude of this relationship is found
not to be zero, the coefficient of correlation can be used for the prediction
of change in one variable from change in the other. Initially, the coefficient
of correlation is computed between a predictor variable X and a criterion
variable Y. Subsequently, a prediction can be made from the predictor variable
to the predicted variable, and the error of this prediction can be ascertained.
The variance of predicted scores can be computed from the equation for computing
predicted scores. This equation can be obtained from the equation of the line
in standard scores
![]()
by
substituting correlation coefficient for slope of the line, beta, as specified
by the fundamental equivalence of the general linear model. The equation for
computing predicted scores is one of the key equations of the general linear
model. Using notation introduced in the previous section, it is written as
The
variance of predicted scores than can be computed from the above equation by
squaring, summing and averaging both sides as
Recalling
that the variance of obtained scores in standard form is always one, the
variance of the predicted scores can be written as
Thus,
the variance of the predicted standard scores equals the coefficient of
determination. This equivalence is one of the key properties of the general
linear model.
If the correlation coefficient is less
than one, the prediction is not perfect. From the definition of error, used for
the development of the criterion of the least squares, the equation for this
error component can be written as
Substituting
the equation for computing predicted scores for the last term in the above
equation, and by squaring, summing, and dividing by n,
![]()
the
right side of the above equation can be expanded, as
![]()
Substituting
1 for the standard form of the variances of X and Y, and r for the coefficient
of correlation, the equation becomes
This
equation, expresses the variance of error scores associated with the
prediction, based on the least squares, can be further simplified as
The
right hand of the above equation can be recognized as the coefficient of
alienation. The coefficient of alienation thus can be equated to the variance
of error scores of bivariate prediction.
At
this point the integration of previous findings is in order. We know that the
variance of criterion scores, in standard form, is always equal to one:
The
variance of predicted scores equals the coefficient of determination
and
the variance of error scores equals the coefficient of alienation,
Thus,
adding the coefficients of determination and alienation must equal to one
![]()
From the previous discussion we have
learned that the coefficient of determination equals the variance of predicted
scores and the coefficient of alienation equals the variance of error scores.
Substituting these variance components into the above equation we have
Since
the variance of standard scores always equals one, it is possible to substitute
one for the variance of standard scores on the left-hand side of the above
equation as
The
above equation is the specification equation for bivariate prediction and is
one of the fundamental equations of the general linear model. It postulates
that the unit variance of the criterion variable Y can be partitioned into
determined and alienated components, into the predictable known and the
unpredictable unknown. Multiplying the above equation by
leads to
which
can be also written as
Alternatively,
![]()
The
partitioning of variance by these specification equations is based on the
criterion of least squares, minimizing the error term.
The
specification equations discussed in this chapter are summarized in the
following table. They are elegant in their simplicity, expressing partitioning
of variance as simple sums of component variances.
According
to the general rule for addition of variance components, the variance sums
should contain the covariance terms. The covariance terms are missing in all
specification equations of the general linear model. This indicates that the
correlation between the components of the specification equations is zero.
Thus, the components of the specification equations must be orthogonal. The
importance of this point cannot be overstressed and shall be recognized in the
course of future discussions of the general linear model in statistical
analysis.