LESSON FIVE

EXPLORE DATA

 

Example 1. Thirty students earned the following scores on their final exam:

25, 35, 21, 14, 33, 22, 12, 19, 17, 24, 11, 31, 24, 27, 26,

28, 16, 15, 8, 27, 20, 23, 21, 6, 10, 30, 19, 25, 23, 22.

The researcher would like to screen the data first.

  

SPSS for Windows

A. Enter data. Define the variable name (score).

B. Choose the Explore procedure.

From the menus choose: Analyze / Descriptive Statistics / Explore

To know more about Procedure Explore, click on the Help button. The help topic will appear.

 

Close the help topic window when you are done.

a. Next, click `score` and the > pushbutton to move `score` to the Dependent List.

b. Click on the Plots button. Select Histogram. Boxplots and Stem-and-leaf are displayed by default. Click Continue. Click OK.  

 

SPSS Printout

 

A. Descriptive Statistics

   

The mean or the median is used to determine the single most representative value, the location of a distribution of scores. The variance, standard deviation, and the range tell you how spread out the scores are. 

1. The mean value was 21.133 and the median was 22. 

The greater is the difference between the mean and the median, the greater is the skew. For our example, the difference is very small. 

2. The range is sensitive to extremes. The minimum value was 6 and the maximum value was 35. The range was 35 - 6 = 29.

The interquartile range is 10.50. It is resistant to the effects of extreme scores. The interquartile range is defined as Q3 - Q1

The standard deviation is based on every score and it is more stable. The unbiased standard deviation is 7.328. 

3. We may compute z scores by using the above mean and the standard deviation. For example, the largest value of the variable Score was 35. A score of 35 is ____ standard deviations above the mean. 

z = (35 - 21.33)/7.328 = 1.87
 

Suppose that the above data set is a random sample from a target population and the population mean is unknown.

1. Estimate Unknown Population Mean

Point Estimation

The single best guess for the unknown population mean is the sample mean. For our example, the sample mean equals 21.1333. Thus, a point estimate of the population mean would be ______.

2. Theoretical Sampling Distribution Of the Mean

If you take all possible samples of size 30 and compute the their means, a theoretical sampling distribution of means can be formed.

Sampling Variability

The standard deviation of the sampling distribution of the mean is called the standard error of the mean. It can be computed as
 

 
 

3. Interval Estimation

It is unlikely that the sample mean exactly equals the population value. Based on the sample mean and standard error of the mean, we can calculate an estimated range of values which will include the population mean at a specified probability.

Computational Procedures

Compute a 95% confidence interval step by step.
 

(1) Estimated Standard Error of the Mean

a. The population standard deviation is unknown. The sample standard deviation (s) is 7.3284.

b. The formula used to compute the standard error of the mean is




Thus,
 

 

(2) Construct a 95% confidence interval for the mean.

a. The normal distribution or the t-distribution

Since the population standard deviation is unknown, we will use the t distribution. 

The advantages of t-test over the z ratio are in increased precision of the probability estimates for
small sample sizes. The disadvantages are that the number of cases must be replaced by the degrees of freedom and that there is no single t-distribution but an infinite number of t-distributions.

Degrees of freedom: 30 - 1 = 29.  

Theoretically, the normal distribution and the t-distribution are identical only for the infinite number of the degrees of freedom. Practically, you may see for yourself that for most but minute ns, the differences between the normal and t-distributions are not substantial.

b. Critical values for middle 95% of the area

Middle 95% of the area under the t distribution with 29 degrees of freedom is between plus and minus 2.045 standard errors of the mean.

c. Construct a distance

-2.05 standard errors: (-2.05)(1.338) = -2.7429

+2.05 standard errors: (+2.05)(1.338) =+2.7429

d. The center of the confidence interval is the sample mean. Extend a distance equal to 2.7429 in both directions. Compute the lower limit and the upper limit

21.133 - (2.05)(1.338) = 18.39

21.133 + (2.05)(1.338) = 23.87
 

Interpretation

a. If we take another sample of size 30 from the same population and compute the mean, the sample mean is likely to be different. Consequently, the constructed confidence interval will be different, too.

b. If all possible samples of size 30 are taken from the same population and a 95% confidence interval is calculated for each sample mean, approximately 95% of all possible confidence intervals would include the population mean.

(4) The Width of A Confidence Interval

Wide confidence intervals suggest the estimation of the population mean is relatively imprecise

a. Sample Size

The confidence intervals becomes wider as the sample size decreases (e.g., from 30 to 10).

Why?

Examine the formula. As the sample size decreases, the standard error increases.

Consequently, the confidence interval is wider.

21.133 - (2.05)(increase) = ?

21.133 + (2.05)(increase) = ?

b. Confidence Level

The confidence intervals becomes wider as the confidence level increases (e.g., from 95% to 99%).

Why?

The critical values for a 99% confidence interval are larger than the critical values for a 95% confidence interval. Consequently, the confidence interval is wider.

21.133 - (increase)(1.338) = ?

21.133 + (increase)(1.338) = ?
 

A value of zero for the skewness indicates a symmetric distribution. A value of zero for the kurtosis indicates the a shape (peakedness) close to normal.  

1. Skewness = -.255 and Standard Error = .427.

The sign of the skewness is negative. The distribution is negatively skewed. Recall that the
mean is slightly less than the median.

Is the distribution severely skewed? It is not severely skewed. The value of the skewness is close to zero. We will visualize the distribution by plotting a histogram and a boxplot later.

2. Kurtosis = -.421and Standard Error = .833.

The sign of the kurtosis is negative. The distribution is platykurtic. However, it is not severe. The value of the kurtosis is close to zero.  We will visualize the distribution by plotting a histogram and a boxplot later.

 

 B. Examine the histogram (frequency distributions).

It is necessary to split the data into intervals (or classes) prior to plotting. Note that the numbers below the bars indicate the middle value of each interval. Each bar represents the number of cases falling within each interval.

Examine the shape. The distribution was slightly skewed. However, it is not severe.

 

C. Examine the stem-and-leaf plot

The stem-and-leaf plot represents cases with numeric values. Divide the values into two parts - the leading digit , called the stem, and the tailing digit, called the leaf. (Refer to SPSS User's Guide)

For example, 

Actual Value Stem Leaf   Stem &Leaf
6 0 6 Leaf between 5-9 indicated by  ".  .6
20 2 0 Leaf between 0-4 indicated by  "*" 2*0
35 3 5 Leaf between 5-9 indicated by "," 3.5

Examine the shape of the stem-and-leaf plot.

Compare it with the histogram. They are similar.

What is the advantage to use the stem-and-leaf plot?

(The stem-and-leaf plot provides more information on the actual values than does the histogram.)

 

 D. Examine the boxplot:

Divide an ordered data set to 4 quarters. A boxplot plots the 25th percentile, the median (the 50th percentile), the 75th percentile, and outlying or extreme values. (Refer to SPSS User's Guide)

The length of the box represents the difference between the 25th and 75th percentiles. The larger the box, the greater the spread of the data. 

The horizontal line inside the box represents the median.

If the median is not in the center of the box, the distribution may be skewed. Note that the distribution was slightly skewed.

Whiskers. Draw lines from the both ends of the box to the largest and smallest values that are not outliers. These lines are called whiskers.

Outliers and Extremes. Case numbers are used to label outliers (o) and extremes (*). The outliers are cases  with the values between 1.5 and 3 box-lengths from  the 75th percentile or 25th percentile. The extreme values are cases with the values more than 3 box-lengths from the 75th percentile or 25th percentile. Note that the above boxplot did not detect any outliers or extremes.

What might cause the outliers?

They may be due to recoding errors, due to the sample being drawn from a skewed population distribution or not being drawn from the same population, or simply due to the small sample size. Once you know the reason, you may take an appropriate action to correct the problem.

bullet

Readings: Box-and-Whisker Plots by Oswego City School District

 


 

Checking Assumptions

Statistical tests generally make assumptions about the distribution of the target population. For example, the t test and ANOVA assume that the data are sampled from one or more normal distributions and that the variances of the different populations are the same. If the assumptions are violated, the test results may not be valid.


Example 2
. Open an existing SPSS data file -- Employee data. Apply the Explore procedure to examine if the variable Beginning Salary (salbegin) is normally distributed.

A. From the SPSS Data Editor menus choose: File / Open / Data. Select Employee data.sav.

B. Choose the Explore procedure.

From the menus choose: Analyze / Descriptive Statistics / Explore

a. Click `Beginning Salary` and the > pushbutton to move it to the Dependent List.

b. In the Display area, click on the Plots radio button.


Boxplots and Stem-and-leaf will be displayed by default.

c. Click the Plots pushbutton.

The Explore: Plots dialog box will appear. To produce a normal probability plot and a detrended normal plot, click the check box next to Normality plots with tests. Click Continue. Click OK.

 

SPSS Output

Checking Normality

A. Tests of Normality

The hypothesis of normality was rejected, p < .001.


B. Observed Value on the X axis vs. Expected Value from the Normal Distribution on the Y axis

What to look for

If the sample is from a normal population, the data points should fall on a straight line.

 

Is the sample from a normal population? (No.)


C. Are the deviations from a straight line randomly distributed around 0?

What to look for

If the sample is from a normal population, the data points are expected to cluster around a horizontal line through 0. Also, there should be no pattern.

  

Are the deviations from a straight line randomly distributed around 0? (No.)

 

D. Remedy

Ask Why

When an assumption is violated, the correct course of action is to find a reason why it happened. Once you know the reason, the further course of analysis becomes obvious.  


Normalize the Skewed Distribution

If the coefficient of skewness is not statistically significant, the departure of the distribution from normality can be considered due to random factors, and the distribution can be normalized. In general, area transformations are a better method of the normalization of data than other methods, as, e.g., the square root method, the logarithm transformation, or the often used arc sine transformation.

If the coefficient of skewness is statistically significant, other avenues leading to normality should be explored. Normalizing markedly skewed distributions may obscure factors making the distribution skewed to begin with. These factors may, in some cases, be of crucial importance.

   Rank the Data

If the distribution does not appear to be normal and the sample size is small, we may consider statistical procedures that do not require the assumption of normality (distribution-free or nonparametric tests) and transform the interval or ratio data to the ordinal data. The scores will be ranked from smallest to largest values.    
 

The Mann-Whitney test (Wilcoxon Rank-Sum) is a non-parametric analog of the independent t-test. However the t-test procedure will always have more power than the corresponding non-parametric test if the distribution is normal.
 


 

Example 3. Normalize the distribution of the variable "Beginning Salary".

In general, area transformations are a better method of the normalization of data than other methods, as, e.g., the square root method, log transformations, or arc sine transformations.

 

A. From the SPSS Data Editor menus choose: File \ Open \ Data. Select Employee data.sav.

B. First, examine the distribution of the variable "Beginning Salary"

Choose Analyze \ Descriptive Statistics \ Explore. Click the Reset button to clear the previous selections. Select the variable "Beginning Salary": Click Beginning Salary from the variable list and click the right arrow button to move it to the Dependent List. Next, click the Statistics button. Select Outliers as shown below. 

 

 

Click Continue. Then click the Plots button.

 

 

Select Histogram and Normality plots with tests as shown below.

 

 

Click Continue and OK. the output window will appear.


SPSS Output

1. Skewness an Kurtosis

A value of zero for the skewness indicates a symmetric distribution. A value of zero for the kurtosis indicates the a shape (peakedness) close to normal.  

Skewness = 2.853. The distribution is positively skewed. We will visualize the distribution by plotting a histogram and a normal Q-Q plot.

Kurtosis = 12.390. The distribution is leptokurtic. Note that the coefficient of kurtosis is very large. We will visualize the distribution by plotting a histogram and a normal Q-Q plot.

The five largest and five smallest values are listed below.

What might cause the outliers?

Note that the outliers may be due to recoding errors, due to the sample being from a skewed population distribution or not being from the same population, or simply due to the small sample size.  Once you know the reason, you may take an appropriate action to correct the problem.

If you can not determine the situation, you may report two analyses (one with the outlying cases included and the other with the outlying cases deleted). If you decide the cases with unusual values should remain, you may need to transform the data to reduce the impact of extreme values.
 


2. Tests of Normality: 
The hypothesis of normality was rejected, p < .001. 



 

Note that when the sample size is large, almost any test will be significant. You should also examine the actual departure from normality by plotting histograms and the normal plots as shown below.

3. Examine the distribution of the variable "Beginning Salary".

 

 

The distribution of the variable "Beginning Salary" was positively skewed.

4. Normal Q-Q Plots: Observed Value on the X axis vs. Expected Value from the Normal Distribution on the Y axis

The normal Q-Q plots showed that the values deviated from the straight line.

 

 

To normalize the positively skewed distribution, a square root transformation is used.

 

Square Root Transformations

Switch to the SPSS Data Editor Window: Window \ SPSS Data Editor.

From the menus choose: Transform / Compute.

To create a new variable, sqrt, type the new variable name in the Target Variable textbox. Next, scroll through a list of functions. Click the SQRT function and click the up arrow button as shown below.



 

Then, click Beginning Salary from the list of variable names. Click the right arrow button as shown below. 

 

 

The numeric expression will look like this:

 

 

Finally, click the OK button. The new variable "sqrt" will appear in the Data Editor window.
 

Descriptive Statistics

Choose Analyze \ Descriptive Statistics \ Explore. First, move the variable Beginning Salary back to the variable list by clicking on it and clicking the left arrow button.

 

 

Next, Scroll down the variable list and select the variable "sqrt".

 

 

Click the right arrow button to move it to the Dependent list. Click OK. 

 

Results

Examine the skewness and the kurtosis. The values of skewness and kurtosis should be closer to zero after a successful transformation.

Skewness = 1.926

Kurtosis = 4.944

Tests of Normality:  The hypothesis of normality was still rejected, p < .001. The square root transformation was not successful.

 


 

Examine the histogram and the normal Q-Q plot. The square root transformation was not successful.

 

 

In general, area transformations are a better method of the normalization of data than other methods, as, e.g., the square root method.

 

Area Transformations

Switch to the SPSS Data Editor Window: Window \ SPSS Data Editor. From the menus choose: Transform \ Rank Cases. Select the variable "Beginning Salary". Next, click on the Rank Types button. Rank is checked by default. Click on the More>> button. 

(1) Choose Proportion estimates and Normal scores.

(2) Proportion Estimate Formula. Choose Rankit as shown below.

Note that Rankit uses the formula (r - 1/2) / w, where w is the number of observations and r is the rank

Click Continue. Click OK.  

Switch to the Data Editor window. Three new variables psalbegi, nsalbegi, and rsalbegi will be added. The new variable psalgegi contains the estimate of the cumulative proportion (area) of the distribution corresponding to a particular rank. The variable nsalgegi contains the z scores from the standard normal distribution that correspond to the estimated cumulative proportion.

Descriptive Statistics 

Choose Analysis \ Descriptive Statistics \ Explore. Move the variable "sqrt" back to the variable list. Select  the NORMAL of SALBEGIN (nsalgi) variable to be analyzed. Click OK.

 

 

The skewness and kurtosis values were much closer to zero.

Skewness = .026

Kurtosis = -.089

Tests of Normality

According to the Shapiro-Wilk test, the hypothesis of normality was not rejected, p > .05.

 

The histogram is much more normal.

The normal Q-Q plot showed that the values fall on the straight line.

 

The area transformation was successful. However, normalizing markedly skewed distributions may obscure factors making the distribution skewed to begin with. These factors may, in some cases, be of crucial importance.

 


 

Example 4. Do groups come from normal populations with the same variance?

Use the Employee data. Select the variable 'Beginning Salary'. Plot the values of spread (variability) and level (mean beginning salary) for each employment category.

A. Apply the Explore procedure to produce a spread-versus-level plot.

From the menus choose: Analyze / Descriptive Statistics / Explore. Click the Reset button to clear all the previous selections.

a. Click `Beginning Salary` and the > pushbutton to move it to the Dependent List.

b. Click `Employment Category` and the > pushbutton to move it to the Factor List.

c. In the Display area, click on the Plots radio button. Boxplots and Stem-and-leaf will be displayed by default.

d. Click the Plots button. The Explore: Plots dialog box will appear.

(a) Click Histogram. To produce a normal probability plot and a detrended normal plot, click the check box next to Normality plots with tests.

(b) In the Spread vs. Level with Levene Test area, click the untransformed.

Click Continue. Click OK.

 

SPSS Output

 

Test of Homogeneity of Variance

The null hypothesis that the group variances are equal was rejected, p < .05. The assumption of homogeneity of variance was not met.

 

Boxplot

 

Variability

The length of the box represents the difference between the 25th and 75th percentiles. The larger the box, the greater the spread of the data.  Note that the manager group had largest variability. Also, note that the outliers and extreme were detected. Case numbers are used to label outliers (o) and extremes (*). 

 

Check Homogeneity of variance

What to look for

Is there a relationship between the group mean or level of the employment category and their associated variability (spread)?

There is a relationship between the group mean (levels) and their associated variability (spread). Note that the highest group mean is associated with the largest variability.

Do groups come from normal populations with the same variance? (No.)  


Reading