LESSON FOUR

DESCRIPTIVE STATISTICS

In this lesson, we will learn how to obtain descriptive statistics and produce a boxplot and an error bar chart.

Example 1. Seven subjects have taken two tests, test X and test Y. Their raw scores are as follow:

 

 Obtain the mean, unbiased standard deviation, and true variance for each test.

SPSS for Windows

A. Open a new data editor window.

B. Enter data

        

 C. Choose the Descriptives procedure

From the menus choose: Analyze / Descriptive Statistics / Descriptives.

To know more about Procedure Descriptives, click on the Help button as shown above. The help topic will appear.

Close the help topic window when you are done.

Next, select the variables, X and Y, to be analyzed. (Move X and Y to the Variable(s) Box. Click OK.

 D. How to list cases

 From the menus choose: Analyze / Reports / Case summaries

The variables in your data file will be displayed on the source list. Select the variables for which you want case listings. Click OK. The results will appear in the Viewer window.

E. Print the case listing: File / Print. Click on OK.

 

SPSS Printout

 A. For Test X

          a. What is the mean and unbiased standard deviation for test X?  (Mx = 3, SD = 1.291)

          b. True and Unbiased Variances

(a) If the researcher considers the subjects to be a random sample from the target population (e.g., all the fourth graders in a school district) the unbiased variance is a correct computational formula for the estimate of the variance of this larger population. Notice that SPSS computes only unbiased variances and unbiased standard deviations by default.

(b) If the researcher only wants to compute the variance to describe a set of scores, the true variance is an appropriate computational formula to use.

 

B. For Test Y  

What is the mean and unbiased standard deviation for test Y? (My = 3, SD = 1.414)

C. Which distribution, X or Y, is more variable? Y. Examine the (unbiased) standard deviation. 1.414 > 1.291

 

Example 2. Produce a boxplot that shows the median and variability of the current salary in each job category.

Open the data file.

Open an existing data file -- Employee data. From the menus of the Data Editor window choose: File / Open / Data. Click on Employee data from the listClick Open.  

What is a boxplot? 

From the menus, choose Help / Topics. Type boxplots in the text box. Click Display. select Box Plots, defined. Click Display. The definition can be shown below.

Create a  boxplot.

From the menus choose: Graphs / Boxplot.

Type of the boxplot: Simple. It is the default.

Data in Chart Are: Summaries for groups of cases

Click Define.

Variable: Select Current Salary [salary]

Category Axis: Select Employment Category [jobcat]. Click OK.

Boxplot

The boxplot provides a vertical view of the data. A boxplot plots the 25th percentile, the median (the 50th percentile), the 75th percentile, and outlying or extreme values. 



 

Variability

The length of the box represents the difference between the 25th and 75th percentiles. From the length of the box, you can determine the variability. The larger the box, the greater the spread of the data. Which group has the largest variability? (Manager)

Central Tendency

The horizontal line inside the box represents the median. If the median is not in the center of the box, the distribution may be skewed. 

Whiskers

Whiskers: Draw lines from the ends of the box to the largest and smallest values that are not outliers. These lines are called whiskers (Refer to SPSS User's Guide). 

Identify Outliers and Extremes

Case numbers are used to label outliers (o) and extremes (*). The boxplot shown above detected outliers and extremes. The outliers are cases  with the values between 1.5 and 3 box-lengths from  the 75th percentile or 25th percentile. The extreme values are cases with the values more than 3 box-lengths from the 75th percentile or 25th percentile (Refer to SPSS User's Guide).

What might cause the outliers? Once you know the reason, you may correct the problem.

The possible reasons may be

(1) Recoding errors.
(2)The sample was drawn from a skewed population distribution.
(3) The sample was not drawn from the same population.
(4) The small sample size.

   Optional Reading: Outliers  and Data Screening by Mike Wulder

 

Example 3. Produce an error bar chart that shows the mean and variability of the current salary in each job category.

Use the same data file -- Employee data. From the menus choose: Graphs / Error Bar

Type of the error bar: Simple. It is the default.

Data in Chart Are: Summaries for groups of cases

Click Define

Variable: Select Current Salary [salary]

Category Axis: Select Employment Category [jobcat].

Bars Represent: Click the down arrow sign. Another list will appear. Click the down arrow sign. You will find Standard deviation. Click Standard deviation. The multiplier box will be available now. The default is 2 (two standard deviations around the mean). Click OK.

This chart shows the mean values with the bar that stretch two standard deviations on either side of the mean.

The current salary in the manager category has the largest variability and the highest mean. The current salary in the custodial category has the least variability.

 

Example 4Produce a clustered boxplot of beginning salaries for males and females across three job categories.

Use the same data file -- Employee data. From the menus choose: Graphs / Boxplot.

Type of the boxplot: Click the picture next to Clustered.

Data in Chart Are: Summaries for groups of cases

Click Define

Variable: Select Beginning Salary [salbegin].

Category Axis: Select Employment Category [jobcat].

Define Clusters by: Gender (gender) Click OK.  

Compare the beginning salaries for males and females across three job categories. What do you observe? 

What to look for  (Refer to SPSS for Windows Base System User's Guide)

Boundaries of the box: The 25th percentile and the 75th percentile. The larger the box, the greater the spread of the data. 

The horizontal line inside the box represents the median.

The median is defined as that point below which fifty percent of the cases fall. If the
median is not in the center of the box, the distribution is skewed.

a. If the median is closer to the top of the box, the distribution may be negatively skewed. Why? The scores tend more toward the higher end of the distribution.

b. If the median is closer to the
bottom of the box, the distribution may be positively skewed, Why? The scores tend more toward  the lower end of the distribution.
 

Whiskers: Draw lines from the boundaries of the box to the largest and smallest values that are not outliers. These lines are called whiskers. If whiskers are of unequal length, the distribution may be skewed.

Outliers and Extremes: Case numbers are used to label outliers and extremes. The extreme values are cases (*) with the values more than 3 box-lengths from the boundaries of the box. The outliers are cases (o) with the values between 1.5 and 3 box-lengths from  the boundaries of the box.

What might cause the outliers?

They may be due to recording errors, due to the sample being drawn from a skewed population distribution or being drawn from different populations, or simply due to the small sample size. Once you know the reason, you may correct the problem.

         Optional Reading: Outliers  and Data Screening by Mike Wulder