Introduction to Data Analysis

 

There are three basic methods of data collection: counting, ranking, and measuring. Historically, number was first associated with counting, and later with length. Ranking can be viewed as a transition from the concept of a number as a countable entity to a concept of numbers as located along a continuum where each number has it predecessor and successor. Ranking procedures are based on observations such as 'less than,' 'greater than,' and 'equal to.' Measuring is closely associated with the concept of the number as a length. Related to these classifications of the numbers is a classification of data into nominal, ordinal, interval, and ratio categories.

Statistical analysis of data is concerned with procedures applied after the observations are made, experiments performed, and data recorded; it pertains to methods for obtaining information from data matrices. To conduct an experiment or a quantitative study, you must eventually create a data matrix, enter it into a computer, and analyze it. A data matrix is a rectangular arrangement of numbers symbolizing properties of phenomena under scrutiny.

These numbers are located at intersections of data matrix's rows and columns. In most instances, the only characteristic of numbers contained by a data matrix which is apparent to a casual observer is that they differ. If the numbers within the data matrix are not the same then they vary. The single columns of data matrices are called variables, or, in general, data vectors. A variable is a symbol which takes on different values. By convention, obtained scores or raw scores are written in upper case letters (e.g., X, Y, or Z). Note that the numbers in the data set are not the same and thus carry information about phenomena that they describe.

Statistical methods can be classified according to the number of variables to be jointly analyzed. Univariate methods describe variables in a data matrix, one at a time. Bivariate methods describe relationships between pairs of variables. Multivariate methods describe relationships and structures among groups of variables.

Scalars-Vectors-Matrices Hierarchy of Data

Advanced data analysis is best conceptualized in matrix algebra terms. However, the basic units of most methods of data analysis are variables, the column vectors of data matrices. The scalars-vectors-matrices hierarchy of data is reflected in design of the Visual Statistics Studio.

Marginal Referents of a Data Matrix

Marginal referents of a data matrix index the rows and columns found within the data matrix. Some authors prefer to call the marginal referents of data matrices entities and attributes.

However, from the standpoint of formal data analysis, the exact character of the marginal referents is immaterial. Statistical data analysis is not restricted by discipline, or by the character of measured entities or attributes. 

Five Modules

Relationships among properties of attributes and entities of concrete or abstract phenomena can be described on several levels. On one level these properties can be described algebraic formulae and operated upon by algebraic algorithms. On another level, these properties can be visualized by the graphs of the analytic geometry. However, underlying these and other levels are the relationships defined in terms of propositional calculus of the formal logic.
Thus, the main components of the Visual Statistics Studio are Logical, Scalar, Vector, Matrix., and Data Visualization module (Graphs, Stereographs, Stereoimages and Stereoviews).

Statistical Data Analysis

Measurement is a process assigning by rule a numerical description to observation of some attribute of an object, person, or event. There is a dictum that 'anything which exists, exists in some amount, and therefore can be measured.'  Statistical data analysis attempts to translate measurements into statistics that can be interpreted and structures that can be described. The Visual Statistics Studio provides a collection of statistical procedures listed below. Following data analysis, a story can be told and a paper can be written, closer to reality and more believable.

Analysis I

Analysis II

Analysis III

 

Getting Visual Statistics Studio up and Running

Visual Statistics Studio is designed for Microsoft Windows XP and Vista. After installing the program, choose (Start, All Programs, Cruise Scientific Visual Statistics Studio). The welcome splash window will inform us the dimensions of the Vector module. Click Ok to close the welcome splash window.


Initially, the visual Statistics Studio will open with a preloaded variable X in the Vector Display window. Note that the Vector display uses a non-scrollable canvass window with data and other information painted on it. In the case of a larger data matrix, a scrollable spreadsheet can be opened for viewing. Also, the size of the dimensions on the Vector Display can be adjusted to fit your data set. This is especially important when there is a large discrepancy between the number of rows and the number of columns, as in the case of many simulation studies. The maximum size of the dimensions of the program are contingent on the amount of memory your computer has. However, the larger the dimensions, the slower the performance. By default, the number of cases (n), the arithmetic mean (M), and the true variance (s2) of each variable will be displayed toward the bottom.

 Why are these descriptors important?

Task A

Suppose the above variable X represents the scores earned by five subjects on a quiz. Your task is to compute the arithmetic mean for this data set. After completing the task, you will know how to access a list of descriptors, how to transfer these descriptors between the Vector module and the Scalar module, and how to produce descriptive statistics for a data set.

1. The Arithmetic Mean

Description of a variable usually begins with the specification of its single most representative value, often called the measure of location, or central tendency. The arithmetic mean is a measure of central tendency commonly referred to as an average. It can be defined as the sum of all scores, divided by the number of scores.

2. Add a new Descriptor

Greek capital letter sigma, ∑, signifies summation. It tells you to sum all values of the variable The sum for the variable X is 15 (1+2+3+4+5 = 15). To add a new descriptor, the sum, click on any existing descriptors (n, M, or s2 ). A list of descriptors will be displayed. Select the box next to Sum and click the Accept button. The new descriptor will be added on the Vector Display.

 

To define different colors for numbers and descriptors on the Vector Display, select (Colors, Define Display Colors) from the top menu. Choose your favorite colors from the list and click the Accept button.

 

3. Transfer between the Modules

We will transfer the descriptors, sum and n, to the Scalar module and compute the mean.

(1) Before the descriptors can be transferred to the Scalar module, we need to launch the Scalar module first. The main menu at the top of the Vector Display window looks like this:

Select (Transfers, Launch Scalar Module). Note that the Scalar module has nine available memory cells for us to use.

If necessary, move the Scalar Module window to a preferred position by dragging the Scalar Module bar at the bottom.

(2) Next, select (Transfers, Descriptors to Scalars). Note that the information button located at the lower left corner of each submenu provides instructions on how to use the menu. Click "Information" to display the instructions. Click it again to close the information panel. Finally, choose Sum and Number of Cases and click the Accept button.

 

(3) Select the variable X and click Accept. The descriptors of the variable X will be transferred to the scalar cell1 and cell 2, respectively. Click on the Format Cells button to format the numbers to 3 decimal places.

(4) Drag and drop the values from Scalar Memory Cells to the Scalar Calculator display window. To compute the mean, drag 15.000 (the sum) from the memory cell 2 to the calculator display window, clicking on / (divide), drag 5 (n), and clicking on the Equals button. The mean is 3. Close the Scalar module window.

4. Produce Descriptive Statistics Output

To obtain a list of the data values and its associated descriptive statistics, choose (Analysis I, Descriptive Statistics). Select the variable X and click Accept. Next, select n, Mean, Minimum, maximum, Range, and List Data. Click Enter.


The output window will appear. Note that the range is the difference between the maximum and the minimum values in a data set (5 - 1 = 4). You can print or save the result. You may also copy the result and paste it to any word processor.

 

5. Open the Notebook

For your convenience, the Visual Statistics Studio provides you a word processor, called Notebook. First, highlight the content to be copied and select (Edit, Copy) from the output window menu as shown above. Next, close the output window since you have copied the content. Third, open the Notebook by choosing (Instruments, Notebook) from the Statistics Studio top menu bar. Finally, choose (Edit, Paste) from the Notebook menu to paste the descriptive statistics to the Notebook.

To avoid too many open windows on your screen, minimize the Notebook window by clicking its underscore button (_).

Task B

The annual number of movies produced in an European country from 1910 to 1968 was recorded. A line chart is often used to visualize a trend in data over a long period of time. Your task is to produce a line chart with the year on the horizontal axis and the number of movies on the vertical-axis. After completing the task, you will know how to start a new project, how to open a Vector module viewer, and how to create a graph.

1. Open an Existing Project

To start a new project, choose (Projects, New Project) from the top menu. To open an existing project, select (projects, Open Project Files). Next, select Longitudinal Studies, 901 Movies (1910 - 1968).longitudinal, click Open and Replace. The data set will appear on the Vector Display. 

 

Note that there are 59 cases (n = 59).  However, the visible cases on the Vector display are only 5 (1910 to 1915).

2. Open a Vector Module Viewer

To view the large data set in a scrollable window, select Vector from the Modules bar.

First, resize the View Vector module window to your desired size by dragging the borders of the window. Next, drag the scroll box downward to examine the data set. Note that the number of movies was dropped from 39 to 5 during World War II (1939 -1945). When you are done, close the scrollable window.

3. The Graphs Menu

(1) Choose Graphs from the Modules bar menu as shown below.

A line chart is often used to visualize a trend in data over a long period of time. Select Line Graphs under the Graphs Indexed by Attributes category. There are two variables, YEAR and MOVIES, on the Vector Display. We will create a line chart with the year on the horizontal axis and the number of movies on the vertical axis. To plot graphs indexed by attributes, the variable defining the horizontal axis (Year) must be located to the left of the variable defining the vertical axis (Number of Movies). Otherwise, you will get an opposite result.

 

(2) The horizontal axis of a graph is known as the abscissa and the vertical axis of a graph is known as the ordinate. Select the variable YEAR as the abscissa and the variable MOVIE as the ordinate. Click Accept.

(3) The line graph is a two dimensional graph. Click the 3D/2D icon from the tool bar.

Extend the horizontal axis to the proper length by dragging the right side border of the chart window. The line graph would look like this.

To add a scroll bar to the chart window, choose (View, Scroll Bars) from the top menu. Now you can scroll through the records. Note that the movie production was decreased dramatically during World War II (1939-1945). Click (View, Scroll Bars) to uncheck this option and return to the regular chart window.

(4) Enter Chart Title  and Label Axes.

First, minimize the Visual Statistics Studio window by clicking its underscore button (_). Next, move the Chart window to the preferred position by dragging its top blue bar. To set the Chart Properties, select the Properties icon from the tool bar as shown below.

The Chart Properties dialog box will open. It has four tabs: General, Series, Axes, and 3D. The General tab is shown first. Click inside of the Title textbox and type “Annual Movie Production, 1910-1968” as shown below.

Next, label the Y-axis as Number of Movies. To do this, select the Axes tab. Note that the “Y Axis” has been selected by default. Click on the Details... button located at the bottom right corner.

The “Y- Axis Properties” dialog box will appear. Click the Labels tab. Click inside of the Title textbox and type “Number of Movies”. Click Apply and OK to return to Chart Properties dialog box. Last, label the X-axis as Years. To do this, click on the Axes tab, click on the down arrow button, and select “X Axis”. Next, click the Details… button and the Labels tab. Type “Years” in the Title textbox. Click Apply and OK to return to Chart Properties dialog box. Finally, click OK to end the task. The resulting graph would look like this.

You may save the file, print the chart, or copy the chart to clipboard and transfer the chart to any word processor. The corresponding icons on the tool bar are shown below.

Note that  to copy the chart, click the Copy to Clipboard icon and select As a Bitmap.

Statistical analysis often begins with description of typical values of variables, their means, medians, and modes, also called the measures of central tendency. The arithmetic mean is a measure of central tendency commonly referred to as an 'average.' The median was discussed by Gauss in 1816 and the idea was elaborated by Fechner in 1878. Fechner called the median Centralwerth, the central value of an ordered series of scores, symbolized by the letter C. The mode, Dichtestewerth, was defined as the locus of a distribution where it is densest, symbolized by the letter D.

Learning Statistics by Doing Statistics

Task C

Professor Stanley administered a statistics test to seven students in his class. The students earned the following scores: 10, 6, 6, 6, 8, 9, 7. Our first task is to find the mean, the median, and the mode.

1. Data Entry

Start a new project by choosing (Projects, New Project). Next, choose (Data, Enter) to bring up the Data Entry window. Click on the default letter A and label the first variable as X. Press the Enter key. The cursor will be advanced to the first data cell.

Enter the following seven scores (10, 6, 6, 6, 8, 9, 7). Remember to press the Enter key following each data entry. Note that the cursor should be on row 8, marked with the pencil tip.

Click the Accept button and the data set will be transferred to the Vector Display. By default, the associated descriptors will be displayed toward the bottom. Note that there are 7 cases (n = 7). However, only six cases are visible. Click on Expand to show all 7 cases.

Drag the top blue bar to move the Vector Display window to a preferred position.

 

2. Name the Project

Choose (Projects, Project Name). Type Measures of Central Tendency in the textbox. Click Accept.

3. The Median and the Mode

A distribution means the arrangement of any set of scores in order of magnitude. First, arrange the scores in order from smallest to largest values Select (Modify, Sort). Click Sort Order and the Sort in Ascending Order option will appear. Select the variable X and  click Append. The ordered series of scores will  be added and the new variable is automatically labeled as SortAsc.

Note that this variable has odd numbers of cases (n = 7) and the median represents the midpoint of a distribution. Thus, the median (C) is 7. Half of the other scores are below it and half are above it. Also, notice that the score of 6 occurs three times. It is the most frequently occurring score in the distribution of the variable. The mode, D, is 6.

4. The Mean and the Median

Select Descriptors: Click on any descriptors on the Vector Display to bring up optional descriptors. Select Sum, Median, and click Accept.

    

The mean (M) can be defined as the sum (∑) of all scores, divided by the number of cases (n). Thus, M = 52/7 = 7.429. Note that the mean is pulled higher than the median because of these three scores (8, 9, and10). 

5. Create a Frequency Table

The frequency of a given score is the number of times the score occurs. A frequency table can be constructed by listing scores in ascending order with their corresponding frequencies. Choose (Frequencies, Frequencies) from the top menu. Select the variable SortAsc and click Accept. Examine the frequency table. The lowest value is 6 and the highest value is 10. Note that there are three students who earned a score of 6, one student who earned a score of 7, one student who earned a score of 8, and etc. The mode is 6.

  

6. Rename a Variable Using a Shortcut

Click on any variable name on the Vector Display to bring up the Specify Column Names dialog box. Erase the variable name, argFreq. Rename it to Test Scores. Click Accept.

7. Visualize of a Frequency Distribution

Create a line chart with the test scores plotted on the horizontal axis and the frequency plotted on the vertical axis. These two variables, Test Scores and Frequency, are required to plot the line graph. Select Graphs from the Modules bar. Click  Line Graphs under the Graphs Indexed by Attributes category. Abscissa: Select the variable Test Scores. Ordinate: select the variable Frequency. Click Accept. Change the default 3-D graph to a 2D graph by clicking on the 3D/2D icon. The resulting line graph would look like this.

Note that the chart has a single peak. The distribution is not symmetric. The tail is toward larger values. The distribution is skewed to the right. It has a positive skew. In general, the mean will be higher than the median when a distribution has a positive skew.

8. Label the Axes and the Chart Using Shortcuts

To label the X axis, right click a data value on the X-axis to bring up a shortcut menu. Select Edit title and type Test Scores in the textbox. Click anywhere outside the textbox to exit.

To label the Y axis, right click a data value on the Y-axis to bring up a shortcut menu. Select Edit title and type Frequency in the textbox. To label the chart, right click the Chart area (the gray area outside the plot) to bring up a shortcut menu and select Edit title. Type Distribution of Test Scores in the textbox. You may copy and paste the chart to a word processor. Close the Chart window.

9. Produce Descriptive Statistics and Frequency Tables

Choose (Analysis I, Descriptive Statistics). Select the variable X. Select n, Median, Mean, List Data, and click Enter. The data set and the associated descriptive statistics will be displayed in the output window. To obtain the frequency table, choose (Frequencies, List Frequencies). Select the variable X and click Accept. The frequency table will appear, along with other related tables (cumulative frequencies and the proportions). Copy and paste the results to a word process. 

 

Cumulative Frequency: The cumulative frequency is the running total of frequencies. For example, the cumulative frequency for the score of 7 is 3 + 1 = 4. The cumulative frequency for the score of 9 is 3 + 1 + 1 + 1 = 6.

Relative Frequency: Frequency counts can be measured in terms of proportions or percentages. For example, there were three student who earned a score of 6. The total number of students is 7. Thus, 3 / 7 = .43 = 43%. About 43 percent of the students scored 6 on the test.

Cumulative Proportions: For our example, approximately 86% of the students had a score of 9 or less.  

10. Compute the Median for Even Number of Cases

Median is computed differently for odd and even number of cases. If the number of scores in the distribution is even, the median is the middle value extrapolated from the adjacent scores to the theoretical midpoint of the distribution. This extrapolation is frequently accomplished by averaging both adjacent scores.

To create an even set of numbers, use the Truncate command to shorten the length of the variable X.

(1) First, delete the variables on the Vector Display except the variable X. The three variables to be cleared are SortAsc, argFreq, and Frequency. Choose (Data, Delete). Click the variable SortAsc and hold down the Shift key while pressing the Down Arrow key twice to highlight the variables we want to remove. Release the Shift key and click Clear and Compact or Clear Selected Variables as shown below. 

 

 


(2) Truncate the variable X: Adjust the length of the variable X to six cases. Choose
(Reshape, Truncate). Select the variable X. Define Length: 6. Click Accept.

(3) Sort Data: A distribution means the arrangement of any set of scores in order of magnitude. To form a distribution of the scores, sort the scores in ascending order. Choose (Modify, Sort). Click Sorting Order to select Sort in Ascending Order. Select the variable X and click Append.

(4) Compute the media for the even set of numbers. Examine the ordered series of scores. There are six scores (n = 6). The two middle values are 6 and 8. To find the half-way between them, add them together and divide by 2. C = (6 + 8) / 2 = 7

Task D

Throw a coin 15 times. How many heads are you likely to see? To summarize and describe the result of your experiment, create a frequency table and a bar chart.

1. Animated Coin

Start a new project. Choose (Animations, Animated Coin) to start tossing a coin. Click the End button to terminate the experiment when the number of throws is 15.

The result will appear on the Vector Display. There are two possible results. Heads are coded as 1. Tails are coded as 0. Note that the result will not be the same as shown below due to random chance.
 

 

Binary numbers are defined as numbers taking on only 0 and 1 values. The variable Throw is a binary variable. Count the number of zeros. Count the number of ones. Choose (Frequencies, Frequencies) and select the variable Throw to obtain a frequency table.

Frequency counts can be measured in terms of proportions or percentages. Next, Choose (Frequencies, Proportions) and and select the variable Throw.

Create a bar chart with the number of heads on the horizontal axis and the percentage frequency on the vertical axis. Choose Graphs. Select Bar Graphs under the Graphs Indexed by Attributes category. Abscissa: arg Prop. Ordinate: Prop, Right click a data value on the Y-axis to bring up a shortcut menu. Choose Properties. Select Scale tab and click drop down arrow next to format and select Percentage. Finally, label the chart and the axes as shown below.