telling stories with your data - graphs, tables and basic ... · telling stories with your data 5...

19
Telling Stories with your Data 1 Telling Stories with your Data - Graphs, Tables and Basic, Basic Statistics with SAS Enterprise Guide ® AnnMaria DeMars, The Julia Group & 7 Generation Games, Santa Monica, CA ABSTRACT You’ve spent a lot of time and effort on design and data collection, now what’s next? There is a lot of talk about BIG data, but many of us spend our days with not-very-big-at-all data where a few errant records here and there can throw off results. This presentation takes that first look at your data - exploratory data analysis, with a focus on the basics. Using data from the pilot study for Spirit Lake:The Game, an educational game for students in grades four through six, you’ll see how to use SAS Enterprise Guide to take a first look at your data, filter data sets, a super- simple method for getting tables of descriptive statistics. Next, you’ll see how to use SAS Enterprise Guide for a second look, with cross-tabulations, graphics, summary tables and a t-test thrown in for good measure, to answer the questions brought up in your first pass through the data. INTRODUCTION There is nothing more fun and exciting than first getting up to your elbows in your data. You’ve spent a lot of time and effort on design and data collection, now here you are. What’s next? Today’s workshop takes that first look at your data. A few words of background ... These data came from the pilot study for Spirit Lake: The Game, an educational game for students in grades four through six. It was tested last year with students from six classrooms from two schools. In the elementary school, three fourth-grade classrooms participated as a whole class. At a neighboring middle school, three fifth-grade teachers each sent five students to the computer lab to play the game three times per week, supervised by a lab monitor. The control group was a school on the same American Indian reservation, with one fourth-grade and one fifth-grade classroom. As with many pilot studies, we basically threw ourselves on the mercy of the schools and begged their permission to do our study. So, we were able to collect data at their convenience. Let’s dive right in with a SAS data set and see what we've got. EXERCISE 1: READY Open SAS Enterprise Guide and then just open the data set by going to FILE > OPEN > DATA then, navigate to where the data are located on your computer. You’re now looking at your data set like this: Take make life easier in outputting our data for, say, a paper like this, let’s change the output option to RTF. Go to TOOLS then OPTIONS

Upload: others

Post on 26-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

1

Telling Stories with your Data - Graphs, Tables and Basic, Basic Statistics with SAS Enterprise Guide ®

AnnMaria DeMars, The Julia Group & 7 Generation Games, Santa Monica, CA

ABSTRACT You’ve spent a lot of time and effort on design and data collection, now what’s next? There is a lot of talk about BIG data, but many of us spend our days with not-very-big-at-all data where a few errant records here and there can throw off results. This presentation takes that first look at your data - exploratory data analysis, with a focus on the basics. Using data from the pilot study for Spirit Lake:The Game, an educational game for students in grades four through six, you’ll see how to use SAS Enterprise Guide to take a first look at your data, filter data sets, a super-simple method for getting tables of descriptive statistics. Next, you’ll see how to use SAS Enterprise Guide for a second look, with cross-tabulations, graphics, summary tables and a t-test thrown in for good measure, to answer the questions brought up in your first pass through the data. INTRODUCTION There is nothing more fun and exciting than first getting up to your elbows in your data. You’ve spent a lot of time and effort on design and data collection, now here you are. What’s next? Today’s workshop takes that first look at your data. A few words of background ... These data came from the pilot study for Spirit Lake: The Game, an educational game for students in grades four through six. It was tested last year with students from six classrooms from two schools. In the elementary school, three fourth-grade classrooms participated as a whole class. At a neighboring middle school, three fifth-grade teachers each sent five students to the computer lab to play the game three times per week, supervised by a lab monitor. The control group was a school on the same American Indian reservation, with one fourth-grade and one fifth-grade classroom. As with many pilot studies, we basically threw ourselves on the mercy of the schools and begged their permission to do our study. So, we were able to collect data at their convenience. Let’s dive right in with a SAS data set and see what we've got.

EXERCISE 1: READY Open SAS Enterprise Guide and then just open the data set by going to FILE > OPEN > DATA then, navigate to where the data are located on your computer. You’re now looking at your data set like this:

Take make life easier in outputting our data for, say, a paper like this, let’s change the output option to RTF. Go to TOOLS then OPTIONS

Page 2: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

2

From the window that pops up, click the tab Results General, and then click RTF under Result Formats. Click OK.

Great. You have data and you are set to have pretty results. What now?

I always recommend doing the Characterize Data task first. Go to TASKS > DESCRIBE > CHARACTERIZE DATA

Just click through the windows and accept all of the defaults. Especially this one (you’ll find out why shortly)

So, now we have results ... and stuff doesn't look like it should ....

Page 3: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

3

On the very first page, we see that there is a teacher named “test.” That was actually the answer key, which we shouldn't have left in there. I left this in here deliberately because the whole POINT of exploratory data analysis is to find out about your data, including problems in it We also see that the teacher variable is missing for 57 of the students. That is sort of okay. Our control group school only had one teacher for each grade, so we didn't need to collect teacher name. Secondly, in our experimental school, all of the fifth-graders used the game monitored by our site supervisor.

EXERCISE 2: SET Before we do anything else, it makes sense filter out the “test” observation. You can either double-click on the data set to open it and then click on the tab that says Filter and Sort

or go to the TASKS menu, Select DATA and then FILTER AND SORT.Either one brings you up with this window. Click on the double arrows in the middle to select all of the variables.

Next ... after you have selected all of the variables, click on the Filter tab at the top.

Page 4: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

4

To select the filter you want to apply, select Username as the variable to filter by, from the first drop down menu. From the second drop down menu, select not equal to.

Then, click on the three dots next to the empty box. That will bring up a window where you can click on Add Values in the lower left. This will bring up all of the possible values. Scroll down and select “TEST”

Click OK and this brings you back to the first Filter Window. Click OK.

EXERCISE 3. GO! Now you have the Test subject filtered out of your data set, run the Characterize Data task again.

Don’t overlook what you can learn from descriptive statistics.

Take a look at your data again. We have gender for 67 subjects but grade level (4th or 5th) for 83. Why is that?

Variable N NMiss Total Min Mean Median Max StdMean

Gender 67 21 99 1 1.4776 1.0 2 0.06148

Variable N NMiss Total Min Mean Median Max StdMean

Grade 83 5 362 4 4.3614 4.0 5 0.05305

Well, we naively believed that there would be little change over a 10-week period. So, we cleverly thought of splitting the demographic questions into the pre-test and the post-test, reasoning that would be half as much time, and there would be less fatigue effect of filling out all of these questions. We underestimated the degree of absenteeism, transfers between schools on the reservation & other reasons for kids missing from the sample including suspension, being called out to the office during the test and, in two cases, "He's just gone". Why am I telling you this? Because real data analysis is more than pointing and clicking, it’s learning to ask questions from the results you get and learn from the answers to those questions. Lesson learned: On our Phase II grant, in progress, we are collecting demographic data upfront.

Data visualization at its most basic The Characterize Data task gives you a lot of output - in this case, 92 pages of it. Go back and forth between graphs and tables. Scan the graphs and if something looks funny, go back to the table to inspect it more carefully. I can look at the graphs of pre- and post-test and see that it looks like the distribution shifted a little to the right, but it’s hard to compare because the Y axis isn’t the same for the two charts.

Page 5: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

5

Take note!

Yes, literally. If you’re exploring a large amount of data, you’ll undoubtedly have a number of thoughts like this, “I might want to compare these later using the graph task. I should make a note of that.” One of the most overlooked options of SAS Enterprise Guide is the note feature. Simply go to FILE > NEW NOTE This will bring up an electronic sticky note, where you can jot down your initial thoughts on the data, to review later. Right-click on the note to bring up a menu that allows you to link it to the procedure that inspired your note.

Page 6: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

6

EXERCISE 4: TABLE ANALYSIS- LOOKING A LITTLE CLOSER

Start with TASKS > DESCRIBE > TABLE ANALYSIS This window comes up. Make sure that you have the right data set. If it shows the allprepost data set, that’s a mistake.

Click on the EDIT button. Pull down to select the data set FILTER_FOR_ALLPREPOS

Once you’ve selected the correct data set, click OK.

Drag the variables to be used in the table, in this case School and Grade, under the Table variables heading

Page 7: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

7

Click on the next tab, Tables. Only the variables you selected in the previous step will show up in the pane Variables permitted in table. Drag “School” to the top of the table as your column variable and “grade” to the side as your row variable.

Click on the RUN button.

We can see that our control and experimental group school both have more fourth-graders than fifth-graders, but the experimental school has proportionally more fourth-graders. Now that we've looked at the data by school and grade, let's see if the unequal distribution by grade persists when we control for missing data, either pretest or post-test. To do this, we'll simply right-click on the Table Analysis icon and select Modify Table Analysis.

Table of grade by School

grade(grade) School(School)

Frequency Col Pct CONTROL EXPERIME Total

4 15 55.56

38 67.86

53

5 12 44.44

18 32.14

30

Total 27 56 83

Frequency Missing = 5

Page 8: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

8

To our original analysis, we're going to drag the variable “missdata” under Group Analysis By and click Run.

When prompted whether to replace the previous results, we’re going to click NO.

We can see from these results that we have far more missing data than we would like. It is not simply 21 students, if you look carefully at the bottom of the second table you'll see that for five students we don't even have the grade or school for five of those students.

Table of grade by School

grade(grade) School(School)

Frequency Col Pct CONTROL EXPERIME Total

4 4 80.00

10 62.50

14

5 1 20.00

6 37.50

7

Total 5 16 21

Frequency Missing = 5

Our reaction: SORT – and then DO something

Let's see who they are. A really simple way is to just sort our data set by school and grade and see who is at the top. Right-click on the data set we want, which is the filter_for_allprepost. From the drop down menu select Filter and Sort. As before, click on the double arrows to select all of the variables. Click on the SORT tab, and from the drop down menus select the variables “school” and “grade”. Click OK.

Page 9: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

9

When this runs you'll see at the top of your data set that you have five students missing information for grade. To insure anonymity and protect student data, we never had the students' names – the teachers had a roster of students matched with username. The experimental school was able to fill in the blanks for all students missing grade data, the control group school did not. This isn't something we just waved our hands about and moved on. It seriously concerned us. The degree of missing data overall disturbed us, as did the fact that we didn't seem to be able to get follow-up data. It concerned us enough that we hired a data coordinator on each reservation where we are testing in the upcoming year and will be analyzing the pretest data as it comes in and trying to update any missing data we can in the same week it is collected. This is the purpose of pilot studies, to find problems and fix them. EXERCISE 5: ITEM ANALYSIS TWO WAYS WITH THE CHARACTERIZE DATA TASK What is an item analysis and how is it helpful? There are two types of item analysis. The first type is an examination of the distribution of responses - the questions q1- q24 give which choice the student selected “a”,”b”, “c” or “d” as the correct answer. You can simply go back to your CHARACTERIZE DATA results and look at the plots. The main thing I’m looking to see is if whether one of the distractors gets selected more often than the correct answer. If so, it might indicate a poorly-worded test question, or, if that happened only for a single class, a problem in how the concept was taught. The second type is item difficulty, done by examining what percentage of students answered each item correctly. Item difficulty analysis is one basic means of establishing test validity. One would expect that items at the second-grade level would have the lowest level of difficulty, being answered by the largest percentage of our students, and at the other end, the items at the fifth-grade level would have the highest difficulty, and be answered correctly by the fewest students. Since the items are scored 0 = wrong, 1 = right, we can use the means to see what percentage of students answered correctly. A summary table can give you a nicely formatted table for a report but here we're just exploring our data, so using the univariate statistics you already have is easier.

Item Difficulty Analysis in Six (or fewer) Easy Steps

1. Click on the univariate statistics data set produced by the CHARACTERIZE DATA TASK to select it 2. From the top menu, select TASKS > DESCRIBE > LIST DATA 3. From the Variables to assign pane, select the ones you want in your report, in this case Variable, N, NMISS, Mean, Min and Max.

4. Select the records you want in your report. Now this part is a bit confusing because there is a variable named “variable”. Your univariate statistics data set has a column named 'variable” and in it is the name of each variable for which you will be listing the N, NMISS, mean, etc. I only want the scored

Page 10: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

10

variables in my analysis, where they were scored 0 for incorrect and 1 for correct. Click on EDIT from the button you can't see in the screen shot above because I cut it off, but there really is an edit button, I promise. Exactly as you did in Exercise 1, you’re going to select out the records you don’t want included in your analysis. From the first drop down menu, select Variable, from the next select Not In A List, then click on the three dots to bring up a new window. In that window, click on the bottom left where it says Add Values. Select q1 – q24,gender, missdata,age,pretotal, posttotal and usernum. Click OK.

5. Format the columns in the report. It’s going to be easier for me to read the data without six decimal places or so for the mean, so I change the format by right-clicking on “Mean” and selecting Properties.

I click the CHANGE button next to format

Page 11: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

11

Then I click on Numeric for the format category, and scroll down to w.d. Under attributes, I put 8 for width and 2 for the number of decimal places. Then click OK.

6. Next, just to make the report even easier to read, I click Options and un-check the box next to Row Numbers

Click RUN to run the task

You don't need to always export your output files to share them with another program. At this point, I selected all of these data from the output open in Enterprise Guide and copied, and then pasted them into an OpenOffice Calc file (Excel would work just as well).

I sorted them in descending order and here is a partial picture of the result. I also changed the name from “variable” to “item” to make it less confusing. (Also, I wanted an example of an Excel file to import for the next exercise!)

Item N NMiss Mean Min Max

postsc2 68 20 0.88 0 1

postsc3 68 20 0.87 0 1

postsc4 68 20 0.81 0 1

Page 12: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

12

sc3 82 6 0.8 0 1

sc2 82 6 0.78 0 1

sc4 82 6 0.78 0 1

postsc18 68 20 0.74 0 1

postsc8 68 20 0.68 0 1

postsc1 68 20 0.65 0 1 It's clear that the post-test and pre-test do not have the same number of people, so I need to be cautious of comparing them directly. However, within test comparisons are fine. The test items are in order of grade level, beginning with second-grade level through fifth-grade. The first few items should be answered correctly by the most people. We can see that is true both for the post-test and pre-test, although it's not perfect. Three items at the second-grade level were answered by over 80% of the students who took the post-test. We can also see that, generally, a higher percentage of students answered the post-test questions correctly than the pretest, as we would hope. If you scroll down to the bottom, you'll find that items 5 and 6 have some of the lowest percentage correct of any item. We make another note to examine those items in more detail.

EXERCISE 6: GRAPHING ITEM DIFFICULTY

Go to FILE > IMPORT DATA and select where I saved the file from the previous exercise. I called it items.xls. I just click NEXT through all of the screens to accept the defaults.

TASKS > GRAPH

Select horizontal bar chart

Under Column to Chart, drag the Variable “item”. Drag “item” under column to chart and drag “mean” under “Sum of” (If you forget the Sum variable, you'll just get a chart that shows each item occurred in the data set once. Not very helpful.)

Page 13: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

13

Edit to select the items you want, in this case, all of the postsc variables,

Click the Layout tab under Appearance and pull down from the menu under Order to select Descending Bar Height

Give a title and footnote

Page 14: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

14

To compare this with the pre-test data, I can just right-click on the bar chart icon in my process flow and select Modify. Three modifications are needed:

1. Click on the EDIT button and change the filter. First click the X at the end of the row to delete the current filter. Then select “item” and “in a list” and the variables sc1 – sc24 for your items to chart.

1. You want the X axis to be the same, from 0 to 1 so you can compare the two charts, so you need to set the X axis. Click on Major Ticks and then under Major Horizontal Ticks click Specify. In the input box on the top right, enter each of the major ticks you want (from 0, .2 to 1) and click ADD.

2. Change the chart title.

The output is shown below. The bars in the post-test chart are longer, which is what we would expect, as more students had the right answer on the post-test. Item difficulty was similar between pretest and posttest. That is, the easier items on the pretest also seemed to be the easier items on the posttest.

Page 15: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

15

Item Difficulty Analysis Post-test

Item Difficulty Analysis Pre-Test

Page 16: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

16

EXERCISE 7: GETTING DOWN TO BUSINESS WITH T-TESTS So far, we've been just getting a feel for the data. We've found one error, that the answer key was left in as a record. We've seen that we have an issue with missing data that needs to be fixed to the extent possible. It appears that the test is reasonably reliable, although, of course, more sophisticated statistics are needed to examine that issue (not discussed in this workshop). We've also realized that we can't really compare the pretest and post-test since we have a large proportion of missing subjects. We need to match pre- and post-test scores. Next, we're going to open a new project and open a data set named matchedmath. Choose TASKS > ANOVA > T-test Click the button next to Paired T-test

Under Paired Variables drag pretotal and posttotal Under Group Analysis by, drag school

Page 17: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

17

Click on the Plots tab and Check the boxes next to Summary plot Click RUN

Take a look at the output. The first chart on the following page is the control group. Not only does the table tell us there is no significant difference between pretest and posttest for the control group, but we can also see by looking at the graph that the difference between the scores is clustered around zero.

If you are not familiar with box plots -- that diamond is the mean, the box is from the 25th to the 75th percentile. The line inside the box is the median. The whiskers, those two lines at either end, extend from the box as far as the extremes of the data up to a maximum of 1.5 times the inter-quartile range. If any observations occur further than that, these would be considered outliers and show up as an asterisk past the end of the whisker.

Now, take a look at the experimental group, in the second chart. It's clear that there is a difference overall. It is

statistically significant, and you can see that there are a lot of students whose pretest scores were substantially lower than their post-test scores. There is also an interesting group here who are around zero, though. We should investigate that more. (Actually, we did and we found those were students who used the game less.)

Page 18: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

18

Control Group

It’s also worth pointing out that there are no outliers in either plot. Since we have a relatively small sample size, our results are susceptible to the effect of a few extreme scores. By looking at these charts, we can see that it wasn’t one or two outliers explaining the better performance of the experimental group, but rather, a definite shift of the distribution.

Page 19: Telling Stories with your Data - Graphs, Tables and Basic ... · Telling Stories with your Data 5 Take note! Yes, literally. If you’re exploring a large amount of data, you’ll

Telling Stories with your Data

19

CONCLUSION Exploratory data analysis is a key first step in any study, even if simply to determine the quality of the data. A few simple tasks in SAS Enterprise Guide can go a surprisingly long way towards answering preliminary questions about test validity, threats to research design validity, such as missing data, pre-existing group differences and presence of outliers. While exploring your data, it’s crucial to note the concerns raised, follow-up questions and policy recommendations that come out of your analysis. These are the valuable skills an analyst needs to learn now that SAS has made the analysis the easy part. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. This research was made possible by a Small Business Innovation Research award from the U.S. Department of Agriculture.