structuring data to facilitate analysis jerry j. vaske jay beaman colorado state university warner...
TRANSCRIPT
Structuring Data to Facilitate Analysis
Jerry J. VaskeJay Beaman
Colorado State UniversityWarner College of Natural Resources
Human Dimensions of Natural ResourcesFort Collins, CO
Workshop at the 2008 Pathways to Success Conference:Integrating Human Dimensions into Fish & Wildlife Mgmt.
Workshop Foundation
Workshop Objectives
• Illustrate strategy for:
– Facilitating analysis of 2006 National Survey ofFishing, Hunting, and Wildlife-Associated Recreation (FHWAR)
– Increasing the usability of FHWAR data formanagement, planning & policy
• Compare two types of data structures:
– Flat files
– Relational Entities
Traditional Flat File
Rows = RespondentsColumns = Variables
Flat Files – Journal Article Example
Every journal article has:
• One or more authors
• Title
• Journal name
• Specifics about date of publication:YearVolume numberIssue numberPage numbers
• Potentially keywords
Flat File Data Structure for Journal Articles
Potential Issues with Flat Files
• Problem– Diefenbach et al (2005) article had 7 co-authors– 7 columns (variables) necessary to accommodate
all authors’ last names– 19 of 26 articles in flat file had only 1 or 2 authors– 67% of author fields empty– If first names included – more empty fields
• Solution – Relational database
Relational Databases• Definition
– Set of tables containing data for predefined categories– Data stored in separate files (tables) that are linked
• Terminology– Table = Entity (E)
– Rows (tuples) in table = information about an object(e.g., journal article or respondent)
– Columns (attributes) = variables
– Two types of relations (R)1. Set of tuples – a table with attributes (these R’s store data)2. Algebraic (Person ID in Table A = Person ID in Table B)
(these relations use data stored in entities)
Relational Data Structure for Journal Articles
Article EntityArticle IDJournal IDArticle titleYear, Issue, Pages
Journal EntityJournal IDJournal namePublisher info
Author EntityAuthor IDLast, First nameContact info
Keyword EntityKeyword ID(attitudes, norms)
(R2)Relation
Journal ID
(R3)Relational Table
Article IDKeyword ID
(R1)Relational Table
Article IDAuthor ID
Comparison Flat File vs. Relational Database
Flat file
Relational Database R1 = table multiple authors (AuthorID) linked to given article (ArticleID)
ArticleID AuthorID
2059 314
2059 59
2059 233
Author entity and R1 (author–article relation)can have any number of rowsso all authors of an articlecan be identified
FHWAR Flat File Example
• Fishing, Hunting & Wildlife-Associated Recreation (FHWAR)
• National Survey – Conducted about every 5 years– 1955 – first survey
– 2006 – most recent survey
• Data on hunters, anglers, wildlife watchers:– Sportsperson expenditures
– Species sought in different states
• Data collection costs (1991–2006) $55 million (in 2008 $)
• 1991-2006 data comparable within limits but not integrated
2006 – FHWAR Flat File Data
Data distributed on CD containing 3 ASCII text files:
1. Screening data
2. Sportsperson (hunting & fishing) data
3. Wildlife Watcher data
Data file # of Records # of Variables
Screening 144,509 56
Sportsperson 21,942 3,765
Wildlife Watcher 11,285 772
FHWAR Flat File – Analytical Issues
• Important issues– 4,500+ vars with obtuse variable names
(e.g., NCU_STD1)
– 200 pages of documentation
– Census conversion programs do not create variable labels or value labels
• Major issues– Data compression
– Conceptual complexity
Analytical Issues Affecting Use
• Data compression– No hunters hunt in all 50 states (at most 8 in 2006 data)
– To avoid numerous empty cells data are compressed(e.g., the values for 3 vars are combined into a single var)
– For example:“days” of participation is combined with an “activity” (e.g., big game or small game hunting)
in a given “state” (in the order states are mentioned)
– Compressed vars cannot be directly analyzed by SAS or SPSS • Conceptual complexity
– When uncompressed to blocks of 50 states ≈ 20,000 variables
– Difficult to visualize analysis strategy
– Flat FHWAR files hide data structure
Relational File Structure Illustration
Entity
1. PERSON
2. SPORTSPERSON
3. HUNTING_ACTIVITY
4. TRIP_EXPENDITURES
Based on flat file:
Screening
Sportsperson
Sportsperson
Sportsperson
Four entities ≈ half of the 2006 FHWAR flat file data
PERSON Entity
• 6 control variables (e.g., Person_Weight)
• 10 demographic variables (e.g., Age, Sex)
• 8 hunting variables (e.g., Hunted_2005)
• 8 fishing variable (e.g., Fished_2005)
• 6 residential wildlife watching variables(e.g., Home_Observe_2005)
• 5 non-residential wildlife watching variables(e.g., Trip_Watch_2005)
SPORTSPERSON Entity
• 6 control variables(e.g., Person_ID, Sportsperson_Weight)
• 11 demographic variables (e.g., Age, Sex)
• 15 national summary variables (e.g., Hunted_2006)
PERSON variables in SPORTSPERSON could be “obtained” from PERSON but also included SPORTSPERSON to simplify analyses
TRIP_EXPENDITURES Entity
TRIP_EXPENDITURES entity reduces 844 compressed vars to 10 vars
Person_ID Unique person ID
Sportsperson_Weight Sportsperson weight
Spender Person in TRIP_EXPENDITURES entity
State_of_Residence State of residence
Location_Trip_Spending State of spending
Spend_State_of_Residence Expenditure in state of residence
Fish_Hunt Fishing or hunting expenditure
Fish_Hunt_Type Fishing or hunting type
Trip_Expend_Category Sportsperson expenditure categories
Dollars Amount spent in dollars
HUNTING_ACTIVITY EntityPerson_ID Unique person ID
Sportsperson_Weight Sportsperson weight
Hunter Person in HUNTING_ACTIVITY entity
Table_Cell_Description Description of tables (e.g., relation to FH3 variables)
Sub_Table_ID Sportsperson sub-table identifier
State_of_Residence State of residence (start Wave 3)
In_State_of_Residence Participation in state of residence
Activity_Location Geographic location for activity (USA or a State)
Private_Public Activity of private or any public land
Fish_Hunt_Type Fishing or hunting type
Response_Unit Participation = 1, Days = 2, Trips = 3
Response Participation (1 = Yes), # of days, or # of trips
HUNTING_ACTIVITY entity reduces 840 compressed vars to 12 vars
Variable: Sub_Table_ID
HUNTING_ACTIVITY entityA collection of state-level sub-tables to facilitate analysis
Visualizing the 4 Entities
Person(Screening data)
Control Variables Person ID Person Weight
Demographics (11)
Hunting (8) Hunted Ever Hunt Intentions
Fishing (8)
Wildlife Watching Residential (6) Trips (5)
Sportsperson(Sportsperson data)
Control Variables Person ID Sportsperson Weight
Demographics (11)
National summary “species” variables (15) Hunted 2006 Big game hunted Days big game hunted Trips hunting big game
Hunting Activity(Sportsperson data)
Person IDSportsperson WeightSub Table IDResponse UnitResponse
Trip Expenditure(Sportsperson data)
Person IDSportsperson WeightSpending categoriesDollars
Summary
• About 1,750 flat file variables reduced to < 60
• Obtuse variable names replaced with intuitive names
• Compressed flat file variables cannot be directly used in SPSS or SAS
Variables in relational entities can be used in analysis
• Details in Beaman & Vaske (2008)
Entity Data FilesEntity SAS filename SPSS filename
PERSON Person.sas7bdat Person.sav
SPORTSPERSON Sportsperson.sas7bdat Sportsperson.sav
HUNTING_ACTIVITY Hunting_Activity.sas7bdat Hunting_Activity.sav
TRIP_EXPENDITURES Trip_Expenditures.sas7bdat Trip_Expenditures.sav
(http://welcome.warnercnr.colostate.edu/~jerryv/)
To simplify analyses 2 additional entities:Hunting_Activity_and_DemographicsTrip_Expenditures_and_Demographics
SAS Code & SPSS Syntax
Figure number: SAS code SPSS syntax
Figure 3 Figure_3_Syntax.sps
Figure 4 Figure_4.sas Figure_4_Syntax.sps
Figure 5 Figure_5_Syntax.sps
Figure 6 Figure_6.sas
Figure 7 Figure_7_Syntax.sps
Figure numbers based on Beaman & Vaske (2008)
http://welcome.warnercnr.colostate.edu/~jerryv/
Example – Hypothesis
Average days of elk huntingvaries between Colorado vs. Wyoming
and by hunter’s sex
Flat File to Entity for HypothesisData FHWAR6.Hunt_BGspecies_States ;
Length Person_ID 5 Sportsperson_Weight 4 Sex 3 State_of_Residence Activity_Location Fish_Hunt_Type Response_Unit Response 4 ;
Set FHWAR6.fh3 (rename = (sex = xsex)) ;
Keep Person_ID Sportsperson_Weight Sex State_of_Residence In_State Response Activity_Location Fish_Hunt_Type Response_Unit ; Person_ID = PersonID ; Sportsperson_Weight = spwgt ; Sex = Xsex ; State_of_Residence = put (resstate, $st2num2.) ;
* Array stores info to identify state when decompressing ;Array a1( 2, 8 ) HUNTSTD1-HUNTSTD8 STDAYSHD1-STDAYSHD8 ;
* Array stores info to associate species with variables ;
Array gam1( 9) g1-g9 ;Retain g1 1 g2 2 g3 3 g4 4 g5 5 g6 6 g7 7 g8 40 g9 41 ;Array a7( 2, 9 , 8 ) bgame1d1--bgdifday9d8 ;Do m = 1 To 2 ; Do j=1 To 9 ; Do k=1 To 8 ;If a1( 1, k) = ' ' Then Goto End7 ; Fish_Hunt_Type = gam1(j) ; If m = 1 Then Do ; Response_Unit = 1 ; End ; Else Do ; Response_Unit = 2 ; End ; Response = a7(m, j, k) ; Activity_Location = put(a1( 1, k), $st2num2. ) ; If Activity_Location = State_of_Residence Then In_State = 1 ; Else In_State = 0 ;
* Outputs data for hypothesis;
If Response > 0 Then Output ;End7: End ; End ; End ;run ;
SAS Entity to SPSS Entity
Get SAS Data = ‘C:\Hunt_BGspecies_States.sas7bdat’.
Add Value labels
Save Outfile = ‘C\Hunt_BGspecies_States.sav’.
Testing Hypothesis with Relational EntityGET File = 'C:\Hunt_BGspecies_States.sav'.
WEIGHT BY Sportsperson_Weight.
Select if (Activity_Location = 8 or Activity_Location = 56).
Select if (Fish_Hunt_Type = 2).Select if (Response_Unit = 2).
UNIANOVA Response BY Sex Activity_Location.
Opens data
Weights data
CO huntersWY hunters
Elk huntersDays of participation
ANOVA
GET File = 'C:\FHWAR\Hunting_Activity.sav'.Select if (Sub_Table_ID = 10).
WEIGHT BY Sportsperson_Weight.
Select if (Activity_Location = 8 or Activity_Location = 56).
Select if (Fish_Hunt_Type = 2).Select if (Response_Unit = 2).
UNIANOVA Response BY Sex Activity_Location.
Results
Conclusions
• Analyses that are difficult to perform with flat file data are possible with relational structure
• Restructuring all of 2006 FHWAR data as well as data from 1991, 1996, & 2001 would:
– Yield similar analysis capabilities
– Allow for trend analysis
– New practical opportunities for state agencies
Practical Opportunity• State agencies have accurate records of license
sales (e.g., hunting only, fishing only, combos)
• With potentially 100s of licenses, permits, & stamps sold, not practical to ask about specific licenses in a flat file
• Moving to relational structure for obtaining license data has advantages …
Advantages of Relational License Data
1. Can ask about actual state license salesAll state license info can be “pre-stored” in one entitySize of entity would not impact other data entities
2. Questions about specific license cost not necessary; correct information pre-stored
3. Establishing relationship between state specific license sales & FHWAR dataprovides foundation for benchmarking / calibratingmeaningful estimates based on FHWAR
From Analysis to Data Collection
• Entity based models:– facilitate analyses– can also enhance data collection
• Currently working with software company Techneos (www.techenos.com) toimplement pilot models that yield:– more consistent and – accurate data collection
Questions?