exploring data (2008) - mdthinducollege.org · psychology): exploring data ... were some missing...
TRANSCRIPT
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page1
ExploringData
AssumptionsofParametricData
If you use a parametric test and certain assumptions are not met then the results are likely to be inaccurate.Therefore, it isvery important thatyoucheck theassumptionsbeforedecidingwhichstatistical test isappropriate.Theassumptionsofparametrictestsbasedonthenormaldistributionareasfollows:
→ Normallydistributeddata:Thisassumptionistrickyandmisunderstoodbecauseitmeansdifferentthingsindifferent contexts (see Field, 2009). In short, the rationale behind hypothesis testing relies on havingsomethingthatisnormallydistributed:insomecasesit’sthesamplingdistribution,inotherstheerrorsinthemodelandsoon. Ifthisassumptionisnotmetthenyourstatisticalmodelortest isusuallyflawedinsomeway. The reason why this assumption is misunderstood is because people often believe that the datathemselvesneed tobenormallydistributed;actually, the rawdata rarelyneed tobenormallydistributed,butiftheyarethenitusuallymeansthatthethingsthatweactuallywanttobenormallydistributedaretoo.Confused? Well, it is confusing. In many statistical tests (e.g. the t‐test) we assume that the samplingdistributionisnormallydistributed.Thisisaproblembecausewedon’thaveaccesstothisdistribution—wecan’tsimplylookatitsshapeandseewhetheritisnormallydistributed.However,weknowfromthecentrallimittheorem(seemybookifyoudon’tknowwhatthatis)thatifthesampledataareapproximatelynormalthenthesamplingdistributionwillbealso.Therefore,peopletendtolookattheirsampledatatoseeiftheyare normally distributed. If so, then they have a little party to celebrate and assume that the samplingdistribution(whichiswhatactuallymatters)isalso.Wealsoknowfromthecentrallimittheoremthatinbigsamples the sampling distribution tends to be normal anyway— regardless of the shape of the data weactually collected (and the sampling distribution will tend to be normal regardless of the populationdistribution insamplesof30ormore).Asoursamplegetsbiggerthen,wecanbemoreconfidentthatthesamplingdistributionisnormallydistributed.
→ Homogeneity of variance: This assumptionmeans that the variances should be the same throughout thedata. Indesigns inwhichyou test severalgroupsofparticipants thisassumptionmeans thateachof thesesamplescomefrompopulationswiththesamevariance.Incorrelationaldesigns,thisassumptionmeansthatthevarianceofonevariableshouldbestableatalllevelsoftheothervariable.
→ Intervaldata:Datashouldbemeasuredatleastattheintervallevel.Thismeansthatthedistancebetweenpointsofyourscaleshouldbeequalatallpartsalongthescale.Forexample, ifyouhada10‐pointanxietyscale,thenthedifferenceinanxietyrepresentedbyachangeinscorefrom2to3shouldbethesameasthatrepresentedbyachangeinscorefrom9to10.
→ Independence:Thisassumptionisthatdatafromdifferentparticipantsare independent,whichmeansthatthebehaviourofoneparticipantdoesnotinfluencethebehaviourofanother.Inrepeatedmeasuresdesigns(in which participants are measured in more than one experimental condition), we expect scores in theexperimental conditions to be non‐independent for a given participant, but behaviour between differentparticipantsshouldbeindependent.
Theassumptionsofintervaldataandindependentmeasurementsare,unfortunately,testedonlybycommonsense.These assumptions of homogeneity of variance and Normal Distribution can be tested more objectively. Theremainderofthishandoutexplainhowtoexploredataandtotesttheseassumptions.
UsingHistogramstoSpottheObviousMistakes
Abiologistwasworriedaboutthepotentialhealtheffectsofmusicfestivals.So,oneyearshewenttotheDownloadMusicFestival (http://www.downloadfestival.co.uk)andmeasured thehygieneof810concertgoersover the threedays of the festival. In theory eachpersonwasmeasuredon eachday but because itwas difficult to track peopledown,thereweresomemissingdataondays2and3.Hygienewasmeasuredusingastandardisedtechnique(don’tworryitwasn’t lickingtheperson’sarmpit)thatresultsinascorerangingbetween0(yousmelllikearottingcorpsethat’s hiding up a skunk’s arse) and 5 (you smell of sweet roses on a fresh spring day). Now I know from bitter
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page2
experiencethatsanitationisnotalwaysgreatattheseplaces(Readingfestivalseemsparticularlybad…)andsothisresearcherpredictedthatpersonalhygienewouldgodowndramaticallyoverthethreedaysofthefestival.ThedatafileiscalledDownloadFestival.sav.
ToplotahistogramweuseChartBuilder(seelastweek’shandout)whichisaccessedthroughthe menu.SelectHistograminthelistlabelledChoosefromtobringupthegalleryshowninFigure1.Thisgalleryhasfouriconsrepresentingdifferenttypesofhistogram,andyoushouldselecttheappropriateoneeitherbydouble‐clickingonit,orbydraggingitontothecanvasintheChartBuilder:
→ Simplehistogram:Usethisoptionwhenyoujustwanttoseethefrequenciesofscoresforasinglevariable(i.e.mostofthetime).
→ Stackedhistogram: Ifyouhadagroupingvariable (e.g.whethermenorwomenattendedthe festival)youcouldproduceahistograminwhicheachbarissplitbygroup.Intheexampleofgender,thisisagoodwaytocomparetherelativefrequencyofscoresacrossgroups(e.g.weretheremoresmellywomenthanmen?).
→ Frequency Polygon: This option displays the samedata as the simple histogramexcept that it uses a lineinsteadofbarstoshowthefrequency,andtheareabelowthelineisshaded.
→ PopulationPyramid:Likeastackedhistogramthisshowstherelativefrequencyofscoresintwopopulations.Itplotsthevariable(inthiscasehygiene)ontheverticalaxisandthefrequenciesforeachpopulationonthehorizontal:thepopulationsappearbacktobackonthegraph.Ifthebarseithersideofthedividinglineareequallylongthenthedistributionshaveequalfrequencies.
Figure1:Thehistogramgallery
Wearegoingtodoasimplehistogramsodouble‐clickontheiconforasimplehistogram(Figure1).TheChartBuilderdialogboxwillnowshowapreviewofthegraphinthecanvasarea.Atthemomentit’snotveryexciting(topofFigure)becausewehaven’ttoldSPSS/PASWwhichvariableswewanttoplot.Notethatthevariablesinthedataeditorarelistedontheleft‐handsideoftheChartBuilder,andanyofthesevariablescanbedraggedintoanyofthedropzones(spacessurroundedbybluedottedlines).
Ahistogramplots a single variable (x‐axis) against the frequencyof scores (y‐axis), soallweneed todo is select avariablefromthelistanddragitinto .Let’sdothisforthehygienescoresonday1ofthefestival.Clickon this variable in the list and drag it to as shown in Figure 2; you will now find the histogrampreviewedonthecanvas.Todrawthehistogramclickon .
TheresultinghistogramisshowninFigure3andthefirstthingthatshouldleapoutatyouisthatthereappearstobeonecasethat isverydifferenttotheothers.Allofthescoresappeartobesquisheduponeendofthedistributionbecause they are all less than5 (yieldingwhat is knownas a leptokurtic distribution!) except forone,whichhas avalue of 20! This is anoutlier: a score very different to the rest. Outliers bias themean and inflate the standarddeviationandscreeningdataisanimportantwaytodetectthem.What’soddaboutthisoutlieristhatithasascoreof20,whichisabovethetopofourscale(rememberourhygienescalerangedonlyfrom0‐5)andsoitmustbeamistake(or thepersonhadobsessivecompulsivedisorderandhadwashed themselves intoa stateofextremecleanliness).However,with810cases,howonearthdowefindoutwhichcaseitwas?Youcouldjustlookthroughthedata,butthatwouldcertainlygiveyouaheadacheandsoinsteadwecanuseaboxplot.
SimpleHistogram
StackedHistogram
FrequencyPolygonHistogra
m
PopulationPyramid
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page3
Figure2:DefiningahistogramintheChartBuilder
Figure3
Boxplots
You encountered boxplots or box‐whisker diagrams in first year. At the centre of the plot is themedian, which issurroundedbyaboxthetopandbottomofwhicharethelimitswithinwhichthemiddle50%ofobservationsfall(theinterquartile range).Stickingoutof thetopandbottomof theboxaretwowhiskerswhichextendtothemostandleastextremescoresrespectively.
Access theChartBuilder ( ), selectBoxplot in the list labelledChoose from tobringup thegalleryshowninFigure4.Therearethreetypesofboxplotyoucanchoose:
→ Simpleboxplot:Usethisoptionwhenyouwanttoplotaboxplotofasinglevariable,butyouwantdifferentboxplots produced for different categories in thedata (for thesehygienedatawe couldproduce separateboxplotsformenandwomen).
→ Clustered boxplot: This option is the same as the simple boxplot except that you can select a secondcategorical variable onwhich to split thedata. Boxplots for this second variable are produced in differentcolours.Forexample,wemighthavemeasuredwhetherourfestival‐goerwasstaying inatentoranearby
ClickontheHygieneDay1variableanddragittothis
DropZone.
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page4
hotel during the festival. We could produce boxplots not just for men and women, but within men andwomenwecouldhavedifferent‐colouredboxplots for thosewho stayed in tentsand thosewho stayed inhotels.
→ 1‐DBoxplot:Usethisoptionwhenyoujustwanttoseeaboxplotforasinglevariable.(Thisdiffersfromthesimpleboxplotonlyinthatnocategoricalvariableisselectedforthex‐axis.)
Figure4:Theboxplotgallery
In the data file of hygiene scores we also have information about the gender of the concert‐goer. Let’s plot thisinformation aswell. Tomake our boxplot of the day 1 hygiene scores formales and females, double‐click on thesimple boxplot icon (Figure 4), then from the variable list select the hygiene day 1 score variable and drag it into
andselectthevariablegenderanddragitto .ThedialogshouldnowlooklikeFigure5—notethatthevariablenamesaredisplayedinthedropzones,andthecanvasnowdisplaysapreviewofourgraph(e.g.therearetwoboxplotsrepresentingeachgender).Clickon to produce the graph.
Figure5:Completeddialogboxforasimpleboxplot Figure 6: Boxplot of hygiene scores on day 1 of theDownloadFestivalsplitbygender
TheresultingboxplotisshowninFigure6.Itshowsaseparateboxplotforthemenandwomeninthedata.Youmayrememberthatthewholereasonthatwegotintothisboxplotmalarkeywastohelpustoidentifyanoutlierfromourhistogram(ifyouhaveskippedstraighttothissectionthenyoumightwanttobacktrackabit).Theimportantthingtonoteisthattheoutlierthatwedetectedinthehistogramisshownupasanasterisk(*)ontheboxplotandnexttoitisthenumberofthecase(611)that’sproducingthisoutlier.(Wecanalsotellthatthiscasewasafemale.)Ifwegoto
thedataeditor (dataview),wecan locatethiscasequicklybyclickingon andtyping611 in thedialogboxthatappears.Thattakesusstraighttocase611.Lookingatthiscaserevealsascoreof20.02,whichisprobablyamistyping
SimpleBoxplot
ClusteredBoxplot
1‐DBoxplot
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page5
of2.02.We’dhavetogobacktotherawdataandcheck.We’llassumewe’vecheckedtherawdataanditshouldbe2.02,soreplacethevalue20.02withthevalue2.02beforewecontinuethisexample.
Nowthatwe’veremovedthemistakelet’sre‐plotthehistogram.Whilewe’reatitweshouldplothistogramsforthedatafromday2andday3ofthefestivalaswell.
Figure 7 shows the resulting three histograms. The first thing to note is that the data fromday 1 look a lotmorehealthynowwe’veremovedthedatapointthatwasmis‐typed.Infactthedistributionisamazinglynormal‐looking:itisnicelysymmetricalanddoesn’tseemtoopointyorflat—thesearegoodthings!However,thedistributionsfordays2and3arenotnearlysosymmetrical. Infact,theybothlookpositivelyskewed.Thissuggeststhatgenerallypeoplebecamesmellierasthefestivalprogressed.Theskewoccursbecauseasubstantialminorityinsistedonupholdingtheirlevelsofhygiene(againstallodds!)overthecourseofthefestival(babywet‐wipesareindispensableIfind).However,theseskeweddistributionsmightcauseusaproblemifwewanttouseparametrictests.Inthenextsectionwe’lllookatwaystotrytoquantifytheskewnessandkurtosis(i.e.howpointyorflatthedistributionis)ofthesedistributions.
Figure7:HistogramsofthehygienescoresoverthethreedaysoftheDownloadFestival
Quantifyingnormalitywithnumbers
SkewandKurtosis
To further explore the distribution of the variables, we can use the frequencies command (). The main dialog box is shown in Figure 8. The variables in the data
editor are listedon the left‐hand side, and they canbe transferred to thebox labelledVariable(s) by clickingon a
variable(orhighlightingseveralwiththemouse)andthenclickingon .IfavariablelistedintheVariable(s)boxisselectedusingthemouse,itcanbetransferredbacktothevariablelistbyclickingonthearrowbutton(whichshouldnowbepointingintheoppositedirection).Bydefault,SPSS/PASWproducesafrequencydistributionofallscoresintableform.However,therearetwootherdialogboxesthatcanbeselectedthatprovideotheroptions.Thestatisticsdialogboxisaccessedbyclickingon ,andthechartsdialogboxisaccessedbyclickingon .
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page6
Thestatisticsdialogboxallowsyoutoselectseveralwaysinwhichadistributionofscorescanbedescribed,suchasmeasures of central tendency (mean,mode,median),measures of variability (range, standard deviation, variance,quartile splits), measures of shape (kurtosis and skewness). To describe the characteristics of the data we shouldselect themean, mode, median, standard deviation, variance and range. To check that a distribution of scores isnormal,weneedtolookatthevaluesofkurtosisandskewness.Thechartsoptionprovidesasimplewaytoplotthefrequencydistributionofscores(asabarchart,apiechartorahistogram).We’vealreadyplottedhistogramsofourdatasowedon’tneed toselect theseoptions,butyoucoulduse theseoptions in futureanalyses.Whenyouhaveselectedtheappropriateoptions,returntothemaindialogboxbyclickingon .Onceinthemaindialogbox,clickon toruntheanalysis.
Figure8
SPSS/PASWOutput1showsthetableofdescriptivestatisticsforthethreevariablesinthisexample.Fromthistable,wecanseethat,onaverage,hygienescoreswere1.77(outof5)onday1ofthefestival,butwentdownto0.96and0.98onday2and3respectively.Theotherimportantmeasuresforourpurposesaretheskewnessandthekurtosis,bothofwhichhaveanassociatedstandarderror.
The values of skewness and kurtosis should be zero in a normal distribution. Positive values of skewnessindicateapile‐upofscoresontheleftofthedistribution,whereasnegativevaluesindicateapile‐upontheright.
Positivevaluesofkurtosisindicateapointydistributionwhereasnegativevaluesindicateaflatdistribution.
Thefurtherthevalueisfromzero,themorelikelyitisthatthedataarenotnormallydistributed.
However,theactualvaluesofskewnessandkurtosisarenot,inthemselves,informative.Instead,weneedtotakethevalueandconvertittoaz‐scorebysubtractingthemean(whichis,inbothcaseszero)anddividingbythestandarderror.
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page7
Intheaboveequations,thevaluesofS(skewness)andK(kurtosis)andtheirrespectivestandarderrorsareproducedby SPSS/PASW. These z‐scores can be compared against known values for the normal distribution shown in Field(2009).Anabsolutevaluegreaterthan1.96issignificantatp<.05,above2.58issignificantatp<.01andabsolutevaluesaboveabout3.29aresignificantatp<.001.Largesampleswillgiverisetosmallstandarderrorsandsowhensamplesizesarebigsignificantvaluesarisefromevensmalldeviationsfromnormality.Inmostsamplesit’sOktolookforvaluesabove1.96,however, in largesamplesthiscriterionshouldbeincreasedto2.58or3.29andinverylargesamples,becauseoftheproblemofsmallstandarderrorsthatI’vedescribed,nocriterionshouldbeappliedatall!
UsingthevaluesinSPSS/PASWOutput1,calculatethez‐scoresforskewnessandKurtosisforeachdayoftheDownloadfestival.
YourAnswers:
SPSS/PASWOutput1
TheKolmogorov‐SmirnovTest
TheKolmogorov‐SmirnovandShapiro‐Wilk testscomparethescores inthesampletoanormallydistributedsetofscoreswiththesamemeanandstandarddeviation.
→ Ifthetestisnon‐significant(p>.05)ittellsusthatthedistributionofthesampleisnotsignificantlydifferentfromanormaldistribution(i.e.itisprobablynormal).
→ If,however, thetest issignificant (p< .05)thenthedistribution inquestion issignificantlydifferentfromanormaldistribution(i.e.itisnon‐normal).
These tests seem great: in one easy procedure they tell us whether our scores are normally distributed (nice!).However,theyhavetheirlimitationsbecausewithlargesamplesizesitisveryeasytogetsignificantresultsfromsmalldeviationsfromnormality.Thetakehomemessageisthatbyallmeansusethesetests,butplotyourdataaswellandtrytomakeaninformeddecisionabouttheextentofnon‐normality.
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page8
The Kolmogorov‐Smirnov (K‐S from now on) test can be accessed through the explore command (
).Figure9showsthedialogboxesfortheexplorecommand.First,enteranyvariablesof interest in thebox labelledDependent List byhighlighting themon the left‐hand sideand transferringthembyclickingon (ordraggingthemacross).Forthisexample,selectthehygienescoresforeachday.It isalsopossibletoselectafactor(orgroupingvariable)bywhichtosplittheoutput(so,wecouldtransfergenderintotheboxlabelled Factor List and SPSS/PASW will produce exploratory analysis for each group—a bit like the split filecommand).Ifyouclickon adialogboxappears,butthedefaultoptionisfine.Themoreinterestingoptionforourpurposes isaccessedbyclickingon . Inthisdialogboxselecttheoption ,andthiswillproduceboththeK‐Stest.Bydefault,SPSS/PASWwillproduceboxplots(splitaccordingtogroupifafactorhasbeen specified)and stemand leafdiagramsaswell. If you clickon you’ll see that,bydefault, SPSS/PASW‘excludescaseslistwise’.Thismeansthatifacaseofdataismissingforanyofthevariableswehaveselecteditwillexcludethatcaseentirely!Thisisnotdesirableherebecauseforhygieneonday1(forexample)wewanttoseethedistributionofallpeople,not just thepeoplewhoweretestedonall threedays.So,changethisoptionto ‘Excludecasespairwise’ and then for each variableweanalyse, SPSS/PASWwill excludepeopleonly if there is notdata forthemon that particular variable. Click on to return to themain dialog box and then click to run theanalysis.
Figure9
AretheresultsoftheK‐Stestssurprisinggiventhehistogramswehavealreadyseen?
YourAnswers:
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page9
ThefirsttableproducedbySPSS/PASWcontainsdescriptivestatistics(meanetc.)andshouldhavethesamevaluesasthetablesobtainedusingthefrequenciesprocedureearlier.The importanttable isthatoftheKolmogorov‐Smirnovtest(SPSS/PASWOutput2).Thistableincludestheteststatisticitself,thedegreesoffreedom(whichshouldequalthesamplesize)andthesignificancevalueofthistest.Rememberthatasignificantvalue(Sig. lessthan.05) indicatesadeviation fromnormality.TheK‐S test ishighlysignificant for the last twodistributions indicating that theyarenotnormal.
SPSS/PASWOutput2
The test statistic for the K‐S test is denoted byD and so we can report these results as: Thehygienescoresonday1,D(810)=0.03,ns,didnotsignificantlydeviatefromnormality;however,day2,D(264)=0.12,p< .001,andday3,D(123)=0.14,p< .001scoresweresignificantlynon‐normal.Thenumbersinbracketsarethedegreesoffreedom(df)fromthetable.
HomogeneityofVariance
Different statistical procedures have their own unique way to look for homogeneity of variance: in correlationalanalysis sucha regressionwe tend tousegraphs (see lectures/handoutsonMultipleRegression)and forgroupsofdatawetendtouseatestcalledLevene’stest.Levene’stestteststhehypothesisthatthevariancesinthegroupsareequal(i.e.thedifferencebetweenthevariancesiszero).Therefore,
→ IfLevene’stestissignificantatp≤.05thenwecanconcludethatthenullhypothesisisincorrectandthatthevariances are significantly different—therefore, the assumption of homogeneity of variances has beenviolated.
→ If,however, Levene’s test isnon‐significant (i.e.p > .05) thenwemustaccept thenullhypothesis that thedifferencebetweenthevariancesiszero—thevariancesareroughlyequalandtheassumptionistenable.
Although Levene’s test can be selected as an option inmany of the statistical tests that require it, it can also beexaminedwhenyou’reexploringdata (andstrictlyspeaking it’sbetter toexamine itnowthanwaituntilyourmainanalysis).However,aswiththeK‐Stest(andothertestsofnormality),whenthesamplessizeislarge,smalldifferencesingroupvariancescanproduceaLevene’stestthatissignificantwhenthevariancesarenotparticularlydifferent.
Whatistheassumptionofhomogeneityofvariance?
YourAnswers:
WecangetLevene’stestusingtheexploremenuthatweusedintheprevioussection.Forthisexample,wellusethechick flick data from last week (if you didn’t save your own file last week, you can find the data in the file
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page10
ChickFlick.sav).Oncethedataareloaded,use toopenthedialogboxin Figure7. Transfer thearousal variable from the liston the left‐hand side to thebox labelledDependent List: by
clickingthe next tothisbox,andbecausewewanttosplit theoutputbythegroupingvariable tocomparethe
variances,selectthevariablegenderandtransferittotheboxlabelledFactorListbyclickingontheappropriate .Thenclickon toopentheotherdialogbox inFigure10.TogetLevene’stestweneedtoselectoneoftheoptionswhereitsaysSpreadvs.LevelwithLevene’stest.Ifyouselect theLevene’stestiscarriedoutontherawdata(agoodplacetostart).Whenyou’vefinishedwiththisdialogboxclickon toreturntothemainexploredialogboxandthenclick toruntheanalysis.
Figure10
SPSS/PASWOutput3
SPSS/PASWOutput3showstheresults. I’mnotgoingtodwellonthisbecausewewillcomeacrossLevene’stest infutureseminars,sowe’lljustnotethatLevene’stestisnonsignificant(valuesinthecolumnlabelledSig.aremorethan.05)indicatingthatthevariancesarenotsignificantlydifferent(i.e.theyaresimilarandthehomogeneityofvarianceassumptionistenable).
Levene’stestcanbedenotedwiththeletterFandtherearetwodifferentdegreesoffreedom.Assuchyoucanreportit,ingeneralform,asF(df1,df2)=value,sig.So,wecouldsaythevariancesareequal,F(1,38)=0.87,ns.
CorrectingproblemsintheData
Ifyourdataarenotnormallydistributed,therearethingsyoucandototrytocorrecttheproblem.Themainthingistransforming the data. Although some students often (understandably) think that transforming data sounds dodgy(thephrase‘fudgingyourresults’springstosomepeople’sminds!),infactitisn’tbecauseyoudothesamethingtoallof your data. As such, transforming the data won’t change the relationships between variables (the relativedifferences between people for a given variable stay the same), however, it does change the differences betweendifferentvariables(becauseitchangestheunitsofmeasurement).Therefore,evenifyouhaveonlyonevariablethathasaskeweddistribution,youshouldstilltransformanyothervariables inyourdataset ifyou’regoingtocompare
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page11
differencesbetweenthatvariableandothersthatyouintendtotransform.So,forexample,withourhygienedata,wemightwanttolookathowhygienelevelschangedacrossthethreedays(thatis,comparethemeanonday1tothemeansondays2and3).
Thedataforday2and3wereskewedandneedtobetransformed,butbecausewemightlatercomparethedatatoscoresonday1,wewouldalsohavetotransformtheday1data(eventhoughitisn’tskewed).Ifwedon’tchangetheday1dataaswell,thenanydifferencesinhygienescoreswefindfromday1today2or3willsimplybeduetoustransformingonevariableandnottheothers.
Whenyouhavepositivelyskeweddata(aswedo)thatneedscorrecting,therearethreetransformationsthatcanbeuseful(andasyou’llseethesecanbeadaptedtodealwithnegativelyskeweddatatoo):
1. Log Transformation (log(Xi)): Taking the logarithm of a set of numbers squashes the right tail of thedistribution.Assuchit’sagoodwaytoreducepositiveskew.However,youcan’tgetaLogvalueofzeroornegativenumbers,soifyourdataincludeanyzerosorproducenegativenumbersyouneedtoaddaconstanttoallofthedatabeforeyoudothetransformation.Forexample,ifyouhavezerosinthedatathendolog(Xi+1),orifyouhavenegativenumbersaddwhatevervaluemakesthesmallestnumberinthedatasetpositive.
2. SquareRootTransformation(√Xi):Takingthesquarerootoflargevalueshasmoreofaneffectthantakingthesquarerootofsmallvalues.Consequently, taking thesquarerootofeachofyourscoreswillbringanylarge scores closer to the centre—rather like the log transformation. As such, this can be a usefulway toreduceapositively‐skeweddata;however,youstillhavethesameproblemwithnegativenumbers(negativenumbersdon’thaveasquareroot).
3. Reciprocal Transformation (1/Xi): Dividing 1 by each score also reduces the impact of large scores. Thetransformedvariablewillhavealowerlimitof0(verylargenumberswillbecomeclosetozero).Onethingtobear in mind with this transformation is that it reverses the scores in the sense that scores that wereoriginally large in thedatasetbecomesmall (close tozero)after the transformation,but scores thatwereoriginallysmallbecomebigafterthetransformation.Forexample,imaginetwoscoresof1and10,afterthetransformation theybecome1/1=1,and1/10=0.1: thesmall scorebecomesbigger than the largescoreafterthetransformation.However,youcanavoidthisbyreversingthescoresbeforethetransformation,byfindingthehighestscoreandchangingeachscoretothehighestscoreminusthescoreyou’relookingat.So,youdoatransformation1/(XHighest–Xi).
Anyoneofthesetransformationscanalsobeusedtocorrectnegativelyskeweddata,butfirstyou’dhavetoreversethescores(that is,subtracteachscorefromthehighestscoreobtained)—ifyoudothis,don’tforgettoreversethescoresbackafterwards!
UsingSPSS/PASW’sComputecommand
Thecomputecommandenablesustocarryoutvariousfunctionsoncolumnsofdatainthedataeditor.Sometypicalfunctionsareaddingscoresacrossseveralcolumns,takingthesquarerootofthescoresinacolumn,orcalculatingthe
meanofseveralvariables.Toaccessthecomputedialogbox,usethemousetoselect .TheresultingdialogboxisshowninFigure11;ithasalistoffunctionsontheright‐handside,acalculator‐likekeyboardinthecentreandablankspacethatI’velabelledthecommandarea.ThebasicideaisthatyoutypeanameforanewvariableinthearealabelledTargetVariableandthenyouwritesomekindofcommandinthecommandareatotellSPSS/PASWhowtocreatethisnewvariable.Youuseacombinationofexistingvariablesselectedfromthelistontheleftandnumericexpressions.So,forexample,youcoulduseitlikeacalculatortoaddvariables(i.e.addtwocolumnsinthedataeditortomakeathird).Therearehundredsofbuilt‐infunctionsthatSPSS/PASWhasgroupedtogether.Inthe dialog box it lists these groups in the area labelled Function group; upon selecting a function group, a list ofavailable functionswithin thatgroupwillappear in thebox labelledFunctionsandSpecialVariables. If youselectafunction, thenadescriptionof that functionappears in thegreybox indicated inFigure11.Youcanenter variablenames into the command area by selecting the variable required from the variables list and then clicking on .Likewise,youcanselectacertainfunctionfromthelistofavailablefunctionsandenteritintothecommandareabyclickingon .
The basic procedure is to first type a variable name in the box labelled Target Variable. You can then click onandanotherdialogboxappears,whereyoucangivethevariableadescriptive label,andwhereyoucan
specifywhetheritisanumericorstringvariable(seeyourhandoutfromweek1).Thenwhenyouhavewrittenyourcommand for SPSS/PASW to execute, click on to run the command and create the new variable. There are
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page12
functions for calculatingmeans, standarddeviationsandsumsof columns.We’regoing touse thesquare rootandlogarithmfunctions,whichareusefulfortransformingdatathatareskewed(seeField,2009,formoredetail).
Figure11:Dialogboxforthecomputefunction
LogTransformation
Let’s return to our Download festival data. To transform these data, first open the main compute dialog box by
selecting .Enter thename logday1 into thebox labelledTargetVariableandthenclickon and give the variable a more descriptive name such as Log transformed hygiene scores for day 1 of
Download festival. In the listbox labelledFunctiongroup clickonArithmeticandthen in thebox labelledFunctionsandSpecialVariablesclickonLg10(thisisthelogtransformationtobase10,Lnisthenaturallog)andtransferittothecommandareabyclickingon .Whenthecommandistransferred,itappearsinthecommandareaas‘LG10(?)’andthequestionmark shouldbe replacedwitha variablename (which canbe typedmanuallyor transferred from thevariables list). So replace the questionmarkwith the variableday1 by either selecting the variable in the list anddraggingitacross,clickingon ,orjusttyping‘day1’wherethequestionmarkis.
There isavalueof0 in theoriginaldata (onday2),andthere isno logarithmof thevalue0.Toovercomethisweshouldaddaconstant toouroriginalscoresbeforewetakethe logof thosescores.Anyconstantwilldo,providedthatitmakesallofthescoresgreaterthan0.Inthiscaseourlowestscoreis0inthedatasetsowecansimplyadd1toallofthescoresandthatwillensurethatallscoresaregreaterthanzero.Todothis,makesurethecursorisstillinside the brackets and click on and then . The final dialog box should look like Figure 11. Note that theexpressionreadsLG10(day1+1);thatis,SPSS/PASWwilladdonetoeachoftheday1scoresandthentakethelogoftheresultingvalues.Clickon tocreateanewvariablelogday1containingthetransformedvalues.
Command area
Variable list
Categories of
functions
Functions within the selected category
Use this dialog box to select certain
cases in the data
Description of selected function
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page13
Nowyouhaveagoatcreatingsimilarvariables logday2and logday3 for theday2andday3data!
SquareRootTransformation
Tousethesquareroottransformation,wecouldrunthroughthesameprocess,byusinganamesuchassqrtday1andselecting the SQRT(numexpr) function from the list. This will appear in the box labelled Numeric Expression: asSQRT(?),andyoucansimplyreplacethequestionmarkwiththevariableyouwanttochange—inthiscaseday1.ThefinalexpressionwillreadSQRT(day1).
Tryrepeatingthisforday2andday3tocreatevariablescalledsqrtday2andsqrtday3.
ReciprocalTransformation
Finally,ifwewantedtouseareciprocaltransformationonthedatafromday1,wecoulduseanamesuchasrecday1andthensimplyclickon andthen .Ordinarilyyouwouldselectthevariablenamethatyouwanttotransformfromthelistanddragitacross,clickon orjusttypethenameofthevariable.However,theday2datacontainazerovalueandifwetrytodivide1by0thenwe’llgetanerrormessage(youcan’tdivideby0).Assuchweneedtoaddaconstanttoourvariable justaswedidforthe logtransformation.Anyconstantwilldo,but1 isaconvenientnumberforthesedata.So, insteadofselectingthevariablewewanttotransform,clickon .Thisplacesapairofbrackets into the box labelledNumeric Expression; thenmake sure the cursor is between these two brackets andselectthevariableyouwanttotransformfromthelistandtransferitacrossbyclickingon (ortypethenameofthevariable manually). Now click on and then (or type + 1 using your keyboard). The box labelled NumericExpression should now contain the text 1/(day1 + 1). Click on to create a new variable containing thetransformedvalues.
Tryrepeatingthisforday2andday3tocreatevariablescalledrecday2andrecday3.
TheEffectofTransformingData
PlotHistogramsofyourtransformedvalues!
Let’s have a look at our newdistributions (Figure 12).Compare these with theuntransformed distributions inFigure 7. Now, you can seethat all three transformationshave cleaned up the hygienescores for day 2: the positiveskew is gone (the square roottransformation in particularhas been useful). However,
C8057(ResearchMethodsinPsychology):ExploringData
©Dr.AndyField,2008 Page14
becauseourhygienescoresonday 1 were more or lesssymmetrical to begin with,theyhavenowbecomeslightlynegatively skewed for the logand square roottransformation, and positivelyskewed for the reciprocaltransformation1! Ifwe’reusingscores from day 2 alone thenwe could use the transformedscores,however,ifwewanttolook at the change in scoresthen we’d have to weigh upwhether the benefits of thetransformation for the day 2scoresoutweighstheproblemsitcreatesintheday1scores—dataanalysiscanbefrustratingsometimes!
Figure12:Day1(left)andday22(right)hygienescoresfromtheDownloadFestival
MultipleChoiceTest
Gotohttp://www.uk.sagepub.com/field3e/MCQ.htmandtestyourselfonthemultiplechoicequestionsforChapter5. If youget anywrong, re‐read thishandout (or Field, 2009,Chapter5) anddo themagainuntil youget themallcorrect.
Thishandoutcontainsmaterialfrom:
Field,A.P.(2009).DiscoveringstatisticsusingSPSS:andsexanddrugsandrock‘n’roll(3rdEdition).London:Sage.
ThismaterialiscopyrightAndyField(2000,2005,2009).
1Thereversaloftheskewforthereciprocaltransformationisbecause,asImentionedearlier,thereciprocalhastheeffectofreversingthescores.