exploring data (2008) - mdthinducollege.org · psychology): exploring data ... were some missing...

C8057(ResearchMethodsinPsychology):ExploringData

©Dr.AndyField,2008 Page1

ExploringData

AssumptionsofParametricData

If you use a parametric test and certain assumptions are not met then the results are likely to be inaccurate.Therefore, it isvery important thatyoucheck theassumptionsbeforedecidingwhichstatistical test isappropriate.Theassumptionsofparametrictestsbasedonthenormaldistributionareasfollows:

→ Normallydistributeddata:Thisassumptionistrickyandmisunderstoodbecauseitmeansdifferentthingsindifferent contexts (see Field, 2009). In short, the rationale behind hypothesis testing relies on havingsomethingthatisnormallydistributed:insomecasesit’sthesamplingdistribution,inotherstheerrorsinthemodelandsoon. Ifthisassumptionisnotmetthenyourstatisticalmodelortest isusuallyflawedinsomeway. The reason why this assumption is misunderstood is because people often believe that the datathemselvesneed tobenormallydistributed;actually, the rawdata rarelyneed tobenormallydistributed,butiftheyarethenitusuallymeansthatthethingsthatweactuallywanttobenormallydistributedaretoo.Confused? Well, it is confusing. In many statistical tests (e.g. the t‐test) we assume that the samplingdistributionisnormallydistributed.Thisisaproblembecausewedon’thaveaccesstothisdistribution—wecan’tsimplylookatitsshapeandseewhetheritisnormallydistributed.However,weknowfromthecentrallimittheorem(seemybookifyoudon’tknowwhatthatis)thatifthesampledataareapproximatelynormalthenthesamplingdistributionwillbealso.Therefore,peopletendtolookattheirsampledatatoseeiftheyare normally distributed. If so, then they have a little party to celebrate and assume that the samplingdistribution(whichiswhatactuallymatters)isalso.Wealsoknowfromthecentrallimittheoremthatinbigsamples the sampling distribution tends to be normal anyway— regardless of the shape of the data weactually collected (and the sampling distribution will tend to be normal regardless of the populationdistribution insamplesof30ormore).Asoursamplegetsbiggerthen,wecanbemoreconfidentthatthesamplingdistributionisnormallydistributed.

→ Homogeneity of variance: This assumptionmeans that the variances should be the same throughout thedata. Indesigns inwhichyou test severalgroupsofparticipants thisassumptionmeans thateachof thesesamplescomefrompopulationswiththesamevariance.Incorrelationaldesigns,thisassumptionmeansthatthevarianceofonevariableshouldbestableatalllevelsoftheothervariable.

→ Intervaldata:Datashouldbemeasuredatleastattheintervallevel.Thismeansthatthedistancebetweenpointsofyourscaleshouldbeequalatallpartsalongthescale.Forexample, ifyouhada10‐pointanxietyscale,thenthedifferenceinanxietyrepresentedbyachangeinscorefrom2to3shouldbethesameasthatrepresentedbyachangeinscorefrom9to10.

→ Independence:Thisassumptionisthatdatafromdifferentparticipantsare independent,whichmeansthatthebehaviourofoneparticipantdoesnotinfluencethebehaviourofanother.Inrepeatedmeasuresdesigns(in which participants are measured in more than one experimental condition), we expect scores in theexperimental conditions to be non‐independent for a given participant, but behaviour between differentparticipantsshouldbeindependent.

Theassumptionsofintervaldataandindependentmeasurementsare,unfortunately,testedonlybycommonsense.These assumptions of homogeneity of variance and Normal Distribution can be tested more objectively. Theremainderofthishandoutexplainhowtoexploredataandtotesttheseassumptions.

UsingHistogramstoSpottheObviousMistakes

Abiologistwasworriedaboutthepotentialhealtheffectsofmusicfestivals.So,oneyearshewenttotheDownloadMusicFestival (http://www.downloadfestival.co.uk)andmeasured thehygieneof810concertgoersover the threedays of the festival. In theory eachpersonwasmeasuredon eachday but because itwas difficult to track peopledown,thereweresomemissingdataondays2and3.Hygienewasmeasuredusingastandardisedtechnique(don’tworryitwasn’t lickingtheperson’sarmpit)thatresultsinascorerangingbetween0(yousmelllikearottingcorpsethat’s hiding up a skunk’s arse) and 5 (you smell of sweet roses on a fresh spring day). Now I know from bitter



experiencethatsanitationisnotalwaysgreatattheseplaces(Readingfestivalseemsparticularlybad…)andsothisresearcherpredictedthatpersonalhygienewouldgodowndramaticallyoverthethreedaysofthefestival.ThedatafileiscalledDownloadFestival.sav.

ToplotahistogramweuseChartBuilder(seelastweek’shandout)whichisaccessedthroughthe menu.SelectHistograminthelistlabelledChoosefromtobringupthegalleryshowninFigure1.Thisgalleryhasfouriconsrepresentingdifferenttypesofhistogram,andyoushouldselecttheappropriateoneeitherbydouble‐clickingonit,orbydraggingitontothecanvasintheChartBuilder:

→ Simplehistogram:Usethisoptionwhenyoujustwanttoseethefrequenciesofscoresforasinglevariable(i.e.mostofthetime).

→ Stackedhistogram: Ifyouhadagroupingvariable (e.g.whethermenorwomenattendedthe festival)youcouldproduceahistograminwhicheachbarissplitbygroup.Intheexampleofgender,thisisagoodwaytocomparetherelativefrequencyofscoresacrossgroups(e.g.weretheremoresmellywomenthanmen?).

→ Frequency Polygon: This option displays the samedata as the simple histogramexcept that it uses a lineinsteadofbarstoshowthefrequency,andtheareabelowthelineisshaded.

→ PopulationPyramid:Likeastackedhistogramthisshowstherelativefrequencyofscoresintwopopulations.Itplotsthevariable(inthiscasehygiene)ontheverticalaxisandthefrequenciesforeachpopulationonthehorizontal:thepopulationsappearbacktobackonthegraph.Ifthebarseithersideofthedividinglineareequallylongthenthedistributionshaveequalfrequencies.

Figure1:Thehistogramgallery

Wearegoingtodoasimplehistogramsodouble‐clickontheiconforasimplehistogram(Figure1).TheChartBuilderdialogboxwillnowshowapreviewofthegraphinthecanvasarea.Atthemomentit’snotveryexciting(topofFigure)becausewehaven’ttoldSPSS/PASWwhichvariableswewanttoplot.Notethatthevariablesinthedataeditorarelistedontheleft‐handsideoftheChartBuilder,andanyofthesevariablescanbedraggedintoanyofthedropzones(spacessurroundedbybluedottedlines).

Ahistogramplots a single variable (x‐axis) against the frequencyof scores (y‐axis), soallweneed todo is select avariablefromthelistanddragitinto .Let’sdothisforthehygienescoresonday1ofthefestival.Clickon this variable in the list and drag it to as shown in Figure 2; you will now find the histogrampreviewedonthecanvas.Todrawthehistogramclickon .

TheresultinghistogramisshowninFigure3andthefirstthingthatshouldleapoutatyouisthatthereappearstobeonecasethat isverydifferenttotheothers.Allofthescoresappeartobesquisheduponeendofthedistributionbecause they are all less than5 (yieldingwhat is knownas a leptokurtic distribution!) except forone,whichhas avalue of 20! This is anoutlier: a score very different to the rest. Outliers bias themean and inflate the standarddeviationandscreeningdataisanimportantwaytodetectthem.What’soddaboutthisoutlieristhatithasascoreof20,whichisabovethetopofourscale(rememberourhygienescalerangedonlyfrom0‐5)andsoitmustbeamistake(or thepersonhadobsessivecompulsivedisorderandhadwashed themselves intoa stateofextremecleanliness).However,with810cases,howonearthdowefindoutwhichcaseitwas?Youcouldjustlookthroughthedata,butthatwouldcertainlygiveyouaheadacheandsoinsteadwecanuseaboxplot.

SimpleHistogram

StackedHistogram

FrequencyPolygonHistogra

m

PopulationPyramid



Figure2:DefiningahistogramintheChartBuilder

Figure3

Boxplots

You encountered boxplots or box‐whisker diagrams in first year. At the centre of the plot is themedian, which issurroundedbyaboxthetopandbottomofwhicharethelimitswithinwhichthemiddle50%ofobservationsfall(theinterquartile range).Stickingoutof thetopandbottomof theboxaretwowhiskerswhichextendtothemostandleastextremescoresrespectively.

Access theChartBuilder ( ), selectBoxplot in the list labelledChoose from tobringup thegalleryshowninFigure4.Therearethreetypesofboxplotyoucanchoose:

→ Simpleboxplot:Usethisoptionwhenyouwanttoplotaboxplotofasinglevariable,butyouwantdifferentboxplots produced for different categories in thedata (for thesehygienedatawe couldproduce separateboxplotsformenandwomen).

→ Clustered boxplot: This option is the same as the simple boxplot except that you can select a secondcategorical variable onwhich to split thedata. Boxplots for this second variable are produced in differentcolours.Forexample,wemighthavemeasuredwhetherourfestival‐goerwasstaying inatentoranearby

ClickontheHygieneDay1variableanddragittothis

DropZone.



hotel during the festival. We could produce boxplots not just for men and women, but within men andwomenwecouldhavedifferent‐colouredboxplots for thosewho stayed in tentsand thosewho stayed inhotels.

→ 1‐DBoxplot:Usethisoptionwhenyoujustwanttoseeaboxplotforasinglevariable.(Thisdiffersfromthesimpleboxplotonlyinthatnocategoricalvariableisselectedforthex‐axis.)

Figure4:Theboxplotgallery

In the data file of hygiene scores we also have information about the gender of the concert‐goer. Let’s plot thisinformation aswell. Tomake our boxplot of the day 1 hygiene scores formales and females, double‐click on thesimple boxplot icon (Figure 4), then from the variable list select the hygiene day 1 score variable and drag it into

andselectthevariablegenderanddragitto .ThedialogshouldnowlooklikeFigure5—notethatthevariablenamesaredisplayedinthedropzones,andthecanvasnowdisplaysapreviewofourgraph(e.g.therearetwoboxplotsrepresentingeachgender).Clickon to produce the graph.

Figure5:Completeddialogboxforasimpleboxplot Figure 6: Boxplot of hygiene scores on day 1 of theDownloadFestivalsplitbygender

TheresultingboxplotisshowninFigure6.Itshowsaseparateboxplotforthemenandwomeninthedata.Youmayrememberthatthewholereasonthatwegotintothisboxplotmalarkeywastohelpustoidentifyanoutlierfromourhistogram(ifyouhaveskippedstraighttothissectionthenyoumightwanttobacktrackabit).Theimportantthingtonoteisthattheoutlierthatwedetectedinthehistogramisshownupasanasterisk(*)ontheboxplotandnexttoitisthenumberofthecase(611)that’sproducingthisoutlier.(Wecanalsotellthatthiscasewasafemale.)Ifwegoto

thedataeditor (dataview),wecan locatethiscasequicklybyclickingon andtyping611 in thedialogboxthatappears.Thattakesusstraighttocase611.Lookingatthiscaserevealsascoreof20.02,whichisprobablyamistyping

SimpleBoxplot

ClusteredBoxplot

1‐DBoxplot



of2.02.We’dhavetogobacktotherawdataandcheck.We’llassumewe’vecheckedtherawdataanditshouldbe2.02,soreplacethevalue20.02withthevalue2.02beforewecontinuethisexample.

Nowthatwe’veremovedthemistakelet’sre‐plotthehistogram.Whilewe’reatitweshouldplothistogramsforthedatafromday2andday3ofthefestivalaswell.

Figure 7 shows the resulting three histograms. The first thing to note is that the data fromday 1 look a lotmorehealthynowwe’veremovedthedatapointthatwasmis‐typed.Infactthedistributionisamazinglynormal‐looking:itisnicelysymmetricalanddoesn’tseemtoopointyorflat—thesearegoodthings!However,thedistributionsfordays2and3arenotnearlysosymmetrical. Infact,theybothlookpositivelyskewed.Thissuggeststhatgenerallypeoplebecamesmellierasthefestivalprogressed.Theskewoccursbecauseasubstantialminorityinsistedonupholdingtheirlevelsofhygiene(againstallodds!)overthecourseofthefestival(babywet‐wipesareindispensableIfind).However,theseskeweddistributionsmightcauseusaproblemifwewanttouseparametrictests.Inthenextsectionwe’lllookatwaystotrytoquantifytheskewnessandkurtosis(i.e.howpointyorflatthedistributionis)ofthesedistributions.

Figure7:HistogramsofthehygienescoresoverthethreedaysoftheDownloadFestival

Quantifyingnormalitywithnumbers

SkewandKurtosis

To further explore the distribution of the variables, we can use the frequencies command (). The main dialog box is shown in Figure 8. The variables in the data

editor are listedon the left‐hand side, and they canbe transferred to thebox labelledVariable(s) by clickingon a

variable(orhighlightingseveralwiththemouse)andthenclickingon .IfavariablelistedintheVariable(s)boxisselectedusingthemouse,itcanbetransferredbacktothevariablelistbyclickingonthearrowbutton(whichshouldnowbepointingintheoppositedirection).Bydefault,SPSS/PASWproducesafrequencydistributionofallscoresintableform.However,therearetwootherdialogboxesthatcanbeselectedthatprovideotheroptions.Thestatisticsdialogboxisaccessedbyclickingon ,andthechartsdialogboxisaccessedbyclickingon .



Thestatisticsdialogboxallowsyoutoselectseveralwaysinwhichadistributionofscorescanbedescribed,suchasmeasures of central tendency (mean,mode,median),measures of variability (range, standard deviation, variance,quartile splits), measures of shape (kurtosis and skewness). To describe the characteristics of the data we shouldselect themean, mode, median, standard deviation, variance and range. To check that a distribution of scores isnormal,weneedtolookatthevaluesofkurtosisandskewness.Thechartsoptionprovidesasimplewaytoplotthefrequencydistributionofscores(asabarchart,apiechartorahistogram).We’vealreadyplottedhistogramsofourdatasowedon’tneed toselect theseoptions,butyoucoulduse theseoptions in futureanalyses.Whenyouhaveselectedtheappropriateoptions,returntothemaindialogboxbyclickingon .Onceinthemaindialogbox,clickon toruntheanalysis.

Figure8

SPSS/PASWOutput1showsthetableofdescriptivestatisticsforthethreevariablesinthisexample.Fromthistable,wecanseethat,onaverage,hygienescoreswere1.77(outof5)onday1ofthefestival,butwentdownto0.96and0.98onday2and3respectively.Theotherimportantmeasuresforourpurposesaretheskewnessandthekurtosis,bothofwhichhaveanassociatedstandarderror.

The values of skewness and kurtosis should be zero in a normal distribution. Positive values of skewnessindicateapile‐upofscoresontheleftofthedistribution,whereasnegativevaluesindicateapile‐upontheright.

Positivevaluesofkurtosisindicateapointydistributionwhereasnegativevaluesindicateaflatdistribution.

Thefurtherthevalueisfromzero,themorelikelyitisthatthedataarenotnormallydistributed.

However,theactualvaluesofskewnessandkurtosisarenot,inthemselves,informative.Instead,weneedtotakethevalueandconvertittoaz‐scorebysubtractingthemean(whichis,inbothcaseszero)anddividingbythestandarderror.



Intheaboveequations,thevaluesofS(skewness)andK(kurtosis)andtheirrespectivestandarderrorsareproducedby SPSS/PASW. These z‐scores can be compared against known values for the normal distribution shown in Field(2009).Anabsolutevaluegreaterthan1.96issignificantatp<.05,above2.58issignificantatp<.01andabsolutevaluesaboveabout3.29aresignificantatp<.001.Largesampleswillgiverisetosmallstandarderrorsandsowhensamplesizesarebigsignificantvaluesarisefromevensmalldeviationsfromnormality.Inmostsamplesit’sOktolookforvaluesabove1.96,however, in largesamplesthiscriterionshouldbeincreasedto2.58or3.29andinverylargesamples,becauseoftheproblemofsmallstandarderrorsthatI’vedescribed,nocriterionshouldbeappliedatall!

UsingthevaluesinSPSS/PASWOutput1,calculatethez‐scoresforskewnessandKurtosisforeachdayoftheDownloadfestival.

YourAnswers:

SPSS/PASWOutput1

TheKolmogorov‐SmirnovTest

TheKolmogorov‐SmirnovandShapiro‐Wilk testscomparethescores inthesampletoanormallydistributedsetofscoreswiththesamemeanandstandarddeviation.

→ Ifthetestisnon‐significant(p>.05)ittellsusthatthedistributionofthesampleisnotsignificantlydifferentfromanormaldistribution(i.e.itisprobablynormal).

→ If,however, thetest issignificant (p< .05)thenthedistribution inquestion issignificantlydifferentfromanormaldistribution(i.e.itisnon‐normal).

These tests seem great: in one easy procedure they tell us whether our scores are normally distributed (nice!).However,theyhavetheirlimitationsbecausewithlargesamplesizesitisveryeasytogetsignificantresultsfromsmalldeviationsfromnormality.Thetakehomemessageisthatbyallmeansusethesetests,butplotyourdataaswellandtrytomakeaninformeddecisionabouttheextentofnon‐normality.



The Kolmogorov‐Smirnov (K‐S from now on) test can be accessed through the explore command (

).Figure9showsthedialogboxesfortheexplorecommand.First,enteranyvariablesof interest in thebox labelledDependent List byhighlighting themon the left‐hand sideand transferringthembyclickingon (ordraggingthemacross).Forthisexample,selectthehygienescoresforeachday.It isalsopossibletoselectafactor(orgroupingvariable)bywhichtosplittheoutput(so,wecouldtransfergenderintotheboxlabelled Factor List and SPSS/PASW will produce exploratory analysis for each group—a bit like the split filecommand).Ifyouclickon adialogboxappears,butthedefaultoptionisfine.Themoreinterestingoptionforourpurposes isaccessedbyclickingon . Inthisdialogboxselecttheoption ,andthiswillproduceboththeK‐Stest.Bydefault,SPSS/PASWwillproduceboxplots(splitaccordingtogroupifafactorhasbeen specified)and stemand leafdiagramsaswell. If you clickon you’ll see that,bydefault, SPSS/PASW‘excludescaseslistwise’.Thismeansthatifacaseofdataismissingforanyofthevariableswehaveselecteditwillexcludethatcaseentirely!Thisisnotdesirableherebecauseforhygieneonday1(forexample)wewanttoseethedistributionofallpeople,not just thepeoplewhoweretestedonall threedays.So,changethisoptionto ‘Excludecasespairwise’ and then for each variableweanalyse, SPSS/PASWwill excludepeopleonly if there is notdata forthemon that particular variable. Click on to return to themain dialog box and then click to run theanalysis.

Figure9

AretheresultsoftheK‐Stestssurprisinggiventhehistogramswehavealreadyseen?

YourAnswers:



ThefirsttableproducedbySPSS/PASWcontainsdescriptivestatistics(meanetc.)andshouldhavethesamevaluesasthetablesobtainedusingthefrequenciesprocedureearlier.The importanttable isthatoftheKolmogorov‐Smirnovtest(SPSS/PASWOutput2).Thistableincludestheteststatisticitself,thedegreesoffreedom(whichshouldequalthesamplesize)andthesignificancevalueofthistest.Rememberthatasignificantvalue(Sig. lessthan.05) indicatesadeviation fromnormality.TheK‐S test ishighlysignificant for the last twodistributions indicating that theyarenotnormal.

SPSS/PASWOutput2

The test statistic for the K‐S test is denoted byD and so we can report these results as: Thehygienescoresonday1,D(810)=0.03,ns,didnotsignificantlydeviatefromnormality;however,day2,D(264)=0.12,p< .001,andday3,D(123)=0.14,p< .001scoresweresignificantlynon‐normal.Thenumbersinbracketsarethedegreesoffreedom(df)fromthetable.

HomogeneityofVariance

Different statistical procedures have their own unique way to look for homogeneity of variance: in correlationalanalysis sucha regressionwe tend tousegraphs (see lectures/handoutsonMultipleRegression)and forgroupsofdatawetendtouseatestcalledLevene’stest.Levene’stestteststhehypothesisthatthevariancesinthegroupsareequal(i.e.thedifferencebetweenthevariancesiszero).Therefore,

→ IfLevene’stestissignificantatp≤.05thenwecanconcludethatthenullhypothesisisincorrectandthatthevariances are significantly different—therefore, the assumption of homogeneity of variances has beenviolated.

→ If,however, Levene’s test isnon‐significant (i.e.p > .05) thenwemustaccept thenullhypothesis that thedifferencebetweenthevariancesiszero—thevariancesareroughlyequalandtheassumptionistenable.

Although Levene’s test can be selected as an option inmany of the statistical tests that require it, it can also beexaminedwhenyou’reexploringdata (andstrictlyspeaking it’sbetter toexamine itnowthanwaituntilyourmainanalysis).However,aswiththeK‐Stest(andothertestsofnormality),whenthesamplessizeislarge,smalldifferencesingroupvariancescanproduceaLevene’stestthatissignificantwhenthevariancesarenotparticularlydifferent.

Whatistheassumptionofhomogeneityofvariance?

YourAnswers:

WecangetLevene’stestusingtheexploremenuthatweusedintheprevioussection.Forthisexample,wellusethechick flick data from last week (if you didn’t save your own file last week, you can find the data in the file



ChickFlick.sav).Oncethedataareloaded,use toopenthedialogboxin Figure7. Transfer thearousal variable from the liston the left‐hand side to thebox labelledDependent List: by

clickingthe next tothisbox,andbecausewewanttosplit theoutputbythegroupingvariable tocomparethe

variances,selectthevariablegenderandtransferittotheboxlabelledFactorListbyclickingontheappropriate .Thenclickon toopentheotherdialogbox inFigure10.TogetLevene’stestweneedtoselectoneoftheoptionswhereitsaysSpreadvs.LevelwithLevene’stest.Ifyouselect theLevene’stestiscarriedoutontherawdata(agoodplacetostart).Whenyou’vefinishedwiththisdialogboxclickon toreturntothemainexploredialogboxandthenclick toruntheanalysis.

Figure10

SPSS/PASWOutput3

SPSS/PASWOutput3showstheresults. I’mnotgoingtodwellonthisbecausewewillcomeacrossLevene’stest infutureseminars,sowe’lljustnotethatLevene’stestisnonsignificant(valuesinthecolumnlabelledSig.aremorethan.05)indicatingthatthevariancesarenotsignificantlydifferent(i.e.theyaresimilarandthehomogeneityofvarianceassumptionistenable).

Levene’stestcanbedenotedwiththeletterFandtherearetwodifferentdegreesoffreedom.Assuchyoucanreportit,ingeneralform,asF(df1,df2)=value,sig.So,wecouldsaythevariancesareequal,F(1,38)=0.87,ns.

CorrectingproblemsintheData

Ifyourdataarenotnormallydistributed,therearethingsyoucandototrytocorrecttheproblem.Themainthingistransforming the data. Although some students often (understandably) think that transforming data sounds dodgy(thephrase‘fudgingyourresults’springstosomepeople’sminds!),infactitisn’tbecauseyoudothesamethingtoallof your data. As such, transforming the data won’t change the relationships between variables (the relativedifferences between people for a given variable stay the same), however, it does change the differences betweendifferentvariables(becauseitchangestheunitsofmeasurement).Therefore,evenifyouhaveonlyonevariablethathasaskeweddistribution,youshouldstilltransformanyothervariables inyourdataset ifyou’regoingtocompare



differencesbetweenthatvariableandothersthatyouintendtotransform.So,forexample,withourhygienedata,wemightwanttolookathowhygienelevelschangedacrossthethreedays(thatis,comparethemeanonday1tothemeansondays2and3).

Thedataforday2and3wereskewedandneedtobetransformed,butbecausewemightlatercomparethedatatoscoresonday1,wewouldalsohavetotransformtheday1data(eventhoughitisn’tskewed).Ifwedon’tchangetheday1dataaswell,thenanydifferencesinhygienescoreswefindfromday1today2or3willsimplybeduetoustransformingonevariableandnottheothers.

Whenyouhavepositivelyskeweddata(aswedo)thatneedscorrecting,therearethreetransformationsthatcanbeuseful(andasyou’llseethesecanbeadaptedtodealwithnegativelyskeweddatatoo):

1. Log Transformation (log(Xi)): Taking the logarithm of a set of numbers squashes the right tail of thedistribution.Assuchit’sagoodwaytoreducepositiveskew.However,youcan’tgetaLogvalueofzeroornegativenumbers,soifyourdataincludeanyzerosorproducenegativenumbersyouneedtoaddaconstanttoallofthedatabeforeyoudothetransformation.Forexample,ifyouhavezerosinthedatathendolog(Xi+1),orifyouhavenegativenumbersaddwhatevervaluemakesthesmallestnumberinthedatasetpositive.

2. SquareRootTransformation(√Xi):Takingthesquarerootoflargevalueshasmoreofaneffectthantakingthesquarerootofsmallvalues.Consequently, taking thesquarerootofeachofyourscoreswillbringanylarge scores closer to the centre—rather like the log transformation. As such, this can be a usefulway toreduceapositively‐skeweddata;however,youstillhavethesameproblemwithnegativenumbers(negativenumbersdon’thaveasquareroot).

3. Reciprocal Transformation (1/Xi): Dividing 1 by each score also reduces the impact of large scores. Thetransformedvariablewillhavealowerlimitof0(verylargenumberswillbecomeclosetozero).Onethingtobear in mind with this transformation is that it reverses the scores in the sense that scores that wereoriginally large in thedatasetbecomesmall (close tozero)after the transformation,but scores thatwereoriginallysmallbecomebigafterthetransformation.Forexample,imaginetwoscoresof1and10,afterthetransformation theybecome1/1=1,and1/10=0.1: thesmall scorebecomesbigger than the largescoreafterthetransformation.However,youcanavoidthisbyreversingthescoresbeforethetransformation,byfindingthehighestscoreandchangingeachscoretothehighestscoreminusthescoreyou’relookingat.So,youdoatransformation1/(XHighest–Xi).

Anyoneofthesetransformationscanalsobeusedtocorrectnegativelyskeweddata,butfirstyou’dhavetoreversethescores(that is,subtracteachscorefromthehighestscoreobtained)—ifyoudothis,don’tforgettoreversethescoresbackafterwards!

UsingSPSS/PASW’sComputecommand

Thecomputecommandenablesustocarryoutvariousfunctionsoncolumnsofdatainthedataeditor.Sometypicalfunctionsareaddingscoresacrossseveralcolumns,takingthesquarerootofthescoresinacolumn,orcalculatingthe

meanofseveralvariables.Toaccessthecomputedialogbox,usethemousetoselect .TheresultingdialogboxisshowninFigure11;ithasalistoffunctionsontheright‐handside,acalculator‐likekeyboardinthecentreandablankspacethatI’velabelledthecommandarea.ThebasicideaisthatyoutypeanameforanewvariableinthearealabelledTargetVariableandthenyouwritesomekindofcommandinthecommandareatotellSPSS/PASWhowtocreatethisnewvariable.Youuseacombinationofexistingvariablesselectedfromthelistontheleftandnumericexpressions.So,forexample,youcoulduseitlikeacalculatortoaddvariables(i.e.addtwocolumnsinthedataeditortomakeathird).Therearehundredsofbuilt‐infunctionsthatSPSS/PASWhasgroupedtogether.Inthe dialog box it lists these groups in the area labelled Function group; upon selecting a function group, a list ofavailable functionswithin thatgroupwillappear in thebox labelledFunctionsandSpecialVariables. If youselectafunction, thenadescriptionof that functionappears in thegreybox indicated inFigure11.Youcanenter variablenames into the command area by selecting the variable required from the variables list and then clicking on .Likewise,youcanselectacertainfunctionfromthelistofavailablefunctionsandenteritintothecommandareabyclickingon .

The basic procedure is to first type a variable name in the box labelled Target Variable. You can then click onandanotherdialogboxappears,whereyoucangivethevariableadescriptive label,andwhereyoucan

specifywhetheritisanumericorstringvariable(seeyourhandoutfromweek1).Thenwhenyouhavewrittenyourcommand for SPSS/PASW to execute, click on to run the command and create the new variable. There are



functions for calculatingmeans, standarddeviationsandsumsof columns.We’regoing touse thesquare rootandlogarithmfunctions,whichareusefulfortransformingdatathatareskewed(seeField,2009,formoredetail).

Figure11:Dialogboxforthecomputefunction

LogTransformation

Let’s return to our Download festival data. To transform these data, first open the main compute dialog box by

selecting .Enter thename logday1 into thebox labelledTargetVariableandthenclickon and give the variable a more descriptive name such as Log transformed hygiene scores for day 1 of

Download festival. In the listbox labelledFunctiongroup clickonArithmeticandthen in thebox labelledFunctionsandSpecialVariablesclickonLg10(thisisthelogtransformationtobase10,Lnisthenaturallog)andtransferittothecommandareabyclickingon .Whenthecommandistransferred,itappearsinthecommandareaas‘LG10(?)’andthequestionmark shouldbe replacedwitha variablename (which canbe typedmanuallyor transferred from thevariables list). So replace the questionmarkwith the variableday1 by either selecting the variable in the list anddraggingitacross,clickingon ,orjusttyping‘day1’wherethequestionmarkis.

There isavalueof0 in theoriginaldata (onday2),andthere isno logarithmof thevalue0.Toovercomethisweshouldaddaconstant toouroriginalscoresbeforewetakethe logof thosescores.Anyconstantwilldo,providedthatitmakesallofthescoresgreaterthan0.Inthiscaseourlowestscoreis0inthedatasetsowecansimplyadd1toallofthescoresandthatwillensurethatallscoresaregreaterthanzero.Todothis,makesurethecursorisstillinside the brackets and click on and then . The final dialog box should look like Figure 11. Note that theexpressionreadsLG10(day1+1);thatis,SPSS/PASWwilladdonetoeachoftheday1scoresandthentakethelogoftheresultingvalues.Clickon tocreateanewvariablelogday1containingthetransformedvalues.

Command area

Variable list

Categories of

functions

Functions within the selected category

Use this dialog box to select certain

cases in the data

Description of selected function



Nowyouhaveagoatcreatingsimilarvariables logday2and logday3 for theday2andday3data!

SquareRootTransformation

Tousethesquareroottransformation,wecouldrunthroughthesameprocess,byusinganamesuchassqrtday1andselecting the SQRT(numexpr) function from the list. This will appear in the box labelled Numeric Expression: asSQRT(?),andyoucansimplyreplacethequestionmarkwiththevariableyouwanttochange—inthiscaseday1.ThefinalexpressionwillreadSQRT(day1).

Tryrepeatingthisforday2andday3tocreatevariablescalledsqrtday2andsqrtday3.

ReciprocalTransformation

Finally,ifwewantedtouseareciprocaltransformationonthedatafromday1,wecoulduseanamesuchasrecday1andthensimplyclickon andthen .Ordinarilyyouwouldselectthevariablenamethatyouwanttotransformfromthelistanddragitacross,clickon orjusttypethenameofthevariable.However,theday2datacontainazerovalueandifwetrytodivide1by0thenwe’llgetanerrormessage(youcan’tdivideby0).Assuchweneedtoaddaconstanttoourvariable justaswedidforthe logtransformation.Anyconstantwilldo,but1 isaconvenientnumberforthesedata.So, insteadofselectingthevariablewewanttotransform,clickon .Thisplacesapairofbrackets into the box labelledNumeric Expression; thenmake sure the cursor is between these two brackets andselectthevariableyouwanttotransformfromthelistandtransferitacrossbyclickingon (ortypethenameofthevariable manually). Now click on and then (or type + 1 using your keyboard). The box labelled NumericExpression should now contain the text 1/(day1 + 1). Click on to create a new variable containing thetransformedvalues.

Tryrepeatingthisforday2andday3tocreatevariablescalledrecday2andrecday3.

TheEffectofTransformingData

PlotHistogramsofyourtransformedvalues!

Let’s have a look at our newdistributions (Figure 12).Compare these with theuntransformed distributions inFigure 7. Now, you can seethat all three transformationshave cleaned up the hygienescores for day 2: the positiveskew is gone (the square roottransformation in particularhas been useful). However,



becauseourhygienescoresonday 1 were more or lesssymmetrical to begin with,theyhavenowbecomeslightlynegatively skewed for the logand square roottransformation, and positivelyskewed for the reciprocaltransformation1! Ifwe’reusingscores from day 2 alone thenwe could use the transformedscores,however,ifwewanttolook at the change in scoresthen we’d have to weigh upwhether the benefits of thetransformation for the day 2scoresoutweighstheproblemsitcreatesintheday1scores—dataanalysiscanbefrustratingsometimes!

Figure12:Day1(left)andday22(right)hygienescoresfromtheDownloadFestival

MultipleChoiceTest

Gotohttp://www.uk.sagepub.com/field3e/MCQ.htmandtestyourselfonthemultiplechoicequestionsforChapter5. If youget anywrong, re‐read thishandout (or Field, 2009,Chapter5) anddo themagainuntil youget themallcorrect.

Thishandoutcontainsmaterialfrom:

Field,A.P.(2009).DiscoveringstatisticsusingSPSS:andsexanddrugsandrock‘n’roll(3rdEdition).London:Sage.

ThismaterialiscopyrightAndyField(2000,2005,2009).

1Thereversaloftheskewforthereciprocaltransformationisbecause,asImentionedearlier,thereciprocalhastheeffectofreversingthescores.

exploring data (2008) - mdthinducollege.org · psychology): exploring data ... were some missing...

Documents