straightforward statistics using excel and tableau: a ...index-of.co.uk/ofimatica/straightforward...
TRANSCRIPT
BusinessstatisticswithExcelandTableauAhands-onguidewithscreencastsanddata
StephenPeplow
©2014-2015StephenPeplow
WiththankstothestudentswhohavehelpedmeatImperialCollegeLondon,theUniversityofBritishColumbiaandKwantlenPolytechnicUniversity.
Contents
1.Howtousethisbook1
1.1Thisbookisalittledifferent 2
1.2Chapterdescriptions 2
2.VisualizationandTableau:telling(true)storieswith
data9
3.Writingupyourfindings15
3.1Planofattack.Followthesesteps. 17
3.2Presentingyourwork 17
4.Dataandhowtogetit18
4.1Bigdata 21
4.2Someusefulsites 21
5.Testingwhetherquantitiesarethesame23
5.1ANOVASingleFactor 23
5.2ANOVA:withmorethanonefactor 29
6.Regression:Predictingwithcontinuousvariables... 31
6.1Layoutofthechapter 33
6.2Introducingregression 33
6.3Truckingexample 36
6.4Howgoodisthemodel?—r-squared 39
6.5Predictingwiththemodel 40
6.6Howitworks:Least-squaresmethod 41
6.7Addinganothervariable 42
6.8Dummyvariables 43
6.9Severaldummyvariables 49
6.10Curvilinearity 50
6.11Interactions 52
6.12Themulticollinearityproblem 58
6.13Howtopickthebestmodel 58
6.14Thekeypoints 59
6.15Workedexamples 60
7.Checkingyourregressionmodel61
7.1Statisticalsignificance 61
7.2Thestandarderrorofthemodel 64
7.3Testingtheleastsquaresassumptions 66
7.4Checkingtheresiduals 67
7.5Constructingastandardizedresidualsplot 68
7.6Correctingwhenanassumptionisviolated 71
7.7Lackoflinearity 71
7.8Whatelsecouldpossiblygowrong? 73
7.9Linearitycondition 73
7.10Correlationandcausation 74
7.11Omittedvariablebias 74
7.12Multicollinearity 75
7.13Don’textrapolate 75
8.TimeSeriesIntroductionandSmoothingMethods. . 76
8.1Layoutofthechapter 76
8.2TimeSeriesComponents 77
8.3Whichmethodtouse? 78
8.4Naiveforecastingandmeasuringerror 79
8.5Movingaverages 81
8.6Exponentiallyweightedmovingaverages 83
9.TimeSeriesRegressionMethods87
9.1Quantifyingalineartrendinatimeseries usingregression87
9.2Measuringseasonality 89
9.3Curvilineardata 91
10.Optimization95
10.1Howlinearprogrammingworks 95
10.2Settingupanoptimizationproblem 96
10.3Exampleofmodeldevelopment. 97
10.4Writingtheconstraintequations 97
10.5Writingtheobjectivefunction 98
10.6OptimizationinExcel(withtheSolveradd-in)...98
10.7Sensitivityanalysis 100
10.8InfeasibilityandUnboundedness 102
10.9Workedexamples 102
11.Morecomplexoptimization107
11.1Proportionality 107
11.2Supplychainproblems 109
11.3Blendingproblems 110
12.Predictingitemsyoucancountonebyone114
12.1Predictingwiththebinomialdistribution 115
12.2PredictingwiththePoissondistribution 118
13.Choiceunderuncertainty119
13.1Influencediagrams 119
13.2Expectedmonetaryvalue 121
13.3Valueofperfectinformation 123
13.4Risk-returnRatio 124
13.5Minimaxandmaximin 124
13.6Workedexamples 125
14.Accountingforrisk-preferences127
14.1Outlineofthechapter 128
14.2Wheredotheutilitynumberscomefrom? 130
14.3Convertinganexpectedutilitynumberintoacer-
taintyequivalent. 136
15.Glossary137
1.HowtousethisbookIbegan business life as an entrepreneur in business in HongKong. I ran tradeexhibitions,importedcoffeefromKenya,and startedandoperatedtworestaurants,oneofwhich(unusuallyforHongKong)servedvegetarianfood.Thesebusinesseswereprofitable,butIwouldhavesavedmyselfagreatdealofstress,anddonebetterif I had used some business intelligence informed by statistics to improve mydecision-making.Lookingback,IwishIhadbeenable tothinkthroughandwriteanalysesontopicssuchastheseamongmanyothers:
a.calculationofoptimalrestaurantstaffinglevels
b.analysisofsalesovertimeandseasonalityintrends
c.receiptspercustomerbyrestauranttypeandanalyzedstrengthandtypeofanydifferences
d.predictedsalesdataforpotentialexhibitorswithvisualiza-tionsofvarious‘whatif’scenarios
e.gaineddeeperinsightsfromvisualizingmydata
f.wonovermorepartnersandinvestorswithbettervisualandwrittenpresentations
I’msurethattherearemanymorepeoplelikeme,awarethattheyoughttobedoingmorewiththedatatheycollectaspartofnormalbusinessoperations,butuncertainofhowtogoaboutit.There is no shortage of textbooks andmanuals, but thesedon’tseem toget to thehands-onapplicationsquicklyenough.Thisbook is forpeoplesuchasme,intwocomponents:theanalysis,findingouttheunderlyingstoryfromthedata,andthenthepresentationofthestory.
1
1.1Thisbookisalittledifferent
Mostbooksonstatsintroducestatisticaltechniquesonachapterbychapterbasis.Instead,I’vewrittenthisbooksothatitreflects thewaymany people learn. Thechaptersarestructuredbythetypeofquestionyoumightwanttoask.Examplesofthe typeofquestionsaresummarizedbelow,andatthebeginningof eachchapter.Thechaptersthemselvesdoesn’tincludemuchmathand technicaldetails. Insteadtheseareplacedtowardstheendofthebookinaglossary.
ManyoftheExcelproceduresarelinkedtoscreencastspreparedbymetoillustratethat particular procedure. The data-sets used in the book are available from myDropboxpublicfolder:justclickonthehyperlinkthatappearsnexttoeachworkedexample.Withthedata-setsopeninExcel,youcanfollowalongatyourownpace.(And then do the same with your own data). I have used Tableau for somevisualizations,andyouwillfindlinkstomy workbooksandscreencasts.
1.2Chapterdescriptions
Ifyou’rereadingthisbook,thenyouareverylikelyalreadyengagedinbusinessandwouldliketoknowhowtotakethatbusinesstonewheights.Takealookthroughthechapterdescriptionsthat followandthengostraighttothatchapter.Ifyouthinkyoumightneedabitofastatisticsrefresher,lookthroughtheGlossaryinChapter16first.AverygoodfreestatisticsOpenIntrotextbookisavailablehere¹
Chapter2introducesTableauPublic²whichisafreedatavisu-alizationtool.WhileExceldoeshavegraphingtoolswhichareeasytouse,theresultscanlookalittleclunky.Tableauhelps
¹https://www.openintro.org/stat/²http://www.tableausoftware.com/public/
ustomergedatafromdifferentsourcesandcreateremarkablevisualizationswhichcanthenbeeasilyshared.I’llbe illustratingresultsthroughoutthisbookwitheitherTableau,Excelorboth.One caution:ifyoupublishyourresultstothewebusingTableauPublic,yourdataisalsopublished.Ifthisisaproblem,thereisanoption topay foranenhancedversionofTableau.ThescreenshotbelowshowsworkdonechangesonlanduseintheDeltaregionofBritishColumbia.Byclickingonthelink,youcanopentheworkbookandalterthesettings.Youcanfiltertheyear (see topright)andalsolandusetype.ThankstoMalcolmLittleforhisworkonthisproject.TableauworkbookforDeltaLandCoverChanges³
Tableaushowingchangesinlandcover
Tableauhastraininganddemonstrationvideosavailableon itswebsite, and thereare plenty of examples out there. The screen casts which Tableau provides(available at their homepage) are probably enough.Where I have found sometechnique(suchasboxplots)particularlytrickyIhavecreatedscreencastsforthisbook.Theimagebelowisdatawewilluseintheregressionchapter.Youruna
³https://public.tableau.com/views/DeltaCropCategories1996-2011/Sheet1?:embed=y&:showTabs=y&:display_count=yes
truckingcompanyandfromyourlogbooksextractthedistanceanddurationandnumberofdeliveriesforsomedeliveryjobs.Theimageshowsatrendline,withdistanceonthehorizontalaxisandtimetakenontheverticalaxis.Thepointsonthescatterplotare sizedbynumberofdeliveries.Youcanseethatmoredeliveriesincreasesthetime,asonewouldexpect.Usingregression,Chapter6,wewillworkoutamodelwhichcanshowhowmuchextrayoushouldchargeforeachdelivery.
HereisaYouTubefortruckingscatterplot⁴
Tableaushowingdistances,times,anddeliveries
Chapter3.Writingupyour findings. Inmostcases statistical analysis isdone inordertohelpadecision-makerdecidewhattodo.Iamassumingthatitisyourjobtothinkthroughtheproblemandassistthedecision-makerbyassemblingandanalyzingthedataonhis/herbehalf.Thischaptersuggestswaysinwhichyoumightwriteupyourreport,accompaniedbyvisualizationstogetacrossyourmessage.Therearealsolinkstosomehelpfulwebsiteswhich
⁴https://youtu.be/oFACLZLJWZI
discussthepreparationofslidesandhowtomakepresentations.
Chapter4:Dataandhowtogetit.Ifyouareconsideringcollectingdatayourself,throughasurveyforexample,thenyou’llfindthischapteruseful.So-calledbigdataisahottopic,andsoIincludesomediscussion.Thechapteralsoincludeslinkstopubliclyavail-abledatasetswhichmightbehelpful.
Chapter 5 deals with tests for whether two or more quantities are the same ordifferent.Forexample,youareafranchiseewiththree coffee-shops.Youwant toknowwhetherdailysalesarethesameordifferent,andperhapswhatfactorscauseanydifference thatyou find. Youwant to know whether there is a statisticallysignificantcommonfactorornot,toeliminatethepossibilitythatthedifferenceyouseeoccurredpurelybychance.Perhapstheaverageageofthecustomersmakesadifferenceinthesales?ANOVAusesverysimilartheorytoregression,thesubjectofthenextchapterandperhapsthemostimportantinthebook.
Chapter 6: Regression is almost certainly the most important statis- tical toolcovered in thisbook. Inoneformoranother,regression isbehindagreatdealofapplied statistics. The example presented in this book imagines you running atrucking business and wanting to be able to provide quotations for jobs morequickly. It turnsout thatifyouhavesomepastlog-bookdatatouse,such as thedistance of various trips, the time that they took and howmany deliverieswereinvolved,anaccuratemodelcanbemade.Regressioncanfindtheaveragetimetakenforeveryextraunitofdistance,aswellas other variables such as the number ofstops. This chapter shows how to create such a model and how to use it forprediction.Withsuchamodel,youcanmakequotationsreallyquickly.Thischapteralsocoversmorecomplexregressionandhowtogoaboutmodel-building.
Otheruses:youwanttocalculatethebetaofastock,comparingthereturnsofoneparticularstockwiththeS&P500.
Chapter7isabouttestingwhethertheregressionmodelswebuilt
inChapters6and7are in fact anygood.Regressionappears so easy to do thateverybody does it,without checking the validity of the answers. Following theprocedures in this chapterwill help to ensure that yourwork is taken seriously.Thischapterismoreofaguidetothinkingcriticallyabouttheresultsother peoplemight have. Is there missing variable bias? For example, you notice a strongcorrelationbetweensalesofice-creamsanddeathsbydrowning.Canyouthereforesaythatreducingice-creamsaleswillmakethewatersafer?Err–no.Themissingorlurkingvariable is temperature.Both drowning deaths and ice cream sales are afunctionoftemperature,notofeachother.
Chapter8isabouttime-seriesandshowshowwecandetecttrendsindataovertime,and make predictions. The smoothing methods, such as moving averages andexponentiallyweightedmovingaver-agescoveredinthischapterarefinefordatawhichlacksseasonality andwhenonlyarelativelyshort-termforecast is needed.Whenyouhavelonger-rundataandformakingpredictions,thefollowing chapterhasthetechniquesyouneed.
Chapter 9 describes the regression-based approach to time series analysis andforecasting.Thisapproachispowerfulwhenthereissomeseasonalitytoyourdata,forexamplesalesofTVsetsshowadistinctquarterlypattern(asdo umbrellas!).
SalesofTVsetsshowingaquarterlyseasonality
Usingregressionwe candetect peaks and troughs, connect them to seasonsandcalculatetheirstrength.Theresultisamodelwhichwecanusetopredictsalesintothefuture.
Chapter 10 is about optimizationormaking thebest useof a limitednumberofresources.Thisishighlyusefulinmanybusinesssituations.Forexample,youneedto staff a factory with workers of different skills and pay-levels. There is aminimumamount of skills requiredforeachclassofworker,and perhaps also aminimumnumberofhoursforeachworker.Usingoptimization,youcancalculatethenumberofhourstoallocatetoeachworker.
Chapter11Optimizationcanalsobeusedformorecomplex‘blend- ing’problems.Example:Yourunapaperrecyclingbusiness. Youtakeinpapersandotherfiberssuchascardboardboxes.Howcanyoumixtogetherthevariousinputssothatyouroutputmeetsminimumqualityrequirements,minimizeswastage,andgeneratesthemostprofit?
Another example: what mix or blend of investments would best suit yourrequirements?
Chapter 12 concerns calculating the probability of items you can count one byone.Forexample,whatistheprobabilitythatmore
thanfivepeoplewillcometotheservicedeskinthenexthalf-hour?What is theprobabilitythatallofthenextthreecustomerswillmakeapurchase?WecansolvetheseproblemsusingthebinomialandthePoisson distributions.
Chapter13concernschoiceunder uncertainty: if you have a choice of differentactions,eachofwhichhasanuncertainoutcome,whichactionshouldyouchoosetomaximize the expectedmonetary value?Afarmerknowsthepayoffshe/shewillmake from different crops provided the weather is in one state or another(sunny/wet) butatthetimewhenthecropdecisionhastobemade,heobviouslydoesn’t know what the actual state will be. Which crop should he plant? Amanufacturerneedstodecidewhethertoinvestinconstructinganewfactoryatatimeofeconomicuncertainty:whatshouldhedo?
Chapter 14adds thedecision-maker’s riskprofile tohisorher decisionprocess.Mostpeopleareriskaverse,andarewillingto tradeoffsomerisk inexchangeforcertainty. This chapter shows how to construct a utility curve which maps riskattitudes,andthenprioritizethedecisionsintermsofmaximumexpectedutility.
Chapter15isaGlossaryandcontainssomebasicstatisticsinforma- tion,primarilydefinitions of terms that the book uses frequently and which have a particularmeaninginstatistics(forexample‘population’).TheGlossaryalsodiscusseswhytheinferentialtech-niquesusedinstatisticsaresopowerful,allowingusto makeinferencesaboutapopulationbasedonwhatappearstobeaverysmallsample.
UnderEforExcelintheGlossary,you’llfindsomelinkstoscreencastsonsubjectsnotdirectlycoveredinthisbook,butwhichyoumightfindhelpful.
2.VisualizationandTableau:telling(true)storieswithdata
In this book, we’ll work on gaining insights from data by visual- ization andquantitativeanalysis,andthenpresentingtheresultstoothers.RecentlyapowerfulnewtoolcalledTableauhasbecomeavailable.TableauPublic¹isafreeversion.Butbecarefulwithyourdatabecausewhenyoupublishtotheweb,asyoumustdowithTableauPublic,thenyourdataalsobecomesavailabletoanybody.Ifyourequirethatyourdatabeprotected,youcanpayfortheprivateversion.Tableau9hasjustbeenmadeavailable.
Many of the worked examples in this book are accompanied by a link to acompleted Tableau workbook, showing how you could have used Tableau topresent your findings. Tableau cannot perform easily the more technicalhypothesis-testingaspectsof statisticssuchasregression,butitcanhelpyoutogetyourpointacrossclearly.TableauhasalsoprovidedausefulWhitePaper²onvisualbestpracticeswhichiswellworthreading
In business intelligence we are generally interesting in detecting patterns andrelationships.Thismightseemobvious,becauseashumanswearealwayslookingforthesephenomena.I’djust liketoaddthattheabsenceofapatternorrelationshipmightbejustasinformativeasfindingone.Anexcellentfirststepistotakealookatyourdatawithanexpositorygraphbeforemovingontomoreformaldataanalysis.Belowareexamplesofthetypesofchartsorgraphsmostcommonlyusedineitherexaminingdatafirstoff,
¹http://www.tableausoftware.com/public/²http://www.tableausoftware.com/whitepapers/visual-analysis-everyone
9
justto‘seehowitlooks’.Checkinganexpositorygraphshowsup problems thatmightthrowyou off later:missing data, outliers or,more excitingly, unexpectedandintriguingrelationships.
Histograms
Histogramsbreakthedataintoclassesandshowthedistributionofthedata.Whichclasses(sometimesknownasbins)aremostcommon.Thehistogramshowsusthe‘shape’ofthedata.Aretheremanysmallmeasurements,ordoesthedatalookasthough it is normally distributed and consequently mound-shaped. SeeDistributionintheGlossaryformoreon this.Theplotbelow isof deaths in caraccidentsonaweeklybasisintheUnitedStatesover theperiod1973-1978.Thehistogramshowsonlythedistributionandnotthesequence.
From the histogram only,wecould say thatthemostcommonnumberofdeathsinaweekis between8000and 9000, be- cause thebars of the histogram arehighest there. They arehigh- est because theycount thenumberof timesin the time-period thatthere have been (forexample 8000 deathsoccurredin15weeks).
What the histogramdoesn’tshowis that
deathshaveactuallybeendecreasingsince
Histogramofcardeaths,reportedweekly
about1976.Thereisprobablyasimpleexplanationforthis:compul-soryseatbeltsperhaps?Thetake-homefromthisisthatexpositorygraphsare reallyeasy todo,and that you should try different methodswith the same data to tease out theinsights.
Cardeathsovertime
Scatterplots
Themostcommonandmostusefulgraphisprobably the scatter- plot. It isusedwhenwehavetwocontinuousvariables,suchas twoquantities(weightofcarandmileage)andwewanttoseetherelationshipbetweenthem.Thescatterplotisalsousefulforplottingtimeseries,wheretimeisoneofthevariables.
YoucanuseExcelforthis,andalsoTableau.InExcel,arrangethecolumnsofyourdatasothatthevariablewhichyouwantonthehorizontalxaxis is theoneon theleft.Thisistheindependentvariable.Puttheothervariable,thedependentvariable,ontheverticalyaxis.Itisusualpracticetoputthevariablewhichwethinkisdoingtheexplainingonthexaxisandtheresponsevariableon
theyaxis.Here isascatterplotofEuropeanUnionExpenditureper student as apercentageofGDP:
EU%ofGDPspentonPrimaryEducation
There is a gradual increase over the years 1996 to 2011 as one would expect;educational expenditures tend to remain a relatively fixed share of the budget.However, there was a blip in 2006 which is interesting. Because the data ispercentageofGDP, thedropmight have been caused by an increase inGDP oralternativelybyanactualeducationalspendingdecrease.We’dneedtolookatGDPfiguresfortheperiod.IgotEuropeanGDPfigurespercapitaandusingTableauputthemonthesameplot,withdifferentaxesofcourse.Theresultisbelow
EuropeanGDPandPrimary Expenditures
GDPdroppedaround2008/2009becauseoftheworld financialcrisis.Theblipinprimaryexpendituresisstillunexplained.
Maps
Tableaumakesthedisplayofspatialdatarelativelyeasy,providedyoutellTableauwhichofyourdimensionscontainthegeographicalvariable.Theplotbelowshowschangesingreenhousegas emis-sionsfromagricultureintheEuropeanUnion.Ilinkedworldbankdatabycountry.Theworkbookishere³.
³https://public.tableausoftware.com/views/EuropeanGHGfromAgriculture2010-2011/Sheet1?:embed=y&:display_count=no
Unfortunately, not all geographic entities are available for linkage in Tableau.States and counties in theUnited States are certainly available and provinces inCanada,aswellascountriesinEurope.TherearewaystoimportArcGISshapefilesintoTableau.Ashape-fileisalistofverticeswhichdelimitthepolygonsthatdefinethegeographicalentity.
GHGemissionsfromagricultureintheEuropeanUnion
3.WritingupyourfindingsInchapter1,ImentionedsomeofthequestionsIwishedIhadbeenabletoanswerwhenIwasoperatingabusiness.NowIwanttosuggestwaysinwhichyoumightstructureyourresponseto suchquestions.Theactualanswerscomewhenyou’vedonethestatisticalwork(carriedoutinthefollowingchapters),butitmighthelptohaveatleastastructureonwhichyoucanbeginbuildingyourwork.
Beforeyoudoanyworkontheproblem,getstraightinyourmindexactlywhatyouarebeingaskedtodo.Youneedtobeclearandsodoesthedecision-makertowhomyouaregoingtosubmityourfindings.Hereisonewaytodothis:
Copydownthekeyquestionthatyouarebeingaskedinexactlythewaythatithasbeengiventoyou.Forexample,letussaythatyouhavebeenaskedtoanswerthisadmittedly toughquestion: ‘What factorsareimportantfor the success of a newsupermarket?’
Nowkeeprewritingitinyourownwordssothatitbecomesas clearaspossible.Makesurethateachandeverywordisclear.Whatdoes‘success’actuallymean?—inthebusinesscontextprobably‘mostprofitable’.Whatdoes‘new’mean?Isthisacompletelynewsupermarketoranewoutletofanexistingbrand?Youmightendupwith:‘whenweareplanningalocationforanewoutletforourbrand,whatfactorscontributemosttoprofitability?’Doingallthishelpsyoutoidentifythestatisticalmethodmostsuitableforthetask.Youalsocanincludeherethehypotheses(ifany)thatyouwillbetesting.
Hereisasuggestedsectionorderforyourreport.However,thisprobablywon’tbetheorderofthework.TheorderinwhichyoudothestatisticalworkisinSection3,
yourplanofattack.Sotheorder
15
ofdoingtheactualworkisthatyoucarryouttheplanofattack,and thenwriteupyourresultsfollowingthesectionbysectionsequenceIamgivingyounow.
Section1.Statethequestiontobeansweredclearlyandsuccinctly,asresultoftherewriting you did above.Doing thismakes sure that youclearlyunderstand theproblemandwhat is being askedof you. Writingout thequestion and stating itclearlyrightatthebeginning of your report also ensures that the decision-maker(thepersonwhoposedthequestioninthefirstplace)andyouunderstandeachother.Youcanalsowriteinthehypothesiswhichyouaretesting.
Section2.Provideanexecutivesummaryoftheresults.Thisshouldbeperhapssixlineslongandcontainthekeyfindingsfromyourwork.Don’tputinanytechnicalwords, jargonorcomplicatedresults.Justenoughsothatabusypersoncanskimthroughitandgetthegeneralideaofyourfindingsbeforegoingmoredeeplyintoyourhardwork.
Section3.Outlineyourplanofattack ,describinghowyouhave approachedtheproblem.Youalsocanincludeherethehypotheses(ifany)thatyouwillbetesting.Atypicalplanofattackfollowslateroninthischapter.
Section4.Data.Provideabriefdescriptionofthedatathatyouhaveused.Thesourceofthedata, howyou found it,whether you think it is reliable andwhether it issufficientforthetask.Itisagoodideato includesummarystatisticsat thispoint.This includes information such as the number of observations, and what thevariablesare.
Section5.Statisticalmethodology.Describethestatisticalmethod-ologywhichyouare going to use, for exampleANOVAto testwhetherornot themeansare thesame.
Section6.Carryoutthetests,makingsuretoincludeadescriptionofwhetherornotthehypothesescanberejected.Thisis thecentralpartofthereport.Justmakesuretofocusonansweringthequestion.
Section7.Writeaconclusionwhichdescribestheresultsof the test and how theresults answer the question that was asked. You can also include here anyshortcomingsinthedataorinyourmethodologywhichmightaffectthevalidityofthe results. The conclusion is very important because it ties together all theprevioussections.HereyoucouldincludealinktoaTableau‘story’asanadditionaloralternativewayofpresentingresults.
Section8.References/end notes or further information, especially on sources ofdata, should end the document. If youwant tomake your document look reallygood, and you have lots of references, consider using the Zotero plug-in forFirefox.Itisanexcellentfreewaytomanageyourbibliography.
3.1Planofattack.Followthesesteps.
a.Identificationofthestatisticalmethodthatyouwilluseandwhyyouchosethosemethods
b.Whatdataisrequired,andwherecanitbefound?
c.Conductthestatisticaltests
d.Discusswhethertheresultsarereliable/answerthequestion.(Ifnot,startagain!)
3.2Presentingyourwork
While it is probably best to concentrate on written workbecause the decision-makermightwant toreadyourworkindetail, and discuss it with colleagues, aTableau presentation is an excellent way ofmaking your point over again. SeeChapter2formoreonvisualizationandTableau.
Here is a link to some excellent notes by Professor Andrew Gelman on givingresearchpresentations¹
¹http://andrewgelman.com/2014/12/01/quick-tips-giving-research-presentations/
4.DataandhowtogetitInthischapter,we’lllookatthedatacollectionprocess, theprob- lemsthatmightaccompanypoorlycollecteddata,discussso-calledbigdataandprovide links tosomeusefulpublicsourcesofdata.
As you might expect, this book works through the analysis of numbers —quantitativedata—butitisimportanttonotethattheanalysisofqualitativedataisarapidlygrowingarea.Unfortunately theanalysisofqualitativedataisbeyondthisbookandthesoftwareavailabletous.
Quantitativedatacomesfromtwomainsources. Primarydataiscollectedbyyouorthecompanyyouareworkingfor.Forexample,themarketresearchyoudotofindoutthepossiblemarketshareforyourproductprovidesprimarydata.Primarydataincludes bothinternalcompanydataanddatafromautomatedequipmentsuchaswebsitehits.Collectingdata is expensive and highly proprietary. It is thereforeunlikelytobepublishedandavailableoutsidetheenterprise.
By contrast, Secondary data is plentiful and mostly free. It is collected bygovernmentsofallsizes,andalsobymany non-governmentorganizationsaswell.There is an increasing trend towards the liberation of data under various opengovernmentinitiatives.Governmentdataisusuallyreliable,butbesuretochecktheaccompanyingnoteswhichwarnofanyproblems, suchas limitedsamplesizeorchangeinclassificationorcollectionmethods over time.Thereare some links tosecondarydataattheendofthischapter.
18
Experimentaldesign
Experimentsdon’tnecessarilyhave tobeconducted in laboratories bypeople inwhitecoats:asurveyinshoppingmallisalsoaformofexperiment,asisanalyzingthe results of your favorite cricket team. There are two different designtypes: experimental and observational . The keydifference is the amount ofrandomization thatisbuiltintotheexperiment.Ingeneral,morerandomizationisgood,becauseithelpstoremovetheinfluenceoffixedeffects,suchasthequalityofthesoilinaparticularlocation.
Here’sanexampletomakeclearthedifference.Imaginea re-searcherwantingtotest the effect of a new drug onmice. In an experimental design, the mice areexamined before the drug is administered, and then the drug is applied to atreatmentgroupofmiceandacontrolgroup.Thecontrolgroupreceivesnodrugs.Itsjobistoactasareferencegroup.Allthemicearekeptinidenticalconditionsapartfromtheapplicationofthedrug.Theallocationofmicetothetreatmentgroupandtothecontrolgroupisentirelyrandom.
Whilewe can (anddo) carryout drug experiments onmice, itwouldclearlybeverywrongtotrytodothesamewithhumansassubjects.Insteadwecollectdataorobservationsandthenanalyzethoseobservationslookingford i ffe rences .
Tocompensateforthelackofrandomization,wecontrolforob-serveddifferencesbyincludingasmanyrelevantvariablesaspossi-ble.IfweknewthatpersonXhadreceivedaparticulardrugandhaddevelopedaparticularcondition,wewouldwanttocomparepersonXwithsomebodyelsewhohadnotdeveloped that condition.Relevantvariablesthatwewouldwanttoknowmightbe age,genderandpossiblypre-existinghealthconditions.Includingthese variables reduces fixedeffectsandallowsustoconcentrateontheeffectofthedrug.
Problemswithdata
It is obvious that to be credible, your analyses must be based on reliable data.Problemswithdataareusuallyconnectedtopoorsamplingandexperimentaldesigntechniques,especially:
Samplesizetoosmall .Therelativesizeof thesample to thepopulationusuallydoesnotmatter.Itistheabsolutesizeofthesamplethatcounts.Youusuallywantatleastseveralhundredobservations.
Population of interest not clearly defined. It is clear that we need to take asamplefromapopulation,butwhatexactlyisthepopulation?Here’sanexample.Youwanttosurveyshoppersinashoppingmallregardingyournewproduct.Butofthepeopleinside themall,whoexactlyareyourpopulation?Peoplejustentering,peoplejustleaving,peoplehavingtheirlunchinthefoodcourt? Singles, couples,elderlypeopleortheteenagershanging aroundoutsidethedoor?Youcanseethatpickinganyoneofthesegroupsonitsownwillleadtoabiasedsample.
Non-response bias . Many people don’t answer those irritating telephone callswhich come in the evening because they’re busy withdinner.As a result, onlyanswers from those who do choose to answer the survey are counted. Thoserespondentsmostlikelyaren’trepresentativeof thepopulation.Perhaps they livealoneordonothavetoomuchtodo.I’mnotsayingthattheyshouldnotbe in thesample,justthatincludingonlythosewhodorespondmaybiasyoursample.
Voluntaryresponsebias .Ifyoufeelstronglyaboutanissue,then you aremorelikelytorespondthanifyouareindifferent.That’ssimplyhumannature.Asaresult,thesurveyresultswillbeskewedbytheviewsofthosewhofeelmostpassionately.Thisishardlyarepresentativesamplebecausethestrongly-heldviewsdrownoutthemoremoderate voices.
4.1Bigdata
Primarydatafrequentlycomesfromautomatedcollectiondevices,suchasscanners,websites,socialmedia,andthelike.Thevolumeofsuchdataisenormous,andisaptlycalledbigdata.Bigdataisthetermusedtodescribelargedatasetsgeneratedbytraditionalbusi-nessactivitiesandfromnewsourcessuchassocialmedia.Typicalbig data includes information from store point-of-sale terminals, bank ATMs,FacebookpostsandYouTubevideos.
One of the apparently attractive features of big data is simply its size, whichsupposedlyenablesdeeperinsightsandrevealsconnectionswhichwouldnotappearinsmallersamples.Thisargu-mentneglectsthepowerofstatistics,andinparticularinferential statistics. A small sample, properly collected, can yield superiorinsightstoaverylargepoorlycollectedsample.Thinkofitthisway:whichisbetter:averylargesampleinwhichalltherespondentsareinthesameage-groupandofthesamegender;orasmalleronewhichmoreaccuratelyreflectsthepopulation?
4.2Someusefulsites
YoucanofcourseeasilyjustGooglefordata,orlookatthesemorefocusedsites:
Gapminderdata:freetousebutbesuretoattribute¹Workeremploymentandcompensation²
Interestingandwide-ranginghistorical data³GooglePublicData⁴
¹http://www.gapminder.org/data/²http://www.bls.gov/fls/country/canada.htm³http://www.historicalstatistics.org/
⁴http://www.google.com/publicdata/directory
InternationalMonetaryFund⁵
FoodandAgricultureOrganisation⁶UnitedNationsData⁷
TheWorldBank⁸
⁵http://www.imf.org/external/data.htm
⁶http://www.fao.org/statistics/en/
⁷http://data.un.org/
⁸http://databank.worldbank.org/data/home.aspx
5.Testingwhetherquantitiesarethesame
This chapter concerns testing whether the population means of two or morequantitiesarethesameornot:andiftheyareinfactdifferent,whetheranyvariablecanbeidentifiedasbeingassociatedwith thedifference.The test we will use isANOVA,(AnalysisofVariance).ThetestwasdevelopedbytheBritishstatisticianSirRonaldFisher,andtheF-testwhichANOVAusesisnamedinhishonor.FisheralsodevelopedmuchofthetheoreticalworkbehindexperimentdesignduringhistimeattheRothamstedResearchStationinEngland.
5.1ANOVASingleFactor
ThemoststraightforwardapplicationofANOVAiswhenwesimply want to testwhetherornottwoormoremeansarethesame.Inthisworkedexample,wehavethreedifferenttypesofwheatfertilizer(Wolfe,WhiteandKorosa)andwewouldliketoknowwhethertheirapplicationproducesequalordifferentyields.
AsthesectiononexperimentaldesigninChapter3emphasized,theexperimentmustbe designed to isolate the effect of the fertilizer, and this is achieved byrandomization. To control for differences in site-specific growing qualities, weselectplotsoflandwhichareassimilaraspossible:exposuretosunlight,drainage,slopeandother relevantqualities.Werandomlyassignfertilizerstoplots,andtheyieldsaremeasured.Thefirstfewlinesofatypicaldatasetappearbelow.
23
Thefertilizerdataset
There are two columns in the dataset: the factor , which is the name of thefertilizer,andtheyieldattributedtoeachplot.Inthiscase,eachfertilizerwastestedonfourplots,providing3x4=12observations.Later,whenusingExceltoruntheANOVA test, it will be necessary to change the format so that the yields aregroupedundereachfactor.
Agoodfirststepistovisualizethedata.HereisaboxplotdrawnwithTableau.
Fertilizer boxplot
Thedatasetishere:fertilizerdataset¹
andhereisaYoutubeofcreatingaboxplotinTableau².
¹https://dl.dropboxusercontent.com/u/23281950/fertilizer.xlsx ²http://youtu.be/3QohthWXp1M
I’llmakeafrankadmissionrightnow:ittookmethebestpartofamorningtogetthetechniqueofdrawingaboxplotinTableaudown, sohopefully thisYouTubewillhelpyoutoavoidspendingsomuchtimeonthis.
The boxplot reveals these useful measurements: the smallest and largestobservation,ortherange.Themedian,whichisthehori-zontallinewithinthebox;andthe25%quartile(upperlineofthebox;and75%quartile,lowerlineofthebox.Therefore50%(75-25)of thedataarecontainedwithin the box. The ‘whiskers’markoffdatawhich are outliers,meaning that any datapointwhich is outside awhiskerisanoutlier.Thisdoesn’tmeantosaythatit issomehowwrong:perhapssomeofthemostinterestingdiscoveriescome fromlookingatoutliers.However,an outliermight also be the result of carelessdataentryandshouldthereforebechecked.Itisclearfromtheplotsthattheyieldsarebynomeansthesame.
Inthisexample,wearemeasuringonlyonefactor :theeffectofthefertilizeronthemeanyieldofeachplot,andsothetestwewanttoconductissinglefactorANOVAwith a completely randomized design. It is completely randomized because theallocationoffertil-izertoplotwasrandom.Wewantanydifferencestobeduetothefertilizerandthefertilizeralone.
Becauseourtestiswhetherthemeansarethesameordifferent,thehypothesisis:
Ho : µ 1= µ 2= µ 3= .. =µ n
withthealternativehypothesisthatnotallthemeansarethesame. That isn’t thesameassayingthattheyarealldifferent;justthatat leastoneisdifferentfromtheothers.
Therejectionrulestatesthatifthepvaluewhichcomesoutofthe testissmallerthan0.05,thenwerejectthenullhypothesis.Ifwe reject the null, thenwemustacceptthealternativehypothesis.
WetestthehypothesisusingtheANOVASingleFactortoolwithinDataAnalysis.First,thetwo-columnstructureofthedata hastobetransformedintocolumnsforeachofthethreefertilizers.ThatiseasilyaccomplishedusingthePIVOTTABLEfunction.Thisyoutube³takesyouthroughtheprocessofchangingthestructureofthedataandrunningtheANOVAtest.Theresultarebelow.
ANOVAoutputforfertilizertest
Thekeystatisticisthep-value.Itismuchsmallerthan0.05andsowerejectthenullhypothesis.Themeansarenotallthesame.Theyaredifferent.ThesummaryoutputtellsusthatWhitehasthelargestmeanyieldandthisagreeswiththeboxplot.Inthiscase,separationofresultsintoarankingofyieldisrelativelyeasybecausetheyaresodistinct.UnfortunatelyExcellacksawayofeasilytestingwhetheranyotherpairarethesameordifferent.Allwecansayforsureisthattheyarenotallthesame.
Here is a slightlymore complicated and realistic example.You are designing anadvertisingcampaignandyouhavemodelswithdifferenteyecolors:blue,brown,andgreen,andalsooneshot in
³http://youtu.be/wevrSWYBl8U
whichthemodelislookingdown.Youmeasuretheresponsetoeach arrangement.Doestheeyecoloraffecttheresponse?Thedatasetiscalledadcolor⁴.Thefirstfewlinesarehere:
Thefirstfewlinesoftheadcolordataset
Thecoloristhefactor,andwe’llneedtousePIVOTTABLEtotabulatethedatasothattheeyecolorbecomesthecolumns.I’vedonethathere….
Theresultis
Theeyecolorresults
⁴https://dl.dropboxusercontent.com/u/23281950/adcolour.xls
Thepvalueis0.036184,smallerthan0.05.Asaresultwecanstatethatdifferentcolors are associated with different responses. Looking at the summary output,greenhasthehighestaverageat3.85974,sowecouldprobablyselectgreen.
5.2ANOVA:withmorethanonefactor
TheFertilizertestaboveshowedhowtotesttheeffectofasinglefactor.TheANOVAresultsshowedthatthemeanswerenot the same.By inspecting the boxplot andalsotheExceloutput, it isclearthatthevarietyWHITEhasahigheryield.Whatmight be interesting is finding whether another factor also has a statisticallysignificanteffectonyields,andwhetherthetwofactorsinteractedtogether.
ThecaseinquestioninvolvespreparationfortheGMAT,anexamrequiredbysomegraduate schools. The GMAT is a test of logical thinking and is therefore notdependentonspecificpriorlearning.Weknowthetestscoresofsomeapplicants,andwhetherthosestudentscamefromBusiness,EngineeringorArtsand Sciencesfaculties.Thestudentshadalsotakenpreparationcourses,ranging froma 3-hourreview to a 10-week course. The question is: did taking the preparation coursematter;anddidfacultymatter?Herewehavetwofactors:facultyandpreparationcourse.Thedataisalreadyarrangedincolumnsandsowecangostraightinwithatwo-waywithreplication.Lookatthedata⁵.Itisreplicatedbecause there are twosetsofobservationsforeachpreparationtype.Theoutputishere:
⁵https://dl.dropboxusercontent.com/u/23281950/testscores.xlsx
TestscoresANOVAoutput
Wehavethepreparationtypeintherows,andtherelevantpvaluethereis0.358,sowefailtorejectthenull.Wecannotsaythatthereisanydifferenceinscoresasaresult of preparation course type; in other words, there is no effect on scoresresultingfrompreparationcoursetype.However,forfaculty(incolumns) there isdistinctdifference,withapvalueof0.004.Fromthesummary output, it looks asthoseintheEngineeringfacultyhadthehighestscore.
6.Regression:Predictingwithcontinuousvariables
This chapter is about discovering the relationships that might or (equallywell)mightnotexistbetweenthevariables in thedataset ofinterest.Here’satypicalifrathersimplisticexampleof regression in action: the operator of some deliverytruckswantstopredictthe average time taken by a delivery truck to complete agivenroute.Theoperatorneedsthisinformationbecausehechargesbythehourandneedstobeabletoprovidequotationsrapidly.Thecustomerprovidesthedistancetobedriven,andthenumberofstopsenroute.Thetaskistodevelopamodelsothatthe truck operator can predict the time takengiven that information.Regressionprovides a mathematical ‘model’ or equation from which we can very quicklypredictjourneytimesgivenrelevantinformationsuchasthedistance,andnumberofstops.Thisisextremelyusefulwhenquotingforjobsorforauditpurposes.
Weknow the sizeof the inputvariables—thedistance andthedeliveries,butwedon’tknowtherateofchangebetweenthemandthedependentorresponsevariable:whatistheeffectontimeofincreasingdistancebyacertainamount?Oraddingonemorestop?Usingthetechniqueof regression ,wemakeuseofa setofhistoricalrecords,perhaps the truck’s log-book, tocalculate the average time takenby thetrucktocoveranygivendistance.Aswithanyattempttopredictthefuturebasedonthepast,thepredictionsfromregressiondependonunchangedconditions:nonewroadworks(whichmightspeedupordelaythejourney),nochangeintheskillsofthedriver.
31
In the language of regression, the time taken in the truck example isthe response or dependent variable, and the distance to be driven is theexplanatory or independent variable. While we will only ever have just oneresponsevariable, thenumberof explanatoryvariablesisunlimited.Thenumberofstopsthetruckhastomakewillimpactjourneytime,aswillvariablessuchastheageofthetruck,weatherconditions,andwhetherthejourneyisurbanorrural.Ifweknow these variables, we also can include them in themodel and gain deeperinsightsintowhatdoes(andequallyimportantdoesnot)affectjourneytime.
Regressionmodelsareusedprimarilyfortwotasks:toexplainevents in thepast;and tomakepredictions.Regression providescoefficientswhichprovidethesizeanddirectionofthechangein theresponsevariableforaoneunitchangeinoneormoreoftheexplanatory variables.
Regressionmakesaprediction forhowlonga truckwill take to make a certainnumberof deliveries and also states how accurate thepredictionwillbe.Witharegression model in hand, we can make quotations accurately and quickly. Inaddition,wecandetect anomaliesinrecordsandreportsbecausewecancalculatehowlongajourneyshouldhavetakenundervariouswhat-ifconditions.
Here I have used a straightforward example of calculating a timeproblemforatrucking firm, but regression is much more powerful than that. Some form ofregressionunderliesagreatdealofappliedresearch.Whenyoureadaboutincreasedriskduetosmoking(orwhateverelseisthelatestdanger)thatriskwascalculatedwith regression. In this chapter,we’ll calculate the differencebetween usedandnewmarioKartsalesonebay,estimatestrokeriskfactors,andgenderinequalitiesinpay,allwithregression.
Regression isnotverydifficult todo,but theproblemis thateverybodydoes it.Ordinary Least Squares (OLS), linear regression’s proper name, rests on someassumptionswhichshouldbecheckedforvalidity,butoften aren’t.We cover theassumptions,andwhat
todoifthey’renotmet,inthefollowingchapter.Thischapterisaboutthehands-onapplicationsofregression.
6.1Layoutofthechapter
The chapter begins with some background on how regression works. We’llillustratethetheorywiththetruckingexamplemen-tionedabove,beforegoingontoaddingmoreexplanatoryvariablestoimprovetheaccuracyoftheprediction.
Particularly useful explanatory variables are dummy variables,which take on acategoricalvalues,typicallyzeroor1.Forexample,wecouldcodeemployeesasmaleandfemale,anddiscoverfromthiswhetherthereisagenderdifferenceinpay,andcalculatetheeffect. Inaworkedexample in the text,weanalyze the salesofmarioKartonebay,andusedummyvariablestofindtheaveragedifferenceincostbetweenanewandausedversion.
Sometimestwovariablesinteract:higherorlowerlevelsinonevariablechangetheeffectofanothervariable.Inthetext,weshowthattheeffectofadvertisingchangesasthelistpriceoftheitemincreases.
Finally,we’lldiscusscurvilinearity .Therelationshipbetweensalaryandexperienceisnon-linear.Aspeoplegetolderandmoreexpe-rienced,theirsalariesfirstmovequicklyupwards,andthenflattenoutorplateau.Wecancapturethatnon-linearfunctiontomakepredictionmore accurate.
6.2Introducingregression
Atschoolyouprobablylearnedhowto calculate the slopeofa line as‘riseoverrun’.Let’ssayyouwanttogotoParisforavacation.Youhaveuptoaweek.Therearetwomainexpenses,theairfareand thecostofaccommodationpernight.Theairfareisfixedand
staysthesamenomatterwhetheryougoforonenightorseven.Thehotel(plusyourmealchargesetc)is$200pernight.Youcouldwriteupasmalldatasetandgraphitlikethis:
TotalParistripexpenses
Theequationforis:Totalexpenses=1000+200*Nights
Basically,youhave justwrittenyour first regressionmodel. Themodelcontainstwocoefficientsofinterest:
• theinterceptwhichisthevalueoftheresponsevariable(cost)whentheexplanatoryvariableiszero.Heretheinterceptis
$1000.Thatistheexpensewithzeronights.Itisthepointon theverticalyaxiswherethetrendlinecutsthroughit,wherex=0.Youstillhavetopaytheairfareregardlessofwhetheryoustayzeronightsormore.
• theslopeof the linewhich is$200.Foreveryoneunit increase in theexplanatoryvariable(nights)thedependent variable (total cost) increasesbythisamount.
In regression, we usually know the dependent variable and the independentvariable, but we don’t know the coefficients.Weuse regressionto extract thosecoefficientsfromthedatasothatwecanmakepredictions.
Howdoweknowwhethertheslopeoftheregressionlinereflectsthedata?Trythisthoughtexperiment:ifyouplottedthedataand
thetrendlinewasflat,whatwouldthatmean?Norelationship.You couldstay inParisforever,free!Aflatlinehaszeroslope,andsoanextranightwouldnotcostanymore.
More formally, thehypothesis testingprocedure tests thenullhypothesisthat theslopeiszero.Ifwecanrejectthenull,thenwearerequiredtoacceptthealternativehypothesis,whichisthattheslopeisnotzero.Ifitisn’tzero(thetrendlinecouldslopeupordown)thenwehavesomething toworkwith.We test the hypothesisusingthedatafoundfromthesample,andExcelgivesusapvalue.Ifthepvalueissmallerthan0.05,werejectthenullhypothesis.Ifwerejectthenullwecansaythatthereisaslopeandtheanalysisisworthwhile.
The Paris example had only one independent variable (number of nights) butregression can includemanymore, as the trucking examplebelowwillshow.Ifthereismorethanonevariable,wecannotshowtherelationshipwithonetrendlineintwodimen-sions.Instead,therelationshipisahyperplaneorsurface.Theplotbelow shows the effect on taste ratings (the dependent variable) of increasingamountsoflacticacidandH2Sincheesesamples.
3Dhyperplaneforresponsestocheese
Hereisanotherexample.Wehavethedataonthesizeof thepopulationofsometowns and also pizza sales. How can we predict sales given population. PizzaYouTubehere¹
6.3Truckingexample
The records of some journeys undertaken are available in an Excel file named‘trucking’².
Takealookatthedata.AsImentionedatthebeginningofthebook, itisalwaysagoodideatobeginwithatleastasimpleexploratoryplotofthevariablesofinterestinyourdata.WithExcelwecanquicklydrawaroughplotsothatwecanseewhatisgoingon.Whatwewanttodoispredicttimewhenweknowthedistance.The
¹https://www.youtube.com/watch?v=ib1BfdVxcaQ²https://dl.dropboxusercontent.com/u/23281950/trucking.xlsx
scatter plot below shows time as a function of distance.YouTube: scatterplot oftrucking data³
Trucking scatterplot
Time is the dependent variable and distance is the independent variable. Theindependentvariablegoesonthehorizontalx axis,thedependentvariableontheverticalyaxis.
Fromtheplotitisclearthatthereisapositiverelationshipbetweenthetwovariables.Asthedistanceincreases,thensodoesthetime.Thisishardlyasurprise.Butwewantmorethanthis:wewanttoquantifytherelationshipbetweenthetimetakenandthedistance traveled.Ifwecanmodel this relationship thatwill be usefulwhencustomersaskfor quotations.
Thedatasetcontainsonefurthervariable,whichisthenumberofdeliveriesontheroute.We’llusethatlateron.Fornowwe’llusejustthetimeandthedistance.
Tobuildourmodel,wewanttobuildanequationwhichlookslikethisinsymbolicform
³https://www.youtube.com/watch?v=HSpY12mLXoU
y = b o + b 1 x 1+…b n x n +e
yhat(spoken‘yhat’) is the dependent variable. It is called ‘hat’ because it isanestimate. b0 is the intercept, or the point where the trend line (also called theregression line) passes through the vertical y axis. This is the point where theindependentvariableiszero.Youperhapsrememberasimilarequationfromschool:y=mx
+b.
Inthetruckingexample,theintercept(b0)mightbethe timetakenwarmingupthetruckandcheckingdocumentation. Timeis running,but the truck isnotmoving.b1 is the ‘coefficient’fortheindependentvariablebecauseitprovidesthechangeinthedependentvariableforaone-unitchangeinthe independentvariable.x1istheindependentvariable,inthiscasedistance.Wewanttobeabletopluginsomevalueoftheindependentvariableandgetbackapredictedtimeforthatdistance.
Thebandthexbothhaveasubscriptof1becausetheyarethefirst(andsofaronlyindependentvariable).Ihaveputinmorevariables just to indicate thatwe couldhavemany.
Thecoefficient,b1,providestherateofchangeofthedependentvariableforaoneunitchangeintheindependentvariable.Youcanthinkofthisastheslopeoftheline:asteeperslopemeansagreaterincreaseinyforaone-unitincreaseinx.
WecannowfindtheestimatedregressionequationusingtheregressionapplicationinExcel.First,weusetheregressfunctioninExcel’sdataanalysistooltoregressdistanceontime.YouTube⁴
Theregressionoutputisasbelow:
⁴https://www.youtube.com/watch?v=xKsYfa7YGgE
Theregressionoutput
Ihavemarkedsomeofthekeyresultsinred,andIexplainthembelow.
6.4Howgoodisthemodel?—r-squared
Thereisaredcirclearoundtheadjustedr-squaredvalue,here0.622.r-squaredisameasureofhowgoodthemodelis.r-squaredgoesfrom0to1,witha1meaningaperfectmodel, which is extremelyrareinappliedworksuchasthis.Zeromeansthereisnorelationshipatall.Theresulthereof0.62meansthat62%ofthevariancein the dependent variable is explained by themodel. This isn’t bad, but it’s notgreat either. Below,we’ll add more independent variables and show how theaccuracy of the model improves.Theadjusted r-squared takes into account thenumberofvariablesintheregressionequationandalsothesamplesize,whichiswhyit is slightly smaller than the r-squared value. It doesn’t have quite the sameinterpretationasr-squared,butitisveryusefulwhen comparing models. Like r-squared,wewantashighavalueaspossible:higherisbetterbecauseitmeansthatthemodelisdoingamoreaccuratejob.
Alsocircledarethewordsinterceptanddistance.Thesegiveusthe
coefficientsweneedtowritethemodel.ExtractingthecoefficientsfromtheExceloutput,wecanwritetheestimatedregressionmodel.
y =1 . 274+0 . 068∗Dist
whereyhatisthe estimated or predicted response time. If you took avery largenumberofjourneysofthesamedistance,thiswouldbetheaverageofthetimetaken.
Theinterceptisaconstant,itdoesn’tchange.Itistheamountoftimetakenbeforeasinglekilometerhasbeendriven.Thedistancecoefficientof0.068 is thepieceofinformationthatwe reallywant.Thisistherateofchangeoftimeforaoneunitchange in the dependent variable, distance. An increase of one kilometer indistanceincreasesthepredictedtimeby0.0678hours,and vice-versaofcourse.Dothemathandyou’llseethattheaveragespeedis14.7mph.
6.5Predictingwiththemodel
Amodel such as this makes estimating and quoting for jobs much easier. If amanagerwantedtoknowhowlongitwouldtakeforatrucktomakeajourneyof2.5kmsforexample,allheorshewouldneedtodoistoplug2.5intodistanceandget:
predictedtime=1.27391+0.06783*2.5=1.4435hours.
multiplythisresultbythecostperhourofthedriver,addthemiscellaneouschargesandyou’redone.
Itisunwisetheextrapolate.Onlymakepredictionswithintherangeofthedatawithwhichyoucalculatedyourmodel(seealsoChapter4onthis.
6.6Howitworks:Least-squaresmethod
Themethod we just used to find the coefficients is called themethodof leastsquares .Itworks like this: the software tries to find a straight line through thepointsthatminimizestheverticaldistancebetweenthepredictedyvalueforeachx(thevalueprovidedbythetrendline)andtheactualorobservedvalueofeachyforthatx.Ittriestogoascloseas itcan toallof thepoints.The verticaldistanceissquaredsothattheamountsarealwayspositive.
Theplotbelowisthesameastheoneabove,exceptthatIhaveaddedthetrendline.
Theblacklineillustratestheerror
Theslopeofthetrendlineisthecoefficient0.0678.Thatistherateofchangeofthedependentvariableforaone-unitchangeinthe independent variable. Think “riseoverrun”.Iextendedthetrendlinebackwardstoillustratethemeaningofintercept.Thevalueoftheinterceptis1.27391,whichisthevalueofywhenxiszero.Thisisactually a form of extrapolation, and because we have no observations of zerodistance,thisisanunreliableestimate.
Nowlookatthepointwherex=100.Thereareseveralyvalues representingtimetakenforthisdistance.Ihavedrawnablackline
betweenoneparticularyvalueandthetrendline.Thisverticaldistancerepresentsanerrorinprediction:ifthemodelwasperfect, all theobservedpointswould liealongthetrendline.Themethodofleastsquaresworksbyminimizingtheverticaldistance. Itispossibletodothecalculationsbyhand,buttheyaretediousandmostpeopleusesoftwareforpracticaluse.Thegapbetweenpredictedandobservedisknownasaresidualanditisthe‘e’terminthegeneralformequationabove.
Theerrordiscussedjustaboveistheverticaldistancebetweenthepredictedandtheobserved values for every x value. The amount of error is indicated by the r-squaredvalue .r-squaredrunsfrom0(acompletelyuselessmodel)to1(perfectfit).
Intheglossary,underRegression,Ihavewrittenupthemaththatunderpins theseresults.
6.7Addinganothervariable
Ther-squaredof0.66wefoundwithoneindependentvariableisreasonableinsuchcircumstances, indicating that our model explains 66% of the variability in theresponsevariable.Butwemight dobetter by adding another variable to explainmoreofthevariability.
Thetruckingdatasetalsoprovidesthenumberofdeliveries that the driver has tomake.Clearly,thesewillhaveaneffectonthetimetaken.Let’sadddeliveriestotheregressionmodel.
Notethatyou’llneedtocutandpastesothattheexplanatoryvariablesareadjacenttoeachother.Itdoesn’tmatterwheretheresponsevariable is,but theexplanatoryvariablesmustbe adjacentinoneblock.Thenewresultisbelow:
Nowwithdeliveries
Notethatther-squaredhasincreasedto0.903, so the newmodel explains about25%moreofthevariationintime.Theimprovedmodelis:
y =− 0 . 869+0 . 06∗Dist+0 . 92∗Del
A few things to note here: the values of the coefficients have changed. This isbecausetheinterpretationofthecoefficientsinamultiplemodellikethisisbasedononlyonevariablechanging .Forexample, thecoefficientofdistanceis0.92,almostonehourforeachdeliveryassumingthatthedistancedoesn’talsochange.Weshouldinterpretthecoefficientsundertheassumptionthatalltheothervariablesare‘heldsteady’,apartfromthecoefficientofinterest.
6.8Dummyvariables
Above, we saw how adding a further variable has dramatically improve theaccuracyofamodel.Adummyvariableisanadditional variablebutone thatweconstructourselvesasa resultofdividing data into two classes, for example bygender.Dummyvariablesare
powerful because they allow us to measure the effect of a binaryvariable, known as a dummy or sometimes indicator variable. Adummyvariabletakesonavalueofzeroorone,andthuspartitionsthedataintotwoclasses.Oneclassiscodedwithazero,andiscalledthereference group . The other classes are coded with a one andsuccessive numbers.
Theremay bemore than two groups, but therewill always be onereference group. W eare generating an extra variable, so theregressionequationlookslike this:
y =b 0+b 1 x 1+b 2 x 2
In the case of observations which have been coded x1=0 (thereference group), then the b1 term will disappear because it ismultipliedbyzero.Theb2termremains.Forthereferencegroup,theestimatedregressionequationthensimplifiesto
y ˆ reference=b 0+b 2 x 2
whileforthenon-referencegroup,itis
y ˆ nonreference=b 0+b 1 x 1+b 2 x 2
Thesizeofb1x1represents thedifferencebetweenthe averagesizeofthereferencelevelandwhatevergroupisthenon-referencegroup.Let’sworkthroughanexample.Creatingadummyvariable⁵
Thedataset[‘gender’](https://dl.dropboxusercontent.com/u/23281950/gender.xlsx)containsrecordsofsalariespaid,yearsofexperienceandgender.
Wemightwantto knowwhethermen andwomen receive the samesalarygiventhesameyearsofexperience.Loadthedata,then run alinear regression of Salary on Years of Experience, using years ofexperienceasthesoleexplanatoryvariable.Theresultisbelow:
⁵https://www.youtube.com/watch?v=TBJsEb2UCPs
Thegenderregression
The interpretation is that the interceptof$30949 is the aver- age starting salary,withyearsofexperiencezero.Thecoefficient
$4076meansthateveryadditionalyearofexperienceincreasestheworker’ssalarybythisamount.
Ther-squaredis0.83,sothesimpleone-variablemodel explainsabout83%ofthevariation in the dependent variable, salary. We have one more variable in thedataset,gender.Addthisasadummyvariableandruntheregressionagain.Notethatyouwillhavetocutandpaste thevariablesso that theexplanatoryvariables areadjacent.
Withthegenderdummyadded
Ther-squaredhasincreasedtonearly1,soournewmodelwiththeinclusionofgenderisveryaccurate.Theimportantnewvariableisgender,codedasfemale=0andmale=1.Nothingsexistaboutthis,wecouldequallywellhavereversedthecoding.Ifyouarefemale,thenthemodelforyoursalaryis:23607.5+4076*Years
ifyouaremale,thenthemodelforyoursalaryis23607.5+114683.67+ 4076Years
Eachextrayearofexperienceprovidesthesamesalaryincrease,butonaveragemalesreceive$14683.67moreearnings.
Hereisanotherexample,usingthemaintenancedataset.Dummyvariable⁶
Anotherdummyvariableexampleandacautionarytale
Themariokartdatasetcamefrom[OpenIntroStatistics](www.openintro.org)awonderfulfreetextbookforentrylevelstudents.Thedataset
⁶https://www.youtube.com/watch?v=Yv681upodDI
contains informationon thepriceofMarioKart in ebay auctions.First let’s testwhethercondition‘new’or‘used’makesdifference.ConstructanewcolumncalledCONDUMMY,codednew=1andused=0.Nowrunaregressionwithyournewdummyvariableagainsttotalprice.Theoutputisbelow
Theconditiondummyresults
Thisresultistremendouslybad.Thepvalueis0.129,meaningthatconditionisnotastatistically significant predictor of price. Surely the conditionmust have somesignificant effect?Wait—we forgot to do some visualization. To the right is ahistogramofthetotalpricevariable.
Lookslikewemighthaveaproblemwithoutliers…someof the observations aremuchlargerthantheothers.Takeanotherlookwithabarchartatthehigherprices.
marioKarttotalprices
Wecanidentifytheseoutliersusingzscores(seetheGlossary).Orwecouldjustsortthembysizeandthenmake
avaluejudgementbasedonthedescription.That’swhatIhavedoneinthisYouTube.Outlierremovaland regres-sion⁷
Thetotalpricesasabarchart
It seems that two of the items listed were for grouped items which were quitedifferentfromtheothers.Thereisthereforealegitimate reasonforexcludingthem.Belowarethenewresults:
⁷https://www.youtube.com/watch?v=ivLGteHuu3Q
CorrectedmarioKart output
Theestimatedregressionequation is
y =42 . 87+10 . 89∗CONDUMMY
Theconditiondummywascodedasnew=1,used=0.Ifamariokartisused,thenitsaveragepriceis42.87,ifitnewthentheaveragepriceis42.87+10.89.Theaveragedifferenceinpricebetweenoldandnewisnearly$11.Makessense.
Take-home:checkyourdatabeforedoingtheregression.
6.9Severaldummyvariables
ThegenderandthemarioKartexamplesabovecontainedjustonedummyvariable.But it is possible to havemore. For example, your sales territory contains fourdistinctregions.Ifyoumakeoneofthefourthereferencelevel,andthendivideupthedatawithdummies
for the remaining three regions, you can compare performance in each of theregionsbothtoeachotherandtothereferencelevel.
Thedatasetmaintenancecontainstwovariableswhichyoucanconverttodummies,followingthisYouTube.
MaintenanceRegression⁸
6.10Curvilinearity
Sofar,wehaveassumedthattherelationshipbetweenthedepen-dentvariableandtheindependentvariablewaslinear. However,inmanysituationsthisassumptiondoes not hold. The plot below showsethanolproduction inNorthAmerica overtime.
NorthAmericanethanolproduction.Source:BP
ThesourceofthedataisBP.Itisclearthatproductionofethanolisincreasingyearlybutinanon-linearfashion.Justdrawingastraightlinethroughthedatawillmisstheincreasingrateof production.Wecancapturetheincreasingratewithaquadraticterm,which is simply the time element squared. I have created a new variablewhichindexestheyears,whichIhavecalledt,andafurthervariable
⁸https://www.youtube.com/watch?v=xWdcT7u9YFE&feature=em-upload_owner
whichistsquared.
urvilinearregressionYouTube⁹
Thedatasetwithanindexfortime.Source:BP
Aregressionoftagainstoutputhasanr-squaredvalueof0.797.Inclusionofthetsquaredtermincreasesther-squaredmarkedlyto
0.98.Theregressionoutputisbelow.
Regressionoutputwiththequadraticterm
Wewouldwritetheestimatedregressionequationas
y =4504−1301 t+194 . 9 t 2
Noticethatforearlysmallervaluesoft,theeffectofthequadratic
⁹https://www.youtube.com/watch?v=jgjWSpyPBqg
termisnegligible.Astgetslargerthenthequadraticswampsthelineartterm.
Makingaprediction .Let’spredictethanolproductionafter5years.Then
y =4505−1301∗5+194 . 9∗25=2872 . 5
Theplotbelowshowspredictedagainstobservedvalues.While the fit is clearlyimperfect,itiscertainlybetterthanastraightline.
Predictedagainstobservedethanolproduction.
6.11Interactions
Increasingthepriceofagoodusuallyreducessalesvolume(al- thoughof courseprofitmightnotchangeifthepriceincreasessufficientlytooffsetthelossofsales.Advertisingalsousuallyincreasessales,otherwisewhywouldwedoit?
What about the joint effect of the two variables? How about reducingthepriceandincreasing theadvertising?The jointeffect is called an interaction and caneasilybeincludedintheexplanatoryvariablesasan extra term.Theoutput belowistheresultof
regressingsalesonPriceandAdsforaluxurytoiletriescompany.Theestimatedregressionmodel is
y =864−281 P+4 . 48 Ads
Regressionoutputforjustpriceandads
Sofarsogood.Thesignsofthecoefficientsareaswewouldexpectfromeconomictheory.Salesgodownaspricerises(negativesignonthecoefficient).Salesgoupwithadvertising(positivesign onthecoefficient).Caution:donotpaytoomuchattention to the absolute size of the coefficients.The fact that price has a muchlargercoefficientthanadvertisingisirrelevant.Thecoefficientisalsorelatedtothechoiceofunitsused.
Nowcreateanothertermwhichispricemultipliedbyads.Callthis termPAD.Thefirstfewlinesofthedatasetarebelow.
Firstfewlineswiththenewinteractionvariable
Nowdotheregressionagain,includingthenewterm.
Inclusionoftheinteractionterm
The new term is statistically significant and the r-squared has increased to0.978109.Thenewmodeldoesabetterjobofexplainingthevariability.Howcomepriceisnowpositiveandtheinteractionterm is negative?We have to look at theresults as a whole rememberingthatthesignsworkwhenalltheothertermsare‘heldconstant’.Theexplanation:asthepriceincreases,theeffectofadvertisingonsalesisLESS.Youmightwanttolowertheprice andseeiftheincreasedvolumecompensates.
Anotherinteractionexample
Here’sanotherexample,thisonerelatingtogenderandpay. Thedatasetiscalled‘paygender’ and contains information on the gen- der of the employee, his/herreview score (performance), years of experience and pay increase. We want toknow:
•Isthereagenderbiasinawardingsalaryincreases?
•Inthereagenderbiasinawardingsalaryincreasesbasedontheinteractionbetweengenderandreviewscore?
First,let’sregresssalaryincreasesonthedummyvariableofgenderandalsoReview.Myresultsarebelow(IhavecreatedanewvariablewhichIhavecalledG.Itisjustthegendervariablecodedwithmale
=0andfemale=1.
GdummyandReview
Nothingverysurprisinghere:womengetpaidonaverage233.286 lessthanmen.And—holdinggendersteady–eachpoint increase in Reviewgives an increase insalaryof2.24689.
How do I know thatwomen are paid less than men? The estimated regressionequationis:
y =204 . 5061 − 233 . 286 ∗ ( x=1 ifF )−233 . 286∗( x=0ifM )+2 . 24689∗Review
Rememberhowwecodedmenandwomen?Thex=0ifMtermdisappears,soweareleftwiththisequationformen:
y =204 . 5061+2 . 24689∗Review
andthisoneforwomen(I’vedonethesubtraction)togive
y =− 28 . 779+2 . 24689∗Review
Conclusion:thereisagenderbiasagainstwomen.ForthesameReviewstandard,onaveragewomenarepaid233.286lessthanmen.
How about the second question — possibility that the gender biasincreases with Review level? Wecan test this with an interactionvariable.MultiplytogetheryourgenderdummyandtheReviewscoretocreateanewvariablecalledInteract.Thendothe regression againwith salary regressed against G, Review and Interact. My output isbelow.
Interactionoutput
Formen,theequationisSalary=59.94472+4.848995(becausetheGandInteracttermsgotozerobecausewecodedMale=0.SoforeveryextraincreaseinReview,aman’ssalaryincreasesby4.848995.
Forwomen,theequationisSalary=59.94472-29.714-1*(4.848995-4.05468)soforeveryone increase inReview,awomen’ssalary increasesbyonly 4.848995-4.05468=0.79. Notice that the adjusted r- squared for thismodel is higher (it is0.927339comparedto0.810309) thanthemodelwithjustGandReview.So theinteractionisbothstatisticallysignificant,pvalue0.000316)andhasthemeaningthatforwomen,improvingtheReviewscoredoesn’tincreasethesalaryasmuchasformen.
ConclusionTheanalysisshowedthatwomenarebeingtreatedunfairly.Thereisagenderbiasagainsttheminaveragesalariesfor thesameperformancereview;andeach increase in reviewpoints earn themconsiderably less insalary thanfor theequivalentmale.Caution: thisconclusion isbasedsolelyon the limitedevidenceprovidedbythissmalldataset.
6.12Themulticollinearityproblem
Multicollinearityreferstowayinwhichtwoormorevariables ‘explain’ thesameaspectof thedependentvariable.Forexample, let’ssaythatwehad a regressionmodelwhichwasexplainingsomeone’ssalary.Employeestendtogetpaidmoreastheygetolderandalsoastheiryearsofexperienceincreases.Soifwehadbothageand years of experience in the list of independent variables we would almostcertainlysufferfrommulticollinearity.
Multicollinearitycanleadtosomefrustratingandperplexing re-sults.Yourunaregression.Oneofthevariableshasapvaluelargerthan0.05,soyoudecidetotakeitout.Youruntheregressionagainand—-thesignand/or significanceofanothervariables changes. This happens because the two variables were explaining thesameaspectofthedependentvariablejointly.
Howtosolvethisproblem:checkthecorrelationoftheindependentvariablesfirst,beforeputtingthemintoyourmodel.Chapter14hasasectiononcorrelation.Ifyoufindthatthecorrelationofanytwovariablesishigherthan0.7,besuspicious.Thesetwomaybringyoursomegrief!
6.13Howtopickthebestmodel
Youwillbetryingoutdifferentformulations,addingandremovingvariablestotrytocaptureasmuchexplanatorypowerasyoucan.Howyoudodecideifonemodelisbetterthananother?Therearetwoapproaches:
• Lookonlyattheadjustedr-squaredvalue,evenif yourmodel containsvariableswith a p value larger than 0.05. If the adjustedr-squaredvaluegoesup,leavesuchvariablesin.
•Pruneyourmodelsothatitcontainsonlyvariableswithapvalue<=0.05.Youstillwantashighanadjustedr-squaredas
youcanget,butyoualsowantallyourexplanatoryvariablestobestatisticallysignificant.
Itakethelatterapproach.Youusuallyhavefewervariablesbutallof themhaveareason(statisticalsignificance)forbeinginyourmodelandyoucaninterprettheirmeaningintelligently.Aparsimoniousmodelisbetter thanacomplexmodel thatfitsyourdataverywell.Simpleandrobustisgood.Thedatasetthatyouusedtofityourmodelisonlyasamplefromapopulation.Anoverlycomplexmodelmaynotworkwellwhenpresentedwithadifferentsamplefromthepopulation.
6.14Thekeypoints
• Think throughyourmodelbeforeyoustart including vari-ables.Whatvariablesdoyouthinkwillhaveaneffectonthedependentvariableandinwhichdirection(plusorminus).Itistemptingtojustputineverythingandhope for the best but this rarely works. Some software is able to dostepwiseregression,pullingoutinsignificantvariablesforyou, butExcelisnotoneofthem.
• Keepyourmodelassimpleaspossible.Complicatedmodels rarelyworkwell.
• Visualiseyourdatafirst,evenwithasimplescatterplotaswehavedonethroughoutthischapter.
•Checkwhetheryouhaveallthevariablesthatyoumightneed.Ifyouweretrying topredictwhetherashopselling expensive jewellerywouldmakesufficientsales,youmightwanttoknowtheaverageincomeofresidents.Ifyoudon’thaveit—getit.Thereisahugeamountofdatalyingaboutwhichyoucanobtain either free or quite cheaply. I’veincludedaverybrieflistofURLsinChapter12.
6.15Workedexamples
1.Theestimatedregressionequationforamodelwithtwoindependentvariablesand10observationsisasfollows:
y =29+0 . 59 x 1+4 . 9 x 2
Whataretheinterpretationsofb1andb2inthisestimatedregres-sion equation?
Answer:thedependentvariablechangesbyonaverage0.59whenx1changebyoneunit,holdingx2constant.Similarlyforb2.
Predictthevalueoftheindependentvariablewhenx1is175andx2is290:
y =29+0 . 59(175)+4 . 9(290)=1553 . 25
7.Checkingyour regressionmodelItisn’tdifficulttobuildaregressionmodelasthepreviouschapterhasshown.Butthatispartoftheproblem:everybodydoesit.Butnoteverybodytakesthetroubletochecktheresults.Checkingtheresultsisanimportantstepfortworeasons:first,tomakesureyourcalculationsandpredictionsarecorrect;andsecondlytoshow thirdparties thatyourwork is solid. In this chapterwe’llworkondoing just that. Todecidewhetherthemodelswehavebeenworkingonareanygoodweneedtolookattwoareas:
•Isyourmodelstatisticallysignificant?
•Aretheassumptionsbehindtheleastsquaresmethodmet?Inparticular,linearityandresidualdistribution.
Thefirstareaiseasiertoworkthroughthanthesecond,whichisprobablywhyonedoesn’talwaysseeresidualanalysisdiscussedwhenresultsarepresented.Thisisashamebecausethereisagreatdealtobelearnedfrompickingthroughresiduals.Youranalysiswillbegreatlyimprovedbyacloseattentiontothisarea.
Belowwe’llworkthroughstatisticalsignificanceandthentesttheassumptions.
7.1Statisticalsignificance
Theregressiontrendlinedisplaysarateofchangebetween twovariables.Thelineslopesupwardswhenthedependent variableincreasesforaone-unitincreaseinthe
independentvariable(
61
positive relationship) and vice-versa (negative relationship). What we want toknowis this:did thatslopeupwardsordownwards occur by chance; ifwe tookanothersamplefromthepopulationwouldweachieveasimilarresult?Keypoint:weare usuallyworkingwithasampledrawnfromapopulation.Thesamplein thetruckingexamplewasthefirm’slogbook.
Thenullhypothesisisthatthereisnoslope;withthealternativethatthereisinfactaslope.Thehypotheses(seetheGlossaryforhelponhypotheses)totestwhetherthereisastatisticallysignificantslopeare:
Thenull:
H o : β =0
andthe alternative:
H a : β = 0
where
β
isthesymbolforthecoefficientsofthepopulationparameter of the independentvariableofinterest.Inwords,thehypothesisistestingwhetherbetaiszeroornot.Ifit is zero, then we are ‘flatlining’ and there is no point in continuing with theanalysis.Ifhoweverwecanrejectthenull,thenweacceptthealternativewhichisthatbetaisnotzero.Itmightbenegativeorpositive,thatdoesn’tmatter:whatmattersiswhether it is zero or not. Note that we use a Greek letterwhendiscussingapopulationparameterwhichwillprobablyneverbeaccuratelyknown.Weuse theLatin letter ‘b’ when we have estimated it using a sample drawn from thepopulation.
Exceldoesallthetestingofstatisticalsignificanceforusintwoways.
First,itteststheindividualvariablesandprovidesapvalue.Belowistheresultfromthefirsttruckingregressionwedid.Notethatthepvaluefordistanceis0.004.Itiscustomarytouseacut-offof0.05.Themeaningofthisresultisthat,followingtherejectionrule ,werejectthenullifthepvalueissmallerthan0.05.Because0.004issmallerthan0.05,werejectthenullwe therefore conclude that there is in factastatisticallysignificantslope.Insummary,wetestedthehypothesis that the slopewaszero,andrejectedit.Thereforeweacceptthealternativewhichisthatthereisaslopeofsomesort.
Thetruckingregressionoutput
Thistest tellsusthat thereisastatisticallysignificantslope.Itdoesn’ttellusthesignoftheslope(upordown)oritsmagnitude.Butthefactthatthereisasignificantslopeisimportantinformation.Itisworthcarryingonwiththeanalysisbecausethevariablesactuallymeansomething.Bytheway,ignorethepvaluefortheintercept.Itusuallyhasnosubstantivemeaning.
ThesecondwaythatExcelteststhevalidityofthemodelistotestwhetherthemodelasawholeissignificant.ThetestiscalledtheFtestafterthestatisticianRAFisher.Forthetruckingexample,lookundersignificanceF.Theresultisapvalue,inthiscasethesame
pvalueforthevariableDISTANCE.Wehaveonlyoneexplanatoryvariableandsothepvaluewillbethesame.Wehopethatthepvalueissmallerthan0.05,whichenablesustoconcludethatatleastoneoftheslopesisnon-zero.
The section above examined the first of the tests, which was for the statisticalsignificanceofthemodel.Nowwemoveontothesecondarea.
7.2Thestandarderrorofthemodel
ThestandarderrorinExcelisfoundjustundertheadjustedr-squaredoutput.Itistheestimatedstandarddeviationoftheamountofthedependentvariablewhichisnotexplainablebythemodel.Inotherwords,itisthestandarddeviationoftheresiduals.
Undertheassumptionthatthemodeliscorrect,itisthelowerboundonthestandarddeviationofanyofthemodel’sforecasterrors.Thefigurebelowshowstheoutputofregressionestimatingriskofastroke(multipliedby100)againstbloodpressureandage,withadummyvariableforsmokingornot.
Exceloutputforstrokelikelihood
Thestandarderroris6(roundedup).Foranormally-distributedvariable,95%ofthe observations will be within two standard deviations of the mean. For ourpurposesthatmeansthat2x6=12shouldbeaddedorsubtractedtothepredictedvaluetofindaconfidenceintervalforpredictions.Theestimatedregressionequationfromthestrokemodelaboveis:
y =− 91+1 . 08 Age+0 . 25Pressure+8 . 74 Smokedummy
Animaginarypersonwhosmokes,isaged68andhasa bloodpressureof175willhaveariskof34(allfiguresrounded).Wecanbe95%confidentthatthisestimatewillfallsomewherebetween34-12and34+12or22to46.Theseareratherwideconfidenceintervalsandsowemightwanttoworkatimprovingthemodelbyforexampleincreasingthesamplesizeoraddingmoreexplanatoryvariables.
7.3Testingtheleastsquaresassumptions
Therearetwokeyassumptionsthattheleastsquaresmethodreliesonwhichweneedtocheck.Aftercheckingwe’llworkthroughwaysofretrievingthesituationshouldtheassumptionsbeviolated.Theassumptionsare:
• Therelationshipbetweenthedependentandtheindependentvariablesislinear.Fortunatelythisiseasytocheckandalsotofixifitisn’tsatisfied.
• The residualshaveanon-constantvariance. (You’ll some- timesseeinotherstatisticstextbooksotherstricterrequire-ments,suchastheresidualsbeingnormallydistributedandwithameanofzero.Formostpurposesjustchecking for non-constant variance is enough). What doesnon-constantvariance mean? We’ll work through this by defining a resid- ual andidentifyingwhetherornottheassumptionhasbeenmet.Andfinallywhattodoaboutit.Butfirst,checkingforlinearity.
Checkingfor linearity
Therelationshipbetweenthedependentvariableandtheindepen- dentvariableisassumedtobelinear.Thisisimportantbecausethemodelgivesusarateofchange:thecoefficientshowsthechangeinthedependentvariableforaone-unitchangeintheindependentvariable.Iftherelationshipisnon-linear,thatcoefficientwill notbevalidinsomeportionsoftherangeofthevariables.
Wewanttoseeastraightline(eitherupordown)onascatterplot.Putthedependentvariable on the vertical y axis, and one of the independent variables on thehorizontal x axis. If you include the trendline,youcanobservehowclosely theobservedvaluesmatchthepredicted.
Wecanalsocheckforlinearitybyplottingthepredictedvaluesand theobservedvalues.Theplotbelowdoesthisforriskofstroke:
Predictedagainstobservedrisk
Thisisareasonableresult,indicatingthatthelinearityconditionismet.Mostofthepointsareinastraightline,althoughIwould be concerned about the lower risklevels,especiallyaroundrisk=20.Thereappearstobeconsiderablevarianceatthispoint.
7.4Checkingtheresiduals
Aresidualisthedifferencebetweentheobservedvalueofyforanygivenxvalue,andthepredictedvalueofyforthatsamexvalue.Itisthereforeapredictionerror,andisgiventhenotationefortheGreekletter‘epsilon’.Itistheverticaldistancebetweentheactualandpredictedvaluesofthedependentvariableforthesamevalueoftheindependentvariable.Wediscussedthisinthepreviouschapter inconnectionwiththeleastsquaresmethod,wherewewrotetheestimationequation:
y = b 0+b 1 x
Thehatony,thedependentvariable,indicatesthatitisanestimateofthedependentvariable.Weknowthattheestimatecannotbecorrectunlessallthepredictedvaluesandtheobservedvaluesmatchupexactly.Iftheydon’t,thenthereareresiduals.Ifwecalltheerrorsε(epsilon)whichistheGreeklettermatchingourlettere,thenwecanrewritetheregressionequationas:
y=β o+β 1x+ε
Theerrorshavenowbeenabsorbedintoepsilonandsowecanremovethehatfromy.Itisthebehaviorofepsilonthatisofinterest, becausetheleast-squaresmethodrestsontheassumptionthatthe errors absorbed into epsilon have a non-constantvariance. Thismeansthattherenorelationshipbetweenthesizeoftheerrortermandthesizeoftheindependentvariable.Therefore,ifweplottedtheerrorsagainsttheindependentvariables,weshouldseenoclearpattern.Howtodo this isdescribedbelow.
7.5Constructingastandardizedresidualsplot
Excelprovideswhatiscallsstandardresidualsaspartofitsregres-sionoutput,andwewillusethese.Notehoweverthatthesearenot‘true’standardizedresiduals,buttheyareprobablycloseenough.MakesureyouchecktheStandardizedResidualsboxwhensettingupyourregression.
Checkthestandardizedresidualsbox
Wewant toplot thestandardresiduals against the predicted value yhat.Wewillcreateanewcolumn to the left of the columnStandard Residuals, and copy thecolumnofpredicted times into that new column. Then create a scatter plot ofpredictedtimeandstandard residuals.Youshould end upwith the image below.Standardizedresidualplot¹
¹https://www.youtube.com/watch?v=1wuyZfB39p4
Standardresiduals
TheYaxisindicatesthenumberofstandarddeviationsthateachresidualisawayfromthemean,plottedagainstthepredictedvaluesonthehorizontalaxis.Noneoftheresidualsaremorethat2standarddeviationsawayfromthemeanofzero,sotheresultsaregenerallysatisfactory,althoughthereisoneobservationinexcessof1.5.Westill have a worrying fan shape which would seem to indicate non-constantvariance:wecanobserveapattern.
Let’sruntheregressionagain,butnowincluding the second vari- able,which isdeliveries.Theresidualsplotisobtainedinthesameway,andhereistheresult.
Standardizedresidualswithtwovariables
Thefanproblemseemstohaveimprovedalittle.Ther-squaredof themodelwithtwovariableswashigher,meaningthattheresidualsweresmaller (because there islessunexplainederror).
7.6Correctingwhenanassumptionisviolated
Above,weexaminedpossibleviolationsoftwooftheassumptionsunderlyinglinearregression:linearityandnon-constantvariance.Nowlet’slookatwhattodowhenthese assumptions are found to havebeenviolated.Firstsomegoodnews: linearregressionisquiterobusttosuchviolations,andevensotheyarequiteeasytocorrectfor.We’ll deal with the problems in the same order: linearity and non-constantvariance.
7.7Lackoflinearity
The firstcheck is to lookat a scatter plot of the dependent variable against theindependentvariable.Theplotbelowshows volume of sales and length of timeemployed,togetherwithatrendline.
Acurvilinearrelationship
Thereareseveraldifferentwaysinwhichalinearrelationshipcanbeachievedsothat we can use the least squares method. A common method is to include aquadraticterm.Thismeansadding thesquaredvalueoftheindependentvariabletothelistofindependent variables.This iseasilydonebycreatinganothercolumn,consistingofsquaredvaluesofthefirstvariable.
Anewcolumnoftheindependentvariablesquared
Todothis,createa newcolumnand then label it.Clickon the first value in thevariablewhichyouwanttosquare,andaddˆ2.Enter. Then drag downwards. Theplotbelowshowsthepredicted andobservedsales.Themodelisclearlysuperior.
Predictedandobservedafterinclusionofquadraticterm
Ifthedependentvariablehasalargenumberoflowvalues,andisheavilyskewedtotheright,thenagoodsolutionistotransformthedependentvariableintoitsnaturallogarithm.
7.8Whatelsecouldpossiblygowrong?
Regressionisaverycommonlyusedanalyticaltoolandyouaremostlikelyeithergoingtouseityourselforexaminetheworkofotherswhohaveusedregression.Below is a list of common mistakes to watch for. And if I have made themsomewhereinthisbook,I’msureyou’llbethefirsttoletmeknow!
7.9Linearitycondition
Regressionassumesthattherelationshipbetweenthevariablesis linear: the trendlinethatthesoftwaretriestofindtominimizethesquareddifferencebetweentheobservedandpredictedvaluesisstraight.Soifyoutrytorunaregressiononnon-lineardata,you’llgetaresultbutitwillbemeaningless.
Action :alwaysvisualizethedatabeforedoinganyanalysis.Ifyouseethatthedataisnon-linear,youmaybeabletotransformitusing the techniquesdescribed in thepreviouschapter.Asanexample: thecurvilinearexamplewhichwastransformedbyincludingaquadratictermasanexplanatoryvariable.
7.10Correlationandcausation
Regressionisaspecialcaseofcorrelation,andasweallknowcorrelationdoesn’tmeancausation(seetheGlossary).Inregression,nomatterhowgoodthemodel,allthat we have been able to show is that a change in an explanatory variable isassociatedbyachangeinthedependentvariable.Forexample,yourecordthehoursyouputintostudyingandyourgrades.Surprise!Morestudying=bettergrades?Orperhapsnot….couldhavebeenabetter instructor.Wecannotsaythatonecausedtheother.Sowhenwritingupresultsorinterpretingthoseofothers,beverycarefulnottoclaimmorethanyouareableto.
There is a related problem which is ‘reverse causation’. You study more, yourgradesgoup.Temptingtothinkthatonepossiblycausedtheother.Butperhapsitistheotherwayround?Yourgradeswerepoor,youstudiedharder?Oryouhadgoodgradesandthenledyouintotakingthecourseseriously.
Action : try not to include explanatory variables which are affected by thedependentvariable.Youcouldalsotryto‘lag’onevariable.Cutandpastethehoursofstudyingvariablesothatitisonetime-periodbehindthegradesandthenruntheregressionagain.
7.11Omittedvariablebias
Under Correlation in the Glossary I give some examples of lurking variables.Regression, like correlation, is susceptible to the same problem.Example: younoticethatinhotterweathertherearemore
deathsbydrowning.Didthehotterweathercausethedrownings?Wellno,theextraswimmingcausedbytheheatpresentedmoreriskscenarios.Thelurkingvariableishoursspentswimmingratherthantemperature.Trytogettotherealvariableifyoucan.
7.12Multicollinearity
Whenpredicting the price of a house, square footage and number of roomsarelikelytobehighlycorrelatedbecausetheyarebothex-plainingthesamething.Thisproblemresultsinunusualbehaviorintheregressionmodel.
Action : correlate your explanatory variables BEFORE doing any regression.Watchoutifyouhaveapairwhichhasanrvalueof
0.7orhigher.Thisdoesn’tmean thatyou shouldn’tuse them, but it’s ared flag.Thisiswhatmighthappen.Youtakeoutononethevariablesina regression, theonethat’sleftreversesitssignorsuddenlybecomesstatisticallyinsignificant.
Ifyoudoruntheregressionandyougetanunlikelyresult,choosethevariablewiththehighesttvalue(orsmallestpvalue)andditchtheother.
7.13Don’textrapolate
Thecoefficientsfromtheregressionarecalculatedbasedonthedatayouprovided.Ifyoutrytopredictforavaluebeyondtherangeof thatdata, the results will beunreliableifnottotallywrong.
In theyearsofexperienceandsalaryexample inChapter5, theco- efficient foryearsofexperienceprovidedthechangeinsalarythatanextrayearofexperiencewouldgive.Wouldyoufeelcomfortablepredictingthesalaryofsomeonewith95yearsofexperience?
Action :beforerunningthenumbers,checkthattheinputsarewithintherangeforwhichyoucalculatedthemodel.
8.TimeSeries IntroductionandSmoothingMethods
Atimeseriesisasetofobservationsonthesamevariablemeasured atconsistentintervalsoftime.Thevariableofinterestmightbemonthlysalesvolume,websitehitsoranyotherdataofrelevance to thebusiness.Using timeseriesanalysiswecan detect patterns and trends and–just possibly–make forecasts. Forecasting istrickybecause (ofcourse!)weonlyhavehistoricaldata togoonand there isnoassurance that the same pattern will repeat itself. Despite all this, time seriesanalysishasdevelopedintoahugetopic, fortunatelywithalargenumberoffreelyavailabledata-setstouse.Linkstosomeofthesedata-setsareprovidedattheendofthedatachapter(Chapter3).
There are two basic approaches: the smoothing approach and the regressionmethod .Bothmethodsattempt to eliminate the background noise. This chaptercoverssmoothingmethods, thefollowingchaptertheregressionapproach.
8.1Layoutofthechapter
Herewewill
1.identifythefourcomponentsthatmakeupatimeseries
2.introducethenaive,movingaveragemethodsandexponen-tiallyweightedsmoothingmethodstomakeaforecast
3.discusswaysinwhichtheaccuracyoftheforecastscanbecalculated
76
8.2TimeSeriesComponents
There are four potential components of a time series. One or more of thecomponents may co-exist. The components are: trend compo- nent; seasonalcomponent;cyclicalcomponent;randomcomponent.
•trendcomponent :theoverallpatternapparentinatimeseries.Theplotbelow shows French military expenditure as a percentage of GDP. Thedownwardtrendisapparent.
FrenchmilitaryexpenditureasapercentageofGDP
• seasonalcomponent :salesofskisorChristmasdecorationsareclearlyhigherinthewintermonths,whileicecreamandsuntancreammovemorequicklyinthesummer.Ifweknow the seasonal component, thenwe canuse this information for prediction. The time between the peaks of aseasonalcomponentisknownastheperiod .Noteherethatseasondoesn’tnecessarilymeantheseasonoftheyear.
TheplotbelowshowssalesofTVsetsbyquarter.HerethereisaconstantupwardtrendbecausemorepeoplearebuyingTV sets,butthereisalsoaseasonaleffect.
TVSalesbyquarter
•cyclicalcomponent :sometimeserieshaveperiods last-inglongerthanone year.Business cycles such as the50- year KondratieffWave are anexample.Thesearealmost impossibletomodel,but that doesn’t stop thehopefulfromtrying.KondratieffhimselfendedupavictimofJosephStalinbecausehissuggestionthatcapitalisteconomieswentinwavesunderminedStalin’sviewthatcapitalismwasdoomed,andhewasexecutedin1938.
•randomcomponent :therandomcomponentisjustthat:thenoiseorbitsleftoverafterwehaveaccountedforeverythingelse.Thetell-talesignsofarandom component are: a con- stant mean; no systematic pattern ofobservations;constantlevelofvariation.
8.3Whichmethodtouse?
In the next two chapters, we’ll work through the two basic ap- proaches: thesmoothingapproachandtheregression approach.Howtodecidewhichtouse?
Firststepasusualis tomakeasimplescatterplotofyourdata,with timeon thehorizontalxaxisandthevariableofinterestontheverticalyaxis.Ifthe result issomethingresemblingastraightline, forexampletheFrenchmilitaryexpenditure,thenuseasmoothingapproach.Ifthereisevidenceofsomeseasonality,suchaswiththeTVsalesdata,thenregressionisthewaytogo.Thisisespeciallythecaseifyouwanttocalculatethesizeoftheseasonaleffect.
8.4Naiveforecastingandmeasuringerror
Forecastingisjustthat:aneducatedguessaboutwhatmighthappenatsomepointinthefuture.Forecastsarebasedonhistoricalevents, andwe are hoping that somepatternofbehaviorwillrepeatitselfwhichwillmaketheforecast‘true’.Thebestthatwecandoisto tryoutdifferentforecastingmethodsandworkout howwelltheypredicteddatawhichwealreadyknewabout.Thenwetakealeapintothedarkandhopethatthebestofthosemethodswilldoagoodjobondatathatwedon’tknowabout.Thissectionconcernsthemeasurementofforecastinga c c u r a c y .
We’regoingtoconstructanaiveforecastandusethatforecastto demonstratetwocommonmethodsofaccuracychecking:theMean SquaredError(MSE)and theMeanAbsoluteDeviation(MAD)approaches.
Anaiveforecastassumesthatthevalueofthevariableatt+1willbethesameasatt.Thevaluesarejustcarriedforwardbyonetimeperiod.Hereisanexample:
ImageGaspriceswithnaiveforecast
The naive approach is sometimes surprisingly effective, perhaps because of itssimplicity.Ingeneralsimplemodelsperformwellperhapsbecauseoftheirlackofassumptions about the future.Now we will measure the accuracy of the NaiveForecastwithMADandMSE.Inbothmethodstheerroriscalculatedbysubtractingtheforecastorpredictedvaluefromtheobservedvalue.Themethodsdifferinwhatisdonewiththoseerrors.
MADmeasurestheabsolutesizeoftheerrors,sumsthemandthendividesbythenumberofforecasts.Theabsolutevalueoftheerrorisjustitssize,withoutthesign.InExcel, you can find an absolute value with =ABS(F1 - F3) where FI is theobservedvalueandFEthepredictedvalue.Usethelittlecorneroftheformulaboxtodrag it down. Then find the sum and divide by the n, which is the number ofobservations.Inmaththisis
MAD=
∑( abserror )
n
InMSE,theerrorissquaredbeforebeingsummedanddivided.
Squaring the error removes the problem of the negative numbers, but createsanotherone:largeerrorsaregivenmoreweightingbecauseofthesquaring.Thiscandistorttheaccuracyoftheresults.ThemathsfortheMSEisbelow
MSE=
∑( error )2
n
Theplotbelowshowstheerrorsassociatedwiththenaiveforecast-ing,theabsolutevaluesoftheerrorsandtheMAD.InExcel,use
=abs()toconverttoanabsolutevalue.
ImageErrorsandMeanAbsoluteDeviation
8.5Movingaverages
Slightlymorecomplexthanthenaiveapproachisthemovingaverageapproach,whichreliesonthefactthatthemeanhasless
variance than individualobservations.By focusingon themean,we can see thegeneraltrend,lessdistractedbynoise.
Theanalystdecideshowfarbackintimetheaveragewillgo.Thelongerbackintime,thenthemorevaluesoftheobservation are averaged, resulting in a flattermore smoothed result. This is good for observing long-term trends, but lesssatisfactoryifyouareinterestedinmorerecenthistory.
The number of observations which are to be averaged is known as k, and thisnumber is chosenby theanalyst basedonexperience. In Excel’s Data AnalysisToolpakthereisaMovingAveragetoolwhichcandoallthework.Positioningtheoutputisslightlytricky.Therearektimeperiodsbeingusedforthecalculationofthe first forecast;sothefirstforecastshouldbeatk+1becausektimeperiods arebeingusedtofindthefirstvalue.Whereexactlytoputthefirstcelloftheoutputisabitfiddly.Iftherearektimeperiodsbeingused, placethefirstcell at k-1.Excelprovidesachartoutput,examplebelow.ThereisaYouTubehere¹
ImageMovingaveragewithk=3forgasprices
Themovingaveragetechniqueisusedbyinvestorstodetectthe‘goldencross’andthe‘deathcross’.Theyexamine50dayand200daymovingaverages.Whenthe50daymovingaverageis above
¹http://youtu.be/zrioQOWfxjY
the200daymovingaveragethispointistheso-calledgoldencrossandasignaltobuy.Bycontrast,whentheoppositehappensthisis the‘deathcross’…timetogetout!
8.6Exponentiallyweightedmovingaverages
The exponential smoothingmethod (EWMA) is a further step forward from thenaiveandthemovingaverageapproaches.Themovingaverageforgetsdataolderthanthek timeperiodsspecified, while the EWMA incorporates both the mostrecentandmorehistoricalobservationstoconstructaforecast.IntheGlossaryyou’llfindmoredetailedmathshowinghowtheEWMAbringsforwardallpasthistory.Ingeneralthefurtherbackintimeyouget,thelessinfluencetheobservationshave.
UsingEWMAwechooseasmoothingconstant,alpha,whichsetstheweightgivento themost recent observation againstprevious forecasts. Ifwe thought that theforecast(t+1)wouldbesimilartothecurrentobservation(t=0)andwethoughtthatolderforecastswereof littlevalue, thenwewouldchooseahighvalueofalpha.Alpharunsfrom0to1.TheEWMAformulais
F t +1=αY t+(1−α ) F t
where F is the forecast, andY is the actual value of the variable. alpha is thesmoothingconstant,ranginginvaluebetween0and1,andchosenbytheanalyst.
Inwords, this equationmeans that the forecast value (at t+1) is the smoothingconstantalphamultiplyingtheactualvalueat t=0plus(1-alpha)timestheforecastatt=0.Soifthepreviousforecastwastotallycorrect,theerrorwouldbezeroandtheforecastwouldbeexactlywhatwehavetoday.Itturnsoutthattheexponential
smoothingforecastforanyperiodisconstructedfromaweightedaverageofallthepreviousactualvaluesofthetimeseries.
Theequationaboveshowsthatthesizeofalpha,thesmoothing constant,controlsthebalancebetweenweightinggiventothemostpreviousobservationandpreviousobservations.Ifthealpha issmall,thentheamountofweightgiventoYat t=0issmallandtheweightgiventopreviousobservationsislargeandvice-versa.Ifalpha=1,thenpreviousobservationsaregivennoweightatall,andweassumethat thefutureisthesameasthepast.Thisachievesthe same result as naive forecastingdiscussedabove.Typically quitesmallvalueofalphaareused,suchas0.1.
##ApplicationofExcel’sexponentialsmoothingtool
An exponential smoothing tool is available in the Analysis ToolPak. We’lluseCanadian Gross Domestic Product per capita as an example. Here is the dataplottedonitsown.Youtube²
CanadianGDPPerCapita
Thereisasteadyupwardstrend,butwithadipin2009duetotheworldfinancialcrisis.Excelasksforadampingfactor.Thisis1-alpha.Ihavedonetheforecaststwice,oncewithanalphaof0.1
²http://youtu.be/w-GmxjrX1qg
andagainwithanalphaof0.9.Thereforethedampingfactorswere
0.9and0.1.Theresultforalpha=0.9isshownbelow.
Forecastwithalpha=0.9
This is pretty good forecast, with the forecast tracking the actual observationsclosely.
Theplotbelowshowssheepnumbersascountsbyhead inCanada from1961to2006.First,let’splotthisusing0.3asalpha(arbitrarilychosen).Theplotisbelow,withadampingfactorof1-0.3=0.7.
Sheepnumberswithalpha=0.3
9.TimeSeriesRegressionMethodsThesmoothingmethodsweworkedinthepreviouschapterarefinewhenthereisnoevidenceofseasonality,forexamplewith the sheepdata.However,smoothinghas only limited use for longer term prediction and analysis. Data that is moreinterestingforusas analystsmaybenon-linear in trend, andalso have seasonalpeaks and troughs. Using regression, we canmeasure the quantitative effect ofseasonalityandtrend,eithertogetherorseparately.Theresultisamodelwhichcanbeusedforprediction.
Belowwewill:
1.useregressiontoquantifyatrendinatimeseries
2.introduceaquadratictermtoaccountfornon-linearityinthetimeseries
3.usedummyvariablestomeasureseasonality
9.1Quantifyingalineartrendinatimeseriesusingregression
TheplotbelowshowslifeexpectancyatbirthforCanadians.
87
Canadianlifeexpectancy
This is pretty much a straight line as one might expect in adevelopedcountrywithalargepercapitaexpenditureon healthcare.Notethatthisisforbothsexes:forwomenonlywemightexpecttoseeevenbetterfigures.IrantheregressionwithYearastheindependentvariable,explaininglifeexpectancy.Theresultisbelow.
Lifeexpectancyregressionresults
NoticethattheadjustedR-squaredvalueisclosetounity,reflecting
thestraightlinethatweseeonthegraph.Thecoefficients fortheinterceptandtheindependentvariableprovidethisestimatedregressionequation
y t=58 . 47+0 . 000565∗YearThemeaning is that for everyyear after 1960 that a childwas born, his/her lifeexpectancyincreasedby0.000565years,orabout5hours.
9.2Measuringseasonality
ThedatasetforTVsales(fromModernBusinessStatistics byAndersonSweeneyandWilliams)containssalesbyquarterforfouryears.There are therefore sixteenobservations.
IcreatedacolumnwhichIcalledindexsothatwecanseesales ina consecutivefashion.ThefirstquarterisSpringandthelastquarter iswinter.Itisapparentthatquarterswhicharedivisiblebyfourarehigherthanothers,andsoforth.Perhapssalesarehigherinwinter?
WecanusethedummyvariablemethodofChapter5todeterminewhetherthisistrueandalsotheextentofthedifference.WewillcreatedummiesforSummer,FallandWinter,leavingSpringasthereferencelevel.Recallthatwehavefourpossiblestatesoftheseasonvariable,andsowewillneedk-1,or4-1=3dummies.Itusuallydoesn’tmatterwhich state you choose as your reference level: just don’t forgetwhichoneyoupicked.Thedataset is below,with a columncalled Index, whichwe’llusetomeasurethetimetrend.
TVSaleswiththequarterlydummies
Iwanttofindouttwothings:
whethersalesareincreasingovertime.Icandothisbyincludingtheindexasanexplanatoryvariable.Becausetheregressiontool requiresthatalltheindependentvariablebeinoneblock,IhavecopiedandpastedtheIndexcolumntotheright.
Theregressionoutput is
TVSalesregressionout
Let’swalkthrough this output line by line. First notice that the p values for allindependentvariablesaresmallerthan0.05.Thereforeallaresignificant.
ThereferencelevelisSpring,andthereforethereisnocoefficientfor this quarter.Summer has a negative sign in front of its coefficient, meaning that sales forSummerare smaller than those forSpring.Fallispositive,somoreTV sales aresold in the Fall than in the Spring. Not unexpectedly, Winter has the largestcoefficientofall,
reflectingtheplot.Theindexhasapositivesign,meaningthataverageTVsalesareincreasingwithtime.
Therearethereforetwocomponentsinthisseries:atrendandaseasonalcomponent.Youtube¹
**Makingpredictions**fromtheseresults.Let’spredictthesalesforFallinsevenquarterstime.Thisis
y =4 . 7+1 . 05875+7∗0 . 147=6 . 78
Theobservedvaluewas6.8,sothepredictionwasreasonableifslightly pessimistic.
9.3Curvilineardata
Thedataabovefollowedalineartrend,whichisperfectforordinary least squares,which assumes that the relationship between the dependent variable and theindependent variable is linear. But some interesting data does not follow a neatlineartrend.Theplotbelowshowscrudeoilandgaspricesovertheperiod1949to2003.
¹http://youtu.be/wyKIHInMbY8
Crudeandgaspricesshowingcrudepricecurvilinearity
However,ifweshowthetwoseriesonthesamegraph,asontheplotIdidinTableau,withseparateyaxesforeachtypeoffuel,wecanseethattheshapesareveryclose.
Crudeandgasondifferentaxes
Thecurvilinearityofcrudepricesisclear.Thepricescametoapeak
inthe1970sasaresultofOPEC’sdecisiontorestrictsupply. Inany event,crudepricescannot be described as linear. The solution is to add an extra term to theregressionof price on time, and that is time squared. The first few lines of thedatasetarebelow.Ihaveaddedacolumncalledtorepresenttheyear,andaddedtsqwhichisjusttsquared.
Datawithtimesquared
Nowregresscrudeagainsttime and time squared, and the pleasing result belowappears
Thecurvilinearregressionforcrude
The adjusted r-squared is high at 0.88 and the two time predictors are highlysignificantstatistically.Noticethatthetwopredictorsarequitedifferentinsizeandalsohaveoppositesigns.Whentissmall,thenthevariabletdominatesandthetrendis upward. However, as t gets larger, then t-squared gets even larger still. Thenegativesignont-squaredpullstheregressionlinedown.
10.OptimizationLinearProgramming (LP) is the tool we use to optimize a particular objectivefunction .Forexample,amanufacturerofcarpetswantstogetthemostprofitoutof his rawmaterials and labor force. The objective function is the relationshipbetween the inputs and his costs and thus his profits. Optimization helps themanufacturerfindthemostefficientallocationofresources.Theobjectivefunctiondoesn’tnecessarilyhavetobeintermsofmoney;itcouldverywellbetoallocateworkinghoursefficientlysothateveryonehasmore timeoff.Frequentlywewanttomaximize theobjective function(perhapsmakemoreprofit)butwemightalsowanttominimizeanobjectivefunction,suchascosts.
Optimizationisnotparticularlydifficult,andwewillusetheSolveradd-intoExcelto do themathematically tricky parts.What is important is thewriting up of acorrectmodelinthefirstplace.
Layoutofthechapter
Linearprogramming isnotespeciallydifficult,but theworkhas to bedone in alogicalandorderlyfashion.Inthischapter,wewill
*spendsometimeworking throughthebasicsteps in setting up an optimizationproblem *work through themeaning of the Solver output in termsofwhat thecoefficientvaluesactuallymean
10.1Howlinearprogrammingworks
Linearprogrammingworksbysolvingasetofsimultaneousequa-tions.Theproblemtobesolved—tomaximizeastockreturnfor
95
example—iswrittenasasetofsimultaneousequations.Theequa- tions may bequitesimple,buttheremaybemanyofthem.LinearProgrammingfindsthebestsolutiontotheequations.Thejobistofindthecoefficientsforthevariablesintheequationwhichmaximizeorminimizetheoutcome.
We’llworkthroughanexamplefirstonpaper,andthenputit intoExcel’sSolveradd-intogetthesolution.Thethreeimportantstepsare:
•Writethesimultaneousequations,basicallyputtingintomaththewordingoftheproblem.Thisisprobablythetoughestpart.
•RuntheOptimizationinSolver,tofindtheoptimalsolution.
• Sensitivity analysis to find out how much effect a unit change in aconstraintwouldhave.
10.2Settingupanoptimizationproblem
Optimizationproblemsareusuallypresentedaswrittenquestions.Thebestapproachistodeterminewhicharethe
•decisionvariablesthosevariableswhichyoucancontrol.Forexample,inthecarpetfactoryexample,thenumberofhoursgiventoeachworkeris adecisionvariable.
Itisusuallythesizeofthedecisionvariablethatwearetryingtooptimize.Fortheworkers,wewouldwantthemtohaveexactlythenumberofhourswhichproducesthe most profitable output. Too few hours and the factory is working belowcapacity.On theother hand, toomanyandmoneyisbeingwasted.The decisionvariablesgointowhatSolvercallsthechangingcells.Theconventionisto color-codethesecellsinExcelinredcolor.
• constraints are constants or givens which we cannot change. Thejurisdictionwhere thefactory is locatedmighthave legislation restrictingthemaximumnumber of hours that an employee canwork per day.Wewouldneedtoadd aconstrainingequationtotakeaccountofthisfact,evenifthesolutionwasfinanciallysub-optimal.
10.3Exampleofmodeldevelopment.
Afarmerhas50acresoflandathis/herdisposal.Healsohasup to 150hours oflabor and up to 200 tonnes of fertilizer.He can plant either cotton or corn or amixture.Cottonproducesaprofitof$400peracre,corn$200peracre.Eachacreofcottonrequires5laborhoursand6tonnesoffertilizer.Forcorntheequivalentis3and2.Howmuchlandshouldthefarmerallocatetoeachcrop?
Let’scallacresofcottonxandacresofcorny.FforfertilizerandLforlabor.Thesefourvariablesarehisdecisionvariablesbecausehecanallocatecropstolandandhecanalsodecidehowmuchlabor and fertilizer to deploy.He is searching for thecombinationofthefourdecisionvariableswhichmaximizeshisprofit.
Weknowhehas50acres,sothetotalacreagecannotexceedthisamount.Thisisaconstraint,whichcanbewrittenasanequation,asshownbelow.Otherconstraintsarethemaximumlaborandfertilizer.Obviouslyhecannotusemorethanhehas.
10.4Writingtheconstraintequations
Thetotalacreagedevotedtoeachcropcannotexceed50:
x+y≤50
Thelaborandfertilizerarealsoconstraints.Referbackto thequestiontogettheamountsoflaborandfertilizerperacrepercrop.
Forlabor,theconstraintequationis:
5 x+3 y≤150
Thisequationcomesaboutbecause for each acre of cotton, the farmer needs 5hours of labor,while for corn it is 3.Wedonotknowthevaluesofxandy,butwhatevertheyare,weknowthat 5xplus3ymust be less than or equal to 150,becauseweonlyhave150hoursavailable.
Forfertilizer,theconstraintequationis:
6 x+2 y≤200
10.5Writingtheobjectivefunction
Whatdowewanttogetoutofthis:whatistheobjective?Clearly it is the mostefficientmixtureofinputs,subjecttoconstraints,producingthelargestprofit.Theobjectivefunctionis:
MaxProfit=400x+200y
Inwords,findthecombinationofxandythatmaximizestheprofit, subject totheconstraintswehavewritten.ThenexttaskistoputallthisintoSolver.
10.6OptimizationinExcel(withtheSolveradd-in)
WewilluseanExcelspreadsheetandSolvertoachieveallthreesteps.
OptimizationYouTubeforthefarmerproblem¹
¹https://www.youtube.com/watch?v=WeTgK6wmvSY
OpenanExcelspreadsheetsothatyoucanfollowalong.
Cellcoloringconventions
Isuggestyoufollowtheseconventionsforcolor-codingthecells:
•Inputcells.Thesecontainallthenumericdatagiveninthestatementoftheproblem.ColorinputcellsinBLUE.
•Changingcells.Thevaluesinthesecellschangetooptimizetheobjective.CodechangingcellsinRED.
•Objectivecell.Onecellcontainsthevalueoftheobjective.ColortheobjectivecellinGREY.
NowI’llworkthroughthefarmingproblemabovestepbystep.
MakesureyouhavetheSolveradd-inloaded.Tocheck,openExcel, thenclickontheDataTab.IfSolverisloaded,youwillseetheSolvernametotheright.
TheSolvertab
BelowistheExcelspreadsheetwiththeinformationthatweknowalreadytypedin.Ihaveputrandomvaluesinthe red-coloredchangingcells,justasplaceholders.Thesenumberswill changewhenExcelsolvesforthemostprofitableallocation.
•TypetheaddressoftheobjectiveintotheSolverdialoguebox,andmakesuretheMaxradiobuttonisselected.
•Typeintherangeofthechangingcells.
•Workthroughtheconstraints.Therearethree:labor;fertil-izer;andland.
•SelectSimplexLPastheSolvingmethod.
•Presssolve.
Thefarmingsolution
Solverwillchangethevaluesinthechangingcellstomaximizetheobjectivecell.Ifoundthat30acresofcottonandnoneofcornprovidedaprofitof$12000.
10.7Sensitivityanalysis
Constraintscanbeeitherbindingorslack .Ifaconstraintisbinding,thatmeansthatallofthatparticularresourceisbeingused,andmorecouldbeemployedifitshouldbecomeavailable.Wecanfindoutwhetherconstraintsarebindingandalso theirshadow
pricebypressingSensitivityAnalysisafterrunningSolver again.TheSensitivityanalysiswill appear as a tab at the bottom of your worksheet. For the farmingproblem,itlookslikethis:
Farmingsensitivityreport
Let’sletattheConstraintssection.Laboruses150hoursandhasashadowpriceof$80.Theshadowpricesindicatestheper-unitvalueoftheconstrainedcommodityiftheconstraintwasincreasedbyoneunit.Ifwecouldhaveonemorehouroflabor,thentheprofitwouldincreaseby$80.Landandfertilizerhaveashadowpricesofzerobecausetheconstraint isslackornon-binding.We are not completelyusingtheseresourcesandsowedonotneedanyextrainputs.Therearesacksoffertilizerlyingaroundunused.Noneedtobuymore.
The columnsAllowableIncrease andAllowableDecrease are rele- vantbecausetheytellushowmuchmoreofabindingconstraintwecouldusebeforerunningthemodelagain.Forlabor,theamountis16.67(rounded).IfweeasedtheconstraintbymorethanthisamountthenwewouldneedtorunthenewmodelinSolveragain.Thesamelogicfordecrease.
10.8InfeasibilityandUnboundedness
Solverisquiterobust,buttwoproblemsmayoccur.Asolutionto anoptimizationproblemisfeasibleifitsatisfiesalltheconstraints.Butitispossiblefornofeasiblesolutiontoexist.Thisoccurs ifyoumakeamistake inwriting themodel,or themodelistootightlyconstrained.
Unboundedness
Unboundedness occurs when you have missed out a constraint. There is nomaximum(orminimum).Trychangingallconstraintsto>=insteadof=<.
10.9Workedexamples
Thedemandingmother
Mostmothersarekeentokeepcontactwiththeirchildren.Oneparticularmotherisratherdemanding.Sherequiresatleast 500minutesperweekofcontactwithyou.Thiscanbethroughtele-phone,visitingorletter-writing(yes!Somepeoplestilldothat!).Herweeklyminimumforphoneis200minutes,visiting40minutesandletters200minutes.Youassignacosttotheseactivities.Forphone:$5perminute;visiting$10perminute; letter-writing$20perminute.Howdoyouallocateyour time sothatyourcostistheminimum?
Writetheequationsfirst.Whatistheobjective?
Letxstandforminutesofphone;yminutesofvisiting;and zminutesofletters.Youwanttominimizethecostoftheseactivities, so theobjective is tominimise200x+40y+200z.Theconstraintsare:x+y+zmustbemorethanorequalto500.
Spreadsheetfordemandingmother
Noticethatwewanttominimizethetime,sochangetheradiobuttontominratherthanmax.Andwhenyouare typing in the constraints,checkthatthesignoftheinequalityistherightwayround.
Thetheatermanager
Youarethemanagerofatheaterwhichisinfinancialtrouble.Youhavetooptimizethecombinationofplaysthatyouwillputontomakethemostprofit.Youhavefiveplaysinyourrepertoire,A,B,C,D,E.Theyhavedifferentdrawpoints(appealtothepublic)andthereforeticketpricestomatch.Adrawpointishowattractive theplayistothegeneralpublic.Thedatalookslikethis:
Play Draw Ticket
A 2 20
B 3 20
C 1 20
D 5 35
Play Draw Ticket
E 7 40
Youhavetheseconstraints:youhaveonly40possibleslots.Andnoplaycanbeputonlessthantwiceormorethantentimes(givestheactorsareasonableturn).Eachperformance costs $10 per ticket sold, regardless of the ticket price (covers theelectricityetc).Thetotaldrawpointshastoexceed160tokeepthecriticshappy.
Howdoyoudistributeyourperformances?Whichcombinationofplaysproducesthehighestprofitandsatisfiestheconstraints?
Wearetryingtomaximizeprofit,solookforanobjective function thatdoesjustthat:20A+20B+20C+35D+40E.Butwait—-wehavethe$10cost.Sobettertakethatoutfirst,leaving10A+10B+10C+25D+30E.
Withthese constraints:
Thedrawpoints(theattractivenessoftheplaytocritics):2A+3B+1C+5D+7E>=160
andnoplaycanbeputonmorethantwiceormorethantentimes.A<=2andA<=10B<=2andB<=10C<=2andC<=10D<=2
andD<=10
andwehaveonly40slots,soA+B+C+D+E<=40
Spreadsheetfortheatreproblem
Moreworked examples
19thcenturyfarmer
Iama19thcenturyEnglishfarmer.I cangrowwheatorbarley.Wheatyields10bushelsanacre,barley8bushelsanacre.Thepriceofwheatistenshillingsabushel,barley5shillingsabushel.The labourcostsforwheatare3 shillingsanacre, forbarley2shillingsanacre.Thetransportationcosttomarketforwheatis1shillingabushel,forbarley½shillingabushel.Ihave100acres.(abushelisameasureofvolume;anacreisameasureofarea;ashillingisacurrencyunit).
Questions:HowdoIsplitupmyland?
IfIcouldgetonemoreacreofland,howmuchwouldthatbeworthtome?
Yetanotherfarmingquestion(Iusethesebecausemostpeoplecanimaginefieldsofland,cropsgrowingandthelike.Butofcoursethesame techniquesareapplicabletootherbusinesssituations).
Afarmerhas100acresofland.Hecanplantcropsorraisesheep. Eachhectareofcropsprovides$100butrequireslaborof$50peracre.Hecanspendamaximumof$500oncroplabor.Eachacreofsheepprovides$40butneedsonly$10 in laborcharges.Thelaborbudgetisunlimited.
Worked solution
CallareaincropsXandin sheepY.ThenX+Y=<100 and50X<= 500Theobjectiveis100X-50X+40Y-10Ysimplifiesto50X+30Y.MyExcelspreadsheetisbelow.
Spreadsheetfortheaboveproblem
11.Morecomplexoptimization
Optimization using Solver is a powerful method of solving common businessresource-allocationproblems.Inthepreviouschapter theproblemswererelativelysimpleconcerning theallocationof land ortime.Butlinearprogrammingcanbeusedtosolvemorecomplexandworthwhileproblemsaswe’llseebelow.Thekeyrequirement isthattheanalystisabletodefinetheproblemasasetofequations.Thereisnoonemethodexceptforthinkingcarefullyandwritingouttheproblemasasetofequations.
Todemonstrate,we’llworkthroughthreedifferenttypesofprob-lem:
*problems concerning proportionality: you need to allocate money to differentinvestmentswhileminimizing riskandkeeping returns above a certain amount.Whatproportiondoyouputineachinvestment?
*supplychain problems
*blendingproblemswhereyouneedtomixtogetherinputsfromdifferentsources
11.1Proportionality
Investmentdecisions example
Thisexampleshowshowwecan‘weight’theinputsaccordingtosomecriteria,inthiscasetheirrisk.Wewanttominimizetheriskbutensurethatthereturnisabovesomeminimumlevel.Whatisthemixtureorblendofinvestmentsthatcandothat?
107
The problem: you have an inheritance of $300,000 from an uncle but there aresomerestrictions:youmustinvestallthemoneyinfourfunds;yourannualreturnhastobeatleast5%.Andyoumustminimizeyourrisk.Thefourfundsyourunclehasspecifiedare:
Fund Risk Return
X1 10.7 4.2
X2 5 4
X3 6 5.6
X4 6.2 4
Let’sdealwiththeobjective function first.That is tominimize the risk.Wecanweightthesizeof the investment ineach fundwith its risk index.So,weightingeachinvestmentandthendividingbythe totalinvestmentgivestheamountoftherisk:whichisexactlywhatwewanttominimize.SowewanttominimizeR(forRisk)likethis:
R=10 . 7 X 1+5 X 2+6 X 3+6 . 2 X 4
300 ,000
Ifyoudon’tgetthis,thinkaboutwhatwouldhappenifall thefundswereasriskyasX1:thetotalriskwouldincrease.Anotherwaytothinkofthis:ifwedecreasethenumberofsharesinX1andinsteadincreasethenumberofX4,whatwillhappen?Theriskwilldecrease.
Theconstraints:thetotalinvestmenthastoaddupto$300,000,sothisconstraintis
X 1+X 2+X 3+X 4=300 ,000
Wealsohavetoachieveareturnofatleast5%(noeasymatter these days).Thisconstraintis
4 . 2 X 1+4 X 2+5 . 6 X 3+4 X 4 ≥5
Belowismyspreadsheetfromthisproblem:
Investmentblending
11.2Supplychainproblems
Workingouthowmuchtoshipfromproductioncenterstodemand locations is acommonprobleminsupplychainoptimization.AsIhavebeenstressing,thekeytosolvingproblemsofthistypeistowriteouttheequationswhichdefinethemodel.Hereisanexample:
You runacompanywhichhas bakeries in locationA and B. The bakeries shipcartons of bread to your retail stores at locations X, Y, Z. The bakeries havedifferent capacities and each retail store has different demands. The costs ofdeliverypercartonfromeachbakerytoeachstoreisbelow
Bakery X Y Z Capacity
A 12 13 11 150
B 9 17 17 200
Demand 50 100 90
Notation:adeliveryfrombakeryAtolocationXisAxandsoforth.
Theobjectivefunctionistominimizethedeliverycosts.Sowewant tominimize:12Ax+13Ay+11Az+9Bx+17By+17Bz
Eachbakeryhasafixedcapacity,soAx+Ay+Az<=150andBx+By+Bz<=200
Supplyhastoexactlymatchdemand
Ax+Bx=50Ay+By=100Az+Bz=90
Noticethestrictequalitysign.Myresultsarebelow:
Distributionproblem
11.3Blendingproblems
Abovewediscussedlinearprogrammingmodelswhichwere simplebuteffective.There are other types of Optimization model which are helpful, especiallyBlendingModels,whichcanalsobesolvedbylinearprogramming.
Blendingmodelsareusedinsituationswherewehavetwoormoreinputswhichhave tobemixed to some formula. Through Optimizationwecanfind themostprofitablemixture.Wine,metals,oil,sausages,recycledpaper—-thisisapowerfultechnique. Youcouldprobablyuseitformarketingcampaigns.We’llworkthoughanexamplewhichisforoil.
Theoilblendingproblem
Theproblem:anoilcompanyhas15000barrelsofCrudeoil1and20000barrelsofCrudeoil2onhand.Thecompanysellsgasolineandheatingoil.TheseproductsaremadebyblendingtogetherCrudeoil1andCrudeoil2.EachbarrelofCrudeoil1hasaqualitylevelof10,andeachbarrelofCrudeoil2hasaqualitylevelof5.Thegasolinethatweproducemusthaveaqualitylevelofatleast8.Theheatingoilmusthaveaqualitylevelofatleast6.Gasolinesellsfor$75abarrel,heatingoilfor$60.Howcanweblend theoils together in such a way that meets minimum qualityrequirementsandmaximizesprofit?
Oilblendingsolution
First,let’sthinkthroughwhatthedecisionvariables(whatgoesintothechangingcells)mightbe.Youmight very well think (as I did first off) that the decisionvariableswouldbetheamountsofthetwooilsusedandtheamountsproduced.Butthisisn’tenough:wehavetoblendtogetherthetwotypesofoil.Theyhavetobemixedbeforetheycanbesold,andthemixturehastoreachsomeminimumqualitystandard.Thecompanyneedsablendingplan.
Theinputs :
selling prices (here gasoline = $75, heating oil = $60) availability of oil fromsuppliersqualitylevelofcrudeoils:Crude1is10andCrude2is5
Theconstraints
Gasolinequality>=8Heatingoilquality>=6QuantityofCrude1
=15000QuantityofCrude2=20000
Theblendingplan :
Gasolinehastohaveaminimumqualityofatleast8,andheating oilmusthaveaminimumqualityofatleast6.TheCrudeoil1we
haveonhandhasaqualitylevelof10whileCrudeoil2hasaqualitylevelof5.Wewanttoblendthesetwocrudeoilstobothachievetheminimumqualitystandardsandmakethegreatestprofit.
Let’sattacktheproblembycreatingtotal‘qualitypoints’(QPs)whichrepresentthequalityofoilinabarrelmultipliedbythenumberofbarrelsofthatoil.
Writeequationstocalculatethequalitypoints:
TotalQPsinthegasoline=10*amountofOil1+5*amountofOil2
Ifforexample,wemixedtogether50barrelsofCrudeoil1and40barrelsofCrudeoil2,thetotalQPswouldbe50x10+40x5=700.Theaverageperbarrelwouldbe700/90=7.78.Thisistoolowforgasoline(needs8)butacceptableforheatingoil.Twopoints:
we could sell the oil as heating oil, but it is exceeding the minimum qualityrequirement.Wecouldmakemoreprofitbyreducing thequalitytotheminimum,orchargeapremium.Butthisisprescrip-tivework,andwehavetoworkwithintheinputsgiventous).
TheonlywaytogettheoiluptogasolinestandardsistoincreasetheamountofCrudeoil1inthemixture.
Ifyoudon’tgetthis,tryathoughtexperiment:forgasoline,iftherewasnooilatallfromOil2,whatwouldbetheQP?Itwouldbe10*thequantityofoilfromOil1.Again,howaboutifweblended together1000barrelsfromeachtypeofoil: howmanyQPswouldbeproduced?Itwouldbe10*1000+5*1000=15000.Readthisthroughagain…itisimportant.
The blending plan provides us with the constraints we need to ensure that theminimumqualitylevelsareachieved.Intheexample justabove,weblended2000barrelstoprovideaQPof15000.Thisisanaverageof7.5.Goodenoughforheatingoil,notgoodenoughforgasoline.
Oilproblemsolution
Discussionofthesolution:notethatgasolinesellsformoremoneythanheatingoil,buttheoptimalsolutionsuggeststhatweshouldsellmoreheatingoilthangasoline.Thisisbecauseoftheconstraintsonquality.
12.Predictingitemsyoucancountonebyone
Why you need to know this . Many business decisions involve counts: eitherbinary(yes/no)orwithintimeorspace.Itwouldbegoodtoknowtheprobabilityofacertainnumberofcustomerscom-inguptoaservicedeskinacertainlengthoftime;ortheprobabilityofacertainnumberofcaraccidentsatagivenintersection.Wearelooking for a discrete probability distribution; discrete because the number ofoccurrencesisaninteger.
Chapter4onregressionshowedhowtopredictthesizeof anoutcomewhichwascontinuous(moneyortimeperhaps).Thedependentvariable—whatweweretryingtopredict—couldtakeonalmostanyvalue.Nowwewanttopredictprobabilitiesfor the occurrence of an independent variable which is an integer. Below we’llwork through some hands-on applications, with the theory available in theGlossary.
We’llbreakthisintotwoparts:
Theprobabilityofa binaryoutcome .Theprobabilitiesoftwofaultyitemsoutofthe next twenty on the production line. Or, the probability of at least fivecustomersoutof thenext fiftyactuallybuyingsomething.Thisisestimatedwiththebinomialdistribution .
The probability of a particular count of occurrences over time or area. TheprobabilityofthreeorfewerpeoplearrivingatyourCustomerServiceDeskwithinthenexthalfhour.Ormorethan twomistakesinthenexttenlinesofcode.ThisisestimatedwiththePoissondistribution .
114
12.1Predictingwiththebinomialdistribution
Afterthenormaldistribution(seetheGlossaryforadefinition), thebinomialisthemostimportantinstatistics.ThemathforthebinomialdistributionisalsodefinedintheGlossary.
The binomial distribution provides the probability of a ‘success’ in a certainnumber of ‘trials’. For example, you can calculate theprobabilityofmore thansevenoutofthenexttwentypeople through thedooractuallybuying something.Herea‘success’issomebodybuyingsomething;whilethenumberof‘trials’isthenumberofpeoplecomingthroughthedoor(hereitistwenty).
Somedefinitions:
Whatwearelookingforistheprobabilityofapre-definednumberof‘successes’.Notethatthedefinitionofsuccessisuptotheanalyst. Itcouldbe ‘spendingmorethan$20’orwearingahat.
Intheexamplewe’llworkthroughbelowyourunastore.Youknowthatthelong-runprobability of somebodybuying something is 0.6. You obtained this number bycountingtotalnumbersofcustomersand,outofthose, thenumberswho actuallyboughtsomething(successes).
Workedexample:Youwanttoknowtheprobabilityofexactlythreepeoplethroughthedooroutofthenextfivebuyingsomething.Note that‘success’justmeansthatthedefinedeventhappens.Whetherornotitisa‘goodthing’isn’trelevant.
ForExcel,theargumentsrequiredare: therandomvariablewhose probabilitywewant to predict; number of trials, the long-run probability, and a true/falsestatement.Intheexample, thenumber of trials is five.The randomvariable (X)whoseprobabilitywewanttopredictis3.Wealsoneedthelong-runprobabilityofasuccess.Intheexample,p=0.6.Wealsowanttheprobabilityofexactly3,sousethefalsestatement.I’llgointothisinsomedetailshortly.
Open an Excel spreadsheet, and type in =BINOM.DIST(3,5,0.6,false) and youshouldget this:0.3456.This is theprobabilityof exactly threepeopleoutof thenextfivebuyingsomething(numberofsuc-cessesoutofthenextfivetrials).NoticetheorderoftheargumentsintheExcelfunction.Andespeciallythelastone,whichintheexampleaboveisfalse.(Thealternativeistrue).Thedifferenceisimportantbecausetheresultsarequitedifferent.
Trueandfalseargument .Definingtheargumentasfalseprovidestheprobabilityof exactly the random variable. Defining the argu- ment as true provides acumulativeprobability.
Excel adds up probabilities from the left, so changing the argu- ments to=BINOM.DIST(3,5,0.6,TRUE)=0.66304isthesumoftheprobabilitiesofX=0+X=1andsoonuptoanincludingX=3.Theprobabilityof0.66304isthereforetheprobabilityofthreeorfewercustomersbuyingsomething.ThetablebelowshowstheprobabilitiesofvariousvaluesofX,both‘false’and‘cumulative’.Asyoucansee, thecumulative is just thecontinuedaddition of eachsuccessiveprobability.Thereisafurtherexamplehere¹
Probabilitiescalculatedbothfalseandtrue
Howaboutmorethanthreecustomersbuyingsomething?Inmathnotation,we’relookingforP(X>=3).Weknowthat probabilitiesmustsumupto1.Weknowthatthreeorfeweris0.66304.Somorethanfivehastobe:1-0.66304=0.33696.
¹https://www.youtube.com/watch?v=oZ1DmNQ8wW4
Aslightlyharderexample:theprobabilitythatatleastfourcus- tomersoutofthenexttenwillbuysomething?We’relookingfor P(X >=4). Look carefully at thenotation.X>=4impliesthatthedistributionofthetencustomersissplitintotwohalves:lessthanfourandfourormore.Itisthelatterhalfwhoseprobabilitywe’relookingfor.
Ifwecanfindtheprobabilityof0+1+2+3,thatistheprobabilityoflessthanfour(weonlywant integers here). So find that probability and then subtract from 1,makinguseofthefactthatprobabilitiesmustsumupto1.
Let’sdothisstepbystep.
First,findP(X=<3),thatistheprobabilitythatXisthreeorless:
=BINOM.DIST (3,10,0.6,true) = 0.054762.Notice the ‘true’ which givesusthecumulativeprobability.
Wewantfourormore,sosubtractfrom1likethis:1-0.054762=0.945238.
Anotherexample.It’swinterandyouneedtowearasweatereveryday.Youhavetwoblueandthreeredsweaters.Calculatetheprobabilitythatduringtheweekyouwillweararedsweater:
Exactly twiceintheweek
Answer:thelong-runprobabilityofpickingaredsweateris3/5
=0.6becauseyouhavefivesweaters,andthreeofthemarered.Thenumberoftrialsis7becausethereare7daysinaweek.Thewordingof thequestioncontains theword ‘exactly’ whichmeans thatwe don’t want a cumulative answer, so we’llinclude the FALSE argument. Therefore the answer is =binom.dist(2, 7, 0.6,FALSE)=0.077414
Morethanthreetimes.Whenyouseewordssuchas‘morethan’ that’saclue thatyou’relookingforacumulative probability.In math notation, we’re looking forP(X>3). So if we find the cumulative probability up to and including two, andthensub-tractfromone,we’redone.Thecumulativeprobabilityoftwo
or fewer is P(X=0) + P(X=1) + P(X=2). So the answer is = 1 -binom.dist(2,7,0.6,true)=0.903744
Thekeys:
Writeoutwhatyouaretryingtopredictinmathnotation.Thisforcesyoutobeclear.Drawalittlesketch(hopefullybetterthanmine!)ifyougetconfused.
Ifthewordingofyourproblemcontains‘exactly’orrequirestheprobabilityofjustoneparticularoutcome(egP(x=3))thenyouwanttousefalse.
12.2PredictingwiththePoissondistribution
Thebinomialdistributiongaveustheprobabilityofabinaryoutcome(yes/no)outof a certain number of trials. The Poisson gives us the probability of a certainnumberwithinaspecifiedtime-frameorarea.
Here’stheexample.You run theCustomerServiceDesk.Youwant to know theprobabilityoffiveorfewercustomersarrivinginthenexthalfhour.Thatwouldbeusefulforstaffing,wouldn’tit?Youknowthattheonaverage,20customersarriveeveryhalfhour.Youknowthatbecausehavecountedthem.
Wewant:P(X>=5).Likethebinomialabove,thePoissonhasatrue/falseargumentforwhetherwewantcumulativeprobabilitiesor‘exact’.Thenotation include a>sign, which implies cumulative, so we use TRUE. In Excel:=POISSON.DIST(5,20,true).Theanswer is7.19088E-05.This looks a bitweird,butitisjustmathnotation.TheE-05meansthatyoushouldmovethedecimalpoint5placestotheleft.Sotherearefourzeroesinfrontoftheleading7.Inotherwords,anextremelysmallprobability.Perhapssendsome staff out for lunch? It is verysmallbecause5isalongwayfromyourlongrunaverageof20.
13.Choiceunderuncertainty
Virtually all business decisions involve at least some degree of uncertainty. Forexample,youdon’tknow(andpresumably canneverknow)somefuture‘stateofnature’whichmightbeanexchangerateorbusinessrental.Afarmerdoesnotknowtheprice of wheat at harvest time, but has to decide howmuch to plant manymonths ahead. His decision is relatively simple compared to a decision-makerconfrontedwithanopponentwhowill reacttotakeadvantageofwhateverdecisionyoudomake.Thefirstsituation iscalled‘non-strategic’and iseasilyhandledbytheexpectedmonetaryvaluemethodwhichwe’llworkthroughinthechapter.Thesecond‘strategic’situationismorecomplicatedbutcanbesolvedbyanapplicationofgametheory.We’llcovertheEMVmethodinsomedetailbelow.Gametheoryisanenormoussubjectandwedon’tcoverithere,butI’llgiveanexampleattheendofthechapteralongwithsomerecommendedreading.
13.1Influencediagrams
Aninfluencediagramisasketchoftheinfluencesandoutcomesofadecision.Theinfluencediagramallowsustothinkthrough anddepict the forces that influenceourchoiceswithouthavingtoassignprobabilities.Itiscustomarytouseovalshapesforvariableswhichareuncertain,aboxforadecisionnode;andlozengesfortheoutcome.
119
VacationInfluencediagram
Youareanoutdoors-typeandsothechoiceofactivityisaffectedbytheweather.Theweatherforecastisaffectedbytheweathercondition (onehopes!).Butyoudon’tknowtheactualweathercondition,justtheforecast(whichiswhyitisaforecast).Youchooseyouractivity(box)basedontheforecast.Theamountofsatisfactionyouget (lozenge) depends on the activity you chose AND the actual weathercondition:observethetwolines leadingtothelozenge.Thereisastandardformat:
1.DecisionNodesarepresentedinrectangles.Intheexampleontheslides,thedecisioniswhatactivitytopursuewhenonvacation(climbamountain?readabook?).Whichchoicewemakemaytosomeextentbedeterminedbytheweatherforecast.
2. UncertaintyNodes are presented in ovals. The weather fore- cast isuncertainandsoistheactualweatheritself.
3.OutcomeNodesarepresentedinlozenges.Theoutcomeinthevacationexampleishowmuchsatisfactionyoutookfrom thechoiceyoumadeforvacationactivity.Thisisautility.
4. Arcs display the flow of the influence. The actualweather conditionaffectstheweatherforecast,whichaffects yourplanning.Noticethatyouare takingadecisiononactivity in advance of knowing what the actualweatherwillbe.Thatiswhythereisnoarcbetweenweatherconditionandvacationactivity.
YoucandrawinfluencediagramsquiteeasilyinExcelusingtheshapesdrop-downtool.
Therearedifferentoutcomeshere,whichwecanbe represented as payoffs.Thehighestpayoffcomesfrommatchingyourchoiceofactivitytotheweatherforecastandthenfindingthattheforecastmatched reality.
13.2Expectedmonetaryvalue
theexpectedmonetaryvalue,orEMV,isjustthevalueofsomeout-comegivenaparticularstateofnature,multipliedbytheprobabilityofthatstateofnature.Here,Iam using the customary expression ‘state of nature’ not necessarily to refer toanythinginthenaturalworld,buttoaparticularsetofconditions.
Thestateofnaturehastobemutuallyexclusivefortheprobabilities towork.Forexample, if we have separate probabilities for rain and sunshine, then cannotcalculateforrainANDsunshine.Thepayoffswhichresultfromeachstateofnaturearedifferent,dependingonthestateofnature.
Let’smotivatethiswithanexampleusingafriendlyfarmerasthedecision-maker.Hehasthechoiceofplantingwheat,raisingcattle,orsomemixtureofboth.Thosethreepotentialdecisionsrepresenthisrangeofchoices.Fromexperience,heknowshowmuchhecanexpecttoreceivedependingontheweatheratthetimeofharvest.Thatamountiscalledthepayoff .
Hefacesthreestatesofnature,representingdifferentweatherconditionsatharvest.Theseare rain,clear,andsunny.Foreach choiceand foreachstateof nature heknowsthepayoffswhicharerepresentedinthecellsbelow.
Thepayofftableforthefarmer
Therearethreedecisionsandthreeoutcomes,makingatotalof9possiblepayoffs.Payoffscanbenegative,andarenotalwaysintermsofmoney.Theycouldbetime,oranyother appropriate metric. In the rowswe put the choices available to thedecision-maker.Inthecolumnsweputthe‘statesofnature’whicharethe futureeventsnotunderthecontrolofthedecision-maker.
The farmer calculates the probability of each of the three outcomes by workingthrougholdweatherrecords.Hefinds:
probabilityofrainingis0.4probabilityofclearis0.3probabilityofsunnyis0.3(ofcoursetheseprobabilitiesmustalladdupto1.Therecouldbemanymore‘statesofnature’butIhavekeptthemtothreeforclarity.
TheExpectedValueisthesumofeachprobabilitymultipliedby theoutcome. Inmathnotationthisis:
EMV=∑p x∗x
which inwords is: the expectedmonetary value is the sumof all the outcomesmultipliedbytheirprobability.InExcel,usetheformula
=sumproduct(array1, array2). This is a mean, or average, with each outcomeweightedbyitsprobability.Indecisionsinvolvingmoney,
weusuallycalltheExpectedValuetheExpectedMonetary Value(EMV).Forthefarmer,theEMVsareasbelow:
TheEMVsforthefarmer
Imultiplied(horizontallybydecision)thepayoffbytheprobability ofthatpayoffandthensummed.Usuallywechoosethedecision with thehighestEMV, so thefarmershouldchoosewheat.ExpectedvalueYoutube¹
Note that a ‘good’ decision is making the best decision at the time with theinformationtohandatthattime.Theremaybeunluckyconsequencesbutprovidedtheanalysthasdoneathoroughjobinselecting theoptimaloutcomeat the time,thensheshouldnotbeblamed!
Itwillbehighlyunusualforanoutcomeof30toactuallyoccur.TheEMVisjustaweightedaverageandnotaprediction.
13.3Valueofperfectinformation
The farmermight be tempted to ask for advice from a consultant. What is themaximumheorsheshouldpayfortheadvice?Itisthedifferencebetweenthebestcase that the farmer calculates and how much he would receive with perfectinformation.WecalculatethevalueofperfectinformationbycomparingtheEMVofthebestchoiceineachstateofnaturewiththeEMVwithoutinformation.Ifyouknewthattheweatherwouldbepoor,youwouldselectCattle;clearwheat;sunnywheatagain.ThenfindtheEMVwithperfectinformationbymultiplyingthesebestoptionsbytheir probabilities.
¹https://www.youtube.com/watch?v=fR7cBts1C1w
Theworkingis:100.4+300.3+70*0.3=4+9+21=34.Thebestwecouldcomeupwithoutperfectinformationwas30,sothevalueofperfectinformationis34-30=4.Soifsomeoneofferedyouperfectinformationforaprice,thehighestthatyouwouldpayfortheperfectinformationwouldbenotmorethan4.
13.4Risk-returnRatio
There are other ways of choosing the optimal outcome other than the EMV,althoughtheEMVremainsthemostcommonlyused.OnecommonmeasureistheReturntoRiskRatio ,whichprovidesthedollarsreturnedperdollarputatrisk.Foreachdecision,wedivide theExpectedValueof thedecisionbytheStandardDeviationoftheoutcomeforthatdecision.WeusuallychoosethedecisionwiththehighestRRR,because then thedollar returnforeachdollarputat riskishighest.Thefarmer’sworksheetisbelow.
Mixedhasazerobecausethepayoffsareallthesame.Thereisnorisk.Cattlehasthehighestat0.715.This ishigher than theRRR for wheat althoughwheat has thehighestEMV.Thefarmermightwanttothinkmoredeeplyabouttheprobabilityofsunnyweatheratharvesttime,becausethismakesahugedifference.
13.5Minimaxandmaximin
A non-probabilistic approach to making choices is the use of maximin andmaximax.Youcan see thatyour choicedependson whether you are pessimistic(maximin)oroptimistic(maximax).Ifwecansomehowobtainprobabilities,thenaprobabilistic approachworksbest.
13.6Workedexamples
Youare planning to market a new coffee-flavored drink. You have a choicebetweenpackingthedrinkinareturnableornon- returnablepackaging.Yourlocalgovernmentisdebatingwhethernon-returnablebottles shouldbeprohibited.Thetablebelowshowsyourprofits.Ifthenon-returnablelawispassed,youwillstillgetsomesalesfromexports.
Decision Lawpassed Lawnotpassed
Returnable 80 40
Nonreturnable 25 60
Example1
1. What would be the best decision based on maximin and maximaxcriteria?
2. Alobbyisttellsyouthattheprobabilityofthegovernmentbanningnon-returnablesis0.7.Assumingthelobbyistisright,whatisthebestdecisionbasedontheEMV?
3.Atwhatlevelofprobabilitywouldyourdecisionchange?
4.Howmuchwouldyoupayforperfectinformation?
Answers:
1. Maximin: theworstare25and40.Thebest is40,meaningpackagewithreturnables.Maximax:thebestare80and60.Againreturnables.
1. TheEMVforreturnableis800.7+400.3=68.Fornonreturn-ableitis250.7+600.3=35.5.UnderEMV,youshouldusereturnablebottles.
2. Setthisupasequation,withptheunknownwhichhastobalancebothsides80p+40(1-p)=25p+60(1-p).Solvefor
pandget0.267.Soifyouhearrumoursthatsuggesttheprobabilityiscomingdown,thinkagain?
3. Ifwehadperfectinformation,theEVPIis800.7+600.3=74.Soyoushouldnotpaymorethan74-68=6
Example2
Yourunabank.Acustomerwantstoborrow$15,000for1year at 10% interest.Youbelievethatthereisa5%chancethatthecustomerwilldefaultontheloan,inwhichcaseyouwillloseallthemoney.Ifyoudon’tlendthemoney,youwillinsteadplacethe$15,000inbondswhichreturn6%butarerisk-free.
1.WhataretheEMVsofloanandnotloan?
2. Youhave a credit investigation department which can help you withmoreaccuracyintheprobabilityofdefault.Whatisthemostyoushouldbewillingtopayfortheiradvice?
3. Calculatethelevelofprobabilityofdefaultatwhichlending themoneyandinvestinginbondshaveequalEMV.
14.Accountingforrisk-preferences
What this chapter is about: the previous chapter showed howwe could pick themostattractivechoicegivenpayoffsandprobabil- ities.But itmighthavestruckyouthat therewaslittlediscussionoftherisksinvolvedandhowdifferentpeoplerelate to risk. In this chapter we are going to work through picking optimaldecisions,takingintoaccounttheriskpreferencesofthedecisionmakers.
Hereisaquestionforyou:youaregivenalotteryticketwhichhasa0.5probabilityofwinning$10,000anda0.5probabilityofzero.TheEMVistherefore10,0000.5+ 00.5 = $5000. Someone comes along and offers you $3000 for the ticketguaranteed. Would take the sure thing? Or would you hope that you win the$10,000?Most peopleare‘riskaverse’andwouldprefer togiveupsomeof theEMVinexchangeforcertainty.The$3000(orhowevermuchitis)is calledyourCertaintyEquivalent (CE) in thisparticulargamble.The difference between theEMVandtheCEiscalledtheRiskPremium.So
RP=EMV-CE
Youcanthinkofthecertaintyequivalentasasellingprice.Itisusuallysmallwhenthesizeofthegambleissmall,butincreasesas thegamblegetslarger.HereIamusing the term ‘gamble’ because there are probabilities involved. The certaintyequivalentisausefulconceptwhich,amongstotherqualities,allowsustocategorizeapproachestorisk.
•Risk-averse .IfyourCEislessthantheEMVofagamble,thenyouarerisk-averse (and most people are, so nothing to worry about. You’re‘normal’)
127
•Risk-neutral .YourCEmatchestheEMV.Youplaytheaver-ages.
• Risk-seeking .Youenjoythegamble inherent in theEMVand need tohaveaveryhighCEbeforeyou’llgiveitup.Mostpeopleinbusinesswhoare risk-seeking usually don’t last verylong.Althoughwesometimesseerisk-seekingdecisionswhenyourbusinessisintroubleandahugegambleistheonlypossiblesolution
Businessdecisionsareallabout risk,and theprobabilities them-selvesareoftenunreliableandhardtofind.Thefarmerwhosecrop- ping decision we studied inChapter2doesn’tknow tomorrow’sweather,letalonetheweatherinsixmonths.WegenerallyprefertogiveupsomeoftheEMVinreturnformorecertaintybecauseweareriskaverse.That’swhywebuyinsurance.IknowthatIwillbecoveredifIsmashmycar,andIcanevenputupwiththeknowledge thatpeopleinZuricharegettingricherbecauseofmyaversiontorisk.
14.1Outlineofthechapter
Here’swhatwe’regoingtodo:
•Describeutilityandcalculateutilities
•Showhowtousetheexponentialutilityfunctiontocalculateutilitiesbasedonrisktolerance
• Convertthoseutilities to certainty equivalentswhichwe can rank andcomparewiththeEMVsofthesamedecision
Torank choices in terms of their certainty equivalent ratherthan their EMV,weneedtheconceptofutility,whichisa numericalmeasureofhowwellagoodorservice satisfiesaneedorwant.Do you prefer to be reading this book, outsideeatinganice-cream,or
talkingwithafriend?Youcan rank these choices quickly in your head.Youcansubconsciouslyallocateutilitynumberstothechoicesandthenpickwhicheverhasthehighestutility.
Utility numbers help us to rank our preferences, but there are no units and therankingisindividual-specific.Inotherwords,givenasetofalternatives,differentpeoplemaywell havedifferentrankingsfor them.For themoment, let’sassumethatitisjustyouwhoseutilitiesweareinterestedin.Ifwecansomehowfigureoutyour utilitiesfor thedifferentoutcomesofabusinessdecision,wecanuse thoseutilitiestohelpyoumakeamorepsychologicallysatisfyingchoice.
Utility
Theplotbelowshowsahypotheticalutilitycurve.Theutilityvalueisontheverticalyaxisandtheoutcomeofagambleison thehorizontalxaxis.Noticetwothings:
Atypicalutilitycurve
*Theslopeofthelineisupwards.Thisreflectsthefactthateveryone prefersmoremoneytoless
• The rate of increase of the line is decreasing or slowing down. Itwassteepearlyon,buttowardstheenditisalmostaplateau.Thisisbecausethevalueofeachextra dollar isslightlylessthanthatof the previousdollar.Thisisthemarginalutilityofmoney.
14.2Wheredotheutilitynumberscomefrom?
Therearetwoapproaches.
Thefirstinvolvesaskingthedecision-makerhis/herchoicebetweentwogambles.
The seconduses a utility function, amathematical function which converts twoinputvariables(risktoleranceandthesizeoftheoutcome)intooneutilitynumber.
WithautilitynumberinhandwecanbegintheprocessofrankingthedecisionsintermsofutilityratherthanEMV.
Theintuitivemodel
We’llworkthroughtheintuitivemodelfirstandthenapplythesamethinkingtotheequationmodel.Onceyouget thehangof it, theequationmodel is quicker andprobably more accurate. We’ll use the payoff table from the farmer example,repeatedbelow.
Thepayofftableforthefarmer
We’llassignautilitynumberof1tothehighestpayoffandthenzerotothelowest.Thebeginningsoftheutilitytablelookslikethis:
PayoffUtility
701
40
30
20
10
-10
-300
Nowweneedtofillinthegaps.Whatwe’lldoistouseeachpayoff asacertaintyequivalent,andaskthisquestion:
Whatvalueofpwouldmakeyouindifferentbetweeneitherreceiv-ingaguaranteedpayoffofthecertaintyequivalent(inthefirstcase
40)oracceptingagambleofreceivingthehighestpayoff(70)withprobabilityporlosing30(thelowestpayoff)withprobability1-p.Let’ssaythatyouanswerp=0.9.Theexpectedvalueofthegambleis70*0.9+0.1(-30)=60.
This is higher than your certainty equivalent of 40, indicating that you are riskaverse.Thedifferencebetweentheexpectedvalueandthecertaintyequivalentistheriskpremium,andistheamounttheindividualiswillingtogiveuptoforgorisk.
Continuedownwards,certaintyequivalentbycertaintyequivalentandcompletethetable.I’vemadeupthenumbersjustforillustra-tion.Aplotoftheutilityvaluesisalsoshown.
Payoff Utility
70 1
40 0.9
30 0.8
20 0.65
10 0.4
-10 0.35
-30 0
Farmer’sutilitycurve
Ifyouwererisk-neutral,youwouldprefertotakethegambleandsowouldbeanEMVmaximizer.Thechoiceoftherisk-neutralperson isindicatedbythestraightline.
Now,let’sreplacethepayoffvaluesinthefarmertablewithutilities.WecancalculatetheexpectedutilitiesinthesamewayaswefoundtheEMVs.Thetableshowsbothforcomparison.Wemultiplytheutilityofeachoutcomewithitsprobability.
Expectedutilities
The famer’s EMV decision would have been wheat (EMV=14)but taking intoaccount his risk preferences, mixed gives the highest expected utility. It makessensestospreadtheriskbetween twocompletelydifferent crops.
##Theexponentialutilitycurvemethod
It is rather tediousaskingsomanyquestions,pluspeopleget tired andconfusedratherquickly.Anattractivealternativeistousea
mathematicalfunction, theexponentialutilitycurve.Thisrequires onlyoneinputvariable,R,thetoleranceforrisk.Theequationisbelow:
U x=1−e− x / R
Herex is the payoff; Ux is the utility of the payoff x; and R is theindividualsrisktolerance.Apersonwitha large value ofR ismorelikelytotakerisksthansomeonewithasmallerRvalue.AstheRvalueincreases,thebehaviorapproachesthatoftheEMVmaximizer.TheplotforsomeonewitharisktoleranceofR=1000is
below. ActuallyperformingthecalculationstofindutilityateachlevelofpayoffiseasyusingExcel,aswe’llseebelow.ButfirstwehavetodetermineR,theindividual’stoleranceforrisk.
Findingrisktolerance
Therearetwoapproaches
Byaskingthisquestion:consideragamblewhichhadequalchances ofmakingaprofitofXoralossofX/2.Whatisthevalueofx forwhichyouwouldn’tcarewhetheryouhadthegamble.Inotherwords,whatisthevalueofxforwhichthecertaintyequivalentis
zero?Theexpectedvalueofthealternativeis0.5xX-0.5*(X/2)=0.25X.Aslongas X >0 then the decision-maker is displaying risk- averse behavior, becausehis/hercertaintyequivalentislessthantheexpectedvalueofthegamble.ThegreatertheR,themoretolerantofrisk.Inhisbook,ThinkingFast,ThinkingSlow,DanielKahnemannotes that the ‘2’ in thedenominatorcanvarya tad. It isn’tapreciseformula,butapparentlyitcomesclose:Indiagramform:
Anal- ternativeapproach,moreusedinbusinessandfinance,istoemploy guidelinenumberscalculatedbyProfessorRonHoward¹,apioneerofdecision-analysis.Basedonhisyearsofexperience,HowardsuggeststhattheRvalueforacompanyis:
•6.4%ofnewsales
•124%ofnetincome
•15.7%ofequity
##Exampledecisionusingutilitymaximisation
Whatwe’ll dohere is to examine adecision from both the EMV and ExpectedUtilityangles,using theexponentialutility function. Thecalculationofexpectedvaluesisthesame:thedifferenceisthatinsteadofmultiplyingeachmonetarypayoffbyitsprobabilityandthenaddingup,wemultiplytheutility.
¹https://profiles.stanford.edu/ronald-howard
The decision set-up. All money figures in millions. I run a coffee importingbusiness.Ineedaspecialimportlicensetobringinaparticulartypeofcoffee.Theequityofmycompanyis12.74.So,myRvalueis15.74%of12.74=2.
Ihaveachoicebetween‘rush’and‘wait’forthepermit.
IfIrush,Ihavetopayaspecialfeeof5,andthereisa50/50chanceofthepermitbeinggranted.Ifitisgranted,Iwillhavesalesof8,givingmeanetprofitof8-5=3.Ifitisn’tgranted,Istillhavetopaythe5,butwillhavesalesofonly6,givingmeanetprofitof2.First,findtheEMVandEUofrushing.TheEMVis0.5x3+0.5(2)=2.5.
Usingtheexponentialcurveformula,theutilityof3is=1-exp(- 3/2)=0.77687.Andtheutilityof2is=1-exp(-2/2)=0.632121.TheEUforrushingis0.5(0.77687)+0.5(0.632121)=0.704496.Allfiguresareinmillions.
Nowforthealternative,whichistowait.Inthisscenario theprobabilitiesarethesameat50/50,butthecostisonly3ifitisissued,nothingifitisn’t.Thesalesarethesameat8and4.Thenetprofitsare8-3=5and4.TheEMVis0.55+0.54=4.5.TheutilitiesareU(5)=1-exp(-5/2)=0.918andU(4)=1-exp(-4/2)=0.865.TheEUis0.50.918+0.50.865=0.892.
ThetablebelowshowstheEMVsandEUs.
Spreadsheetfortherushdecision
Butwecangofurtherusingcertaintyequivalents,subjectofthenext section.
14.3Convertinganexpectedutilitynumberintoacertaintyequivalent.
Fortunately there is an easyway to convert expected utilities back to certaintyequivalents. It is then straightforward to rank the decisions by certaintyequivalents. Remember the concept of the certainty equivalent? The amount ofmoneywhichitwould takeforyou to sellyourgamble? It turnsout thatwecanconvertanexpectedutilitynumberintoacertaintyequivalentusinganequation:
CE=− R∗ln (1−EU)wherelnisthenaturallogarithm.Weusethisequationwherewearetalkingprofitsandwantmore.Inthecaseofcosts,takeoffthenegativesigninfrontofR.
AbovewefoundEU=0.704496fortherushdecisionbranch.ThatisaCEof
=-2*LN(1-0.704496)= 2.43814
using Excel’s ln function. For the wait decision, the EU was 0.892, giving acertaintyequivalentof=-2*LN(1-0.892)=4.451248
LargerCEsarepreferredwhentheoutcomesareprofits,andsmallerCEswhentheoutcomesarecosts.SoI’dwait!ThisconfirmsthedecisionmadewiththeExpectedUtilities.
15.GlossaryThisisahands-onguidetodoingsomebasic analyticworkwith data.However,statisticshasitsownsetofvocabularyandtech-niques.Whenyouarepresentingyour results tootherpeople,youmaywish to followestablished techniques andterminology. Also, here you’ll find more of the underlying theory if you areinterestedinhowExcelgetstheresultsitdoes.
Binomialdistribution
The binomial distribution is used to find the probability of the occurrence of acertainnumberofevents(calledsuccesses,althoughtheycouldbeanythingclearlydefined)withinanumberoftrials.Thebinomialdistributionwillgive,forexample,theprobabilityofatleasttenpartsbeingdefectiveoutofthenexthundred,providedthelong-termdefectiverateisknown.Thebinomial requires some conditions toworkproperly.These are:
asequenceofnidenticaltrialstherearetwooutcomesforeachtrial,denotedsuccessandfailuretheprobabilityofsuccessdoesn’tchangethetrialsareindependent.
It isquiteeasytocalculatetheprobabilitieswithExcelbutfor illustrationhereistheequationwhichExceluses:
f( x )=
(n)
x
p x (1 −p )
( n − x )
wherexisthenumberofsuccesses;ptheprobabilityofsuccessononetrial;nthenumberoftrials,andf(x) theprobabilityofx successes inn trials.Thisnotationpointsupthefactthattheresultisafunctionofx.
137
Centrallimittheorem(CLT)
Thecentral limit theoremisfundamental instatistics.What the theoremstates isthis:let’ssayyouhavealargepopulationand you keep taking samples from thepopulationandmeasuring themeanofeachsample.Thenyoudrawahistogramofthemeanssothateachmeanisavariableinthedistribution.Inotherwords,insteadof plotting each observation independently, we plot them by themeans of eachsample.Thehistogramwhichappearswillbeapproximatelynormallydistributed,orintheshapeofthewell- known bell curve. This is wonderful news because thenormaldistributioniswellbehaved,andsowecancalculatetheareaunderanypartofit.
Correlation
Youhaveapairofvariables:forexample:websitehitsand actualsales.Youwantto knowwhether there is a relationship between thevariables.Andwhetheritis‘statisticallysignificant’,orperhapswhatyouthoughtwasarelationshipoccurredjustbychance.
Correlation is the study of the linear relationship between two or more randomvariables.Correlationanalysisprovidesboththedirectionand thestrengthof therelationshipbetweenthevariables.Forexample,thereisarelationshipbetweentheheightandweightofindividuals.There is apparently also a relationshipbetweenconsumptionofsugarandobesityratesatthenationallevel.Cor-relationanalysiscantestwhethertherelationshipsarespuriousorreal.
It’strueandworthrepeating: correlationdoesnotimplycau-sation .Itdoesnotnecessarily follow that onevariable causes the other even though theymight behighly correlated. There are many examples of such spurious correlations, forexamplenational consumptionof chocolate andnumber ofNobel prizeswon. Ifyou do find a correlation, first think about whether you have a lurking variablediscussed below. The two examples given above for height/weight and sugarconsumptionmightverywellinclude
causaleffects,butthisisnotsomethingthatcorrelationanalysiscanestablishonitsown.Indeed,somephilosophers(such asDavid Hume) claim that causation canneverbedecisivelyestablished.
Nowwewill:
•Visualizeacorrelationwithascatterplot
•explainwhythecoefficientofcorrelationworks
•interpretthevalueofthecoefficientofcorrelation
•testforthestatisticalsignificanceofthecoefficientofcorre-lation
Scatterplotsandcorrelation
Acanningfactoryhasdataonnumberofworkersemployedandoutputofcans.Thedatasetis‘canning’.Thefactoryisinterestedintherelationshipbetweenthenumberofworkersandoutput.LoadthedataintoExcel.
Firstfewlineofthecanningdata
Thecolumnsofdata representsnumbersofworkers andcansmanufactured.Forthenotationwe’llcalltheseXandY.WhenImeanthenameofthevariable,I’llusetheuppercase.Forindividualobservations,suchasx=23,I’lluselowercase.WewanttoseetherelationshipbetweenXandY.TodothisweneedtohavethedatalaidoutincolumnssothateachobservationofXandYareonthesameline.It’salwaysagoodthingtovisualisetherelationshipwith a scatter plot and the scatter plot isbelow.
Thecansscatterplot
I’vedrawnaverticalandhorizontallinewhichgoesthroughthemeansofthetwovariables.Wecandividetheobservationsinto quadrants,withquadrantonebeingtoprightandsoforth.Iftherelationshipispositivethenwewouldexpecttoseemostof the observations in quadrant one and quadrant three. If the relationship wasnegativethenmostoftheobservationswouldbeinquadrantstwoandfour.
TherelationshipbetweenXandYisnotperfectlylinear.Ifitwas all the pairs ofpointswould lineupalonga straight line.Weneedsomewayofmeasuringhowfar off being in a straight line these observations are. There are two measures,covarianceandcoefficientofcorrelation.Thereissomemathsinwhatfollowsbutwe’llwalkthroughitslowly.
Covariance:Trythisthoughtexperiment:ifalltheYpointswereonaverticalline,thenthesubtractionofthemeanfromeachYvaluewouldgiveuszero:
( y i− y )=0
I’veputinalittleifortheyvaluestoshowthatitcouldbeany
ofthevaluesintheYvariable.SimilarlyfortheXvalues,ifallthepointswereonaverticallinethen
( x i−x )=0
Wecanseethatwedon’thaveverticallinesandthereissomedif- ferencebetweeneachobservedvalueandthemean.Ifwemultiplytogetherthedifferencesandthenfind theaveragebydividingbyn-1wecanfindthecovariance.Wedividebyn-1becauseweareinmostcasesdealingwithasample.Subtracting1adjustsforthisfact.ThesamplecovarianceofXandYistherefore:
S XY=
( x−x )( y−y )
n−1
Thebigproblemwithcovarianceisthattheresultisverydependentontheunits.Wecan get round this problem by standardizing the differences between eachobservationand itsmean.Wedo thatby dividing the difference by the standarddeviation of the variable. You can think of this technique as providing thedifferenceinunitsofstandarddeviations,so
( x−x ¯
s xandthesamefortheYvariable.Nowwenolongerneedtoworryabouttheunits.Thecoefficient of correlation is the covariance now between the standardiseddifferences,not thedifferences them- selves.The equation for the coefficient ofcorrelationofasampleis
r XY=∑
( x−x )( y−y )( n−1) S XS Y
Performingthiscalculationisextremelytediousandweusuallyletthecomputerdo
thework.Typein=correl(array1,array2)andhit
enter.Forthecansdatatheresultis0.991083.ThisisthePearsonProductMomentor‘r’value.
CorrelationYoutube¹
Valuesofr:valuesofrarerestrictedtotherange-1to+1.Anegativevaluemeansthat the slope is negative: as one variable increases thentheotherdecreases.Apositivevaluemeansthatbothvariables increasetogether.Thelargertheabsolutevalueofr,thenthecloser thecorrelation.Acorrelationofzeromeansthat all thepointsarescatteredaboutwithnopatternwhatsoever.Thervalueof0.991083forcansmeansthatthepointsarelinedupalongastraightline,whichiswhatwewouldexpectfromlookingatthescatterplot.
Testingr:theobservationsthatwehaveusedtocalculaterconsistofasamplefromapopulation.Becauseitisonlyasample,wedon’tknowwhetherthecorrelationwouldholdup ifweweresomehowabletogainaccesstothepopulation.ris thenotationforthecoefficientofcorrelationforasample.ForapopulationweusetheGreekletterρ“rho”.Generally,Greeklettersareusedfor population parameters.ThisdistinguishesthemfromtheRomanlettersusedforestimates.
Thehypothesiswe’retestingis:
againstthenull
H 0:ρ ≤0
H a :ρ =0
Wefindtheteststatisticusingthisequation
r √ n −2
t =√
−r 2
¹https://www.youtube.com/watch?v=QDj8aYfvccQ
We’llfirstworkoutthevalueoftheteststatisticusingthervaluewefoundabove.Thenwe’llfindthepvaluefromthis.
Thervaluefoundfromthecorrelationwas0.99and therewere 24 observations(n=24).Plugginginthenumbers,wefindthatt=32.86.Thisisaonetailedtest,souse=t.dist(yourvaluefort,n-1,andfalse) tofindapvalueof2.67343E-21.The -21meansmovethedecimalplace21placestotheleft,sothisnumberis effectivelyzero.Thervaluewefoundisstatisticallysignificant.Itismuchsmaller than0.05,sowecanrejectthenullhypothesis.Thereisacorrelation.
Descriptivestatistics usesgraphsand tables topresent whatweknowabout thedata. This is a powerful method of gaining insights into the data, and also formakingpresentations.However,descriptivestatisticslimitsitselftothedataavailabledoesnotmakeanyinferencesbeyondwhatwehavefromthedatatohand.
Distribution of the observations within a dataset describes the density of theobservations:aretheyspreadaboutevenly,orpeakedinthemiddleandthenspreadoutoneithersideoftheirmean?
One of the most common and useful distributions is the normal distribution,sometimesknownas thebellcurve.Manythings in thenaturalworldfollowthisdistribution.Forexample,ifyouwereabletomeasuretheheightsofalargenumberofmenorwomen,you’dfindthattheyfollowedanormaldistribution.
TheplottotherightshowsthedistributionofrecordsofheightsofparentscollectedbyFrancisGaltoninthe19thcentury.Galtonwastryingtofindouttherelationshipbetweentheheightofparentsandtheirchildren.Alongtheway,hediscoveredthephenomenonknownas‘regressiontothemean’.
Ihaveplottedtheheightsasahistogramandaddedanormalcurvewiththesamemeanandstandarddeviationastheoriginaldata.Thisclearlyfollowsthebellcurve.
Thenormaldistributionisveryimportantinstatisticsforseveralreasons
-thenormaldistribu-tionis‘well-behaved’inthesensethatital-waysfollowsthesamemathematicalfunction.Theequationisalit-tlecomplex,butwecandrawanynormaldistributionprovidedweknowitsmeanandvariance.Themeanistheaverageoftheob-
Galton’sheightobservations
servations,andinanormaldistributionoccursrightinthemiddle.Thevarianceisameasureoftheaveragedistanceofeachobserva-tionfromthemean.Thelargerthevariance,themore‘spreadout’ordispersedthedata.
Thegraphshereshowtwodatasets,withidenticalmeans butdifferentvariances.In most of this book we won’t use the term variance, but instead standarddeviation .Thestandarddeviationisjustthesquarerootofthevariance.Weusethestandarddeviationprimarilysothatwecanusethesameunitsasthemean.Thefirstplothasastandarddeviationof5,andthesecondoneastandarddeviationof2.Theplotswerecreatedbygeneratingarandomsetof1000numbers,withameanoftenandthestandarddeviationsjustdescribed.
-the second reason is connected with the most important result in statistics,the CentralLimitTheorem ,orCLT.This statesthatifyoukeepdrawingsampleswithreplacementfromalargepopulation, themeans of the samples will followa normaldistribution .Theimportantthingisthatthedistributionoftheoriginalpopulationdoesn’tmatteras longaswedrawat least30samples.Infact, itgetsbetterthanthat:iftheoriginalpopulationisnotnormallydistributed,thenwedon’tneedasmanyas30inoursample.Asfewas10willdothetrick.
Standarddeviation=5
Becausethenormaldistributionfollowsanequationsowell,wecan find the areaunderanypartofitveryeasily.
Standarddeviation=2
Themean splits the area under the curve into halves,meaning that50%of theobservationsaresmallerthanthemean,and50%larger.Acommonpracticeis tomarkoff thehorizontalaxis inunits of standard deviation. The further from themean,thegreaterthestandarddeviation.Thishelpsbothinvisualizingtheshapeofthedata,butalsoinusingtheempiricalrule .
Empirical rule : a ‘rule of thumb’ which states that when data are normallydistributed,
-64%ofthedataarewithinonestandarddeviationofthemean-95%ofthedataarewithin twostandarddeviations of themean -99.7% of the data arewithin threestandarddeviationsofthemean
Thismakesintuitivesense:only100-99.7=0.3%ofthedataaremorethanthreestandard deviations from the mean. They would therefore be in either of theextremeendsortailsofthedistribution.Theprobabilityoftheiroccurrencewouldbeverysmall.Bycontrast,observationsclosertothemean, those located onlyasmallnumberofstandarddeviationsfromthemean,aremuchmorelikelytooccur.Thatiswhythehistogramishighestonatthemean.Thoughtexperiment:veryshortorverytallpeopleareunusual(whichiswhyyoustareatthem).Bycontrast,peoplearoundthemeanheightarenotatallunusualandyoujustpassthemby.
Estimation : The statistical process is circular: we start off by choosing thecharacteristic of the population that interests us. That characteristic is called aparameter.Wedrawarandomsamplefromthatpopulation.Weuse thestatistic toinfer information about the same quantifiable feature in the population. Thestatistic(for ex-ampleanaverageoraproportion)isanestimateofthepopulationparameter. The process of finding how close the estimate to the parameter, andbuildingaconfidence interval around the estimate, is inference .We test claimsabouttheparameterusinghypothesis-testing,whichiscloselyrelatedtoinference.Hypothesis-testingisthefinalpartofthischapter.
Excel –morescreencasts
Insertingthedataanalysistoolpakadd-in²Thenormaldistribution³
²https://www.youtube.com/watch?v=Ods1Z5BLD9g&list=UUVZFhqXDgJgIRq-ljotClgQ
³https://i1.ytimg.com/vi/1qn0EIm5-PQ/mqdefault.jpg
ExponentialSmoothing
Hypotheses and p values A hypothesis is a claim someone makes about apopulation.Forexample,acoffee importer (me!) is toldby the supplier that themeanweightofthesacksofcoffeebeansis40kg.Thatistheclaim.Iwanttotestthat claim because I am paying for the coffee. All the sacks of coffee in theshipmentarethe‘population’.Itakearandomsampleofsackstocheck.Ifind thatthemeanweightofthesampleis38kgs.DoIrejectthewholeshipment,orsayImusthavegot a low sample, let’s take them? The hypotheses here are: thenullhypothesis:thepopulationmeanweightis40kg.Meanwhilethealternativeothepopulationmeanweightisnot40kgs.Hypothesistestingprovidesananswertothetestintheformofapvalue,definedbelow.
ApvalueistheprobabilityoffindingasamplewiththemeasuredcharacteristicsIFthe null hypothesiswas true. In the case of the coffee above, let’s say that wecalculatedthepvalue(SeeChapter
9)anditwas0.1.Thatmeansthatthereisa10%chanceoffindingasamplewithameanweightof38kgsiftherealtruemeanweightoftheshipmentwas40kgs.Themostfrequentlyusedcut-offpointis0.05or5%.Ifthepvaluewassmallerthan0.05thenwewouldsaythatthereisalessthan5%chancethatthissamplecamefromapopulationwiththeclaimedweight.Therefore the shipment isn’t of the claimedweightandshouldberejected.
Hypothesiswritingandtesting
Ahypothesisisastatementorclaim.Forexample,‘themeanweight of thecoffeepacksis3kg’isahypothesisorclaim.Thehypothesisneedstobetestedsothatweknowwhethertheclaimstandsorisshowntobefalse.
Hypothesesarewrittensothattheclaim,andthecounter-claim,aredistinct.Therearetwohypotheses,onefortheclaimandthe other for the counterclaim.Ho, the‘nullhypothesis’,containstheclaim.Ha,the‘alternativehypothesis’,containsthecounterclaim.Wewritehypothesesbyassumingthatthenullhypothesisistrue.
Thenullhypothesesforthecoffeeexampleabovewouldbe
Ho :µ =3 kg
Thisisreadas:Thenullhypothesisisthatthepopulationmeanweight(mu)ofthecoffeeisequalto3kg.
Thealternativehypothesis (Ha) is that thepopulationmean weight is not 3 kg,written:
Ha :µ =3 kg
Wetestthehypothesesusinginformation(statistics)fromthesam- ple (themeanweight,thesamplesize,andthestandarddeviation).Notethatthecounterclaimsaysnothingaboutthemeanweightinthepopulationbeingmorethanorlessthan3kg,itsimplysaysthatitnotequalto3kg.Also,important,theequalitysignappearsinthenull.Thisisalwaysthecase,whethertheinequalityis‘strict’or‘weak’.
Wecanalsotestwhetherthepopulationmeanissmallerthanor largerthansomeconstant.Theshipperclaimsthattheweightis‘atleast’3kg.Wewritethisas:
H 0:µ ≥3 kg
Notethattheequalityremainswiththenull.Thealternativeisthen thenegationofthenull,andthismustbethattheweightis‘lessthan3kg’.
H a :µ<3 kg
Convinceyourself that theonlypossiblewaytonegateHoiswith thealternativehypothesis.Iftheclaimwasthatthattheaverageweightislessthanorequalto3kg,wesimplyswitchoverthe
inequalitysigns.Thereareotherformsofhypothesiswritingwhichwewillworkthroughinthisbook.However,theyallsharetherulethattheequalitysigngoesinthenull;andthealternativenegatesthenull.
Inference is thekeystatisticalprocess. It is theway in which information aboutpopulationscanbegleanedfromsurprisinglysmallsamples.Anexampleispollingbeforeanelection.Providedthesampleisdrawnrandomlyandisrepresentativeofthepop-ulation,asampleofperhapsonly1,000peoplecanprovidequiteaccurateestimates of the proportion of the population predicted to vote for a particularpoliticalparty.Theestimatesareusedtomakeinferencesaboutthepopulation.Wewon’tknowuntilelectiondaywhetherornottheinferenceswerecorrectornot.
Inferentialstatisticsisaverypowerfultechniquewhichallowsustomakethejumpfromasampletoapopulation.Wecanuseasmallsampletomakeinferencesaboutthecharacteristicsofapopulationaboutwhichweperhapsknowverylittle.Thekeyistoobtainasamplewhichrepresentsthepopulationasaccuratelyaspossible.Forthis, both randomization and good surveydesign are essential. Wewill discussthesetwoimportantissueslaterinthechapter.Recently,athirdclassofstatisticalmethodologieshasarisen.Thisismachinelearning ,whichtakesadvantageofnewanalytictechniquesandgreatercomputingpower.Wewon’tbeable togointothesetechniquesinthisbook,unfortunately.
Lurkingvariables : Ithappens that there isclosemonthlycor- relationbetweentheconsumptionoficecreamanddeathsbydrowning.Isitthatpeopleeatanicecream,goswimmingandthendrown?Thecorrelationmissesthelurkingvariablewhichistemperature.Inhotweather,peoplebothgoswimmingmoreoftenandbuyice creams. The ice cream and the drowning variables are both correlated withtemperatureandsoarecorrelatedwitheachother.Anotherexample:itseemsthatthereisacorrelationbetween kidsneeding readingglassesandsleepingwith thelighton.Soyou
should switch the light off? Not so fast. The lurking variable is the short-sightednessoftheparents,notthekids.Theparentsleavethelightonsotheycanseethekid.Thekidsinheritbadeyesightfromtheirparents.
Parameter : a quantifiable characteristic of interest in the popula- tion. Theaverage weight of all of the fish in the Fraser Riveris a parameter. The samecharacteristicinthesampleisknownasastatistic.Wecaneasilyfindthisstatisticbecausethesampleisonlyasubset,andthereforesmaller,thanthepopulation.Weusuallyneverknowtheactualvalueofaparameter,butthatdoesn’tmatterbecausewecanestimatefromthesample.
Pointestimate :astatistic,suchasthemeanoraverage.Itiscalledapointbecauseitisanexactnumbersuchas4.21.Itisnotarange,itisapoint.Itiscalledanestimatebecauseitisanestimateofthepopulationparameter.Therefore4.21isanestimateofwhatthesamequantifiablefeaturewouldbeinthepopulation.Youcanthinkofapointestimateasbeingthe‘bestguess’forthepopulationparameter.
Ifwe selectedadifferent sample from the population, then almost certainly thepointestimatewouldbedifferent,givinganewbestguess.Ifwecontinuetotakesamples,thenwearegoingtogetasmanybestguessesaswehavesamples.Therangeofbestguessesisthesamplingvariation .
Poissondistribution
ThePoissonprobabilitydistributionis,aswiththeBinomial,adis-creteprobabilitytool.Wecanuseitforcalculatingtheprobabilityofeventswhichoccurovertimeorspace.Forexample,numberofmistakesinagivennumberof linesof code.Theformulais:
f( x )=
µ x e− xx !
wheref(x)istheprobabilityofxoccurrencesinagiveninterval.mu
istheexpectedvalueormeannumberofoccurrencesinthegiveninterval(orarea).eisaconstant,value2.71828.
Population : the group of individuals about whom we want some information.Each individual in the population is called an element or sometimes a case .Examplesofpopulationsareall the fish in the FraserRiver,Toyotacarsmadein2012,orindeedthepopulationofacountry.Itisusuallynotpossibletodealwithdataatthepopulation levelbecauseitiseitherexpensiveorimpossibletoobtain.Asaresultwedrawasample,orsubset,fromthepopulation.
Randomisationisthetechniqueofselectingelementsfromthepopulationtobuildasamplesothateachelementhasthesame known probability of being selected.This technique greatly reduces bias, described in more detail in the followingchapter.
Regression
Samplingframe :alistoftheitemstobesampled.Imaginethatyouwishtosurvey100studentsfromaUniversity.TheUniversityprovidesyouwithalistofstudentsandtheirIDnumbers.Thestudentsformthepopulation,andthelististhesamplingframe.Youwouldusearandomsamplingmethod,coveredbelow,torandomlyselect100studentsfromtheUniversity.The100studentsformthesample.
Standarderror :Trythis thoughtexperiment.Let’ssay that thereare1000jellybabiesinabowlandforsomereasonyou’dliketoknownthemeanweightofajellybabywithouthavingtoweighall ofthem.Youhaveachoiceoftakingarandomsampleofeithern
= 100 or n = 500. Which sample would produce the most accurate estimate:obviouslythesamplesizeof500becauseitis‘closer’tothepopulation size.
Yetitstillwon’tbecompletelyaccuratebecauseitcouldhappenthatthesampleyouselect is biased in someway: perhaps you selected all theheavyones?We canmeasuretheamountoferrorwiththestandarderror,whichhelpsustodeterminehow‘close’wearetothe
unknownparameter.BelowIshowhowtocalculatethestandarderror(SE).Hereisthestandarderrorforastatisticsuchasamean:
σSE =√ n
Inwords, the standard error is the population standard deviation dividedby thesquarerootofthesamplesize,denotedbyn.Notethatasthesamplesizeincreases,thenthestandarderrordecreases.
Thisis in linewithyourearlier intuition that the larger the sample size then themoreaccuratetheestimate.Usuallywedon’t know‘sigma’,σ ,sowereplaceitwith‘s’whichisthestandarddeviationofthesample.
Transformations :thisisatechniquesometimesusedinregression. Theobjectisusually to transform thedistributionof the dependent variables so that it moreclosely resembles a normal distribution. We do this because the theoreticalunderpinning of linear regression is basedonanassumption of normality in thedependentvariable.Thedependentvariable is typically transformedby taking itsnaturallogarithmandthenusingthetransformedvariableintheregression.Thisisoften used when the dependent variable has only positive values and is highlyskewed to the right. Independent variables can also be transformed such as byaddingasquaredversionof the variable in the regression. Independent variablescanalso betransformedby
zscores :A z score is ameasure of how far an observation contained within avariable is from themean of that variable, in termsof standard deviations. It isquiteeasytocalculatewithinExcel,andthisYoutubeshowsyouhow
zscoresYoutube⁴.
Whydothis?Bydividingthedifferencebetweenthevalueofthe
⁴https://www.youtube.com/watch?v=FSZpynSBev8
observationandthemeanofthevariablebythestandarddeviationofthevariable,likethis:
z i=
x i−x ¯
s
thedifferencesareinunitsofstandarddeviations.Wecanusethisfor:
•Comparingthedistributionsoftwocompletelydifferentvariables
• Detectingoutliers.Anoutlierisanobservationwhichisveryfarfromtherestofthedistribution.Usually,99.7%of the observationswillbewithinthreestandarddeviationsofthemean.Soanobservationwhichismorethanthreestandarddeviationsfromthemeanislikelytobequestionable. Thiscanbegoodorbad:
•Badbecauseperhapssomeonemadeadataentryerrorwhichwecancatch.
• Good because perhaps there is something anomalous and interestingaboutthat‘weird’observation.