straightforward statistics using excel and tableau: a ...index-of.co.uk/ofimatica/straightforward...

BusinessstatisticswithExcelandTableauAhands-onguidewithscreencastsanddata

StephenPeplow

©2014-2015StephenPeplow

WiththankstothestudentswhohavehelpedmeatImperialCollegeLondon,theUniversityofBritishColumbiaandKwantlenPolytechnicUniversity.

Contents

1.Howtousethisbook1

1.1Thisbookisalittledifferent 2

1.2Chapterdescriptions 2

2.VisualizationandTableau:telling(true)storieswith

data9

3.Writingupyourfindings15

3.1Planofattack.Followthesesteps. 17

3.2Presentingyourwork 17

4.Dataandhowtogetit18

4.1Bigdata 21

4.2Someusefulsites 21

5.Testingwhetherquantitiesarethesame23

5.1ANOVASingleFactor 23

5.2ANOVA:withmorethanonefactor 29

6.Regression:Predictingwithcontinuousvariables... 31

6.1Layoutofthechapter 33

6.2Introducingregression 33

6.3Truckingexample 36

6.4Howgoodisthemodel?—r-squared 39

6.5Predictingwiththemodel 40

6.6Howitworks:Least-squaresmethod 41

6.7Addinganothervariable 42

6.8Dummyvariables 43

6.9Severaldummyvariables 49

6.10Curvilinearity 50

6.11Interactions 52

6.12Themulticollinearityproblem 58

6.13Howtopickthebestmodel 58

6.14Thekeypoints 59

6.15Workedexamples 60

7.Checkingyourregressionmodel61

7.1Statisticalsignificance 61

7.2Thestandarderrorofthemodel 64

7.3Testingtheleastsquaresassumptions 66

7.4Checkingtheresiduals 67

7.5Constructingastandardizedresidualsplot 68

7.6Correctingwhenanassumptionisviolated 71

7.7Lackoflinearity 71

7.8Whatelsecouldpossiblygowrong? 73

7.9Linearitycondition 73

7.10Correlationandcausation 74

7.11Omittedvariablebias 74

7.12Multicollinearity 75

7.13Don’textrapolate 75

8.TimeSeriesIntroductionandSmoothingMethods. . 76

8.1Layoutofthechapter 76

8.2TimeSeriesComponents 77

8.3Whichmethodtouse? 78

8.4Naiveforecastingandmeasuringerror 79

8.5Movingaverages 81

8.6Exponentiallyweightedmovingaverages 83

9.TimeSeriesRegressionMethods87

9.1Quantifyingalineartrendinatimeseries usingregression87

9.2Measuringseasonality 89

9.3Curvilineardata 91

10.Optimization95

10.1Howlinearprogrammingworks 95

10.2Settingupanoptimizationproblem 96

10.3Exampleofmodeldevelopment. 97

10.4Writingtheconstraintequations 97

10.5Writingtheobjectivefunction 98

10.6OptimizationinExcel(withtheSolveradd-in)...98

10.7Sensitivityanalysis 100

10.8InfeasibilityandUnboundedness 102


11.Morecomplexoptimization107

11.1Proportionality 107

11.2Supplychainproblems 109

11.3Blendingproblems 110

12.Predictingitemsyoucancountonebyone114

12.1Predictingwiththebinomialdistribution 115

12.2PredictingwiththePoissondistribution 118

13.Choiceunderuncertainty119

13.1Influencediagrams 119

13.2Expectedmonetaryvalue 121

13.3Valueofperfectinformation 123

13.4Risk-returnRatio 124

13.5Minimaxandmaximin 124


14.Accountingforrisk-preferences127

14.1Outlineofthechapter 128

14.2Wheredotheutilitynumberscomefrom? 130

14.3Convertinganexpectedutilitynumberintoacer-

taintyequivalent. 136

15.Glossary137

1.HowtousethisbookIbegan business life as an entrepreneur in business in HongKong. I ran tradeexhibitions,importedcoffeefromKenya,and startedandoperatedtworestaurants,oneofwhich(unusuallyforHongKong)servedvegetarianfood.Thesebusinesseswereprofitable,butIwouldhavesavedmyselfagreatdealofstress,anddonebetterif I had used some business intelligence informed by statistics to improve mydecision-making.Lookingback,IwishIhadbeenable tothinkthroughandwriteanalysesontopicssuchastheseamongmanyothers:

a.calculationofoptimalrestaurantstaffinglevels

b.analysisofsalesovertimeandseasonalityintrends

c.receiptspercustomerbyrestauranttypeandanalyzedstrengthandtypeofanydifferences

d.predictedsalesdataforpotentialexhibitorswithvisualiza-tionsofvarious‘whatif’scenarios

e.gaineddeeperinsightsfromvisualizingmydata

f.wonovermorepartnersandinvestorswithbettervisualandwrittenpresentations

I’msurethattherearemanymorepeoplelikeme,awarethattheyoughttobedoingmorewiththedatatheycollectaspartofnormalbusinessoperations,butuncertainofhowtogoaboutit.There is no shortage of textbooks andmanuals, but thesedon’tseem toget to thehands-onapplicationsquicklyenough.Thisbook is forpeoplesuchasme,intwocomponents:theanalysis,findingouttheunderlyingstoryfromthedata,andthenthepresentationofthestory.

1

1.1Thisbookisalittledifferent

Mostbooksonstatsintroducestatisticaltechniquesonachapterbychapterbasis.Instead,I’vewrittenthisbooksothatitreflects thewaymany people learn. Thechaptersarestructuredbythetypeofquestionyoumightwanttoask.Examplesofthe typeofquestionsaresummarizedbelow,andatthebeginningof eachchapter.Thechaptersthemselvesdoesn’tincludemuchmathand technicaldetails. Insteadtheseareplacedtowardstheendofthebookinaglossary.

ManyoftheExcelproceduresarelinkedtoscreencastspreparedbymetoillustratethat particular procedure. The data-sets used in the book are available from myDropboxpublicfolder:justclickonthehyperlinkthatappearsnexttoeachworkedexample.Withthedata-setsopeninExcel,youcanfollowalongatyourownpace.(And then do the same with your own data). I have used Tableau for somevisualizations,andyouwillfindlinkstomy workbooksandscreencasts.

1.2Chapterdescriptions

Ifyou’rereadingthisbook,thenyouareverylikelyalreadyengagedinbusinessandwouldliketoknowhowtotakethatbusinesstonewheights.Takealookthroughthechapterdescriptionsthat followandthengostraighttothatchapter.Ifyouthinkyoumightneedabitofastatisticsrefresher,lookthroughtheGlossaryinChapter16first.AverygoodfreestatisticsOpenIntrotextbookisavailablehere¹

Chapter2introducesTableauPublic²whichisafreedatavisu-alizationtool.WhileExceldoeshavegraphingtoolswhichareeasytouse,theresultscanlookalittleclunky.Tableauhelps

¹https://www.openintro.org/stat/²http://www.tableausoftware.com/public/

https://www.openintro.org/stat/


http://www.tableausoftware.com/public/



ustomergedatafromdifferentsourcesandcreateremarkablevisualizationswhichcanthenbeeasilyshared.I’llbe illustratingresultsthroughoutthisbookwitheitherTableau,Excelorboth.One caution:ifyoupublishyourresultstothewebusingTableauPublic,yourdataisalsopublished.Ifthisisaproblem,thereisanoption topay foranenhancedversionofTableau.ThescreenshotbelowshowsworkdonechangesonlanduseintheDeltaregionofBritishColumbia.Byclickingonthelink,youcanopentheworkbookandalterthesettings.Youcanfiltertheyear (see topright)andalsolandusetype.ThankstoMalcolmLittleforhisworkonthisproject.TableauworkbookforDeltaLandCoverChanges³

Tableaushowingchangesinlandcover

Tableauhastraininganddemonstrationvideosavailableon itswebsite, and thereare plenty of examples out there. The screen casts which Tableau provides(available at their homepage) are probably enough.Where I have found sometechnique(suchasboxplots)particularlytrickyIhavecreatedscreencastsforthisbook.Theimagebelowisdatawewilluseintheregressionchapter.Youruna

³https://public.tableau.com/views/DeltaCropCategories1996-2011/Sheet1?:embed=y&:showTabs=y&:display_count=yes

https://public.tableau.com/views/DeltaCropCategories1996-2011/Sheet1?%3Aembed=y&%3AshowTabs=y&%3Adisplay_count=yes



truckingcompanyandfromyourlogbooksextractthedistanceanddurationandnumberofdeliveriesforsomedeliveryjobs.Theimageshowsatrendline,withdistanceonthehorizontalaxisandtimetakenontheverticalaxis.Thepointsonthescatterplotare sizedbynumberofdeliveries.Youcanseethatmoredeliveriesincreasesthetime,asonewouldexpect.Usingregression,Chapter6,wewillworkoutamodelwhichcanshowhowmuchextrayoushouldchargeforeachdelivery.

HereisaYouTubefortruckingscatterplot⁴

Tableaushowingdistances,times,anddeliveries

Chapter3.Writingupyour findings. Inmostcases statistical analysis isdone inordertohelpadecision-makerdecidewhattodo.Iamassumingthatitisyourjobtothinkthroughtheproblemandassistthedecision-makerbyassemblingandanalyzingthedataonhis/herbehalf.Thischaptersuggestswaysinwhichyoumightwriteupyourreport,accompaniedbyvisualizationstogetacrossyourmessage.Therearealsolinkstosomehelpfulwebsiteswhich

⁴https://youtu.be/oFACLZLJWZI

https://youtu.be/oFACLZLJWZI

https://youtu.be/oFACLZLJWZI

discussthepreparationofslidesandhowtomakepresentations.

Chapter4:Dataandhowtogetit.Ifyouareconsideringcollectingdatayourself,throughasurveyforexample,thenyou’llfindthischapteruseful.So-calledbigdataisahottopic,andsoIincludesomediscussion.Thechapteralsoincludeslinkstopubliclyavail-abledatasetswhichmightbehelpful.

Chapter 5 deals with tests for whether two or more quantities are the same ordifferent.Forexample,youareafranchiseewiththree coffee-shops.Youwant toknowwhetherdailysalesarethesameordifferent,andperhapswhatfactorscauseanydifference thatyou find. Youwant to know whether there is a statisticallysignificantcommonfactorornot,toeliminatethepossibilitythatthedifferenceyouseeoccurredpurelybychance.Perhapstheaverageageofthecustomersmakesadifferenceinthesales?ANOVAusesverysimilartheorytoregression,thesubjectofthenextchapterandperhapsthemostimportantinthebook.

Chapter 6: Regression is almost certainly the most important statistical toolcovered in thisbook. Inoneformoranother,regression isbehindagreatdealofapplied statistics. The example presented in this book imagines you running atrucking business and wanting to be able to provide quotations for jobs morequickly. It turnsout thatifyouhavesomepastlog-bookdatatouse,such as thedistance of various trips, the time that they took and howmany deliverieswereinvolved,anaccuratemodelcanbemade.Regressioncanfindtheaveragetimetakenforeveryextraunitofdistance,aswellas other variables such as the number ofstops. This chapter shows how to create such a model and how to use it forprediction.Withsuchamodel,youcanmakequotationsreallyquickly.Thischapteralsocoversmorecomplexregressionandhowtogoaboutmodel-building.

Otheruses:youwanttocalculatethebetaofastock,comparingthereturnsofoneparticularstockwiththeS&P500.

Chapter7isabouttestingwhethertheregressionmodelswebuilt

inChapters6and7are in fact anygood.Regressionappears so easy to do thateverybody does it,without checking the validity of the answers. Following theprocedures in this chapterwill help to ensure that yourwork is taken seriously.Thischapterismoreofaguidetothinkingcriticallyabouttheresultsother peoplemight have. Is there missing variable bias? For example, you notice a strongcorrelationbetweensalesofice-creamsanddeathsbydrowning.Canyouthereforesaythatreducingice-creamsaleswillmakethewatersafer?Err–no.Themissingorlurkingvariable is temperature.Both drowning deaths and ice cream sales are afunctionoftemperature,notofeachother.

Chapter8isabouttime-seriesandshowshowwecandetecttrendsindataovertime,and make predictions. The smoothing methods, such as moving averages andexponentiallyweightedmovingaver-agescoveredinthischapterarefinefordatawhichlacksseasonality andwhenonlyarelativelyshort-termforecast is needed.Whenyouhavelonger-rundataandformakingpredictions,thefollowing chapterhasthetechniquesyouneed.

Chapter 9 describes the regression-based approach to time series analysis andforecasting.Thisapproachispowerfulwhenthereissomeseasonalitytoyourdata,forexamplesalesofTVsetsshowadistinctquarterlypattern(asdo umbrellas!).

SalesofTVsetsshowingaquarterlyseasonality

Usingregressionwe candetect peaks and troughs, connect them to seasonsandcalculatetheirstrength.Theresultisamodelwhichwecanusetopredictsalesintothefuture.

Chapter 10 is about optimizationormaking thebest useof a limitednumberofresources.Thisishighlyusefulinmanybusinesssituations.Forexample,youneedto staff a factory with workers of different skills and pay-levels. There is aminimumamount of skills requiredforeachclassofworker,and perhaps also aminimumnumberofhoursforeachworker.Usingoptimization,youcancalculatethenumberofhourstoallocatetoeachworker.

Chapter11Optimizationcanalsobeusedformorecomplex‘blending’problems.Example:Yourunapaperrecyclingbusiness. Youtakeinpapersandotherfiberssuchascardboardboxes.Howcanyoumixtogetherthevariousinputssothatyouroutputmeetsminimumqualityrequirements,minimizeswastage,andgeneratesthemostprofit?

Another example: what mix or blend of investments would best suit yourrequirements?

Chapter 12 concerns calculating the probability of items you can count one byone.Forexample,whatistheprobabilitythatmore

thanfivepeoplewillcometotheservicedeskinthenexthalf-hour?What is theprobabilitythatallofthenextthreecustomerswillmakeapurchase?WecansolvetheseproblemsusingthebinomialandthePoisson distributions.

Chapter13concernschoiceunder uncertainty: if you have a choice of differentactions,eachofwhichhasanuncertainoutcome,whichactionshouldyouchoosetomaximize the expectedmonetary value?Afarmerknowsthepayoffshe/shewillmake from different crops provided the weather is in one state or another(sunny/wet) butatthetimewhenthecropdecisionhastobemade,heobviouslydoesn’t know what the actual state will be. Which crop should he plant? Amanufacturerneedstodecidewhethertoinvestinconstructinganewfactoryatatimeofeconomicuncertainty:whatshouldhedo?

Chapter 14adds thedecision-maker’s riskprofile tohisorher decisionprocess.Mostpeopleareriskaverse,andarewillingto tradeoffsomerisk inexchangeforcertainty. This chapter shows how to construct a utility curve which maps riskattitudes,andthenprioritizethedecisionsintermsofmaximumexpectedutility.

Chapter15isaGlossaryandcontainssomebasicstatisticsinforma- tion,primarilydefinitions of terms that the book uses frequently and which have a particularmeaninginstatistics(forexample‘population’).TheGlossaryalsodiscusseswhytheinferentialtech-niquesusedinstatisticsaresopowerful,allowingusto makeinferencesaboutapopulationbasedonwhatappearstobeaverysmallsample.

UnderEforExcelintheGlossary,you’llfindsomelinkstoscreencastsonsubjectsnotdirectlycoveredinthisbook,butwhichyoumightfindhelpful.

2.VisualizationandTableau:telling(true)storieswithdata

In this book, we’ll work on gaining insights from data by visualization andquantitativeanalysis,andthenpresentingtheresultstoothers.RecentlyapowerfulnewtoolcalledTableauhasbecomeavailable.TableauPublic¹isafreeversion.Butbecarefulwithyourdatabecausewhenyoupublishtotheweb,asyoumustdowithTableauPublic,thenyourdataalsobecomesavailabletoanybody.Ifyourequirethatyourdatabeprotected,youcanpayfortheprivateversion.Tableau9hasjustbeenmadeavailable.

Many of the worked examples in this book are accompanied by a link to acompleted Tableau workbook, showing how you could have used Tableau topresent your findings. Tableau cannot perform easily the more technicalhypothesis-testingaspectsof statisticssuchasregression,butitcanhelpyoutogetyourpointacrossclearly.TableauhasalsoprovidedausefulWhitePaper²onvisualbestpracticeswhichiswellworthreading

In business intelligence we are generally interesting in detecting patterns andrelationships.Thismightseemobvious,becauseashumanswearealwayslookingforthesephenomena.I’djust liketoaddthattheabsenceofapatternorrelationshipmightbejustasinformativeasfindingone.Anexcellentfirststepistotakealookatyourdatawithanexpositorygraphbeforemovingontomoreformaldataanalysis.Belowareexamplesofthetypesofchartsorgraphsmostcommonlyusedineitherexaminingdatafirstoff,

¹http://www.tableausoftware.com/public/²http://www.tableausoftware.com/whitepapers/visual-analysis-everyone

9


http://www.tableausoftware.com/whitepapers/visual-analysis-everyone


http://www.tableausoftware.com/whitepapers/visual-analysis-everyone

justto‘seehowitlooks’.Checkinganexpositorygraphshowsup problems thatmightthrowyou off later:missing data, outliers or,more excitingly, unexpectedandintriguingrelationships.

Histograms

Histogramsbreakthedataintoclassesandshowthedistributionofthedata.Whichclasses(sometimesknownasbins)aremostcommon.Thehistogramshowsusthe‘shape’ofthedata.Aretheremanysmallmeasurements,ordoesthedatalookasthough it is normally distributed and consequently mound-shaped. SeeDistributionintheGlossaryformoreon this.Theplotbelow isof deaths in caraccidentsonaweeklybasisintheUnitedStatesover theperiod1973-1978.Thehistogramshowsonlythedistributionandnotthesequence.

From the histogram only,wecould say thatthemostcommonnumberofdeathsinaweekis between8000and 9000, because thebars of the histogram arehighest there. They arehighest because theycount thenumberof timesin the time-period thatthere have been (forexample 8000 deathsoccurredin15weeks).

What the histogramdoesn’tshowis that

deathshaveactuallybeendecreasingsince

Histogramofcardeaths,reportedweekly

about1976.Thereisprobablyasimpleexplanationforthis:compul-soryseatbeltsperhaps?Thetake-homefromthisisthatexpositorygraphsare reallyeasy todo,and that you should try different methodswith the same data to tease out theinsights.

Cardeathsovertime

Scatterplots

Themostcommonandmostusefulgraphisprobably the scatterplot. It isusedwhenwehavetwocontinuousvariables,suchas twoquantities(weightofcarandmileage)andwewanttoseetherelationshipbetweenthem.Thescatterplotisalsousefulforplottingtimeseries,wheretimeisoneofthevariables.

YoucanuseExcelforthis,andalsoTableau.InExcel,arrangethecolumnsofyourdatasothatthevariablewhichyouwantonthehorizontalxaxis is theoneon theleft.Thisistheindependentvariable.Puttheothervariable,thedependentvariable,ontheverticalyaxis.Itisusualpracticetoputthevariablewhichwethinkisdoingtheexplainingonthexaxisandtheresponsevariableon

theyaxis.Here isascatterplotofEuropeanUnionExpenditureper student as apercentageofGDP:

EU%ofGDPspentonPrimaryEducation

There is a gradual increase over the years 1996 to 2011 as one would expect;educational expenditures tend to remain a relatively fixed share of the budget.However, there was a blip in 2006 which is interesting. Because the data ispercentageofGDP, thedropmight have been caused by an increase inGDP oralternativelybyanactualeducationalspendingdecrease.We’dneedtolookatGDPfiguresfortheperiod.IgotEuropeanGDPfigurespercapitaandusingTableauputthemonthesameplot,withdifferentaxesofcourse.Theresultisbelow

EuropeanGDPandPrimary Expenditures

GDPdroppedaround2008/2009becauseoftheworld financialcrisis.Theblipinprimaryexpendituresisstillunexplained.

Maps

Tableaumakesthedisplayofspatialdatarelativelyeasy,providedyoutellTableauwhichofyourdimensionscontainthegeographicalvariable.Theplotbelowshowschangesingreenhousegas emis-sionsfromagricultureintheEuropeanUnion.Ilinkedworldbankdatabycountry.Theworkbookishere³.

³https://public.tableausoftware.com/views/EuropeanGHGfromAgriculture2010-2011/Sheet1?:embed=y&:display_count=no

https://public.tableausoftware.com/views/EuropeanGHGfromAgriculture2010-2011/Sheet1?%3Aembed=y&%3Adisplay_count=no



Unfortunately, not all geographic entities are available for linkage in Tableau.States and counties in theUnited States are certainly available and provinces inCanada,aswellascountriesinEurope.TherearewaystoimportArcGISshapefilesintoTableau.Ashape-fileisalistofverticeswhichdelimitthepolygonsthatdefinethegeographicalentity.

GHGemissionsfromagricultureintheEuropeanUnion

3.WritingupyourfindingsInchapter1,ImentionedsomeofthequestionsIwishedIhadbeenabletoanswerwhenIwasoperatingabusiness.NowIwanttosuggestwaysinwhichyoumightstructureyourresponseto suchquestions.Theactualanswerscomewhenyou’vedonethestatisticalwork(carriedoutinthefollowingchapters),butitmighthelptohaveatleastastructureonwhichyoucanbeginbuildingyourwork.

Beforeyoudoanyworkontheproblem,getstraightinyourmindexactlywhatyouarebeingaskedtodo.Youneedtobeclearandsodoesthedecision-makertowhomyouaregoingtosubmityourfindings.Hereisonewaytodothis:

Copydownthekeyquestionthatyouarebeingaskedinexactlythewaythatithasbeengiventoyou.Forexample,letussaythatyouhavebeenaskedtoanswerthisadmittedly toughquestion: ‘What factorsareimportantfor the success of a newsupermarket?’

Nowkeeprewritingitinyourownwordssothatitbecomesas clearaspossible.Makesurethateachandeverywordisclear.Whatdoes‘success’actuallymean?—inthebusinesscontextprobably‘mostprofitable’.Whatdoes‘new’mean?Isthisacompletelynewsupermarketoranewoutletofanexistingbrand?Youmightendupwith:‘whenweareplanningalocationforanewoutletforourbrand,whatfactorscontributemosttoprofitability?’Doingallthishelpsyoutoidentifythestatisticalmethodmostsuitableforthetask.Youalsocanincludeherethehypotheses(ifany)thatyouwillbetesting.

Hereisasuggestedsectionorderforyourreport.However,thisprobablywon’tbetheorderofthework.TheorderinwhichyoudothestatisticalworkisinSection3,

yourplanofattack.Sotheorder

15

ofdoingtheactualworkisthatyoucarryouttheplanofattack,and thenwriteupyourresultsfollowingthesectionbysectionsequenceIamgivingyounow.

Section1.Statethequestiontobeansweredclearlyandsuccinctly,asresultoftherewriting you did above.Doing thismakes sure that youclearlyunderstand theproblemandwhat is being askedof you. Writingout thequestion and stating itclearlyrightatthebeginning of your report also ensures that the decision-maker(thepersonwhoposedthequestioninthefirstplace)andyouunderstandeachother.Youcanalsowriteinthehypothesiswhichyouaretesting.

Section2.Provideanexecutivesummaryoftheresults.Thisshouldbeperhapssixlineslongandcontainthekeyfindingsfromyourwork.Don’tputinanytechnicalwords, jargonorcomplicatedresults.Justenoughsothatabusypersoncanskimthroughitandgetthegeneralideaofyourfindingsbeforegoingmoredeeplyintoyourhardwork.

Section3.Outlineyourplanofattack ,describinghowyouhave approachedtheproblem.Youalsocanincludeherethehypotheses(ifany)thatyouwillbetesting.Atypicalplanofattackfollowslateroninthischapter.

Section4.Data.Provideabriefdescriptionofthedatathatyouhaveused.Thesourceofthedata, howyou found it,whether you think it is reliable andwhether it issufficientforthetask.Itisagoodideato includesummarystatisticsat thispoint.This includes information such as the number of observations, and what thevariablesare.

Section5.Statisticalmethodology.Describethestatisticalmethod-ologywhichyouare going to use, for exampleANOVAto testwhetherornot themeansare thesame.

Section6.Carryoutthetests,makingsuretoincludeadescriptionofwhetherornotthehypothesescanberejected.Thisis thecentralpartofthereport.Justmakesuretofocusonansweringthequestion.

Section7.Writeaconclusionwhichdescribestheresultsof the test and how theresults answer the question that was asked. You can also include here anyshortcomingsinthedataorinyourmethodologywhichmightaffectthevalidityofthe results. The conclusion is very important because it ties together all theprevioussections.HereyoucouldincludealinktoaTableau‘story’asanadditionaloralternativewayofpresentingresults.

Section8.References/end notes or further information, especially on sources ofdata, should end the document. If youwant tomake your document look reallygood, and you have lots of references, consider using the Zotero plug-in forFirefox.Itisanexcellentfreewaytomanageyourbibliography.

3.1Planofattack.Followthesesteps.

a.Identificationofthestatisticalmethodthatyouwilluseandwhyyouchosethosemethods

b.Whatdataisrequired,andwherecanitbefound?

c.Conductthestatisticaltests

d.Discusswhethertheresultsarereliable/answerthequestion.(Ifnot,startagain!)

3.2Presentingyourwork

While it is probably best to concentrate on written workbecause the decision-makermightwant toreadyourworkindetail, and discuss it with colleagues, aTableau presentation is an excellent way ofmaking your point over again. SeeChapter2formoreonvisualizationandTableau.

Here is a link to some excellent notes by Professor Andrew Gelman on givingresearchpresentations¹

¹http://andrewgelman.com/2014/12/01/quick-tips-giving-research-presentations/

http://andrewgelman.com/2014/12/01/quick-tips-giving-research-presentations/

http://andrewgelman.com/2014/12/01/quick-tips-giving-research-presentations/

4.DataandhowtogetitInthischapter,we’lllookatthedatacollectionprocess, theprob- lemsthatmightaccompanypoorlycollecteddata,discussso-calledbigdataandprovide links tosomeusefulpublicsourcesofdata.

As you might expect, this book works through the analysis of numbers —quantitativedata—butitisimportanttonotethattheanalysisofqualitativedataisarapidlygrowingarea.Unfortunately theanalysisofqualitativedataisbeyondthisbookandthesoftwareavailabletous.

Quantitativedatacomesfromtwomainsources. Primarydataiscollectedbyyouorthecompanyyouareworkingfor.Forexample,themarketresearchyoudotofindoutthepossiblemarketshareforyourproductprovidesprimarydata.Primarydataincludes bothinternalcompanydataanddatafromautomatedequipmentsuchaswebsitehits.Collectingdata is expensive and highly proprietary. It is thereforeunlikelytobepublishedandavailableoutsidetheenterprise.

By contrast, Secondary data is plentiful and mostly free. It is collected bygovernmentsofallsizes,andalsobymany non-governmentorganizationsaswell.There is an increasing trend towards the liberation of data under various opengovernmentinitiatives.Governmentdataisusuallyreliable,butbesuretochecktheaccompanyingnoteswhichwarnofanyproblems, suchas limitedsamplesizeorchangeinclassificationorcollectionmethods over time.Thereare some links tosecondarydataattheendofthischapter.

18

Experimentaldesign

Experimentsdon’tnecessarilyhave tobeconducted in laboratories bypeople inwhitecoats:asurveyinshoppingmallisalsoaformofexperiment,asisanalyzingthe results of your favorite cricket team. There are two different designtypes: experimental and observational . The keydifference is the amount ofrandomization thatisbuiltintotheexperiment.Ingeneral,morerandomizationisgood,becauseithelpstoremovetheinfluenceoffixedeffects,suchasthequalityofthesoilinaparticularlocation.

Here’sanexampletomakeclearthedifference.Imaginea re-searcherwantingtotest the effect of a new drug onmice. In an experimental design, the mice areexamined before the drug is administered, and then the drug is applied to atreatmentgroupofmiceandacontrolgroup.Thecontrolgroupreceivesnodrugs.Itsjobistoactasareferencegroup.Allthemicearekeptinidenticalconditionsapartfromtheapplicationofthedrug.Theallocationofmicetothetreatmentgroupandtothecontrolgroupisentirelyrandom.

Whilewe can (anddo) carryout drug experiments onmice, itwouldclearlybeverywrongtotrytodothesamewithhumansassubjects.Insteadwecollectdataorobservationsandthenanalyzethoseobservationslookingford i ffe rences .

Tocompensateforthelackofrandomization,wecontrolforob-serveddifferencesbyincludingasmanyrelevantvariablesaspossi-ble.IfweknewthatpersonXhadreceivedaparticulardrugandhaddevelopedaparticularcondition,wewouldwanttocomparepersonXwithsomebodyelsewhohadnotdeveloped that condition.Relevantvariablesthatwewouldwanttoknowmightbe age,genderandpossiblypre-existinghealthconditions.Includingthese variables reduces fixedeffectsandallowsustoconcentrateontheeffectofthedrug.

Problemswithdata

It is obvious that to be credible, your analyses must be based on reliable data.Problemswithdataareusuallyconnectedtopoorsamplingandexperimentaldesigntechniques,especially:

Samplesizetoosmall .Therelativesizeof thesample to thepopulationusuallydoesnotmatter.Itistheabsolutesizeofthesamplethatcounts.Youusuallywantatleastseveralhundredobservations.

Population of interest not clearly defined. It is clear that we need to take asamplefromapopulation,butwhatexactlyisthepopulation?Here’sanexample.Youwanttosurveyshoppersinashoppingmallregardingyournewproduct.Butofthepeopleinside themall,whoexactlyareyourpopulation?Peoplejustentering,peoplejustleaving,peoplehavingtheirlunchinthefoodcourt? Singles, couples,elderlypeopleortheteenagershanging aroundoutsidethedoor?Youcanseethatpickinganyoneofthesegroupsonitsownwillleadtoabiasedsample.

Non-response bias . Many people don’t answer those irritating telephone callswhich come in the evening because they’re busy withdinner.As a result, onlyanswers from those who do choose to answer the survey are counted. Thoserespondentsmostlikelyaren’trepresentativeof thepopulation.Perhaps they livealoneordonothavetoomuchtodo.I’mnotsayingthattheyshouldnotbe in thesample,justthatincludingonlythosewhodorespondmaybiasyoursample.

Voluntaryresponsebias .Ifyoufeelstronglyaboutanissue,then you aremorelikelytorespondthanifyouareindifferent.That’ssimplyhumannature.Asaresult,thesurveyresultswillbeskewedbytheviewsofthosewhofeelmostpassionately.Thisishardlyarepresentativesamplebecausethestrongly-heldviewsdrownoutthemoremoderate voices.

4.1Bigdata

Primarydatafrequentlycomesfromautomatedcollectiondevices,suchasscanners,websites,socialmedia,andthelike.Thevolumeofsuchdataisenormous,andisaptlycalledbigdata.Bigdataisthetermusedtodescribelargedatasetsgeneratedbytraditionalbusi-nessactivitiesandfromnewsourcessuchassocialmedia.Typicalbig data includes information from store point-of-sale terminals, bank ATMs,FacebookpostsandYouTubevideos.

One of the apparently attractive features of big data is simply its size, whichsupposedlyenablesdeeperinsightsandrevealsconnectionswhichwouldnotappearinsmallersamples.Thisargu-mentneglectsthepowerofstatistics,andinparticularinferential statistics. A small sample, properly collected, can yield superiorinsightstoaverylargepoorlycollectedsample.Thinkofitthisway:whichisbetter:averylargesampleinwhichalltherespondentsareinthesameage-groupandofthesamegender;orasmalleronewhichmoreaccuratelyreflectsthepopulation?

4.2Someusefulsites

YoucanofcourseeasilyjustGooglefordata,orlookatthesemorefocusedsites:

Gapminderdata:freetousebutbesuretoattribute¹Workeremploymentandcompensation²

Interestingandwide-ranginghistorical data³GooglePublicData⁴

¹http://www.gapminder.org/data/²http://www.bls.gov/fls/country/canada.htm³http://www.historicalstatistics.org/

⁴http://www.google.com/publicdata/directory

http://www.gapminder.org/data/

http://www.bls.gov/fls/country/canada.htm

http://www.historicalstatistics.org/

http://www.google.com/publicdata/directory

http://www.gapminder.org/data/

http://www.bls.gov/fls/country/canada.htm

http://www.historicalstatistics.org/

http://www.google.com/publicdata/directory

InternationalMonetaryFund⁵

FoodandAgricultureOrganisation⁶UnitedNationsData⁷

TheWorldBank⁸

⁵http://www.imf.org/external/data.htm

⁶http://www.fao.org/statistics/en/

⁷http://data.un.org/

⁸http://databank.worldbank.org/data/home.aspx

http://www.imf.org/external/data.htm

http://www.fao.org/statistics/en/

http://data.un.org/

http://databank.worldbank.org/data/home.aspx

http://www.imf.org/external/data.htm

http://www.fao.org/statistics/en/

http://data.un.org/

http://databank.worldbank.org/data/home.aspx

5.Testingwhetherquantitiesarethesame

This chapter concerns testing whether the population means of two or morequantitiesarethesameornot:andiftheyareinfactdifferent,whetheranyvariablecanbeidentifiedasbeingassociatedwith thedifference.The test we will use isANOVA,(AnalysisofVariance).ThetestwasdevelopedbytheBritishstatisticianSirRonaldFisher,andtheF-testwhichANOVAusesisnamedinhishonor.FisheralsodevelopedmuchofthetheoreticalworkbehindexperimentdesignduringhistimeattheRothamstedResearchStationinEngland.

5.1ANOVASingleFactor

ThemoststraightforwardapplicationofANOVAiswhenwesimply want to testwhetherornottwoormoremeansarethesame.Inthisworkedexample,wehavethreedifferenttypesofwheatfertilizer(Wolfe,WhiteandKorosa)andwewouldliketoknowwhethertheirapplicationproducesequalordifferentyields.

AsthesectiononexperimentaldesigninChapter3emphasized,theexperimentmustbe designed to isolate the effect of the fertilizer, and this is achieved byrandomization. To control for differences in site-specific growing qualities, weselectplotsoflandwhichareassimilaraspossible:exposuretosunlight,drainage,slopeandother relevantqualities.Werandomlyassignfertilizerstoplots,andtheyieldsaremeasured.Thefirstfewlinesofatypicaldatasetappearbelow.

23

Thefertilizerdataset

There are two columns in the dataset: the factor , which is the name of thefertilizer,andtheyieldattributedtoeachplot.Inthiscase,eachfertilizerwastestedonfourplots,providing3x4=12observations.Later,whenusingExceltoruntheANOVA test, it will be necessary to change the format so that the yields aregroupedundereachfactor.

Agoodfirststepistovisualizethedata.HereisaboxplotdrawnwithTableau.

Fertilizer boxplot

Thedatasetishere:fertilizerdataset¹

andhereisaYoutubeofcreatingaboxplotinTableau².

¹https://dl.dropboxusercontent.com/u/23281950/fertilizer.xlsx ²http://youtu.be/3QohthWXp1M

https://dl.dropboxusercontent.com/u/23281950/fertilizer.xlsx

http://youtu.be/3QohthWXp1M

https://dl.dropboxusercontent.com/u/23281950/fertilizer.xlsx

http://youtu.be/3QohthWXp1M

I’llmakeafrankadmissionrightnow:ittookmethebestpartofamorningtogetthetechniqueofdrawingaboxplotinTableaudown, sohopefully thisYouTubewillhelpyoutoavoidspendingsomuchtimeonthis.

The boxplot reveals these useful measurements: the smallest and largestobservation,ortherange.Themedian,whichisthehori-zontallinewithinthebox;andthe25%quartile(upperlineofthebox;and75%quartile,lowerlineofthebox.Therefore50%(75-25)of thedataarecontainedwithin the box. The ‘whiskers’markoffdatawhich are outliers,meaning that any datapointwhich is outside awhiskerisanoutlier.Thisdoesn’tmeantosaythatit issomehowwrong:perhapssomeofthemostinterestingdiscoveriescome fromlookingatoutliers.However,an outliermight also be the result of carelessdataentryandshouldthereforebechecked.Itisclearfromtheplotsthattheyieldsarebynomeansthesame.

Inthisexample,wearemeasuringonlyonefactor :theeffectofthefertilizeronthemeanyieldofeachplot,andsothetestwewanttoconductissinglefactorANOVAwith a completely randomized design. It is completely randomized because theallocationoffertil-izertoplotwasrandom.Wewantanydifferencestobeduetothefertilizerandthefertilizeralone.

Becauseourtestiswhetherthemeansarethesameordifferent,thehypothesisis:

Ho : µ 1= µ 2= µ 3= .. =µ n

withthealternativehypothesisthatnotallthemeansarethesame. That isn’t thesameassayingthattheyarealldifferent;justthatat leastoneisdifferentfromtheothers.

Therejectionrulestatesthatifthepvaluewhichcomesoutofthe testissmallerthan0.05,thenwerejectthenullhypothesis.Ifwe reject the null, thenwemustacceptthealternativehypothesis.

WetestthehypothesisusingtheANOVASingleFactortoolwithinDataAnalysis.First,thetwo-columnstructureofthedata hastobetransformedintocolumnsforeachofthethreefertilizers.ThatiseasilyaccomplishedusingthePIVOTTABLEfunction.Thisyoutube³takesyouthroughtheprocessofchangingthestructureofthedataandrunningtheANOVAtest.Theresultarebelow.

ANOVAoutputforfertilizertest

Thekeystatisticisthep-value.Itismuchsmallerthan0.05andsowerejectthenullhypothesis.Themeansarenotallthesame.Theyaredifferent.ThesummaryoutputtellsusthatWhitehasthelargestmeanyieldandthisagreeswiththeboxplot.Inthiscase,separationofresultsintoarankingofyieldisrelativelyeasybecausetheyaresodistinct.UnfortunatelyExcellacksawayofeasilytestingwhetheranyotherpairarethesameordifferent.Allwecansayforsureisthattheyarenotallthesame.

Here is a slightlymore complicated and realistic example.You are designing anadvertisingcampaignandyouhavemodelswithdifferenteyecolors:blue,brown,andgreen,andalsooneshot in

³http://youtu.be/wevrSWYBl8U

http://youtu.be/wevrSWYBl8U



whichthemodelislookingdown.Youmeasuretheresponsetoeach arrangement.Doestheeyecoloraffecttheresponse?Thedatasetiscalledadcolor⁴.Thefirstfewlinesarehere:

Thefirstfewlinesoftheadcolordataset

Thecoloristhefactor,andwe’llneedtousePIVOTTABLEtotabulatethedatasothattheeyecolorbecomesthecolumns.I’vedonethathere….

Theresultis

Theeyecolorresults

⁴https://dl.dropboxusercontent.com/u/23281950/adcolour.xls

https://dl.dropboxusercontent.com/u/23281950/adcolour.xls

https://dl.dropboxusercontent.com/u/23281950/adcolour.xls

Thepvalueis0.036184,smallerthan0.05.Asaresultwecanstatethatdifferentcolors are associated with different responses. Looking at the summary output,greenhasthehighestaverageat3.85974,sowecouldprobablyselectgreen.

5.2ANOVA:withmorethanonefactor

TheFertilizertestaboveshowedhowtotesttheeffectofasinglefactor.TheANOVAresultsshowedthatthemeanswerenot the same.By inspecting the boxplot andalsotheExceloutput, it isclearthatthevarietyWHITEhasahigheryield.Whatmight be interesting is finding whether another factor also has a statisticallysignificanteffectonyields,andwhetherthetwofactorsinteractedtogether.

ThecaseinquestioninvolvespreparationfortheGMAT,anexamrequiredbysomegraduate schools. The GMAT is a test of logical thinking and is therefore notdependentonspecificpriorlearning.Weknowthetestscoresofsomeapplicants,andwhetherthosestudentscamefromBusiness,EngineeringorArtsand Sciencesfaculties.Thestudentshadalsotakenpreparationcourses,ranging froma 3-hourreview to a 10-week course. The question is: did taking the preparation coursematter;anddidfacultymatter?Herewehavetwofactors:facultyandpreparationcourse.Thedataisalreadyarrangedincolumnsandsowecangostraightinwithatwo-waywithreplication.Lookatthedata⁵.Itisreplicatedbecause there are twosetsofobservationsforeachpreparationtype.Theoutputishere:

⁵https://dl.dropboxusercontent.com/u/23281950/testscores.xlsx

https://dl.dropboxusercontent.com/u/23281950/testscores.xlsx

https://dl.dropboxusercontent.com/u/23281950/testscores.xlsx

TestscoresANOVAoutput

Wehavethepreparationtypeintherows,andtherelevantpvaluethereis0.358,sowefailtorejectthenull.Wecannotsaythatthereisanydifferenceinscoresasaresult of preparation course type; in other words, there is no effect on scoresresultingfrompreparationcoursetype.However,forfaculty(incolumns) there isdistinctdifference,withapvalueof0.004.Fromthesummary output, it looks asthoseintheEngineeringfacultyhadthehighestscore.

6.Regression:Predictingwithcontinuousvariables

This chapter is about discovering the relationships that might or (equallywell)mightnotexistbetweenthevariables in thedataset ofinterest.Here’satypicalifrathersimplisticexampleof regression in action: the operator of some deliverytruckswantstopredictthe average time taken by a delivery truck to complete agivenroute.Theoperatorneedsthisinformationbecausehechargesbythehourandneedstobeabletoprovidequotationsrapidly.Thecustomerprovidesthedistancetobedriven,andthenumberofstopsenroute.Thetaskistodevelopamodelsothatthe truck operator can predict the time takengiven that information.Regressionprovides a mathematical ‘model’ or equation from which we can very quicklypredictjourneytimesgivenrelevantinformationsuchasthedistance,andnumberofstops.Thisisextremelyusefulwhenquotingforjobsorforauditpurposes.

Weknow the sizeof the inputvariables—thedistance andthedeliveries,butwedon’tknowtherateofchangebetweenthemandthedependentorresponsevariable:whatistheeffectontimeofincreasingdistancebyacertainamount?Oraddingonemorestop?Usingthetechniqueof regression ,wemakeuseofa setofhistoricalrecords,perhaps the truck’s log-book, tocalculate the average time takenby thetrucktocoveranygivendistance.Aswithanyattempttopredictthefuturebasedonthepast,thepredictionsfromregressiondependonunchangedconditions:nonewroadworks(whichmightspeedupordelaythejourney),nochangeintheskillsofthedriver.

31

In the language of regression, the time taken in the truck example isthe response or dependent variable, and the distance to be driven is theexplanatory or independent variable. While we will only ever have just oneresponsevariable, thenumberof explanatoryvariablesisunlimited.Thenumberofstopsthetruckhastomakewillimpactjourneytime,aswillvariablessuchastheageofthetruck,weatherconditions,andwhetherthejourneyisurbanorrural.Ifweknow these variables, we also can include them in themodel and gain deeperinsightsintowhatdoes(andequallyimportantdoesnot)affectjourneytime.

Regressionmodelsareusedprimarilyfortwotasks:toexplainevents in thepast;and tomakepredictions.Regression providescoefficientswhichprovidethesizeanddirectionofthechangein theresponsevariableforaoneunitchangeinoneormoreoftheexplanatory variables.

Regressionmakesaprediction forhowlonga truckwill take to make a certainnumberof deliveries and also states how accurate thepredictionwillbe.Witharegression model in hand, we can make quotations accurately and quickly. Inaddition,wecandetect anomaliesinrecordsandreportsbecausewecancalculatehowlongajourneyshouldhavetakenundervariouswhat-ifconditions.

Here I have used a straightforward example of calculating a timeproblemforatrucking firm, but regression is much more powerful than that. Some form ofregressionunderliesagreatdealofappliedresearch.Whenyoureadaboutincreasedriskduetosmoking(orwhateverelseisthelatestdanger)thatriskwascalculatedwith regression. In this chapter,we’ll calculate the differencebetween usedandnewmarioKartsalesonebay,estimatestrokeriskfactors,andgenderinequalitiesinpay,allwithregression.

Regression isnotverydifficult todo,but theproblemis thateverybodydoes it.Ordinary Least Squares (OLS), linear regression’s proper name, rests on someassumptionswhichshouldbecheckedforvalidity,butoften aren’t.We cover theassumptions,andwhat

todoifthey’renotmet,inthefollowingchapter.Thischapterisaboutthehands-onapplicationsofregression.

6.1Layoutofthechapter

The chapter begins with some background on how regression works. We’llillustratethetheorywiththetruckingexamplemen-tionedabove,beforegoingontoaddingmoreexplanatoryvariablestoimprovetheaccuracyoftheprediction.

Particularly useful explanatory variables are dummy variables,which take on acategoricalvalues,typicallyzeroor1.Forexample,wecouldcodeemployeesasmaleandfemale,anddiscoverfromthiswhetherthereisagenderdifferenceinpay,andcalculatetheeffect. Inaworkedexample in the text,weanalyze the salesofmarioKartonebay,andusedummyvariablestofindtheaveragedifferenceincostbetweenanewandausedversion.

Sometimestwovariablesinteract:higherorlowerlevelsinonevariablechangetheeffectofanothervariable.Inthetext,weshowthattheeffectofadvertisingchangesasthelistpriceoftheitemincreases.

Finally,we’lldiscusscurvilinearity .Therelationshipbetweensalaryandexperienceisnon-linear.Aspeoplegetolderandmoreexpe-rienced,theirsalariesfirstmovequicklyupwards,andthenflattenoutorplateau.Wecancapturethatnon-linearfunctiontomakepredictionmore accurate.

6.2Introducingregression

Atschoolyouprobablylearnedhowto calculate the slopeofa line as‘riseoverrun’.Let’ssayyouwanttogotoParisforavacation.Youhaveuptoaweek.Therearetwomainexpenses,theairfareand thecostofaccommodationpernight.Theairfareisfixedand

staysthesamenomatterwhetheryougoforonenightorseven.Thehotel(plusyourmealchargesetc)is$200pernight.Youcouldwriteupasmalldatasetandgraphitlikethis:

TotalParistripexpenses

Theequationforis:Totalexpenses=1000+200*Nights

Basically,youhave justwrittenyour first regressionmodel. Themodelcontainstwocoefficientsofinterest:

• theinterceptwhichisthevalueoftheresponsevariable(cost)whentheexplanatoryvariableiszero.Heretheinterceptis

$1000.Thatistheexpensewithzeronights.Itisthepointon theverticalyaxiswherethetrendlinecutsthroughit,wherex=0.Youstillhavetopaytheairfareregardlessofwhetheryoustayzeronightsormore.

• theslopeof the linewhich is$200.Foreveryoneunit increase in theexplanatoryvariable(nights)thedependent variable (total cost) increasesbythisamount.

In regression, we usually know the dependent variable and the independentvariable, but we don’t know the coefficients.Weuse regressionto extract thosecoefficientsfromthedatasothatwecanmakepredictions.

Howdoweknowwhethertheslopeoftheregressionlinereflectsthedata?Trythisthoughtexperiment:ifyouplottedthedataand

thetrendlinewasflat,whatwouldthatmean?Norelationship.You couldstay inParisforever,free!Aflatlinehaszeroslope,andsoanextranightwouldnotcostanymore.

More formally, thehypothesis testingprocedure tests thenullhypothesisthat theslopeiszero.Ifwecanrejectthenull,thenwearerequiredtoacceptthealternativehypothesis,whichisthattheslopeisnotzero.Ifitisn’tzero(thetrendlinecouldslopeupordown)thenwehavesomething toworkwith.We test the hypothesisusingthedatafoundfromthesample,andExcelgivesusapvalue.Ifthepvalueissmallerthan0.05,werejectthenullhypothesis.Ifwerejectthenullwecansaythatthereisaslopeandtheanalysisisworthwhile.

The Paris example had only one independent variable (number of nights) butregression can includemanymore, as the trucking examplebelowwillshow.Ifthereismorethanonevariable,wecannotshowtherelationshipwithonetrendlineintwodimen-sions.Instead,therelationshipisahyperplaneorsurface.Theplotbelow shows the effect on taste ratings (the dependent variable) of increasingamountsoflacticacidandH2Sincheesesamples.

3Dhyperplaneforresponsestocheese

Hereisanotherexample.Wehavethedataonthesizeof thepopulationofsometowns and also pizza sales. How can we predict sales given population. PizzaYouTubehere¹

https://www.youtube.com/watch?v=ib1BfdVxcaQ

6.3Truckingexample

The records of some journeys undertaken are available in an Excel file named‘trucking’².

Takealookatthedata.AsImentionedatthebeginningofthebook, itisalwaysagoodideatobeginwithatleastasimpleexploratoryplotofthevariablesofinterestinyourdata.WithExcelwecanquicklydrawaroughplotsothatwecanseewhatisgoingon.Whatwewanttodoispredicttimewhenweknowthedistance.The

¹https://www.youtube.com/watch?v=ib1BfdVxcaQ²https://dl.dropboxusercontent.com/u/23281950/trucking.xlsx

https://dl.dropboxusercontent.com/u/23281950/trucking.xlsx

https://www.youtube.com/watch?v=ib1BfdVxcaQ

https://dl.dropboxusercontent.com/u/23281950/trucking.xlsx

scatter plot below shows time as a function of distance.YouTube: scatterplot oftrucking data³

Trucking scatterplot

Time is the dependent variable and distance is the independent variable. Theindependentvariablegoesonthehorizontalx axis,thedependentvariableontheverticalyaxis.

Fromtheplotitisclearthatthereisapositiverelationshipbetweenthetwovariables.Asthedistanceincreases,thensodoesthetime.Thisishardlyasurprise.Butwewantmorethanthis:wewanttoquantifytherelationshipbetweenthetimetakenandthedistance traveled.Ifwecanmodel this relationship thatwill be usefulwhencustomersaskfor quotations.

Thedatasetcontainsonefurthervariable,whichisthenumberofdeliveriesontheroute.We’llusethatlateron.Fornowwe’llusejustthetimeandthedistance.

Tobuildourmodel,wewanttobuildanequationwhichlookslikethisinsymbolicform

³https://www.youtube.com/watch?v=HSpY12mLXoU

https://www.youtube.com/watch?v=HSpY12mLXoU



y = b o + b 1 x 1+…b n x n +e

yhat(spoken‘yhat’) is the dependent variable. It is called ‘hat’ because it isanestimate. b0 is the intercept, or the point where the trend line (also called theregression line) passes through the vertical y axis. This is the point where theindependentvariableiszero.Youperhapsrememberasimilarequationfromschool:y=mx

+b.

Inthetruckingexample,theintercept(b0)mightbethe timetakenwarmingupthetruckandcheckingdocumentation. Timeis running,but the truck isnotmoving.b1 is the ‘coefficient’fortheindependentvariablebecauseitprovidesthechangeinthedependentvariableforaone-unitchangeinthe independentvariable.x1istheindependentvariable,inthiscasedistance.Wewanttobeabletopluginsomevalueoftheindependentvariableandgetbackapredictedtimeforthatdistance.

Thebandthexbothhaveasubscriptof1becausetheyarethefirst(andsofaronlyindependentvariable).Ihaveputinmorevariables just to indicate thatwe couldhavemany.

Thecoefficient,b1,providestherateofchangeofthedependentvariableforaoneunitchangeintheindependentvariable.Youcanthinkofthisastheslopeoftheline:asteeperslopemeansagreaterincreaseinyforaone-unitincreaseinx.

WecannowfindtheestimatedregressionequationusingtheregressionapplicationinExcel.First,weusetheregressfunctioninExcel’sdataanalysistooltoregressdistanceontime.YouTube⁴

Theregressionoutputisasbelow:

⁴https://www.youtube.com/watch?v=xKsYfa7YGgE

https://www.youtube.com/watch?v=xKsYfa7YGgE

https://www.youtube.com/watch?v=xKsYfa7YGgE

Theregressionoutput

Ihavemarkedsomeofthekeyresultsinred,andIexplainthembelow.

6.4Howgoodisthemodel?—r-squared

Thereisaredcirclearoundtheadjustedr-squaredvalue,here0.622.r-squaredisameasureofhowgoodthemodelis.r-squaredgoesfrom0to1,witha1meaningaperfectmodel, which is extremelyrareinappliedworksuchasthis.Zeromeansthereisnorelationshipatall.Theresulthereof0.62meansthat62%ofthevariancein the dependent variable is explained by themodel. This isn’t bad, but it’s notgreat either. Below,we’ll add more independent variables and show how theaccuracy of the model improves.Theadjusted r-squared takes into account thenumberofvariablesintheregressionequationandalsothesamplesize,whichiswhyit is slightly smaller than the r-squared value. It doesn’t have quite the sameinterpretationasr-squared,butitisveryusefulwhen comparing models. Like r-squared,wewantashighavalueaspossible:higherisbetterbecauseitmeansthatthemodelisdoingamoreaccuratejob.

Alsocircledarethewordsinterceptanddistance.Thesegiveusthe

coefficientsweneedtowritethemodel.ExtractingthecoefficientsfromtheExceloutput,wecanwritetheestimatedregressionmodel.

y =1 . 274+0 . 068∗Dist

whereyhatisthe estimated or predicted response time. If you took avery largenumberofjourneysofthesamedistance,thiswouldbetheaverageofthetimetaken.

Theinterceptisaconstant,itdoesn’tchange.Itistheamountoftimetakenbeforeasinglekilometerhasbeendriven.Thedistancecoefficientof0.068 is thepieceofinformationthatwe reallywant.Thisistherateofchangeoftimeforaoneunitchange in the dependent variable, distance. An increase of one kilometer indistanceincreasesthepredictedtimeby0.0678hours,and vice-versaofcourse.Dothemathandyou’llseethattheaveragespeedis14.7mph.

6.5Predictingwiththemodel

Amodel such as this makes estimating and quoting for jobs much easier. If amanagerwantedtoknowhowlongitwouldtakeforatrucktomakeajourneyof2.5kmsforexample,allheorshewouldneedtodoistoplug2.5intodistanceandget:

predictedtime=1.27391+0.06783*2.5=1.4435hours.

multiplythisresultbythecostperhourofthedriver,addthemiscellaneouschargesandyou’redone.

Itisunwisetheextrapolate.Onlymakepredictionswithintherangeofthedatawithwhichyoucalculatedyourmodel(seealsoChapter4onthis.

6.6Howitworks:Least-squaresmethod

Themethod we just used to find the coefficients is called themethodof leastsquares .Itworks like this: the software tries to find a straight line through thepointsthatminimizestheverticaldistancebetweenthepredictedyvalueforeachx(thevalueprovidedbythetrendline)andtheactualorobservedvalueofeachyforthatx.Ittriestogoascloseas itcan toallof thepoints.The verticaldistanceissquaredsothattheamountsarealwayspositive.

Theplotbelowisthesameastheoneabove,exceptthatIhaveaddedthetrendline.

Theblacklineillustratestheerror

Theslopeofthetrendlineisthecoefficient0.0678.Thatistherateofchangeofthedependentvariableforaone-unitchangeinthe independent variable. Think “riseoverrun”.Iextendedthetrendlinebackwardstoillustratethemeaningofintercept.Thevalueoftheinterceptis1.27391,whichisthevalueofywhenxiszero.Thisisactually a form of extrapolation, and because we have no observations of zerodistance,thisisanunreliableestimate.

Nowlookatthepointwherex=100.Thereareseveralyvalues representingtimetakenforthisdistance.Ihavedrawnablackline

betweenoneparticularyvalueandthetrendline.Thisverticaldistancerepresentsanerrorinprediction:ifthemodelwasperfect, all theobservedpointswould liealongthetrendline.Themethodofleastsquaresworksbyminimizingtheverticaldistance. Itispossibletodothecalculationsbyhand,buttheyaretediousandmostpeopleusesoftwareforpracticaluse.Thegapbetweenpredictedandobservedisknownasaresidualanditisthe‘e’terminthegeneralformequationabove.

Theerrordiscussedjustaboveistheverticaldistancebetweenthepredictedandtheobserved values for every x value. The amount of error is indicated by the r-squaredvalue .r-squaredrunsfrom0(acompletelyuselessmodel)to1(perfectfit).

Intheglossary,underRegression,Ihavewrittenupthemaththatunderpins theseresults.

6.7Addinganothervariable

Ther-squaredof0.66wefoundwithoneindependentvariableisreasonableinsuchcircumstances, indicating that our model explains 66% of the variability in theresponsevariable.Butwemight dobetter by adding another variable to explainmoreofthevariability.

Thetruckingdatasetalsoprovidesthenumberofdeliveries that the driver has tomake.Clearly,thesewillhaveaneffectonthetimetaken.Let’sadddeliveriestotheregressionmodel.

Notethatyou’llneedtocutandpastesothattheexplanatoryvariablesareadjacenttoeachother.Itdoesn’tmatterwheretheresponsevariable is,but theexplanatoryvariablesmustbe adjacentinoneblock.Thenewresultisbelow:

Nowwithdeliveries

Notethatther-squaredhasincreasedto0.903, so the newmodel explains about25%moreofthevariationintime.Theimprovedmodelis:

y =− 0 . 869+0 . 06∗Dist+0 . 92∗Del

A few things to note here: the values of the coefficients have changed. This isbecausetheinterpretationofthecoefficientsinamultiplemodellikethisisbasedononlyonevariablechanging .Forexample, thecoefficientofdistanceis0.92,almostonehourforeachdeliveryassumingthatthedistancedoesn’talsochange.Weshouldinterpretthecoefficientsundertheassumptionthatalltheothervariablesare‘heldsteady’,apartfromthecoefficientofinterest.

6.8Dummyvariables

Above, we saw how adding a further variable has dramatically improve theaccuracyofamodel.Adummyvariableisanadditional variablebutone thatweconstructourselvesasa resultofdividing data into two classes, for example bygender.Dummyvariablesare

powerful because they allow us to measure the effect of a binaryvariable, known as a dummy or sometimes indicator variable. Adummyvariabletakesonavalueofzeroorone,andthuspartitionsthedataintotwoclasses.Oneclassiscodedwithazero,andiscalledthereference group . The other classes are coded with a one andsuccessive numbers.

Theremay bemore than two groups, but therewill always be onereference group. W eare generating an extra variable, so theregressionequationlookslike this:

y =b 0+b 1 x 1+b 2 x 2

In the case of observations which have been coded x1=0 (thereference group), then the b1 term will disappear because it ismultipliedbyzero.Theb2termremains.Forthereferencegroup,theestimatedregressionequationthensimplifiesto

y ˆ reference=b 0+b 2 x 2

whileforthenon-referencegroup,itis

y ˆ nonreference=b 0+b 1 x 1+b 2 x 2

Thesizeofb1x1represents thedifferencebetweenthe averagesizeofthereferencelevelandwhatevergroupisthenon-referencegroup.Let’sworkthroughanexample.Creatingadummyvariable⁵

Thedataset[‘gender’](https://dl.dropboxusercontent.com/u/23281950/gender.xlsx)containsrecordsofsalariespaid,yearsofexperienceandgender.

Wemightwantto knowwhethermen andwomen receive the samesalarygiventhesameyearsofexperience.Loadthedata,then run alinear regression of Salary on Years of Experience, using years ofexperienceasthesoleexplanatoryvariable.Theresultisbelow:

⁵https://www.youtube.com/watch?v=TBJsEb2UCPs

https://www.youtube.com/watch?v=TBJsEb2UCPs

https://www.youtube.com/watch?v=TBJsEb2UCPs

Thegenderregression

The interpretation is that the interceptof$30949 is the average starting salary,withyearsofexperiencezero.Thecoefficient

$4076meansthateveryadditionalyearofexperienceincreasestheworker’ssalarybythisamount.

Ther-squaredis0.83,sothesimpleone-variablemodel explainsabout83%ofthevariation in the dependent variable, salary. We have one more variable in thedataset,gender.Addthisasadummyvariableandruntheregressionagain.Notethatyouwillhavetocutandpaste thevariablesso that theexplanatoryvariables areadjacent.

Withthegenderdummyadded

Ther-squaredhasincreasedtonearly1,soournewmodelwiththeinclusionofgenderisveryaccurate.Theimportantnewvariableisgender,codedasfemale=0andmale=1.Nothingsexistaboutthis,wecouldequallywellhavereversedthecoding.Ifyouarefemale,thenthemodelforyoursalaryis:23607.5+4076*Years

ifyouaremale,thenthemodelforyoursalaryis23607.5+114683.67+ 4076Years

Eachextrayearofexperienceprovidesthesamesalaryincrease,butonaveragemalesreceive$14683.67moreearnings.

Hereisanotherexample,usingthemaintenancedataset.Dummyvariable⁶

Anotherdummyvariableexampleandacautionarytale

Themariokartdatasetcamefrom[OpenIntroStatistics](www.openintro.org)awonderfulfreetextbookforentrylevelstudents.Thedataset

⁶https://www.youtube.com/watch?v=Yv681upodDI

https://www.youtube.com/watch?v=Yv681upodDI



contains informationon thepriceofMarioKart in ebay auctions.First let’s testwhethercondition‘new’or‘used’makesdifference.ConstructanewcolumncalledCONDUMMY,codednew=1andused=0.Nowrunaregressionwithyournewdummyvariableagainsttotalprice.Theoutputisbelow

Theconditiondummyresults

Thisresultistremendouslybad.Thepvalueis0.129,meaningthatconditionisnotastatistically significant predictor of price. Surely the conditionmust have somesignificant effect?Wait—we forgot to do some visualization. To the right is ahistogramofthetotalpricevariable.

Lookslikewemighthaveaproblemwithoutliers…someof the observations aremuchlargerthantheothers.Takeanotherlookwithabarchartatthehigherprices.

marioKarttotalprices

Wecanidentifytheseoutliersusingzscores(seetheGlossary).Orwecouldjustsortthembysizeandthenmake

avaluejudgementbasedonthedescription.That’swhatIhavedoneinthisYouTube.Outlierremovaland regres-sion⁷

Thetotalpricesasabarchart

It seems that two of the items listed were for grouped items which were quitedifferentfromtheothers.Thereisthereforealegitimate reasonforexcludingthem.Belowarethenewresults:

⁷https://www.youtube.com/watch?v=ivLGteHuu3Q

https://www.youtube.com/watch?v=ivLGteHuu3Q




CorrectedmarioKart output

Theestimatedregressionequation is

y =42 . 87+10 . 89∗CONDUMMY

Theconditiondummywascodedasnew=1,used=0.Ifamariokartisused,thenitsaveragepriceis42.87,ifitnewthentheaveragepriceis42.87+10.89.Theaveragedifferenceinpricebetweenoldandnewisnearly$11.Makessense.

Take-home:checkyourdatabeforedoingtheregression.

6.9Severaldummyvariables

ThegenderandthemarioKartexamplesabovecontainedjustonedummyvariable.But it is possible to havemore. For example, your sales territory contains fourdistinctregions.Ifyoumakeoneofthefourthereferencelevel,andthendivideupthedatawithdummies

for the remaining three regions, you can compare performance in each of theregionsbothtoeachotherandtothereferencelevel.

Thedatasetmaintenancecontainstwovariableswhichyoucanconverttodummies,followingthisYouTube.

MaintenanceRegression⁸

https://www.youtube.com/watch?v=xWdcT7u9YFE&feature=em-upload_owner

6.10Curvilinearity

Sofar,wehaveassumedthattherelationshipbetweenthedepen-dentvariableandtheindependentvariablewaslinear. However,inmanysituationsthisassumptiondoes not hold. The plot below showsethanolproduction inNorthAmerica overtime.

NorthAmericanethanolproduction.Source:BP

ThesourceofthedataisBP.Itisclearthatproductionofethanolisincreasingyearlybutinanon-linearfashion.Justdrawingastraightlinethroughthedatawillmisstheincreasingrateof production.Wecancapturetheincreasingratewithaquadraticterm,which is simply the time element squared. I have created a new variablewhichindexestheyears,whichIhavecalledt,andafurthervariable

⁸https://www.youtube.com/watch?v=xWdcT7u9YFE&feature=em-upload_owner

https://www.youtube.com/watch?v=xWdcT7u9YFE&feature=em-upload_owner

whichistsquared.

urvilinearregressionYouTube⁹

Thedatasetwithanindexfortime.Source:BP

Aregressionoftagainstoutputhasanr-squaredvalueof0.797.Inclusionofthetsquaredtermincreasesther-squaredmarkedlyto

0.98.Theregressionoutputisbelow.

Regressionoutputwiththequadraticterm

Wewouldwritetheestimatedregressionequationas

y =4504−1301 t+194 . 9 t 2

Noticethatforearlysmallervaluesoft,theeffectofthequadratic

⁹https://www.youtube.com/watch?v=jgjWSpyPBqg

https://www.youtube.com/watch?v=jgjWSpyPBqg

https://www.youtube.com/watch?v=jgjWSpyPBqg

termisnegligible.Astgetslargerthenthequadraticswampsthelineartterm.

Makingaprediction .Let’spredictethanolproductionafter5years.Then

y =4505−1301∗5+194 . 9∗25=2872 . 5

Theplotbelowshowspredictedagainstobservedvalues.While the fit is clearlyimperfect,itiscertainlybetterthanastraightline.

Predictedagainstobservedethanolproduction.

6.11Interactions

Increasingthepriceofagoodusuallyreducessalesvolume(al- thoughof courseprofitmightnotchangeifthepriceincreasessufficientlytooffsetthelossofsales.Advertisingalsousuallyincreasessales,otherwisewhywouldwedoit?

What about the joint effect of the two variables? How about reducingthepriceandincreasing theadvertising?The jointeffect is called an interaction and caneasilybeincludedintheexplanatoryvariablesasan extra term.Theoutput belowistheresultof

regressingsalesonPriceandAdsforaluxurytoiletriescompany.Theestimatedregressionmodel is

y =864−281 P+4 . 48 Ads

Regressionoutputforjustpriceandads

Sofarsogood.Thesignsofthecoefficientsareaswewouldexpectfromeconomictheory.Salesgodownaspricerises(negativesignonthecoefficient).Salesgoupwithadvertising(positivesign onthecoefficient).Caution:donotpaytoomuchattention to the absolute size of the coefficients.The fact that price has a muchlargercoefficientthanadvertisingisirrelevant.Thecoefficientisalsorelatedtothechoiceofunitsused.

Nowcreateanothertermwhichispricemultipliedbyads.Callthis termPAD.Thefirstfewlinesofthedatasetarebelow.

Firstfewlineswiththenewinteractionvariable

Nowdotheregressionagain,includingthenewterm.

Inclusionoftheinteractionterm

The new term is statistically significant and the r-squared has increased to0.978109.Thenewmodeldoesabetterjobofexplainingthevariability.Howcomepriceisnowpositiveandtheinteractionterm is negative?We have to look at theresults as a whole rememberingthatthesignsworkwhenalltheothertermsare‘heldconstant’.Theexplanation:asthepriceincreases,theeffectofadvertisingonsalesisLESS.Youmightwanttolowertheprice andseeiftheincreasedvolumecompensates.

Anotherinteractionexample

Here’sanotherexample,thisonerelatingtogenderandpay. Thedatasetiscalled‘paygender’ and contains information on the gender of the employee, his/herreview score (performance), years of experience and pay increase. We want toknow:

•Isthereagenderbiasinawardingsalaryincreases?

•Inthereagenderbiasinawardingsalaryincreasesbasedontheinteractionbetweengenderandreviewscore?

First,let’sregresssalaryincreasesonthedummyvariableofgenderandalsoReview.Myresultsarebelow(IhavecreatedanewvariablewhichIhavecalledG.Itisjustthegendervariablecodedwithmale

=0andfemale=1.

GdummyandReview

Nothingverysurprisinghere:womengetpaidonaverage233.286 lessthanmen.And—holdinggendersteady–eachpoint increase in Reviewgives an increase insalaryof2.24689.

How do I know thatwomen are paid less than men? The estimated regressionequationis:

y =204 . 5061 − 233 . 286 ∗ ( x=1 ifF )−233 . 286∗( x=0ifM )+2 . 24689∗Review

Rememberhowwecodedmenandwomen?Thex=0ifMtermdisappears,soweareleftwiththisequationformen:

y =204 . 5061+2 . 24689∗Review

andthisoneforwomen(I’vedonethesubtraction)togive

y =− 28 . 779+2 . 24689∗Review

Conclusion:thereisagenderbiasagainstwomen.ForthesameReviewstandard,onaveragewomenarepaid233.286lessthanmen.

How about the second question — possibility that the gender biasincreases with Review level? Wecan test this with an interactionvariable.MultiplytogetheryourgenderdummyandtheReviewscoretocreateanewvariablecalledInteract.Thendothe regression againwith salary regressed against G, Review and Interact. My output isbelow.

Interactionoutput

Formen,theequationisSalary=59.94472+4.848995(becausetheGandInteracttermsgotozerobecausewecodedMale=0.SoforeveryextraincreaseinReview,aman’ssalaryincreasesby4.848995.

Forwomen,theequationisSalary=59.94472-29.714-1*(4.848995-4.05468)soforeveryone increase inReview,awomen’ssalary increasesbyonly 4.848995-4.05468=0.79. Notice that the adjusted r- squared for thismodel is higher (it is0.927339comparedto0.810309) thanthemodelwithjustGandReview.So theinteractionisbothstatisticallysignificant,pvalue0.000316)andhasthemeaningthatforwomen,improvingtheReviewscoredoesn’tincreasethesalaryasmuchasformen.

ConclusionTheanalysisshowedthatwomenarebeingtreatedunfairly.Thereisagenderbiasagainsttheminaveragesalariesfor thesameperformancereview;andeach increase in reviewpoints earn themconsiderably less insalary thanfor theequivalentmale.Caution: thisconclusion isbasedsolelyon the limitedevidenceprovidedbythissmalldataset.

6.12Themulticollinearityproblem

Multicollinearityreferstowayinwhichtwoormorevariables ‘explain’ thesameaspectof thedependentvariable.Forexample, let’ssaythatwehad a regressionmodelwhichwasexplainingsomeone’ssalary.Employeestendtogetpaidmoreastheygetolderandalsoastheiryearsofexperienceincreases.Soifwehadbothageand years of experience in the list of independent variables we would almostcertainlysufferfrommulticollinearity.

Multicollinearitycanleadtosomefrustratingandperplexing re-sults.Yourunaregression.Oneofthevariableshasapvaluelargerthan0.05,soyoudecidetotakeitout.Youruntheregressionagainand—-thesignand/or significanceofanothervariables changes. This happens because the two variables were explaining thesameaspectofthedependentvariablejointly.

Howtosolvethisproblem:checkthecorrelationoftheindependentvariablesfirst,beforeputtingthemintoyourmodel.Chapter14hasasectiononcorrelation.Ifyoufindthatthecorrelationofanytwovariablesishigherthan0.7,besuspicious.Thesetwomaybringyoursomegrief!

6.13Howtopickthebestmodel

Youwillbetryingoutdifferentformulations,addingandremovingvariablestotrytocaptureasmuchexplanatorypowerasyoucan.Howyoudodecideifonemodelisbetterthananother?Therearetwoapproaches:

• Lookonlyattheadjustedr-squaredvalue,evenif yourmodel containsvariableswith a p value larger than 0.05. If the adjustedr-squaredvaluegoesup,leavesuchvariablesin.

•Pruneyourmodelsothatitcontainsonlyvariableswithapvalue<=0.05.Youstillwantashighanadjustedr-squaredas

youcanget,butyoualsowantallyourexplanatoryvariablestobestatisticallysignificant.

Itakethelatterapproach.Youusuallyhavefewervariablesbutallof themhaveareason(statisticalsignificance)forbeinginyourmodelandyoucaninterprettheirmeaningintelligently.Aparsimoniousmodelisbetter thanacomplexmodel thatfitsyourdataverywell.Simpleandrobustisgood.Thedatasetthatyouusedtofityourmodelisonlyasamplefromapopulation.Anoverlycomplexmodelmaynotworkwellwhenpresentedwithadifferentsamplefromthepopulation.

6.14Thekeypoints

• Think throughyourmodelbeforeyoustart including vari-ables.Whatvariablesdoyouthinkwillhaveaneffectonthedependentvariableandinwhichdirection(plusorminus).Itistemptingtojustputineverythingandhope for the best but this rarely works. Some software is able to dostepwiseregression,pullingoutinsignificantvariablesforyou, butExcelisnotoneofthem.

• Keepyourmodelassimpleaspossible.Complicatedmodels rarelyworkwell.

• Visualiseyourdatafirst,evenwithasimplescatterplotaswehavedonethroughoutthischapter.

•Checkwhetheryouhaveallthevariablesthatyoumightneed.Ifyouweretrying topredictwhetherashopselling expensive jewellerywouldmakesufficientsales,youmightwanttoknowtheaverageincomeofresidents.Ifyoudon’thaveit—getit.Thereisahugeamountofdatalyingaboutwhichyoucanobtain either free or quite cheaply. I’veincludedaverybrieflistofURLsinChapter12.

6.15Workedexamples

1.Theestimatedregressionequationforamodelwithtwoindependentvariablesand10observationsisasfollows:

y =29+0 . 59 x 1+4 . 9 x 2

Whataretheinterpretationsofb1andb2inthisestimatedregres-sion equation?

Answer:thedependentvariablechangesbyonaverage0.59whenx1changebyoneunit,holdingx2constant.Similarlyforb2.

Predictthevalueoftheindependentvariablewhenx1is175andx2is290:

y =29+0 . 59(175)+4 . 9(290)=1553 . 25

7.Checkingyour regressionmodelItisn’tdifficulttobuildaregressionmodelasthepreviouschapterhasshown.Butthatispartoftheproblem:everybodydoesit.Butnoteverybodytakesthetroubletochecktheresults.Checkingtheresultsisanimportantstepfortworeasons:first,tomakesureyourcalculationsandpredictionsarecorrect;andsecondlytoshow thirdparties thatyourwork is solid. In this chapterwe’llworkondoing just that. Todecidewhetherthemodelswehavebeenworkingonareanygoodweneedtolookattwoareas:

•Isyourmodelstatisticallysignificant?

•Aretheassumptionsbehindtheleastsquaresmethodmet?Inparticular,linearityandresidualdistribution.

Thefirstareaiseasiertoworkthroughthanthesecond,whichisprobablywhyonedoesn’talwaysseeresidualanalysisdiscussedwhenresultsarepresented.Thisisashamebecausethereisagreatdealtobelearnedfrompickingthroughresiduals.Youranalysiswillbegreatlyimprovedbyacloseattentiontothisarea.

Belowwe’llworkthroughstatisticalsignificanceandthentesttheassumptions.

7.1Statisticalsignificance

Theregressiontrendlinedisplaysarateofchangebetween twovariables.Thelineslopesupwardswhenthedependent variableincreasesforaone-unitincreaseinthe

independentvariable(

61

positive relationship) and vice-versa (negative relationship). What we want toknowis this:did thatslopeupwardsordownwards occur by chance; ifwe tookanothersamplefromthepopulationwouldweachieveasimilarresult?Keypoint:weare usuallyworkingwithasampledrawnfromapopulation.Thesamplein thetruckingexamplewasthefirm’slogbook.

Thenullhypothesisisthatthereisnoslope;withthealternativethatthereisinfactaslope.Thehypotheses(seetheGlossaryforhelponhypotheses)totestwhetherthereisastatisticallysignificantslopeare:

Thenull:

H o : β =0

andthe alternative:

H a : β = 0

where

β

isthesymbolforthecoefficientsofthepopulationparameter of the independentvariableofinterest.Inwords,thehypothesisistestingwhetherbetaiszeroornot.Ifit is zero, then we are ‘flatlining’ and there is no point in continuing with theanalysis.Ifhoweverwecanrejectthenull,thenweacceptthealternativewhichisthatbetaisnotzero.Itmightbenegativeorpositive,thatdoesn’tmatter:whatmattersiswhether it is zero or not. Note that we use a Greek letterwhendiscussingapopulationparameterwhichwillprobablyneverbeaccuratelyknown.Weuse theLatin letter ‘b’ when we have estimated it using a sample drawn from thepopulation.

Exceldoesallthetestingofstatisticalsignificanceforusintwoways.

First,itteststheindividualvariablesandprovidesapvalue.Belowistheresultfromthefirsttruckingregressionwedid.Notethatthepvaluefordistanceis0.004.Itiscustomarytouseacut-offof0.05.Themeaningofthisresultisthat,followingtherejectionrule ,werejectthenullifthepvalueissmallerthan0.05.Because0.004issmallerthan0.05,werejectthenullwe therefore conclude that there is in factastatisticallysignificantslope.Insummary,wetestedthehypothesis that the slopewaszero,andrejectedit.Thereforeweacceptthealternativewhichisthatthereisaslopeofsomesort.

Thetruckingregressionoutput

Thistest tellsusthat thereisastatisticallysignificantslope.Itdoesn’ttellusthesignoftheslope(upordown)oritsmagnitude.Butthefactthatthereisasignificantslopeisimportantinformation.Itisworthcarryingonwiththeanalysisbecausethevariablesactuallymeansomething.Bytheway,ignorethepvaluefortheintercept.Itusuallyhasnosubstantivemeaning.

ThesecondwaythatExcelteststhevalidityofthemodelistotestwhetherthemodelasawholeissignificant.ThetestiscalledtheFtestafterthestatisticianRAFisher.Forthetruckingexample,lookundersignificanceF.Theresultisapvalue,inthiscasethesame

pvalueforthevariableDISTANCE.Wehaveonlyoneexplanatoryvariableandsothepvaluewillbethesame.Wehopethatthepvalueissmallerthan0.05,whichenablesustoconcludethatatleastoneoftheslopesisnon-zero.

The section above examined the first of the tests, which was for the statisticalsignificanceofthemodel.Nowwemoveontothesecondarea.

7.2Thestandarderrorofthemodel

ThestandarderrorinExcelisfoundjustundertheadjustedr-squaredoutput.Itistheestimatedstandarddeviationoftheamountofthedependentvariablewhichisnotexplainablebythemodel.Inotherwords,itisthestandarddeviationoftheresiduals.

Undertheassumptionthatthemodeliscorrect,itisthelowerboundonthestandarddeviationofanyofthemodel’sforecasterrors.Thefigurebelowshowstheoutputofregressionestimatingriskofastroke(multipliedby100)againstbloodpressureandage,withadummyvariableforsmokingornot.

Exceloutputforstrokelikelihood

Thestandarderroris6(roundedup).Foranormally-distributedvariable,95%ofthe observations will be within two standard deviations of the mean. For ourpurposesthatmeansthat2x6=12shouldbeaddedorsubtractedtothepredictedvaluetofindaconfidenceintervalforpredictions.Theestimatedregressionequationfromthestrokemodelaboveis:

y =− 91+1 . 08 Age+0 . 25Pressure+8 . 74 Smokedummy

Animaginarypersonwhosmokes,isaged68andhasa bloodpressureof175willhaveariskof34(allfiguresrounded).Wecanbe95%confidentthatthisestimatewillfallsomewherebetween34-12and34+12or22to46.Theseareratherwideconfidenceintervalsandsowemightwanttoworkatimprovingthemodelbyforexampleincreasingthesamplesizeoraddingmoreexplanatoryvariables.

7.3Testingtheleastsquaresassumptions

Therearetwokeyassumptionsthattheleastsquaresmethodreliesonwhichweneedtocheck.Aftercheckingwe’llworkthroughwaysofretrievingthesituationshouldtheassumptionsbeviolated.Theassumptionsare:

• Therelationshipbetweenthedependentandtheindependentvariablesislinear.Fortunatelythisiseasytocheckandalsotofixifitisn’tsatisfied.

• The residualshaveanon-constantvariance. (You’ll some- timesseeinotherstatisticstextbooksotherstricterrequire-ments,suchastheresidualsbeingnormallydistributedandwithameanofzero.Formostpurposesjustchecking for non-constant variance is enough). What doesnon-constantvariance mean? We’ll work through this by defining a resid- ual andidentifyingwhetherornottheassumptionhasbeenmet.Andfinallywhattodoaboutit.Butfirst,checkingforlinearity.

Checkingfor linearity

Therelationshipbetweenthedependentvariableandtheindepen- dentvariableisassumedtobelinear.Thisisimportantbecausethemodelgivesusarateofchange:thecoefficientshowsthechangeinthedependentvariableforaone-unitchangeintheindependentvariable.Iftherelationshipisnon-linear,thatcoefficientwill notbevalidinsomeportionsoftherangeofthevariables.

Wewanttoseeastraightline(eitherupordown)onascatterplot.Putthedependentvariable on the vertical y axis, and one of the independent variables on thehorizontal x axis. If you include the trendline,youcanobservehowclosely theobservedvaluesmatchthepredicted.

Wecanalsocheckforlinearitybyplottingthepredictedvaluesand theobservedvalues.Theplotbelowdoesthisforriskofstroke:

Predictedagainstobservedrisk

Thisisareasonableresult,indicatingthatthelinearityconditionismet.Mostofthepointsareinastraightline,althoughIwould be concerned about the lower risklevels,especiallyaroundrisk=20.Thereappearstobeconsiderablevarianceatthispoint.

7.4Checkingtheresiduals

Aresidualisthedifferencebetweentheobservedvalueofyforanygivenxvalue,andthepredictedvalueofyforthatsamexvalue.Itisthereforeapredictionerror,andisgiventhenotationefortheGreekletter‘epsilon’.Itistheverticaldistancebetweentheactualandpredictedvaluesofthedependentvariableforthesamevalueoftheindependentvariable.Wediscussedthisinthepreviouschapter inconnectionwiththeleastsquaresmethod,wherewewrotetheestimationequation:

y = b 0+b 1 x

Thehatony,thedependentvariable,indicatesthatitisanestimateofthedependentvariable.Weknowthattheestimatecannotbecorrectunlessallthepredictedvaluesandtheobservedvaluesmatchupexactly.Iftheydon’t,thenthereareresiduals.Ifwecalltheerrorsε(epsilon)whichistheGreeklettermatchingourlettere,thenwecanrewritetheregressionequationas:

y=β o+β 1x+ε

Theerrorshavenowbeenabsorbedintoepsilonandsowecanremovethehatfromy.Itisthebehaviorofepsilonthatisofinterest, becausetheleast-squaresmethodrestsontheassumptionthatthe errors absorbed into epsilon have a non-constantvariance. Thismeansthattherenorelationshipbetweenthesizeoftheerrortermandthesizeoftheindependentvariable.Therefore,ifweplottedtheerrorsagainsttheindependentvariables,weshouldseenoclearpattern.Howtodo this isdescribedbelow.

7.5Constructingastandardizedresidualsplot

Excelprovideswhatiscallsstandardresidualsaspartofitsregres-sionoutput,andwewillusethese.Notehoweverthatthesearenot‘true’standardizedresiduals,buttheyareprobablycloseenough.MakesureyouchecktheStandardizedResidualsboxwhensettingupyourregression.

Checkthestandardizedresidualsbox

Wewant toplot thestandardresiduals against the predicted value yhat.Wewillcreateanewcolumn to the left of the columnStandard Residuals, and copy thecolumnofpredicted times into that new column. Then create a scatter plot ofpredictedtimeandstandard residuals.Youshould end upwith the image below.Standardizedresidualplot¹

¹https://www.youtube.com/watch?v=1wuyZfB39p4

https://www.youtube.com/watch?v=1wuyZfB39p4



Standardresiduals

TheYaxisindicatesthenumberofstandarddeviationsthateachresidualisawayfromthemean,plottedagainstthepredictedvaluesonthehorizontalaxis.Noneoftheresidualsaremorethat2standarddeviationsawayfromthemeanofzero,sotheresultsaregenerallysatisfactory,althoughthereisoneobservationinexcessof1.5.Westill have a worrying fan shape which would seem to indicate non-constantvariance:wecanobserveapattern.

Let’sruntheregressionagain,butnowincluding the second variable,which isdeliveries.Theresidualsplotisobtainedinthesameway,andhereistheresult.

Standardizedresidualswithtwovariables

Thefanproblemseemstohaveimprovedalittle.Ther-squaredof themodelwithtwovariableswashigher,meaningthattheresidualsweresmaller (because there islessunexplainederror).

7.6Correctingwhenanassumptionisviolated

Above,weexaminedpossibleviolationsoftwooftheassumptionsunderlyinglinearregression:linearityandnon-constantvariance.Nowlet’slookatwhattodowhenthese assumptions are found to havebeenviolated.Firstsomegoodnews: linearregressionisquiterobusttosuchviolations,andevensotheyarequiteeasytocorrectfor.We’ll deal with the problems in the same order: linearity and non-constantvariance.

7.7Lackoflinearity

The firstcheck is to lookat a scatter plot of the dependent variable against theindependentvariable.Theplotbelowshows volume of sales and length of timeemployed,togetherwithatrendline.

Acurvilinearrelationship

Thereareseveraldifferentwaysinwhichalinearrelationshipcanbeachievedsothat we can use the least squares method. A common method is to include aquadraticterm.Thismeansadding thesquaredvalueoftheindependentvariabletothelistofindependent variables.This iseasilydonebycreatinganothercolumn,consistingofsquaredvaluesofthefirstvariable.

Anewcolumnoftheindependentvariablesquared

Todothis,createa newcolumnand then label it.Clickon the first value in thevariablewhichyouwanttosquare,andaddˆ2.Enter. Then drag downwards. Theplotbelowshowsthepredicted andobservedsales.Themodelisclearlysuperior.

Predictedandobservedafterinclusionofquadraticterm

Ifthedependentvariablehasalargenumberoflowvalues,andisheavilyskewedtotheright,thenagoodsolutionistotransformthedependentvariableintoitsnaturallogarithm.

7.8Whatelsecouldpossiblygowrong?

Regressionisaverycommonlyusedanalyticaltoolandyouaremostlikelyeithergoingtouseityourselforexaminetheworkofotherswhohaveusedregression.Below is a list of common mistakes to watch for. And if I have made themsomewhereinthisbook,I’msureyou’llbethefirsttoletmeknow!

7.9Linearitycondition

Regressionassumesthattherelationshipbetweenthevariablesis linear: the trendlinethatthesoftwaretriestofindtominimizethesquareddifferencebetweentheobservedandpredictedvaluesisstraight.Soifyoutrytorunaregressiononnon-lineardata,you’llgetaresultbutitwillbemeaningless.

Action :alwaysvisualizethedatabeforedoinganyanalysis.Ifyouseethatthedataisnon-linear,youmaybeabletotransformitusing the techniquesdescribed in thepreviouschapter.Asanexample: thecurvilinearexamplewhichwastransformedbyincludingaquadratictermasanexplanatoryvariable.

7.10Correlationandcausation

Regressionisaspecialcaseofcorrelation,andasweallknowcorrelationdoesn’tmeancausation(seetheGlossary).Inregression,nomatterhowgoodthemodel,allthat we have been able to show is that a change in an explanatory variable isassociatedbyachangeinthedependentvariable.Forexample,yourecordthehoursyouputintostudyingandyourgrades.Surprise!Morestudying=bettergrades?Orperhapsnot….couldhavebeenabetter instructor.Wecannotsaythatonecausedtheother.Sowhenwritingupresultsorinterpretingthoseofothers,beverycarefulnottoclaimmorethanyouareableto.

There is a related problem which is ‘reverse causation’. You study more, yourgradesgoup.Temptingtothinkthatonepossiblycausedtheother.Butperhapsitistheotherwayround?Yourgradeswerepoor,youstudiedharder?Oryouhadgoodgradesandthenledyouintotakingthecourseseriously.

Action : try not to include explanatory variables which are affected by thedependentvariable.Youcouldalsotryto‘lag’onevariable.Cutandpastethehoursofstudyingvariablesothatitisonetime-periodbehindthegradesandthenruntheregressionagain.

7.11Omittedvariablebias

Under Correlation in the Glossary I give some examples of lurking variables.Regression, like correlation, is susceptible to the same problem.Example: younoticethatinhotterweathertherearemore

deathsbydrowning.Didthehotterweathercausethedrownings?Wellno,theextraswimmingcausedbytheheatpresentedmoreriskscenarios.Thelurkingvariableishoursspentswimmingratherthantemperature.Trytogettotherealvariableifyoucan.

7.12Multicollinearity

Whenpredicting the price of a house, square footage and number of roomsarelikelytobehighlycorrelatedbecausetheyarebothex-plainingthesamething.Thisproblemresultsinunusualbehaviorintheregressionmodel.

Action : correlate your explanatory variables BEFORE doing any regression.Watchoutifyouhaveapairwhichhasanrvalueof

0.7orhigher.Thisdoesn’tmean thatyou shouldn’tuse them, but it’s ared flag.Thisiswhatmighthappen.Youtakeoutononethevariablesina regression, theonethat’sleftreversesitssignorsuddenlybecomesstatisticallyinsignificant.

Ifyoudoruntheregressionandyougetanunlikelyresult,choosethevariablewiththehighesttvalue(orsmallestpvalue)andditchtheother.

7.13Don’textrapolate

Thecoefficientsfromtheregressionarecalculatedbasedonthedatayouprovided.Ifyoutrytopredictforavaluebeyondtherangeof thatdata, the results will beunreliableifnottotallywrong.

In theyearsofexperienceandsalaryexample inChapter5, thecoefficient foryearsofexperienceprovidedthechangeinsalarythatanextrayearofexperiencewouldgive.Wouldyoufeelcomfortablepredictingthesalaryofsomeonewith95yearsofexperience?

Action :beforerunningthenumbers,checkthattheinputsarewithintherangeforwhichyoucalculatedthemodel.

8.TimeSeries IntroductionandSmoothingMethods

Atimeseriesisasetofobservationsonthesamevariablemeasured atconsistentintervalsoftime.Thevariableofinterestmightbemonthlysalesvolume,websitehitsoranyotherdataofrelevance to thebusiness.Using timeseriesanalysiswecan detect patterns and trends and–just possibly–make forecasts. Forecasting istrickybecause (ofcourse!)weonlyhavehistoricaldata togoonand there isnoassurance that the same pattern will repeat itself. Despite all this, time seriesanalysishasdevelopedintoahugetopic, fortunatelywithalargenumberoffreelyavailabledata-setstouse.Linkstosomeofthesedata-setsareprovidedattheendofthedatachapter(Chapter3).

There are two basic approaches: the smoothing approach and the regressionmethod .Bothmethodsattempt to eliminate the background noise. This chaptercoverssmoothingmethods, thefollowingchaptertheregressionapproach.

8.1Layoutofthechapter

Herewewill

1.identifythefourcomponentsthatmakeupatimeseries

2.introducethenaive,movingaveragemethodsandexponen-tiallyweightedsmoothingmethodstomakeaforecast

3.discusswaysinwhichtheaccuracyoftheforecastscanbecalculated

76

8.2TimeSeriesComponents

There are four potential components of a time series. One or more of thecomponents may co-exist. The components are: trend component; seasonalcomponent;cyclicalcomponent;randomcomponent.

•trendcomponent :theoverallpatternapparentinatimeseries.Theplotbelow shows French military expenditure as a percentage of GDP. Thedownwardtrendisapparent.

FrenchmilitaryexpenditureasapercentageofGDP

• seasonalcomponent :salesofskisorChristmasdecorationsareclearlyhigherinthewintermonths,whileicecreamandsuntancreammovemorequicklyinthesummer.Ifweknow the seasonal component, thenwe canuse this information for prediction. The time between the peaks of aseasonalcomponentisknownastheperiod .Noteherethatseasondoesn’tnecessarilymeantheseasonoftheyear.

TheplotbelowshowssalesofTVsetsbyquarter.HerethereisaconstantupwardtrendbecausemorepeoplearebuyingTV sets,butthereisalsoaseasonaleffect.

TVSalesbyquarter

•cyclicalcomponent :sometimeserieshaveperiods last-inglongerthanone year.Business cycles such as the50- year KondratieffWave are anexample.Thesearealmost impossibletomodel,but that doesn’t stop thehopefulfromtrying.KondratieffhimselfendedupavictimofJosephStalinbecausehissuggestionthatcapitalisteconomieswentinwavesunderminedStalin’sviewthatcapitalismwasdoomed,andhewasexecutedin1938.

•randomcomponent :therandomcomponentisjustthat:thenoiseorbitsleftoverafterwehaveaccountedforeverythingelse.Thetell-talesignsofarandom component are: a constant mean; no systematic pattern ofobservations;constantlevelofvariation.

8.3Whichmethodtouse?

In the next two chapters, we’ll work through the two basic approaches: thesmoothingapproachandtheregression approach.Howtodecidewhichtouse?

Firststepasusualis tomakeasimplescatterplotofyourdata,with timeon thehorizontalxaxisandthevariableofinterestontheverticalyaxis.Ifthe result issomethingresemblingastraightline, forexampletheFrenchmilitaryexpenditure,thenuseasmoothingapproach.Ifthereisevidenceofsomeseasonality,suchaswiththeTVsalesdata,thenregressionisthewaytogo.Thisisespeciallythecaseifyouwanttocalculatethesizeoftheseasonaleffect.

8.4Naiveforecastingandmeasuringerror

Forecastingisjustthat:aneducatedguessaboutwhatmighthappenatsomepointinthefuture.Forecastsarebasedonhistoricalevents, andwe are hoping that somepatternofbehaviorwillrepeatitselfwhichwillmaketheforecast‘true’.Thebestthatwecandoisto tryoutdifferentforecastingmethodsandworkout howwelltheypredicteddatawhichwealreadyknewabout.Thenwetakealeapintothedarkandhopethatthebestofthosemethodswilldoagoodjobondatathatwedon’tknowabout.Thissectionconcernsthemeasurementofforecastinga c c u r a c y .

We’regoingtoconstructanaiveforecastandusethatforecastto demonstratetwocommonmethodsofaccuracychecking:theMean SquaredError(MSE)and theMeanAbsoluteDeviation(MAD)approaches.

Anaiveforecastassumesthatthevalueofthevariableatt+1willbethesameasatt.Thevaluesarejustcarriedforwardbyonetimeperiod.Hereisanexample:

ImageGaspriceswithnaiveforecast

The naive approach is sometimes surprisingly effective, perhaps because of itssimplicity.Ingeneralsimplemodelsperformwellperhapsbecauseoftheirlackofassumptions about the future.Now we will measure the accuracy of the NaiveForecastwithMADandMSE.Inbothmethodstheerroriscalculatedbysubtractingtheforecastorpredictedvaluefromtheobservedvalue.Themethodsdifferinwhatisdonewiththoseerrors.

MADmeasurestheabsolutesizeoftheerrors,sumsthemandthendividesbythenumberofforecasts.Theabsolutevalueoftheerrorisjustitssize,withoutthesign.InExcel, you can find an absolute value with =ABS(F1 - F3) where FI is theobservedvalueandFEthepredictedvalue.Usethelittlecorneroftheformulaboxtodrag it down. Then find the sum and divide by the n, which is the number ofobservations.Inmaththisis

MAD=

∑( abserror )

n

InMSE,theerrorissquaredbeforebeingsummedanddivided.

Squaring the error removes the problem of the negative numbers, but createsanotherone:largeerrorsaregivenmoreweightingbecauseofthesquaring.Thiscandistorttheaccuracyoftheresults.ThemathsfortheMSEisbelow

MSE=

∑( error )2

n

Theplotbelowshowstheerrorsassociatedwiththenaiveforecast-ing,theabsolutevaluesoftheerrorsandtheMAD.InExcel,use

=abs()toconverttoanabsolutevalue.

ImageErrorsandMeanAbsoluteDeviation

8.5Movingaverages

Slightlymorecomplexthanthenaiveapproachisthemovingaverageapproach,whichreliesonthefactthatthemeanhasless

variance than individualobservations.By focusingon themean,we can see thegeneraltrend,lessdistractedbynoise.

Theanalystdecideshowfarbackintimetheaveragewillgo.Thelongerbackintime,thenthemorevaluesoftheobservation are averaged, resulting in a flattermore smoothed result. This is good for observing long-term trends, but lesssatisfactoryifyouareinterestedinmorerecenthistory.

The number of observations which are to be averaged is known as k, and thisnumber is chosenby theanalyst basedonexperience. In Excel’s Data AnalysisToolpakthereisaMovingAveragetoolwhichcandoallthework.Positioningtheoutputisslightlytricky.Therearektimeperiodsbeingusedforthecalculationofthe first forecast;sothefirstforecastshouldbeatk+1becausektimeperiods arebeingusedtofindthefirstvalue.Whereexactlytoputthefirstcelloftheoutputisabitfiddly.Iftherearektimeperiodsbeingused, placethefirstcell at k-1.Excelprovidesachartoutput,examplebelow.ThereisaYouTubehere¹

ImageMovingaveragewithk=3forgasprices

Themovingaveragetechniqueisusedbyinvestorstodetectthe‘goldencross’andthe‘deathcross’.Theyexamine50dayand200daymovingaverages.Whenthe50daymovingaverageis above

¹http://youtu.be/zrioQOWfxjY

http://youtu.be/zrioQOWfxjY

http://youtu.be/zrioQOWfxjY

the200daymovingaveragethispointistheso-calledgoldencrossandasignaltobuy.Bycontrast,whentheoppositehappensthisis the‘deathcross’…timetogetout!

8.6Exponentiallyweightedmovingaverages

The exponential smoothingmethod (EWMA) is a further step forward from thenaiveandthemovingaverageapproaches.Themovingaverageforgetsdataolderthanthek timeperiodsspecified, while the EWMA incorporates both the mostrecentandmorehistoricalobservationstoconstructaforecast.IntheGlossaryyou’llfindmoredetailedmathshowinghowtheEWMAbringsforwardallpasthistory.Ingeneralthefurtherbackintimeyouget,thelessinfluencetheobservationshave.

UsingEWMAwechooseasmoothingconstant,alpha,whichsetstheweightgivento themost recent observation againstprevious forecasts. Ifwe thought that theforecast(t+1)wouldbesimilartothecurrentobservation(t=0)andwethoughtthatolderforecastswereof littlevalue, thenwewouldchooseahighvalueofalpha.Alpharunsfrom0to1.TheEWMAformulais

F t +1=αY t+(1−α ) F t

where F is the forecast, andY is the actual value of the variable. alpha is thesmoothingconstant,ranginginvaluebetween0and1,andchosenbytheanalyst.

Inwords, this equationmeans that the forecast value (at t+1) is the smoothingconstantalphamultiplyingtheactualvalueat t=0plus(1-alpha)timestheforecastatt=0.Soifthepreviousforecastwastotallycorrect,theerrorwouldbezeroandtheforecastwouldbeexactlywhatwehavetoday.Itturnsoutthattheexponential

smoothingforecastforanyperiodisconstructedfromaweightedaverageofallthepreviousactualvaluesofthetimeseries.

Theequationaboveshowsthatthesizeofalpha,thesmoothing constant,controlsthebalancebetweenweightinggiventothemostpreviousobservationandpreviousobservations.Ifthealpha issmall,thentheamountofweightgiventoYat t=0issmallandtheweightgiventopreviousobservationsislargeandvice-versa.Ifalpha=1,thenpreviousobservationsaregivennoweightatall,andweassumethat thefutureisthesameasthepast.Thisachievesthe same result as naive forecastingdiscussedabove.Typically quitesmallvalueofalphaareused,suchas0.1.

##ApplicationofExcel’sexponentialsmoothingtool

An exponential smoothing tool is available in the Analysis ToolPak. We’lluseCanadian Gross Domestic Product per capita as an example. Here is the dataplottedonitsown.Youtube²

CanadianGDPPerCapita

Thereisasteadyupwardstrend,butwithadipin2009duetotheworldfinancialcrisis.Excelasksforadampingfactor.Thisis1-alpha.Ihavedonetheforecaststwice,oncewithanalphaof0.1

²http://youtu.be/w-GmxjrX1qg

http://youtu.be/w-GmxjrX1qg

http://youtu.be/w-GmxjrX1qg

andagainwithanalphaof0.9.Thereforethedampingfactorswere

0.9and0.1.Theresultforalpha=0.9isshownbelow.

Forecastwithalpha=0.9

This is pretty good forecast, with the forecast tracking the actual observationsclosely.

Theplotbelowshowssheepnumbersascountsbyhead inCanada from1961to2006.First,let’splotthisusing0.3asalpha(arbitrarilychosen).Theplotisbelow,withadampingfactorof1-0.3=0.7.

Sheepnumberswithalpha=0.3

9.TimeSeriesRegressionMethodsThesmoothingmethodsweworkedinthepreviouschapterarefinewhenthereisnoevidenceofseasonality,forexamplewith the sheepdata.However,smoothinghas only limited use for longer term prediction and analysis. Data that is moreinterestingforusas analystsmaybenon-linear in trend, andalso have seasonalpeaks and troughs. Using regression, we canmeasure the quantitative effect ofseasonalityandtrend,eithertogetherorseparately.Theresultisamodelwhichcanbeusedforprediction.

Belowwewill:

1.useregressiontoquantifyatrendinatimeseries

2.introduceaquadratictermtoaccountfornon-linearityinthetimeseries

3.usedummyvariablestomeasureseasonality

9.1Quantifyingalineartrendinatimeseriesusingregression

TheplotbelowshowslifeexpectancyatbirthforCanadians.

87

Canadianlifeexpectancy

This is pretty much a straight line as one might expect in adevelopedcountrywithalargepercapitaexpenditureon healthcare.Notethatthisisforbothsexes:forwomenonlywemightexpecttoseeevenbetterfigures.IrantheregressionwithYearastheindependentvariable,explaininglifeexpectancy.Theresultisbelow.

Lifeexpectancyregressionresults

NoticethattheadjustedR-squaredvalueisclosetounity,reflecting

thestraightlinethatweseeonthegraph.Thecoefficients fortheinterceptandtheindependentvariableprovidethisestimatedregressionequation

y t=58 . 47+0 . 000565∗YearThemeaning is that for everyyear after 1960 that a childwas born, his/her lifeexpectancyincreasedby0.000565years,orabout5hours.

9.2Measuringseasonality

ThedatasetforTVsales(fromModernBusinessStatistics byAndersonSweeneyandWilliams)containssalesbyquarterforfouryears.There are therefore sixteenobservations.

IcreatedacolumnwhichIcalledindexsothatwecanseesales ina consecutivefashion.ThefirstquarterisSpringandthelastquarter iswinter.Itisapparentthatquarterswhicharedivisiblebyfourarehigherthanothers,andsoforth.Perhapssalesarehigherinwinter?

WecanusethedummyvariablemethodofChapter5todeterminewhetherthisistrueandalsotheextentofthedifference.WewillcreatedummiesforSummer,FallandWinter,leavingSpringasthereferencelevel.Recallthatwehavefourpossiblestatesoftheseasonvariable,andsowewillneedk-1,or4-1=3dummies.Itusuallydoesn’tmatterwhich state you choose as your reference level: just don’t forgetwhichoneyoupicked.Thedataset is below,with a columncalled Index, whichwe’llusetomeasurethetimetrend.

TVSaleswiththequarterlydummies

Iwanttofindouttwothings:

whethersalesareincreasingovertime.Icandothisbyincludingtheindexasanexplanatoryvariable.Becausetheregressiontool requiresthatalltheindependentvariablebeinoneblock,IhavecopiedandpastedtheIndexcolumntotheright.

Theregressionoutput is

TVSalesregressionout

Let’swalkthrough this output line by line. First notice that the p values for allindependentvariablesaresmallerthan0.05.Thereforeallaresignificant.

ThereferencelevelisSpring,andthereforethereisnocoefficientfor this quarter.Summer has a negative sign in front of its coefficient, meaning that sales forSummerare smaller than those forSpring.Fallispositive,somoreTV sales aresold in the Fall than in the Spring. Not unexpectedly, Winter has the largestcoefficientofall,

reflectingtheplot.Theindexhasapositivesign,meaningthataverageTVsalesareincreasingwithtime.

Therearethereforetwocomponentsinthisseries:atrendandaseasonalcomponent.Youtube¹

**Makingpredictions**fromtheseresults.Let’spredictthesalesforFallinsevenquarterstime.Thisis

y =4 . 7+1 . 05875+7∗0 . 147=6 . 78

Theobservedvaluewas6.8,sothepredictionwasreasonableifslightly pessimistic.

http://youtu.be/wyKIHInMbY8

9.3Curvilineardata

Thedataabovefollowedalineartrend,whichisperfectforordinary least squares,which assumes that the relationship between the dependent variable and theindependent variable is linear. But some interesting data does not follow a neatlineartrend.Theplotbelowshowscrudeoilandgaspricesovertheperiod1949to2003.

¹http://youtu.be/wyKIHInMbY8

http://youtu.be/wyKIHInMbY8

Crudeandgaspricesshowingcrudepricecurvilinearity

However,ifweshowthetwoseriesonthesamegraph,asontheplotIdidinTableau,withseparateyaxesforeachtypeoffuel,wecanseethattheshapesareveryclose.

Crudeandgasondifferentaxes

Thecurvilinearityofcrudepricesisclear.Thepricescametoapeak

inthe1970sasaresultofOPEC’sdecisiontorestrictsupply. Inany event,crudepricescannot be described as linear. The solution is to add an extra term to theregressionof price on time, and that is time squared. The first few lines of thedatasetarebelow.Ihaveaddedacolumncalledtorepresenttheyear,andaddedtsqwhichisjusttsquared.

Datawithtimesquared

Nowregresscrudeagainsttime and time squared, and the pleasing result belowappears

Thecurvilinearregressionforcrude

The adjusted r-squared is high at 0.88 and the two time predictors are highlysignificantstatistically.Noticethatthetwopredictorsarequitedifferentinsizeandalsohaveoppositesigns.Whentissmall,thenthevariabletdominatesandthetrendis upward. However, as t gets larger, then t-squared gets even larger still. Thenegativesignont-squaredpullstheregressionlinedown.

10.OptimizationLinearProgramming (LP) is the tool we use to optimize a particular objectivefunction .Forexample,amanufacturerofcarpetswantstogetthemostprofitoutof his rawmaterials and labor force. The objective function is the relationshipbetween the inputs and his costs and thus his profits. Optimization helps themanufacturerfindthemostefficientallocationofresources.Theobjectivefunctiondoesn’tnecessarilyhavetobeintermsofmoney;itcouldverywellbetoallocateworkinghoursefficientlysothateveryonehasmore timeoff.Frequentlywewanttomaximize theobjective function(perhapsmakemoreprofit)butwemightalsowanttominimizeanobjectivefunction,suchascosts.

Optimizationisnotparticularlydifficult,andwewillusetheSolveradd-intoExcelto do themathematically tricky parts.What is important is thewriting up of acorrectmodelinthefirstplace.

Layoutofthechapter

Linearprogramming isnotespeciallydifficult,but theworkhas to bedone in alogicalandorderlyfashion.Inthischapter,wewill

*spendsometimeworking throughthebasicsteps in setting up an optimizationproblem *work through themeaning of the Solver output in termsofwhat thecoefficientvaluesactuallymean

10.1Howlinearprogrammingworks

Linearprogrammingworksbysolvingasetofsimultaneousequa-tions.Theproblemtobesolved—tomaximizeastockreturnfor

95

example—iswrittenasasetofsimultaneousequations.Theequa- tions may bequitesimple,buttheremaybemanyofthem.LinearProgrammingfindsthebestsolutiontotheequations.Thejobistofindthecoefficientsforthevariablesintheequationwhichmaximizeorminimizetheoutcome.

We’llworkthroughanexamplefirstonpaper,andthenputit intoExcel’sSolveradd-intogetthesolution.Thethreeimportantstepsare:

•Writethesimultaneousequations,basicallyputtingintomaththewordingoftheproblem.Thisisprobablythetoughestpart.

•RuntheOptimizationinSolver,tofindtheoptimalsolution.

• Sensitivity analysis to find out how much effect a unit change in aconstraintwouldhave.

10.2Settingupanoptimizationproblem

Optimizationproblemsareusuallypresentedaswrittenquestions.Thebestapproachistodeterminewhicharethe

•decisionvariablesthosevariableswhichyoucancontrol.Forexample,inthecarpetfactoryexample,thenumberofhoursgiventoeachworkeris adecisionvariable.

Itisusuallythesizeofthedecisionvariablethatwearetryingtooptimize.Fortheworkers,wewouldwantthemtohaveexactlythenumberofhourswhichproducesthe most profitable output. Too few hours and the factory is working belowcapacity.On theother hand, toomanyandmoneyisbeingwasted.The decisionvariablesgointowhatSolvercallsthechangingcells.Theconventionisto color-codethesecellsinExcelinredcolor.

• constraints are constants or givens which we cannot change. Thejurisdictionwhere thefactory is locatedmighthave legislation restrictingthemaximumnumber of hours that an employee canwork per day.Wewouldneedtoadd aconstrainingequationtotakeaccountofthisfact,evenifthesolutionwasfinanciallysub-optimal.

10.3Exampleofmodeldevelopment.

Afarmerhas50acresoflandathis/herdisposal.Healsohasup to 150hours oflabor and up to 200 tonnes of fertilizer.He can plant either cotton or corn or amixture.Cottonproducesaprofitof$400peracre,corn$200peracre.Eachacreofcottonrequires5laborhoursand6tonnesoffertilizer.Forcorntheequivalentis3and2.Howmuchlandshouldthefarmerallocatetoeachcrop?

Let’scallacresofcottonxandacresofcorny.FforfertilizerandLforlabor.Thesefourvariablesarehisdecisionvariablesbecausehecanallocatecropstolandandhecanalsodecidehowmuchlabor and fertilizer to deploy.He is searching for thecombinationofthefourdecisionvariableswhichmaximizeshisprofit.

Weknowhehas50acres,sothetotalacreagecannotexceedthisamount.Thisisaconstraint,whichcanbewrittenasanequation,asshownbelow.Otherconstraintsarethemaximumlaborandfertilizer.Obviouslyhecannotusemorethanhehas.

10.4Writingtheconstraintequations

Thetotalacreagedevotedtoeachcropcannotexceed50:

x+y≤50

Thelaborandfertilizerarealsoconstraints.Referbackto thequestiontogettheamountsoflaborandfertilizerperacrepercrop.

Forlabor,theconstraintequationis:

5 x+3 y≤150

Thisequationcomesaboutbecause for each acre of cotton, the farmer needs 5hours of labor,while for corn it is 3.Wedonotknowthevaluesofxandy,butwhatevertheyare,weknowthat 5xplus3ymust be less than or equal to 150,becauseweonlyhave150hoursavailable.

Forfertilizer,theconstraintequationis:

6 x+2 y≤200

10.5Writingtheobjectivefunction

Whatdowewanttogetoutofthis:whatistheobjective?Clearly it is the mostefficientmixtureofinputs,subjecttoconstraints,producingthelargestprofit.Theobjectivefunctionis:

MaxProfit=400x+200y

Inwords,findthecombinationofxandythatmaximizestheprofit, subject totheconstraintswehavewritten.ThenexttaskistoputallthisintoSolver.

10.6OptimizationinExcel(withtheSolveradd-in)

WewilluseanExcelspreadsheetandSolvertoachieveallthreesteps.

OptimizationYouTubeforthefarmerproblem¹

¹https://www.youtube.com/watch?v=WeTgK6wmvSY

https://www.youtube.com/watch?v=WeTgK6wmvSY

https://www.youtube.com/watch?v=WeTgK6wmvSY

OpenanExcelspreadsheetsothatyoucanfollowalong.

Cellcoloringconventions

Isuggestyoufollowtheseconventionsforcolor-codingthecells:

•Inputcells.Thesecontainallthenumericdatagiveninthestatementoftheproblem.ColorinputcellsinBLUE.

•Changingcells.Thevaluesinthesecellschangetooptimizetheobjective.CodechangingcellsinRED.

•Objectivecell.Onecellcontainsthevalueoftheobjective.ColortheobjectivecellinGREY.

NowI’llworkthroughthefarmingproblemabovestepbystep.

MakesureyouhavetheSolveradd-inloaded.Tocheck,openExcel, thenclickontheDataTab.IfSolverisloaded,youwillseetheSolvernametotheright.

TheSolvertab

BelowistheExcelspreadsheetwiththeinformationthatweknowalreadytypedin.Ihaveputrandomvaluesinthe red-coloredchangingcells,justasplaceholders.Thesenumberswill changewhenExcelsolvesforthemostprofitableallocation.

•TypetheaddressoftheobjectiveintotheSolverdialoguebox,andmakesuretheMaxradiobuttonisselected.

•Typeintherangeofthechangingcells.

•Workthroughtheconstraints.Therearethree:labor;fertil-izer;andland.

•SelectSimplexLPastheSolvingmethod.

•Presssolve.

Thefarmingsolution

Solverwillchangethevaluesinthechangingcellstomaximizetheobjectivecell.Ifoundthat30acresofcottonandnoneofcornprovidedaprofitof$12000.

10.7Sensitivityanalysis

Constraintscanbeeitherbindingorslack .Ifaconstraintisbinding,thatmeansthatallofthatparticularresourceisbeingused,andmorecouldbeemployedifitshouldbecomeavailable.Wecanfindoutwhetherconstraintsarebindingandalso theirshadow

pricebypressingSensitivityAnalysisafterrunningSolver again.TheSensitivityanalysiswill appear as a tab at the bottom of your worksheet. For the farmingproblem,itlookslikethis:

Farmingsensitivityreport

Let’sletattheConstraintssection.Laboruses150hoursandhasashadowpriceof$80.Theshadowpricesindicatestheper-unitvalueoftheconstrainedcommodityiftheconstraintwasincreasedbyoneunit.Ifwecouldhaveonemorehouroflabor,thentheprofitwouldincreaseby$80.Landandfertilizerhaveashadowpricesofzerobecausetheconstraint isslackornon-binding.We are not completelyusingtheseresourcesandsowedonotneedanyextrainputs.Therearesacksoffertilizerlyingaroundunused.Noneedtobuymore.

The columnsAllowableIncrease andAllowableDecrease are rele- vantbecausetheytellushowmuchmoreofabindingconstraintwecouldusebeforerunningthemodelagain.Forlabor,theamountis16.67(rounded).IfweeasedtheconstraintbymorethanthisamountthenwewouldneedtorunthenewmodelinSolveragain.Thesamelogicfordecrease.

10.8InfeasibilityandUnboundedness

Solverisquiterobust,buttwoproblemsmayoccur.Asolutionto anoptimizationproblemisfeasibleifitsatisfiesalltheconstraints.Butitispossiblefornofeasiblesolutiontoexist.Thisoccurs ifyoumakeamistake inwriting themodel,or themodelistootightlyconstrained.

Unboundedness

Unboundedness occurs when you have missed out a constraint. There is nomaximum(orminimum).Trychangingallconstraintsto>=insteadof=<.

10.9Workedexamples

Thedemandingmother

Mostmothersarekeentokeepcontactwiththeirchildren.Oneparticularmotherisratherdemanding.Sherequiresatleast 500minutesperweekofcontactwithyou.Thiscanbethroughtele-phone,visitingorletter-writing(yes!Somepeoplestilldothat!).Herweeklyminimumforphoneis200minutes,visiting40minutesandletters200minutes.Youassignacosttotheseactivities.Forphone:$5perminute;visiting$10perminute; letter-writing$20perminute.Howdoyouallocateyour time sothatyourcostistheminimum?

Writetheequationsfirst.Whatistheobjective?

Letxstandforminutesofphone;yminutesofvisiting;and zminutesofletters.Youwanttominimizethecostoftheseactivities, so theobjective is tominimise200x+40y+200z.Theconstraintsare:x+y+zmustbemorethanorequalto500.

Spreadsheetfordemandingmother

Noticethatwewanttominimizethetime,sochangetheradiobuttontominratherthanmax.Andwhenyouare typing in the constraints,checkthatthesignoftheinequalityistherightwayround.

Thetheatermanager

Youarethemanagerofatheaterwhichisinfinancialtrouble.Youhavetooptimizethecombinationofplaysthatyouwillputontomakethemostprofit.Youhavefiveplaysinyourrepertoire,A,B,C,D,E.Theyhavedifferentdrawpoints(appealtothepublic)andthereforeticketpricestomatch.Adrawpointishowattractive theplayistothegeneralpublic.Thedatalookslikethis:

Play Draw Ticket

A 2 20

B 3 20

C 1 20

D 5 35

Play Draw Ticket

E 7 40

Youhavetheseconstraints:youhaveonly40possibleslots.Andnoplaycanbeputonlessthantwiceormorethantentimes(givestheactorsareasonableturn).Eachperformance costs $10 per ticket sold, regardless of the ticket price (covers theelectricityetc).Thetotaldrawpointshastoexceed160tokeepthecriticshappy.

Howdoyoudistributeyourperformances?Whichcombinationofplaysproducesthehighestprofitandsatisfiestheconstraints?

Wearetryingtomaximizeprofit,solookforanobjective function thatdoesjustthat:20A+20B+20C+35D+40E.Butwait—-wehavethe$10cost.Sobettertakethatoutfirst,leaving10A+10B+10C+25D+30E.

Withthese constraints:

Thedrawpoints(theattractivenessoftheplaytocritics):2A+3B+1C+5D+7E>=160

andnoplaycanbeputonmorethantwiceormorethantentimes.A<=2andA<=10B<=2andB<=10C<=2andC<=10D<=2

andD<=10

andwehaveonly40slots,soA+B+C+D+E<=40

Spreadsheetfortheatreproblem

Moreworked examples

19thcenturyfarmer

Iama19thcenturyEnglishfarmer.I cangrowwheatorbarley.Wheatyields10bushelsanacre,barley8bushelsanacre.Thepriceofwheatistenshillingsabushel,barley5shillingsabushel.The labourcostsforwheatare3 shillingsanacre, forbarley2shillingsanacre.Thetransportationcosttomarketforwheatis1shillingabushel,forbarley½shillingabushel.Ihave100acres.(abushelisameasureofvolume;anacreisameasureofarea;ashillingisacurrencyunit).

Questions:HowdoIsplitupmyland?

IfIcouldgetonemoreacreofland,howmuchwouldthatbeworthtome?

Yetanotherfarmingquestion(Iusethesebecausemostpeoplecanimaginefieldsofland,cropsgrowingandthelike.Butofcoursethesame techniquesareapplicabletootherbusinesssituations).

Afarmerhas100acresofland.Hecanplantcropsorraisesheep. Eachhectareofcropsprovides$100butrequireslaborof$50peracre.Hecanspendamaximumof$500oncroplabor.Eachacreofsheepprovides$40butneedsonly$10 in laborcharges.Thelaborbudgetisunlimited.

Worked solution

CallareaincropsXandin sheepY.ThenX+Y=<100 and50X<= 500Theobjectiveis100X-50X+40Y-10Ysimplifiesto50X+30Y.MyExcelspreadsheetisbelow.

Spreadsheetfortheaboveproblem

11.Morecomplexoptimization

Optimization using Solver is a powerful method of solving common businessresource-allocationproblems.Inthepreviouschapter theproblemswererelativelysimpleconcerning theallocationof land ortime.Butlinearprogrammingcanbeusedtosolvemorecomplexandworthwhileproblemsaswe’llseebelow.Thekeyrequirement isthattheanalystisabletodefinetheproblemasasetofequations.Thereisnoonemethodexceptforthinkingcarefullyandwritingouttheproblemasasetofequations.

Todemonstrate,we’llworkthroughthreedifferenttypesofprob-lem:

*problems concerning proportionality: you need to allocate money to differentinvestmentswhileminimizing riskandkeeping returns above a certain amount.Whatproportiondoyouputineachinvestment?

*supplychain problems

*blendingproblemswhereyouneedtomixtogetherinputsfromdifferentsources

11.1Proportionality

Investmentdecisions example

Thisexampleshowshowwecan‘weight’theinputsaccordingtosomecriteria,inthiscasetheirrisk.Wewanttominimizetheriskbutensurethatthereturnisabovesomeminimumlevel.Whatisthemixtureorblendofinvestmentsthatcandothat?

107

The problem: you have an inheritance of $300,000 from an uncle but there aresomerestrictions:youmustinvestallthemoneyinfourfunds;yourannualreturnhastobeatleast5%.Andyoumustminimizeyourrisk.Thefourfundsyourunclehasspecifiedare:

Fund Risk Return

X1 10.7 4.2

X2 5 4

X3 6 5.6

X4 6.2 4

Let’sdealwiththeobjective function first.That is tominimize the risk.Wecanweightthesizeof the investment ineach fundwith its risk index.So,weightingeachinvestmentandthendividingbythe totalinvestmentgivestheamountoftherisk:whichisexactlywhatwewanttominimize.SowewanttominimizeR(forRisk)likethis:

R=10 . 7 X 1+5 X 2+6 X 3+6 . 2 X 4

300 ,000

Ifyoudon’tgetthis,thinkaboutwhatwouldhappenifall thefundswereasriskyasX1:thetotalriskwouldincrease.Anotherwaytothinkofthis:ifwedecreasethenumberofsharesinX1andinsteadincreasethenumberofX4,whatwillhappen?Theriskwilldecrease.

Theconstraints:thetotalinvestmenthastoaddupto$300,000,sothisconstraintis

X 1+X 2+X 3+X 4=300 ,000

Wealsohavetoachieveareturnofatleast5%(noeasymatter these days).Thisconstraintis

4 . 2 X 1+4 X 2+5 . 6 X 3+4 X 4 ≥5

Belowismyspreadsheetfromthisproblem:

Investmentblending

11.2Supplychainproblems

Workingouthowmuchtoshipfromproductioncenterstodemand locations is acommonprobleminsupplychainoptimization.AsIhavebeenstressing,thekeytosolvingproblemsofthistypeistowriteouttheequationswhichdefinethemodel.Hereisanexample:

You runacompanywhichhas bakeries in locationA and B. The bakeries shipcartons of bread to your retail stores at locations X, Y, Z. The bakeries havedifferent capacities and each retail store has different demands. The costs ofdeliverypercartonfromeachbakerytoeachstoreisbelow

Bakery X Y Z Capacity

A 12 13 11 150

B 9 17 17 200

Demand 50 100 90

Notation:adeliveryfrombakeryAtolocationXisAxandsoforth.

Theobjectivefunctionistominimizethedeliverycosts.Sowewant tominimize:12Ax+13Ay+11Az+9Bx+17By+17Bz

Eachbakeryhasafixedcapacity,soAx+Ay+Az<=150andBx+By+Bz<=200

Supplyhastoexactlymatchdemand

Ax+Bx=50Ay+By=100Az+Bz=90

Noticethestrictequalitysign.Myresultsarebelow:

Distributionproblem

11.3Blendingproblems

Abovewediscussedlinearprogrammingmodelswhichwere simplebuteffective.There are other types of Optimization model which are helpful, especiallyBlendingModels,whichcanalsobesolvedbylinearprogramming.

Blendingmodelsareusedinsituationswherewehavetwoormoreinputswhichhave tobemixed to some formula. Through Optimizationwecanfind themostprofitablemixture.Wine,metals,oil,sausages,recycledpaper—-thisisapowerfultechnique. Youcouldprobablyuseitformarketingcampaigns.We’llworkthoughanexamplewhichisforoil.

Theoilblendingproblem

Theproblem:anoilcompanyhas15000barrelsofCrudeoil1and20000barrelsofCrudeoil2onhand.Thecompanysellsgasolineandheatingoil.TheseproductsaremadebyblendingtogetherCrudeoil1andCrudeoil2.EachbarrelofCrudeoil1hasaqualitylevelof10,andeachbarrelofCrudeoil2hasaqualitylevelof5.Thegasolinethatweproducemusthaveaqualitylevelofatleast8.Theheatingoilmusthaveaqualitylevelofatleast6.Gasolinesellsfor$75abarrel,heatingoilfor$60.Howcanweblend theoils together in such a way that meets minimum qualityrequirementsandmaximizesprofit?

Oilblendingsolution

First,let’sthinkthroughwhatthedecisionvariables(whatgoesintothechangingcells)mightbe.Youmight very well think (as I did first off) that the decisionvariableswouldbetheamountsofthetwooilsusedandtheamountsproduced.Butthisisn’tenough:wehavetoblendtogetherthetwotypesofoil.Theyhavetobemixedbeforetheycanbesold,andthemixturehastoreachsomeminimumqualitystandard.Thecompanyneedsablendingplan.

Theinputs :

selling prices (here gasoline = $75, heating oil = $60) availability of oil fromsuppliersqualitylevelofcrudeoils:Crude1is10andCrude2is5

Theconstraints

Gasolinequality>=8Heatingoilquality>=6QuantityofCrude1

=15000QuantityofCrude2=20000

Theblendingplan :

Gasolinehastohaveaminimumqualityofatleast8,andheating oilmusthaveaminimumqualityofatleast6.TheCrudeoil1we

haveonhandhasaqualitylevelof10whileCrudeoil2hasaqualitylevelof5.Wewanttoblendthesetwocrudeoilstobothachievetheminimumqualitystandardsandmakethegreatestprofit.

Let’sattacktheproblembycreatingtotal‘qualitypoints’(QPs)whichrepresentthequalityofoilinabarrelmultipliedbythenumberofbarrelsofthatoil.

Writeequationstocalculatethequalitypoints:

TotalQPsinthegasoline=10*amountofOil1+5*amountofOil2

Ifforexample,wemixedtogether50barrelsofCrudeoil1and40barrelsofCrudeoil2,thetotalQPswouldbe50x10+40x5=700.Theaverageperbarrelwouldbe700/90=7.78.Thisistoolowforgasoline(needs8)butacceptableforheatingoil.Twopoints:

we could sell the oil as heating oil, but it is exceeding the minimum qualityrequirement.Wecouldmakemoreprofitbyreducing thequalitytotheminimum,orchargeapremium.Butthisisprescrip-tivework,andwehavetoworkwithintheinputsgiventous).

TheonlywaytogettheoiluptogasolinestandardsistoincreasetheamountofCrudeoil1inthemixture.

Ifyoudon’tgetthis,tryathoughtexperiment:forgasoline,iftherewasnooilatallfromOil2,whatwouldbetheQP?Itwouldbe10*thequantityofoilfromOil1.Again,howaboutifweblended together1000barrelsfromeachtypeofoil: howmanyQPswouldbeproduced?Itwouldbe10*1000+5*1000=15000.Readthisthroughagain…itisimportant.

The blending plan provides us with the constraints we need to ensure that theminimumqualitylevelsareachieved.Intheexample justabove,weblended2000barrelstoprovideaQPof15000.Thisisanaverageof7.5.Goodenoughforheatingoil,notgoodenoughforgasoline.

Oilproblemsolution

Discussionofthesolution:notethatgasolinesellsformoremoneythanheatingoil,buttheoptimalsolutionsuggeststhatweshouldsellmoreheatingoilthangasoline.Thisisbecauseoftheconstraintsonquality.

12.Predictingitemsyoucancountonebyone

Why you need to know this . Many business decisions involve counts: eitherbinary(yes/no)orwithintimeorspace.Itwouldbegoodtoknowtheprobabilityofacertainnumberofcustomerscom-inguptoaservicedeskinacertainlengthoftime;ortheprobabilityofacertainnumberofcaraccidentsatagivenintersection.Wearelooking for a discrete probability distribution; discrete because the number ofoccurrencesisaninteger.

Chapter4onregressionshowedhowtopredictthesizeof anoutcomewhichwascontinuous(moneyortimeperhaps).Thedependentvariable—whatweweretryingtopredict—couldtakeonalmostanyvalue.Nowwewanttopredictprobabilitiesfor the occurrence of an independent variable which is an integer. Below we’llwork through some hands-on applications, with the theory available in theGlossary.

We’llbreakthisintotwoparts:

Theprobabilityofa binaryoutcome .Theprobabilitiesoftwofaultyitemsoutofthe next twenty on the production line. Or, the probability of at least fivecustomersoutof thenext fiftyactuallybuyingsomething.Thisisestimatedwiththebinomialdistribution .

The probability of a particular count of occurrences over time or area. TheprobabilityofthreeorfewerpeoplearrivingatyourCustomerServiceDeskwithinthenexthalfhour.Ormorethan twomistakesinthenexttenlinesofcode.ThisisestimatedwiththePoissondistribution .

114

12.1Predictingwiththebinomialdistribution

Afterthenormaldistribution(seetheGlossaryforadefinition), thebinomialisthemostimportantinstatistics.ThemathforthebinomialdistributionisalsodefinedintheGlossary.

The binomial distribution provides the probability of a ‘success’ in a certainnumber of ‘trials’. For example, you can calculate theprobabilityofmore thansevenoutofthenexttwentypeople through thedooractuallybuying something.Herea‘success’issomebodybuyingsomething;whilethenumberof‘trials’isthenumberofpeoplecomingthroughthedoor(hereitistwenty).

Somedefinitions:

Whatwearelookingforistheprobabilityofapre-definednumberof‘successes’.Notethatthedefinitionofsuccessisuptotheanalyst. Itcouldbe ‘spendingmorethan$20’orwearingahat.

Intheexamplewe’llworkthroughbelowyourunastore.Youknowthatthelong-runprobability of somebodybuying something is 0.6. You obtained this number bycountingtotalnumbersofcustomersand,outofthose, thenumberswho actuallyboughtsomething(successes).

Workedexample:Youwanttoknowtheprobabilityofexactlythreepeoplethroughthedooroutofthenextfivebuyingsomething.Note that‘success’justmeansthatthedefinedeventhappens.Whetherornotitisa‘goodthing’isn’trelevant.

ForExcel,theargumentsrequiredare: therandomvariablewhose probabilitywewant to predict; number of trials, the long-run probability, and a true/falsestatement.Intheexample, thenumber of trials is five.The randomvariable (X)whoseprobabilitywewanttopredictis3.Wealsoneedthelong-runprobabilityofasuccess.Intheexample,p=0.6.Wealsowanttheprobabilityofexactly3,sousethefalsestatement.I’llgointothisinsomedetailshortly.

Open an Excel spreadsheet, and type in =BINOM.DIST(3,5,0.6,false) and youshouldget this:0.3456.This is theprobabilityof exactly threepeopleoutof thenextfivebuyingsomething(numberofsuc-cessesoutofthenextfivetrials).NoticetheorderoftheargumentsintheExcelfunction.Andespeciallythelastone,whichintheexampleaboveisfalse.(Thealternativeistrue).Thedifferenceisimportantbecausetheresultsarequitedifferent.

Trueandfalseargument .Definingtheargumentasfalseprovidestheprobabilityof exactly the random variable. Defining the argument as true provides acumulativeprobability.

Excel adds up probabilities from the left, so changing the argu- ments to=BINOM.DIST(3,5,0.6,TRUE)=0.66304isthesumoftheprobabilitiesofX=0+X=1andsoonuptoanincludingX=3.Theprobabilityof0.66304isthereforetheprobabilityofthreeorfewercustomersbuyingsomething.ThetablebelowshowstheprobabilitiesofvariousvaluesofX,both‘false’and‘cumulative’.Asyoucansee, thecumulative is just thecontinuedaddition of eachsuccessiveprobability.Thereisafurtherexamplehere¹

Probabilitiescalculatedbothfalseandtrue

Howaboutmorethanthreecustomersbuyingsomething?Inmathnotation,we’relookingforP(X>=3).Weknowthat probabilitiesmustsumupto1.Weknowthatthreeorfeweris0.66304.Somorethanfivehastobe:1-0.66304=0.33696.

¹https://www.youtube.com/watch?v=oZ1DmNQ8wW4

https://www.youtube.com/watch?v=oZ1DmNQ8wW4

https://www.youtube.com/watch?v=oZ1DmNQ8wW4

Aslightlyharderexample:theprobabilitythatatleastfourcus- tomersoutofthenexttenwillbuysomething?We’relookingfor P(X >=4). Look carefully at thenotation.X>=4impliesthatthedistributionofthetencustomersissplitintotwohalves:lessthanfourandfourormore.Itisthelatterhalfwhoseprobabilitywe’relookingfor.

Ifwecanfindtheprobabilityof0+1+2+3,thatistheprobabilityoflessthanfour(weonlywant integers here). So find that probability and then subtract from 1,makinguseofthefactthatprobabilitiesmustsumupto1.

Let’sdothisstepbystep.

First,findP(X=<3),thatistheprobabilitythatXisthreeorless:

=BINOM.DIST (3,10,0.6,true) = 0.054762.Notice the ‘true’ which givesusthecumulativeprobability.

Wewantfourormore,sosubtractfrom1likethis:1-0.054762=0.945238.

Anotherexample.It’swinterandyouneedtowearasweatereveryday.Youhavetwoblueandthreeredsweaters.Calculatetheprobabilitythatduringtheweekyouwillweararedsweater:

Exactly twiceintheweek

Answer:thelong-runprobabilityofpickingaredsweateris3/5

=0.6becauseyouhavefivesweaters,andthreeofthemarered.Thenumberoftrialsis7becausethereare7daysinaweek.Thewordingof thequestioncontains theword ‘exactly’ whichmeans thatwe don’t want a cumulative answer, so we’llinclude the FALSE argument. Therefore the answer is =binom.dist(2, 7, 0.6,FALSE)=0.077414

Morethanthreetimes.Whenyouseewordssuchas‘morethan’ that’saclue thatyou’relookingforacumulative probability.In math notation, we’re looking forP(X>3). So if we find the cumulative probability up to and including two, andthensub-tractfromone,we’redone.Thecumulativeprobabilityoftwo

or fewer is P(X=0) + P(X=1) + P(X=2). So the answer is = 1 -binom.dist(2,7,0.6,true)=0.903744

Thekeys:

Writeoutwhatyouaretryingtopredictinmathnotation.Thisforcesyoutobeclear.Drawalittlesketch(hopefullybetterthanmine!)ifyougetconfused.

Ifthewordingofyourproblemcontains‘exactly’orrequirestheprobabilityofjustoneparticularoutcome(egP(x=3))thenyouwanttousefalse.

12.2PredictingwiththePoissondistribution

Thebinomialdistributiongaveustheprobabilityofabinaryoutcome(yes/no)outof a certain number of trials. The Poisson gives us the probability of a certainnumberwithinaspecifiedtime-frameorarea.

Here’stheexample.You run theCustomerServiceDesk.Youwant to know theprobabilityoffiveorfewercustomersarrivinginthenexthalfhour.Thatwouldbeusefulforstaffing,wouldn’tit?Youknowthattheonaverage,20customersarriveeveryhalfhour.Youknowthatbecausehavecountedthem.

Wewant:P(X>=5).Likethebinomialabove,thePoissonhasatrue/falseargumentforwhetherwewantcumulativeprobabilitiesor‘exact’.Thenotation include a>sign, which implies cumulative, so we use TRUE. In Excel:=POISSON.DIST(5,20,true).Theanswer is7.19088E-05.This looks a bitweird,butitisjustmathnotation.TheE-05meansthatyoushouldmovethedecimalpoint5placestotheleft.Sotherearefourzeroesinfrontoftheleading7.Inotherwords,anextremelysmallprobability.Perhapssendsome staff out for lunch? It is verysmallbecause5isalongwayfromyourlongrunaverageof20.

13.Choiceunderuncertainty

Virtually all business decisions involve at least some degree of uncertainty. Forexample,youdon’tknow(andpresumably canneverknow)somefuture‘stateofnature’whichmightbeanexchangerateorbusinessrental.Afarmerdoesnotknowtheprice of wheat at harvest time, but has to decide howmuch to plant manymonths ahead. His decision is relatively simple compared to a decision-makerconfrontedwithanopponentwhowill reacttotakeadvantageofwhateverdecisionyoudomake.Thefirstsituation iscalled‘non-strategic’and iseasilyhandledbytheexpectedmonetaryvaluemethodwhichwe’llworkthroughinthechapter.Thesecond‘strategic’situationismorecomplicatedbutcanbesolvedbyanapplicationofgametheory.We’llcovertheEMVmethodinsomedetailbelow.Gametheoryisanenormoussubjectandwedon’tcoverithere,butI’llgiveanexampleattheendofthechapteralongwithsomerecommendedreading.

13.1Influencediagrams

Aninfluencediagramisasketchoftheinfluencesandoutcomesofadecision.Theinfluencediagramallowsustothinkthrough anddepict the forces that influenceourchoiceswithouthavingtoassignprobabilities.Itiscustomarytouseovalshapesforvariableswhichareuncertain,aboxforadecisionnode;andlozengesfortheoutcome.

119

VacationInfluencediagram

Youareanoutdoors-typeandsothechoiceofactivityisaffectedbytheweather.Theweatherforecastisaffectedbytheweathercondition (onehopes!).Butyoudon’tknowtheactualweathercondition,justtheforecast(whichiswhyitisaforecast).Youchooseyouractivity(box)basedontheforecast.Theamountofsatisfactionyouget (lozenge) depends on the activity you chose AND the actual weathercondition:observethetwolines leadingtothelozenge.Thereisastandardformat:

1.DecisionNodesarepresentedinrectangles.Intheexampleontheslides,thedecisioniswhatactivitytopursuewhenonvacation(climbamountain?readabook?).Whichchoicewemakemaytosomeextentbedeterminedbytheweatherforecast.

2. UncertaintyNodes are presented in ovals. The weather forecast isuncertainandsoistheactualweatheritself.

3.OutcomeNodesarepresentedinlozenges.Theoutcomeinthevacationexampleishowmuchsatisfactionyoutookfrom thechoiceyoumadeforvacationactivity.Thisisautility.

4. Arcs display the flow of the influence. The actualweather conditionaffectstheweatherforecast,whichaffects yourplanning.Noticethatyouare takingadecisiononactivity in advance of knowing what the actualweatherwillbe.Thatiswhythereisnoarcbetweenweatherconditionandvacationactivity.

YoucandrawinfluencediagramsquiteeasilyinExcelusingtheshapesdrop-downtool.

Therearedifferentoutcomeshere,whichwecanbe represented as payoffs.Thehighestpayoffcomesfrommatchingyourchoiceofactivitytotheweatherforecastandthenfindingthattheforecastmatched reality.

13.2Expectedmonetaryvalue

theexpectedmonetaryvalue,orEMV,isjustthevalueofsomeout-comegivenaparticularstateofnature,multipliedbytheprobabilityofthatstateofnature.Here,Iam using the customary expression ‘state of nature’ not necessarily to refer toanythinginthenaturalworld,buttoaparticularsetofconditions.

Thestateofnaturehastobemutuallyexclusivefortheprobabilities towork.Forexample, if we have separate probabilities for rain and sunshine, then cannotcalculateforrainANDsunshine.Thepayoffswhichresultfromeachstateofnaturearedifferent,dependingonthestateofnature.

Let’smotivatethiswithanexampleusingafriendlyfarmerasthedecision-maker.Hehasthechoiceofplantingwheat,raisingcattle,orsomemixtureofboth.Thosethreepotentialdecisionsrepresenthisrangeofchoices.Fromexperience,heknowshowmuchhecanexpecttoreceivedependingontheweatheratthetimeofharvest.Thatamountiscalledthepayoff .

Hefacesthreestatesofnature,representingdifferentweatherconditionsatharvest.Theseare rain,clear,andsunny.Foreach choiceand foreachstateof nature heknowsthepayoffswhicharerepresentedinthecellsbelow.

Thepayofftableforthefarmer

Therearethreedecisionsandthreeoutcomes,makingatotalof9possiblepayoffs.Payoffscanbenegative,andarenotalwaysintermsofmoney.Theycouldbetime,oranyother appropriate metric. In the rowswe put the choices available to thedecision-maker.Inthecolumnsweputthe‘statesofnature’whicharethe futureeventsnotunderthecontrolofthedecision-maker.

The farmer calculates the probability of each of the three outcomes by workingthrougholdweatherrecords.Hefinds:

probabilityofrainingis0.4probabilityofclearis0.3probabilityofsunnyis0.3(ofcoursetheseprobabilitiesmustalladdupto1.Therecouldbemanymore‘statesofnature’butIhavekeptthemtothreeforclarity.

TheExpectedValueisthesumofeachprobabilitymultipliedby theoutcome. Inmathnotationthisis:

EMV=∑p x∗x

which inwords is: the expectedmonetary value is the sumof all the outcomesmultipliedbytheirprobability.InExcel,usetheformula

=sumproduct(array1, array2). This is a mean, or average, with each outcomeweightedbyitsprobability.Indecisionsinvolvingmoney,

weusuallycalltheExpectedValuetheExpectedMonetary Value(EMV).Forthefarmer,theEMVsareasbelow:

TheEMVsforthefarmer

Imultiplied(horizontallybydecision)thepayoffbytheprobability ofthatpayoffandthensummed.Usuallywechoosethedecision with thehighestEMV, so thefarmershouldchoosewheat.ExpectedvalueYoutube¹

Note that a ‘good’ decision is making the best decision at the time with theinformationtohandatthattime.Theremaybeunluckyconsequencesbutprovidedtheanalysthasdoneathoroughjobinselecting theoptimaloutcomeat the time,thensheshouldnotbeblamed!

Itwillbehighlyunusualforanoutcomeof30toactuallyoccur.TheEMVisjustaweightedaverageandnotaprediction.

https://www.youtube.com/watch?v=fR7cBts1C1w


13.3Valueofperfectinformation

The farmermight be tempted to ask for advice from a consultant. What is themaximumheorsheshouldpayfortheadvice?Itisthedifferencebetweenthebestcase that the farmer calculates and how much he would receive with perfectinformation.WecalculatethevalueofperfectinformationbycomparingtheEMVofthebestchoiceineachstateofnaturewiththeEMVwithoutinformation.Ifyouknewthattheweatherwouldbepoor,youwouldselectCattle;clearwheat;sunnywheatagain.ThenfindtheEMVwithperfectinformationbymultiplyingthesebestoptionsbytheir probabilities.

¹https://www.youtube.com/watch?v=fR7cBts1C1w


Theworkingis:100.4+300.3+70*0.3=4+9+21=34.Thebestwecouldcomeupwithoutperfectinformationwas30,sothevalueofperfectinformationis34-30=4.Soifsomeoneofferedyouperfectinformationforaprice,thehighestthatyouwouldpayfortheperfectinformationwouldbenotmorethan4.

13.4Risk-returnRatio

There are other ways of choosing the optimal outcome other than the EMV,althoughtheEMVremainsthemostcommonlyused.OnecommonmeasureistheReturntoRiskRatio ,whichprovidesthedollarsreturnedperdollarputatrisk.Foreachdecision,wedivide theExpectedValueof thedecisionbytheStandardDeviationoftheoutcomeforthatdecision.WeusuallychoosethedecisionwiththehighestRRR,because then thedollar returnforeachdollarputat riskishighest.Thefarmer’sworksheetisbelow.

Mixedhasazerobecausethepayoffsareallthesame.Thereisnorisk.Cattlehasthehighestat0.715.This ishigher than theRRR for wheat althoughwheat has thehighestEMV.Thefarmermightwanttothinkmoredeeplyabouttheprobabilityofsunnyweatheratharvesttime,becausethismakesahugedifference.

13.5Minimaxandmaximin

A non-probabilistic approach to making choices is the use of maximin andmaximax.Youcan see thatyour choicedependson whether you are pessimistic(maximin)oroptimistic(maximax).Ifwecansomehowobtainprobabilities,thenaprobabilistic approachworksbest.

13.6Workedexamples

Youare planning to market a new coffee-flavored drink. You have a choicebetweenpackingthedrinkinareturnableornon- returnablepackaging.Yourlocalgovernmentisdebatingwhethernon-returnablebottles shouldbeprohibited.Thetablebelowshowsyourprofits.Ifthenon-returnablelawispassed,youwillstillgetsomesalesfromexports.

Decision Lawpassed Lawnotpassed

Returnable 80 40

Nonreturnable 25 60

Example1

1. What would be the best decision based on maximin and maximaxcriteria?

2. Alobbyisttellsyouthattheprobabilityofthegovernmentbanningnon-returnablesis0.7.Assumingthelobbyistisright,whatisthebestdecisionbasedontheEMV?

3.Atwhatlevelofprobabilitywouldyourdecisionchange?

4.Howmuchwouldyoupayforperfectinformation?

Answers:

1. Maximin: theworstare25and40.Thebest is40,meaningpackagewithreturnables.Maximax:thebestare80and60.Againreturnables.

1. TheEMVforreturnableis800.7+400.3=68.Fornonreturn-ableitis250.7+600.3=35.5.UnderEMV,youshouldusereturnablebottles.

2. Setthisupasequation,withptheunknownwhichhastobalancebothsides80p+40(1-p)=25p+60(1-p).Solvefor

pandget0.267.Soifyouhearrumoursthatsuggesttheprobabilityiscomingdown,thinkagain?

3. Ifwehadperfectinformation,theEVPIis800.7+600.3=74.Soyoushouldnotpaymorethan74-68=6

Example2

Yourunabank.Acustomerwantstoborrow$15,000for1year at 10% interest.Youbelievethatthereisa5%chancethatthecustomerwilldefaultontheloan,inwhichcaseyouwillloseallthemoney.Ifyoudon’tlendthemoney,youwillinsteadplacethe$15,000inbondswhichreturn6%butarerisk-free.

1.WhataretheEMVsofloanandnotloan?

2. Youhave a credit investigation department which can help you withmoreaccuracyintheprobabilityofdefault.Whatisthemostyoushouldbewillingtopayfortheiradvice?

3. Calculatethelevelofprobabilityofdefaultatwhichlending themoneyandinvestinginbondshaveequalEMV.

14.Accountingforrisk-preferences

What this chapter is about: the previous chapter showed howwe could pick themostattractivechoicegivenpayoffsandprobabil- ities.But itmighthavestruckyouthat therewaslittlediscussionoftherisksinvolvedandhowdifferentpeoplerelate to risk. In this chapter we are going to work through picking optimaldecisions,takingintoaccounttheriskpreferencesofthedecisionmakers.

Hereisaquestionforyou:youaregivenalotteryticketwhichhasa0.5probabilityofwinning$10,000anda0.5probabilityofzero.TheEMVistherefore10,0000.5+ 00.5 = $5000. Someone comes along and offers you $3000 for the ticketguaranteed. Would take the sure thing? Or would you hope that you win the$10,000?Most peopleare‘riskaverse’andwouldprefer togiveupsomeof theEMVinexchangeforcertainty.The$3000(orhowevermuchitis)is calledyourCertaintyEquivalent (CE) in thisparticulargamble.The difference between theEMVandtheCEiscalledtheRiskPremium.So

RP=EMV-CE

Youcanthinkofthecertaintyequivalentasasellingprice.Itisusuallysmallwhenthesizeofthegambleissmall,butincreasesas thegamblegetslarger.HereIamusing the term ‘gamble’ because there are probabilities involved. The certaintyequivalentisausefulconceptwhich,amongstotherqualities,allowsustocategorizeapproachestorisk.

•Risk-averse .IfyourCEislessthantheEMVofagamble,thenyouarerisk-averse (and most people are, so nothing to worry about. You’re‘normal’)

127

•Risk-neutral .YourCEmatchestheEMV.Youplaytheaver-ages.

• Risk-seeking .Youenjoythegamble inherent in theEMVand need tohaveaveryhighCEbeforeyou’llgiveitup.Mostpeopleinbusinesswhoare risk-seeking usually don’t last verylong.Althoughwesometimesseerisk-seekingdecisionswhenyourbusinessisintroubleandahugegambleistheonlypossiblesolution

Businessdecisionsareallabout risk,and theprobabilities them-selvesareoftenunreliableandhardtofind.Thefarmerwhosecrop- ping decision we studied inChapter2doesn’tknow tomorrow’sweather,letalonetheweatherinsixmonths.WegenerallyprefertogiveupsomeoftheEMVinreturnformorecertaintybecauseweareriskaverse.That’swhywebuyinsurance.IknowthatIwillbecoveredifIsmashmycar,andIcanevenputupwiththeknowledge thatpeopleinZuricharegettingricherbecauseofmyaversiontorisk.

14.1Outlineofthechapter

Here’swhatwe’regoingtodo:

•Describeutilityandcalculateutilities

•Showhowtousetheexponentialutilityfunctiontocalculateutilitiesbasedonrisktolerance

• Convertthoseutilities to certainty equivalentswhichwe can rank andcomparewiththeEMVsofthesamedecision

Torank choices in terms of their certainty equivalent ratherthan their EMV,weneedtheconceptofutility,whichisa numericalmeasureofhowwellagoodorservice satisfiesaneedorwant.Do you prefer to be reading this book, outsideeatinganice-cream,or

talkingwithafriend?Youcan rank these choices quickly in your head.Youcansubconsciouslyallocateutilitynumberstothechoicesandthenpickwhicheverhasthehighestutility.

Utility numbers help us to rank our preferences, but there are no units and therankingisindividual-specific.Inotherwords,givenasetofalternatives,differentpeoplemaywell havedifferentrankingsfor them.For themoment, let’sassumethatitisjustyouwhoseutilitiesweareinterestedin.Ifwecansomehowfigureoutyour utilitiesfor thedifferentoutcomesofabusinessdecision,wecanuse thoseutilitiestohelpyoumakeamorepsychologicallysatisfyingchoice.

Utility

Theplotbelowshowsahypotheticalutilitycurve.Theutilityvalueisontheverticalyaxisandtheoutcomeofagambleison thehorizontalxaxis.Noticetwothings:

Atypicalutilitycurve

*Theslopeofthelineisupwards.Thisreflectsthefactthateveryone prefersmoremoneytoless

• The rate of increase of the line is decreasing or slowing down. Itwassteepearlyon,buttowardstheenditisalmostaplateau.Thisisbecausethevalueofeachextra dollar isslightlylessthanthatof the previousdollar.Thisisthemarginalutilityofmoney.

14.2Wheredotheutilitynumberscomefrom?

Therearetwoapproaches.

Thefirstinvolvesaskingthedecision-makerhis/herchoicebetweentwogambles.

The seconduses a utility function, amathematical function which converts twoinputvariables(risktoleranceandthesizeoftheoutcome)intooneutilitynumber.

WithautilitynumberinhandwecanbegintheprocessofrankingthedecisionsintermsofutilityratherthanEMV.

Theintuitivemodel

We’llworkthroughtheintuitivemodelfirstandthenapplythesamethinkingtotheequationmodel.Onceyouget thehangof it, theequationmodel is quicker andprobably more accurate. We’ll use the payoff table from the farmer example,repeatedbelow.

Thepayofftableforthefarmer

We’llassignautilitynumberof1tothehighestpayoffandthenzerotothelowest.Thebeginningsoftheutilitytablelookslikethis:

PayoffUtility

701

40

30

20

10

-10

-300

Nowweneedtofillinthegaps.Whatwe’lldoistouseeachpayoff asacertaintyequivalent,andaskthisquestion:

Whatvalueofpwouldmakeyouindifferentbetweeneitherreceiv-ingaguaranteedpayoffofthecertaintyequivalent(inthefirstcase

40)oracceptingagambleofreceivingthehighestpayoff(70)withprobabilityporlosing30(thelowestpayoff)withprobability1-p.Let’ssaythatyouanswerp=0.9.Theexpectedvalueofthegambleis70*0.9+0.1(-30)=60.

This is higher than your certainty equivalent of 40, indicating that you are riskaverse.Thedifferencebetweentheexpectedvalueandthecertaintyequivalentistheriskpremium,andistheamounttheindividualiswillingtogiveuptoforgorisk.

Continuedownwards,certaintyequivalentbycertaintyequivalentandcompletethetable.I’vemadeupthenumbersjustforillustra-tion.Aplotoftheutilityvaluesisalsoshown.

Payoff Utility

70 1

40 0.9

30 0.8

20 0.65

10 0.4

-10 0.35

-30 0

Farmer’sutilitycurve

Ifyouwererisk-neutral,youwouldprefertotakethegambleandsowouldbeanEMVmaximizer.Thechoiceoftherisk-neutralperson isindicatedbythestraightline.

Now,let’sreplacethepayoffvaluesinthefarmertablewithutilities.WecancalculatetheexpectedutilitiesinthesamewayaswefoundtheEMVs.Thetableshowsbothforcomparison.Wemultiplytheutilityofeachoutcomewithitsprobability.

Expectedutilities

The famer’s EMV decision would have been wheat (EMV=14)but taking intoaccount his risk preferences, mixed gives the highest expected utility. It makessensestospreadtheriskbetween twocompletelydifferent crops.

##Theexponentialutilitycurvemethod

It is rather tediousaskingsomanyquestions,pluspeopleget tired andconfusedratherquickly.Anattractivealternativeistousea

mathematicalfunction, theexponentialutilitycurve.Thisrequires onlyoneinputvariable,R,thetoleranceforrisk.Theequationisbelow:

U x=1−e− x / R

Herex is the payoff; Ux is the utility of the payoff x; and R is theindividualsrisktolerance.Apersonwitha large value ofR ismorelikelytotakerisksthansomeonewithasmallerRvalue.AstheRvalueincreases,thebehaviorapproachesthatoftheEMVmaximizer.TheplotforsomeonewitharisktoleranceofR=1000is

below. ActuallyperformingthecalculationstofindutilityateachlevelofpayoffiseasyusingExcel,aswe’llseebelow.ButfirstwehavetodetermineR,theindividual’stoleranceforrisk.

Findingrisktolerance

Therearetwoapproaches

Byaskingthisquestion:consideragamblewhichhadequalchances ofmakingaprofitofXoralossofX/2.Whatisthevalueofx forwhichyouwouldn’tcarewhetheryouhadthegamble.Inotherwords,whatisthevalueofxforwhichthecertaintyequivalentis

zero?Theexpectedvalueofthealternativeis0.5xX-0.5*(X/2)=0.25X.Aslongas X >0 then the decision-maker is displaying riskaverse behavior, becausehis/hercertaintyequivalentislessthantheexpectedvalueofthegamble.ThegreatertheR,themoretolerantofrisk.Inhisbook,ThinkingFast,ThinkingSlow,DanielKahnemannotes that the ‘2’ in thedenominatorcanvarya tad. It isn’tapreciseformula,butapparentlyitcomesclose:Indiagramform:

Anal- ternativeapproach,moreusedinbusinessandfinance,istoemploy guidelinenumberscalculatedbyProfessorRonHoward¹,apioneerofdecision-analysis.Basedonhisyearsofexperience,HowardsuggeststhattheRvalueforacompanyis:

•6.4%ofnewsales

•124%ofnetincome

•15.7%ofequity

##Exampledecisionusingutilitymaximisation

Whatwe’ll dohere is to examine adecision from both the EMV and ExpectedUtilityangles,using theexponentialutility function. Thecalculationofexpectedvaluesisthesame:thedifferenceisthatinsteadofmultiplyingeachmonetarypayoffbyitsprobabilityandthenaddingup,wemultiplytheutility.

¹https://profiles.stanford.edu/ronald-howard

https://profiles.stanford.edu/ronald-howard

https://profiles.stanford.edu/ronald-howard

The decision set-up. All money figures in millions. I run a coffee importingbusiness.Ineedaspecialimportlicensetobringinaparticulartypeofcoffee.Theequityofmycompanyis12.74.So,myRvalueis15.74%of12.74=2.

Ihaveachoicebetween‘rush’and‘wait’forthepermit.

IfIrush,Ihavetopayaspecialfeeof5,andthereisa50/50chanceofthepermitbeinggranted.Ifitisgranted,Iwillhavesalesof8,givingmeanetprofitof8-5=3.Ifitisn’tgranted,Istillhavetopaythe5,butwillhavesalesofonly6,givingmeanetprofitof2.First,findtheEMVandEUofrushing.TheEMVis0.5x3+0.5(2)=2.5.

Usingtheexponentialcurveformula,theutilityof3is=1-exp(- 3/2)=0.77687.Andtheutilityof2is=1-exp(-2/2)=0.632121.TheEUforrushingis0.5(0.77687)+0.5(0.632121)=0.704496.Allfiguresareinmillions.

Nowforthealternative,whichistowait.Inthisscenario theprobabilitiesarethesameat50/50,butthecostisonly3ifitisissued,nothingifitisn’t.Thesalesarethesameat8and4.Thenetprofitsare8-3=5and4.TheEMVis0.55+0.54=4.5.TheutilitiesareU(5)=1-exp(-5/2)=0.918andU(4)=1-exp(-4/2)=0.865.TheEUis0.50.918+0.50.865=0.892.

ThetablebelowshowstheEMVsandEUs.

Spreadsheetfortherushdecision

Butwecangofurtherusingcertaintyequivalents,subjectofthenext section.

14.3Convertinganexpectedutilitynumberintoacertaintyequivalent.

Fortunately there is an easyway to convert expected utilities back to certaintyequivalents. It is then straightforward to rank the decisions by certaintyequivalents. Remember the concept of the certainty equivalent? The amount ofmoneywhichitwould takeforyou to sellyourgamble? It turnsout thatwecanconvertanexpectedutilitynumberintoacertaintyequivalentusinganequation:

CE=− R∗ln (1−EU)wherelnisthenaturallogarithm.Weusethisequationwherewearetalkingprofitsandwantmore.Inthecaseofcosts,takeoffthenegativesigninfrontofR.

AbovewefoundEU=0.704496fortherushdecisionbranch.ThatisaCEof

=-2*LN(1-0.704496)= 2.43814

using Excel’s ln function. For the wait decision, the EU was 0.892, giving acertaintyequivalentof=-2*LN(1-0.892)=4.451248

LargerCEsarepreferredwhentheoutcomesareprofits,andsmallerCEswhentheoutcomesarecosts.SoI’dwait!ThisconfirmsthedecisionmadewiththeExpectedUtilities.

15.GlossaryThisisahands-onguidetodoingsomebasic analyticworkwith data.However,statisticshasitsownsetofvocabularyandtech-niques.Whenyouarepresentingyour results tootherpeople,youmaywish to followestablished techniques andterminology. Also, here you’ll find more of the underlying theory if you areinterestedinhowExcelgetstheresultsitdoes.

Binomialdistribution

The binomial distribution is used to find the probability of the occurrence of acertainnumberofevents(calledsuccesses,althoughtheycouldbeanythingclearlydefined)withinanumberoftrials.Thebinomialdistributionwillgive,forexample,theprobabilityofatleasttenpartsbeingdefectiveoutofthenexthundred,providedthelong-termdefectiverateisknown.Thebinomial requires some conditions toworkproperly.These are:

asequenceofnidenticaltrialstherearetwooutcomesforeachtrial,denotedsuccessandfailuretheprobabilityofsuccessdoesn’tchangethetrialsareindependent.

It isquiteeasytocalculatetheprobabilitieswithExcelbutfor illustrationhereistheequationwhichExceluses:

f( x )=

(n)

x

p x (1 −p )

( n − x )

wherexisthenumberofsuccesses;ptheprobabilityofsuccessononetrial;nthenumberoftrials,andf(x) theprobabilityofx successes inn trials.Thisnotationpointsupthefactthattheresultisafunctionofx.

137

Centrallimittheorem(CLT)

Thecentral limit theoremisfundamental instatistics.What the theoremstates isthis:let’ssayyouhavealargepopulationand you keep taking samples from thepopulationandmeasuring themeanofeachsample.Thenyoudrawahistogramofthemeanssothateachmeanisavariableinthedistribution.Inotherwords,insteadof plotting each observation independently, we plot them by themeans of eachsample.Thehistogramwhichappearswillbeapproximatelynormallydistributed,orintheshapeofthewell- known bell curve. This is wonderful news because thenormaldistributioniswellbehaved,andsowecancalculatetheareaunderanypartofit.

Correlation

Youhaveapairofvariables:forexample:websitehitsand actualsales.Youwantto knowwhether there is a relationship between thevariables.Andwhetheritis‘statisticallysignificant’,orperhapswhatyouthoughtwasarelationshipoccurredjustbychance.

Correlation is the study of the linear relationship between two or more randomvariables.Correlationanalysisprovidesboththedirectionand thestrengthof therelationshipbetweenthevariables.Forexample,thereisarelationshipbetweentheheightandweightofindividuals.There is apparently also a relationshipbetweenconsumptionofsugarandobesityratesatthenationallevel.Cor-relationanalysiscantestwhethertherelationshipsarespuriousorreal.

It’strueandworthrepeating: correlationdoesnotimplycau-sation .Itdoesnotnecessarily follow that onevariable causes the other even though theymight behighly correlated. There are many examples of such spurious correlations, forexamplenational consumptionof chocolate andnumber ofNobel prizeswon. Ifyou do find a correlation, first think about whether you have a lurking variablediscussed below. The two examples given above for height/weight and sugarconsumptionmightverywellinclude

causaleffects,butthisisnotsomethingthatcorrelationanalysiscanestablishonitsown.Indeed,somephilosophers(such asDavid Hume) claim that causation canneverbedecisivelyestablished.

Nowwewill:

•Visualizeacorrelationwithascatterplot

•explainwhythecoefficientofcorrelationworks

•interpretthevalueofthecoefficientofcorrelation

•testforthestatisticalsignificanceofthecoefficientofcorre-lation

Scatterplotsandcorrelation

Acanningfactoryhasdataonnumberofworkersemployedandoutputofcans.Thedatasetis‘canning’.Thefactoryisinterestedintherelationshipbetweenthenumberofworkersandoutput.LoadthedataintoExcel.

Firstfewlineofthecanningdata

Thecolumnsofdata representsnumbersofworkers andcansmanufactured.Forthenotationwe’llcalltheseXandY.WhenImeanthenameofthevariable,I’llusetheuppercase.Forindividualobservations,suchasx=23,I’lluselowercase.WewanttoseetherelationshipbetweenXandY.TodothisweneedtohavethedatalaidoutincolumnssothateachobservationofXandYareonthesameline.It’salwaysagoodthingtovisualisetherelationshipwith a scatter plot and the scatter plot isbelow.

Thecansscatterplot

I’vedrawnaverticalandhorizontallinewhichgoesthroughthemeansofthetwovariables.Wecandividetheobservationsinto quadrants,withquadrantonebeingtoprightandsoforth.Iftherelationshipispositivethenwewouldexpecttoseemostof the observations in quadrant one and quadrant three. If the relationship wasnegativethenmostoftheobservationswouldbeinquadrantstwoandfour.

TherelationshipbetweenXandYisnotperfectlylinear.Ifitwas all the pairs ofpointswould lineupalonga straight line.Weneedsomewayofmeasuringhowfar off being in a straight line these observations are. There are two measures,covarianceandcoefficientofcorrelation.Thereissomemathsinwhatfollowsbutwe’llwalkthroughitslowly.

Covariance:Trythisthoughtexperiment:ifalltheYpointswereonaverticalline,thenthesubtractionofthemeanfromeachYvaluewouldgiveuszero:

( y i− y )=0

I’veputinalittleifortheyvaluestoshowthatitcouldbeany

ofthevaluesintheYvariable.SimilarlyfortheXvalues,ifallthepointswereonaverticallinethen

( x i−x )=0

Wecanseethatwedon’thaveverticallinesandthereissomedif- ferencebetweeneachobservedvalueandthemean.Ifwemultiplytogetherthedifferencesandthenfind theaveragebydividingbyn-1wecanfindthecovariance.Wedividebyn-1becauseweareinmostcasesdealingwithasample.Subtracting1adjustsforthisfact.ThesamplecovarianceofXandYistherefore:

S XY=

( x−x )( y−y )

n−1

Thebigproblemwithcovarianceisthattheresultisverydependentontheunits.Wecan get round this problem by standardizing the differences between eachobservationand itsmean.Wedo thatby dividing the difference by the standarddeviation of the variable. You can think of this technique as providing thedifferenceinunitsofstandarddeviations,so

( x−x ¯

s xandthesamefortheYvariable.Nowwenolongerneedtoworryabouttheunits.Thecoefficient of correlation is the covariance now between the standardiseddifferences,not thedifferences them- selves.The equation for the coefficient ofcorrelationofasampleis

r XY=∑

( x−x )( y−y )( n−1) S XS Y

Performingthiscalculationisextremelytediousandweusuallyletthecomputerdo

thework.Typein=correl(array1,array2)andhit

enter.Forthecansdatatheresultis0.991083.ThisisthePearsonProductMomentor‘r’value.

CorrelationYoutube¹

Valuesofr:valuesofrarerestrictedtotherange-1to+1.Anegativevaluemeansthat the slope is negative: as one variable increases thentheotherdecreases.Apositivevaluemeansthatbothvariables increasetogether.Thelargertheabsolutevalueofr,thenthecloser thecorrelation.Acorrelationofzeromeansthat all thepointsarescatteredaboutwithnopatternwhatsoever.Thervalueof0.991083forcansmeansthatthepointsarelinedupalongastraightline,whichiswhatwewouldexpectfromlookingatthescatterplot.

Testingr:theobservationsthatwehaveusedtocalculaterconsistofasamplefromapopulation.Becauseitisonlyasample,wedon’tknowwhetherthecorrelationwouldholdup ifweweresomehowabletogainaccesstothepopulation.ris thenotationforthecoefficientofcorrelationforasample.ForapopulationweusetheGreekletterρ“rho”.Generally,Greeklettersareusedfor population parameters.ThisdistinguishesthemfromtheRomanlettersusedforestimates.

Thehypothesiswe’retestingis:

againstthenull

H 0:ρ ≤0

H a :ρ =0

Wefindtheteststatisticusingthisequation

r √ n −2

t =√

−r 2

¹https://www.youtube.com/watch?v=QDj8aYfvccQ

https://www.youtube.com/watch?v=QDj8aYfvccQ

https://www.youtube.com/watch?v=QDj8aYfvccQ

We’llfirstworkoutthevalueoftheteststatisticusingthervaluewefoundabove.Thenwe’llfindthepvaluefromthis.

Thervaluefoundfromthecorrelationwas0.99and therewere 24 observations(n=24).Plugginginthenumbers,wefindthatt=32.86.Thisisaonetailedtest,souse=t.dist(yourvaluefort,n-1,andfalse) tofindapvalueof2.67343E-21.The -21meansmovethedecimalplace21placestotheleft,sothisnumberis effectivelyzero.Thervaluewefoundisstatisticallysignificant.Itismuchsmaller than0.05,sowecanrejectthenullhypothesis.Thereisacorrelation.

Descriptivestatistics usesgraphsand tables topresent whatweknowabout thedata. This is a powerful method of gaining insights into the data, and also formakingpresentations.However,descriptivestatisticslimitsitselftothedataavailabledoesnotmakeanyinferencesbeyondwhatwehavefromthedatatohand.

Distribution of the observations within a dataset describes the density of theobservations:aretheyspreadaboutevenly,orpeakedinthemiddleandthenspreadoutoneithersideoftheirmean?

One of the most common and useful distributions is the normal distribution,sometimesknownas thebellcurve.Manythings in thenaturalworldfollowthisdistribution.Forexample,ifyouwereabletomeasuretheheightsofalargenumberofmenorwomen,you’dfindthattheyfollowedanormaldistribution.

TheplottotherightshowsthedistributionofrecordsofheightsofparentscollectedbyFrancisGaltoninthe19thcentury.Galtonwastryingtofindouttherelationshipbetweentheheightofparentsandtheirchildren.Alongtheway,hediscoveredthephenomenonknownas‘regressiontothemean’.

Ihaveplottedtheheightsasahistogramandaddedanormalcurvewiththesamemeanandstandarddeviationastheoriginaldata.Thisclearlyfollowsthebellcurve.

Thenormaldistributionisveryimportantinstatisticsforseveralreasons

-thenormaldistribu-tionis‘well-behaved’inthesensethatital-waysfollowsthesamemathematicalfunction.Theequationisalit-tlecomplex,butwecandrawanynormaldistributionprovidedweknowitsmeanandvariance.Themeanistheaverageoftheob-

Galton’sheightobservations

servations,andinanormaldistributionoccursrightinthemiddle.Thevarianceisameasureoftheaveragedistanceofeachobserva-tionfromthemean.Thelargerthevariance,themore‘spreadout’ordispersedthedata.

Thegraphshereshowtwodatasets,withidenticalmeans butdifferentvariances.In most of this book we won’t use the term variance, but instead standarddeviation .Thestandarddeviationisjustthesquarerootofthevariance.Weusethestandarddeviationprimarilysothatwecanusethesameunitsasthemean.Thefirstplothasastandarddeviationof5,andthesecondoneastandarddeviationof2.Theplotswerecreatedbygeneratingarandomsetof1000numbers,withameanoftenandthestandarddeviationsjustdescribed.

-the second reason is connected with the most important result in statistics,the CentralLimitTheorem ,orCLT.This statesthatifyoukeepdrawingsampleswithreplacementfromalargepopulation, themeans of the samples will followa normaldistribution .Theimportantthingisthatthedistributionoftheoriginalpopulationdoesn’tmatteras longaswedrawat least30samples.Infact, itgetsbetterthanthat:iftheoriginalpopulationisnotnormallydistributed,thenwedon’tneedasmanyas30inoursample.Asfewas10willdothetrick.

Standarddeviation=5

Becausethenormaldistributionfollowsanequationsowell,wecan find the areaunderanypartofitveryeasily.

Standarddeviation=2

Themean splits the area under the curve into halves,meaning that50%of theobservationsaresmallerthanthemean,and50%larger.Acommonpracticeis tomarkoff thehorizontalaxis inunits of standard deviation. The further from themean,thegreaterthestandarddeviation.Thishelpsbothinvisualizingtheshapeofthedata,butalsoinusingtheempiricalrule .

Empirical rule : a ‘rule of thumb’ which states that when data are normallydistributed,

-64%ofthedataarewithinonestandarddeviationofthemean-95%ofthedataarewithin twostandarddeviations of themean -99.7% of the data arewithin threestandarddeviationsofthemean

Thismakesintuitivesense:only100-99.7=0.3%ofthedataaremorethanthreestandard deviations from the mean. They would therefore be in either of theextremeendsortailsofthedistribution.Theprobabilityoftheiroccurrencewouldbeverysmall.Bycontrast,observationsclosertothemean, those located onlyasmallnumberofstandarddeviationsfromthemean,aremuchmorelikelytooccur.Thatiswhythehistogramishighestonatthemean.Thoughtexperiment:veryshortorverytallpeopleareunusual(whichiswhyyoustareatthem).Bycontrast,peoplearoundthemeanheightarenotatallunusualandyoujustpassthemby.

Estimation : The statistical process is circular: we start off by choosing thecharacteristic of the population that interests us. That characteristic is called aparameter.Wedrawarandomsamplefromthatpopulation.Weuse thestatistic toinfer information about the same quantifiable feature in the population. Thestatistic(for ex-ampleanaverageoraproportion)isanestimateofthepopulationparameter. The process of finding how close the estimate to the parameter, andbuildingaconfidence interval around the estimate, is inference .We test claimsabouttheparameterusinghypothesis-testing,whichiscloselyrelatedtoinference.Hypothesis-testingisthefinalpartofthischapter.

Excel –morescreencasts

Insertingthedataanalysistoolpakadd-in²Thenormaldistribution³

²https://www.youtube.com/watch?v=Ods1Z5BLD9g&list=UUVZFhqXDgJgIRq-ljotClgQ

³https://i1.ytimg.com/vi/1qn0EIm5-PQ/mqdefault.jpg

https://www.youtube.com/watch?v=Ods1Z5BLD9g&list=UUVZFhqXDgJgIRq-ljotClgQ

https://i1.ytimg.com/vi/1qn0EIm5-PQ/mqdefault.jpg



https://i1.ytimg.com/vi/1qn0EIm5-PQ/mqdefault.jpg

ExponentialSmoothing

Hypotheses and p values A hypothesis is a claim someone makes about apopulation.Forexample,acoffee importer (me!) is toldby the supplier that themeanweightofthesacksofcoffeebeansis40kg.Thatistheclaim.Iwanttotestthat claim because I am paying for the coffee. All the sacks of coffee in theshipmentarethe‘population’.Itakearandomsampleofsackstocheck.Ifind thatthemeanweightofthesampleis38kgs.DoIrejectthewholeshipment,orsayImusthavegot a low sample, let’s take them? The hypotheses here are: thenullhypothesis:thepopulationmeanweightis40kg.Meanwhilethealternativeothepopulationmeanweightisnot40kgs.Hypothesistestingprovidesananswertothetestintheformofapvalue,definedbelow.

ApvalueistheprobabilityoffindingasamplewiththemeasuredcharacteristicsIFthe null hypothesiswas true. In the case of the coffee above, let’s say that wecalculatedthepvalue(SeeChapter

9)anditwas0.1.Thatmeansthatthereisa10%chanceoffindingasamplewithameanweightof38kgsiftherealtruemeanweightoftheshipmentwas40kgs.Themostfrequentlyusedcut-offpointis0.05or5%.Ifthepvaluewassmallerthan0.05thenwewouldsaythatthereisalessthan5%chancethatthissamplecamefromapopulationwiththeclaimedweight.Therefore the shipment isn’t of the claimedweightandshouldberejected.

Hypothesiswritingandtesting

Ahypothesisisastatementorclaim.Forexample,‘themeanweight of thecoffeepacksis3kg’isahypothesisorclaim.Thehypothesisneedstobetestedsothatweknowwhethertheclaimstandsorisshowntobefalse.

Hypothesesarewrittensothattheclaim,andthecounter-claim,aredistinct.Therearetwohypotheses,onefortheclaimandthe other for the counterclaim.Ho, the‘nullhypothesis’,containstheclaim.Ha,the‘alternativehypothesis’,containsthecounterclaim.Wewritehypothesesbyassumingthatthenullhypothesisistrue.

Thenullhypothesesforthecoffeeexampleabovewouldbe

Ho :µ =3 kg

Thisisreadas:Thenullhypothesisisthatthepopulationmeanweight(mu)ofthecoffeeisequalto3kg.

Thealternativehypothesis (Ha) is that thepopulationmean weight is not 3 kg,written:

Ha :µ =3 kg

Wetestthehypothesesusinginformation(statistics)fromthesam- ple (themeanweight,thesamplesize,andthestandarddeviation).Notethatthecounterclaimsaysnothingaboutthemeanweightinthepopulationbeingmorethanorlessthan3kg,itsimplysaysthatitnotequalto3kg.Also,important,theequalitysignappearsinthenull.Thisisalwaysthecase,whethertheinequalityis‘strict’or‘weak’.

Wecanalsotestwhetherthepopulationmeanissmallerthanor largerthansomeconstant.Theshipperclaimsthattheweightis‘atleast’3kg.Wewritethisas:

H 0:µ ≥3 kg

Notethattheequalityremainswiththenull.Thealternativeisthen thenegationofthenull,andthismustbethattheweightis‘lessthan3kg’.

H a :µ<3 kg

Convinceyourself that theonlypossiblewaytonegateHoiswith thealternativehypothesis.Iftheclaimwasthatthattheaverageweightislessthanorequalto3kg,wesimplyswitchoverthe

inequalitysigns.Thereareotherformsofhypothesiswritingwhichwewillworkthroughinthisbook.However,theyallsharetherulethattheequalitysigngoesinthenull;andthealternativenegatesthenull.

Inference is thekeystatisticalprocess. It is theway in which information aboutpopulationscanbegleanedfromsurprisinglysmallsamples.Anexampleispollingbeforeanelection.Providedthesampleisdrawnrandomlyandisrepresentativeofthepop-ulation,asampleofperhapsonly1,000peoplecanprovidequiteaccurateestimates of the proportion of the population predicted to vote for a particularpoliticalparty.Theestimatesareusedtomakeinferencesaboutthepopulation.Wewon’tknowuntilelectiondaywhetherornottheinferenceswerecorrectornot.

Inferentialstatisticsisaverypowerfultechniquewhichallowsustomakethejumpfromasampletoapopulation.Wecanuseasmallsampletomakeinferencesaboutthecharacteristicsofapopulationaboutwhichweperhapsknowverylittle.Thekeyistoobtainasamplewhichrepresentsthepopulationasaccuratelyaspossible.Forthis, both randomization and good surveydesign are essential. Wewill discussthesetwoimportantissueslaterinthechapter.Recently,athirdclassofstatisticalmethodologieshasarisen.Thisismachinelearning ,whichtakesadvantageofnewanalytictechniquesandgreatercomputingpower.Wewon’tbeable togointothesetechniquesinthisbook,unfortunately.

Lurkingvariables : Ithappens that there isclosemonthlycor- relationbetweentheconsumptionoficecreamanddeathsbydrowning.Isitthatpeopleeatanicecream,goswimmingandthendrown?Thecorrelationmissesthelurkingvariablewhichistemperature.Inhotweather,peoplebothgoswimmingmoreoftenandbuyice creams. The ice cream and the drowning variables are both correlated withtemperatureandsoarecorrelatedwitheachother.Anotherexample:itseemsthatthereisacorrelationbetween kidsneeding readingglassesandsleepingwith thelighton.Soyou

should switch the light off? Not so fast. The lurking variable is the short-sightednessoftheparents,notthekids.Theparentsleavethelightonsotheycanseethekid.Thekidsinheritbadeyesightfromtheirparents.

Parameter : a quantifiable characteristic of interest in the population. Theaverage weight of all of the fish in the Fraser Riveris a parameter. The samecharacteristicinthesampleisknownasastatistic.Wecaneasilyfindthisstatisticbecausethesampleisonlyasubset,andthereforesmaller,thanthepopulation.Weusuallyneverknowtheactualvalueofaparameter,butthatdoesn’tmatterbecausewecanestimatefromthesample.

Pointestimate :astatistic,suchasthemeanoraverage.Itiscalledapointbecauseitisanexactnumbersuchas4.21.Itisnotarange,itisapoint.Itiscalledanestimatebecauseitisanestimateofthepopulationparameter.Therefore4.21isanestimateofwhatthesamequantifiablefeaturewouldbeinthepopulation.Youcanthinkofapointestimateasbeingthe‘bestguess’forthepopulationparameter.

Ifwe selectedadifferent sample from the population, then almost certainly thepointestimatewouldbedifferent,givinganewbestguess.Ifwecontinuetotakesamples,thenwearegoingtogetasmanybestguessesaswehavesamples.Therangeofbestguessesisthesamplingvariation .

Poissondistribution

ThePoissonprobabilitydistributionis,aswiththeBinomial,adis-creteprobabilitytool.Wecanuseitforcalculatingtheprobabilityofeventswhichoccurovertimeorspace.Forexample,numberofmistakesinagivennumberof linesof code.Theformulais:

f( x )=

µ x e− xx !

wheref(x)istheprobabilityofxoccurrencesinagiveninterval.mu

istheexpectedvalueormeannumberofoccurrencesinthegiveninterval(orarea).eisaconstant,value2.71828.

Population : the group of individuals about whom we want some information.Each individual in the population is called an element or sometimes a case .Examplesofpopulationsareall the fish in the FraserRiver,Toyotacarsmadein2012,orindeedthepopulationofacountry.Itisusuallynotpossibletodealwithdataatthepopulation levelbecauseitiseitherexpensiveorimpossibletoobtain.Asaresultwedrawasample,orsubset,fromthepopulation.

Randomisationisthetechniqueofselectingelementsfromthepopulationtobuildasamplesothateachelementhasthesame known probability of being selected.This technique greatly reduces bias, described in more detail in the followingchapter.

Regression

Samplingframe :alistoftheitemstobesampled.Imaginethatyouwishtosurvey100studentsfromaUniversity.TheUniversityprovidesyouwithalistofstudentsandtheirIDnumbers.Thestudentsformthepopulation,andthelististhesamplingframe.Youwouldusearandomsamplingmethod,coveredbelow,torandomlyselect100studentsfromtheUniversity.The100studentsformthesample.

Standarderror :Trythis thoughtexperiment.Let’ssay that thereare1000jellybabiesinabowlandforsomereasonyou’dliketoknownthemeanweightofajellybabywithouthavingtoweighall ofthem.Youhaveachoiceoftakingarandomsampleofeithern

= 100 or n = 500. Which sample would produce the most accurate estimate:obviouslythesamplesizeof500becauseitis‘closer’tothepopulation size.

Yetitstillwon’tbecompletelyaccuratebecauseitcouldhappenthatthesampleyouselect is biased in someway: perhaps you selected all theheavyones?We canmeasuretheamountoferrorwiththestandarderror,whichhelpsustodeterminehow‘close’wearetothe

unknownparameter.BelowIshowhowtocalculatethestandarderror(SE).Hereisthestandarderrorforastatisticsuchasamean:

σSE =√ n

Inwords, the standard error is the population standard deviation dividedby thesquarerootofthesamplesize,denotedbyn.Notethatasthesamplesizeincreases,thenthestandarderrordecreases.

Thisis in linewithyourearlier intuition that the larger the sample size then themoreaccuratetheestimate.Usuallywedon’t know‘sigma’,σ ,sowereplaceitwith‘s’whichisthestandarddeviationofthesample.

Transformations :thisisatechniquesometimesusedinregression. Theobjectisusually to transform thedistributionof the dependent variables so that it moreclosely resembles a normal distribution. We do this because the theoreticalunderpinning of linear regression is basedonanassumption of normality in thedependentvariable.Thedependentvariable is typically transformedby taking itsnaturallogarithmandthenusingthetransformedvariableintheregression.Thisisoften used when the dependent variable has only positive values and is highlyskewed to the right. Independent variables can also be transformed such as byaddingasquaredversionof the variable in the regression. Independent variablescanalso betransformedby

zscores :A z score is ameasure of how far an observation contained within avariable is from themean of that variable, in termsof standard deviations. It isquiteeasytocalculatewithinExcel,andthisYoutubeshowsyouhow

zscoresYoutube⁴.

Whydothis?Bydividingthedifferencebetweenthevalueofthe

⁴https://www.youtube.com/watch?v=FSZpynSBev8

https://www.youtube.com/watch?v=FSZpynSBev8

https://www.youtube.com/watch?v=FSZpynSBev8

observationandthemeanofthevariablebythestandarddeviationofthevariable,likethis:

z i=

x i−x ¯

s

thedifferencesareinunitsofstandarddeviations.Wecanusethisfor:

•Comparingthedistributionsoftwocompletelydifferentvariables

• Detectingoutliers.Anoutlierisanobservationwhichisveryfarfromtherestofthedistribution.Usually,99.7%of the observationswillbewithinthreestandarddeviationsofthemean.Soanobservationwhichismorethanthreestandarddeviationsfromthemeanislikelytobequestionable. Thiscanbegoodorbad:

•Badbecauseperhapssomeonemadeadataentryerrorwhichwecancatch.

• Good because perhaps there is something anomalous and interestingaboutthat‘weird’observation.

straightforward statistics using excel and tableau: a ...index-of.co.uk/ofimatica/straightforward...

Documents