stat 306: finding relaonships in data.1. regression to the mean regression to the mean can be a...

35
Stat 306: Finding Rela1onships in Data. Lecture 13 Sec1on 3.11 Interpreta1ons

Upload: others

Post on 07-Feb-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

  • Stat306:FindingRela1onshipsinData.

    Lecture13Sec1on3.11Interpreta1ons

  • Sec$on3.11Interpreta$ons

    •  4categoriesofstudy•  Threeissues:– 1.Regressiontothemean– 2.Unobservedconfounding– 3.Mul$plecomparisons

    •  Lookthroughsomepapers

  • Observa$onal Experimental

    GoalisExplana$on 1. 2.

    GoalisPredic$on 3. 4.

    Fourcategoriesofscien$ficstudy

  • Studieswiththegoalofexplainingaphenomenon

    Observa1onalstudiesaredefinedbyhavingnointerven$onbyresearchers.Theexploratoryvariablesinthemodel(X)arenotdeterminedbytheresearchers.OLendatacomesfromsurveysordatabases.Observa1onalstudiesareimportantinthefollowingfields:

    -macro-economics-epidemiology/publichealth-publicpolicy-poli1calscience-sociology-criminology

    1.

  • Experimentalstudiesaredefinedbyhavingaspecificinterven$onbyresearchers.Atleastoneexploratoryvariableinthemodel(X)isdeterminedforeachobserva1onbytheresearchers.ThisisoLendonebyrandomiza$on.Dataiscollectedbyresearchers.Experimentalstudiesareimportantinthefollowingfields:

    -medicine(clinicaltrials)-educa1onalresearch-psychology

    2.Studieswiththegoalofexplainingaphenomenon

  • Experimentalstudiesforpredic1onareimportantinthefollowingfields:

    -ABtes1ng(onlineadver1sing,websiteop1miza1on)

    Studieswiththegoalofpredic$ngfutureevents

    4.

    3. Observa1onalstudiesforpredic1onareimportantinthefollowingfields:

    -Economics-transporta1onresearch-realestate-financials-insurance

  • 1.RegressiontothemeanExample:OnDay1:Studentstakeamul1plechoicetestandfillouttheanswersrandomly.Welookattheresults,ahistogramoftestscores: Histogram of day1test

    test score

    Frequency

    0 20 40 60 80 100

    05

    1015

    2025

    30

  • 1.RegressiontothemeanExample:OnDay2:Studentstakeanothermul1plechoicetestandfillouttheanswersrandomly.Welookattheresults,ahistogramoftestscores: Histogram of day2test

    test score

    Frequency

    0 20 40 60 80 100

    05

    1015

    2025

    3035

  • 1.RegressiontothemeanExample:Let’slookatascaXerplotofeachstudent’stwotestscores: 1

    1

    30 40 50 60 70

    3040

    5060

    7080

    x

    y

  • 1.RegressiontothemeanExample:Thosewhodidworstonthefirsttest,tendedtoimprovetheirscoreonthesecondtest.

    1

    1

    30 40 50 60 70

    3040

    5060

    7080

    x

    y

  • 1.RegressiontothemeanLet’simaginesomeonehasanew“treatment”tohelpstudentswhodopoorlyonmul1plechoicetestsgetbeXergrades.Whatwouldhappenifwetestedthistreatment?Wouldweseeanyimprovement? 1

    1

    30 40 50 60 70

    3040

    5060

    7080

    x

    y

  • •  ThehoXestplaceinthecountrytodayismorelikelytobecoolertomorrow

    thanhoXer,ascomparedtotoday.

    •  Thebestperformingmutualfundoverthelastthreeyearsismorelikelytoseerela1veperformancedeclinethanimproveoverthenextthreeyears.

    •  ThemostsuccessfulHollywoodactorofthisyearislikelytohavelessgrossthanmoregrossforhisorhernextmovie.

    •  Thebaseballplayerwiththegreatestba\ngaveragebytheAll-Starbreakismorelikelytohavealoweraveragethanahigheraverageoverthesecondhalfoftheseason.

    hXps://en.wikipedia.org/wiki/Regression_toward_the_mean

    MoreExamplesfromWikipedia

    1.Regressiontothemean

  • 1.Regressiontothemean

    •  “Regressiontothemean”canbeaproblemforobserva$onalstudiesdependingonwhichobserva1onsareincludedintheanalysis.•  “Regressiontothemean”canbeaproblemforexperimentalstudiesifsubjectsareusedastheirowncontrol.InotherwordsifyousimplycompareTheoutcomepost-treatmenttopre-treatment,youwilllikelysee“regressiontothemean”andcouldmistakethisfortreatmenteffect.

    •  Thebestwaytoavoidthisprobleminexperimentalstudiesistorandomizesubjectstotwogroups:atreatmentgroupandacontrolgroup.

  • 1.Regressiontothemean

    Themeasurementofbloodpressureservesasagoodexample.Ifbloodpressureisini1allymeasuredinagroupofpa1entsandthenre-measuredaLeraperiodof1me,peoplewithextremebloodpressureatTime1willtendtobeclosertotheaveragelevelatTime2.Thisimprovementisnotduetoanytreatment,onlyduetorandomerror.Peopleusuallyseektreatmentwhentheirsymptomsarepar1cularlysevere.Iftreatmentissoughtwhenthesesymptomsareattheirworst,thesesymptomsshouldbelessseveresimplybyrandomfluctua1onsandnaturalrecovery,evenwhennotreatmentisused

    YuandChen(2015)hXps://www.fron1ersin.org/ar1cles/10.3389/fpsyg.2014.01574/full

  • 1.RegressiontothemeanRegressiontothemeancanbeaproblemforobserva$onalstudiesandexperimentalstudies(thathavenocontrolgroup).

    day1test

  • 2.UnobservedConfounding

    TheNurses’HealthStudy(NHS)wasoneofthelargestandmostinfluen1alobserva1onalstudiesinhealth.TheNHSbeganin1976andsubsequentlyfollowedmorethan120,000marriedfemaleregisterednurses.TheNHSpublishedresultsinthe1991andfoundthathormonetherapyinpost-menopausalwomenwasassociatedwithasubstan1alreduc1oninthedevelopmentofheartdisease.In1998,theHeartandEstrogen-proges1nReplacementStudy(HERS)randomized2,763womentoreceiveeitherhormonetherapyorplacebo.Itconcludedthathormonetherapyincreased,notdecreased,theriskofheartdisease.

    Mostfamousexample:

  • 2.UnobservedConfounding

    X Y

    Z

    Variablethatisknownandmeasured

    Outcomevariable

    Variablethatisunknownand/orunmeasured

  • 2.UnobservedConfounding

    X Y

    Z

    Sizeofgarage0=nogarage1=1cargarage2=2cargarage

    Saleprice($)

    Loca$on(Downtownvs.Suburbs) n=200housesforsale

    Whatistheeffectofhavingalargergarageonthesalepriceofthehouse?

  • 2.UnobservedConfounding

    X Y

    Z

    Policebudget Murderrate

    Crimelevel n=60ci1es

    Whatistheeffectofincreasingordecreasingthepolicebudgetonthemurderrate?

  • 3.“Mul$plecomparisons”

    Type1error=Pr(rejectH0|H0istrue)

    Forlinearregression:

    Type1error=Pr(βjisnotzero|βj=0) =Pr(p-valueforβjissmall|βj=0)

    0.05 >Pr(p-valueforβj

  • 3.“Mul$plecomparisons”

    0.05 >Pr(p-valueforβj

  • 3.“Mul$plecomparisons”

    0.05 >Pr(p-valueforβj

  • Agevs.Money

    Popula$on

    cash($)onhand

    Popula1onparameters

    HypothesisTest

    Sample,n=9Samplesta1s1cs

    β0, σ2β1,

    H0:β1=0H1:β1≠0

    82

    22

    4571

    29

    129

    1824

    X y 71

    54

    43452111304510

    AgeinYears

    PREDICTOR variable

    X RESPONSE variable

    Y

    b0=17.7b1=0.55s=15.5R2=0.49

    Forsta1s1cβ1:

    linearregression

  • Agevs.MoneyObjec$ve: Thepurposeofthisobserva$onalstudywasto

    demonstrateif,andtowhatextent,ageis associatedwithuseofcash.

    DesignandMethods: Wecollectedarandomsampleofindividualsandforeach

    determinedtheirage(recordedinyears)andtheamount ofcash(indollars)theyhadonhand.Analysisof thedatawasdoneusinglinearregression.

    Results: Weobtainedarandomsampleofn=9subjects. Thereisa

    sta1s1callysignificantassocia1onbetweenageandmoney(p-value=0.036). Foreveryaddi1onalyearinage,anindividual’samountofmoneyincreases onaveragebyanes1matedof$0.55(95%C.I.=[$0.05,$1.05]).

    Conclusions: Wefoundthat,ashypothesized,ageisassociatedwithcashuse. Inoursampleageaccountedforabouthalfofthevariability observedinmoney(R2=0.49).Wepredictthata50yearoldwill have$45.1(95%P.I.=[$5.6,$84.5]),whereasa40year oldwillhave$39.6(95%P.I.=[$0.8,$78.4]).

    SmallPrint: Theanalysisrestsonthefollowingassump1ons:

    - theobserva1onsareindependentlyandiden1callydistributed. - theresponsevariable,money,isnormallydistributed. - Homoscedas1cityofresidualsorequalvariance. - therela1onshipbetweenresponseandpredictorvariablesislinear.

    Forparameterβ1:

  • Giulietal.(2014)

  • Habyetal.(2011)

  • Promislowetal.(2002)

  • Fallowfieldetal.(2002)

  • •  Categoricalpredictors

    •  Quadra1c(polynomial)rela1onships•  Outliers

    •  Howtofixheterogeneity•  Regressiontothemean•  SimpsonsParadox

    •  UnobservedConfounding

    Theartoflinearregression