stat 306: finding relaonships in data.1. regression to the mean regression to the mean can be a...
TRANSCRIPT
-
Stat306:FindingRela1onshipsinData.
Lecture13Sec1on3.11Interpreta1ons
-
Sec$on3.11Interpreta$ons
• 4categoriesofstudy• Threeissues:– 1.Regressiontothemean– 2.Unobservedconfounding– 3.Mul$plecomparisons
• Lookthroughsomepapers
-
Observa$onal Experimental
GoalisExplana$on 1. 2.
GoalisPredic$on 3. 4.
Fourcategoriesofscien$ficstudy
-
Studieswiththegoalofexplainingaphenomenon
Observa1onalstudiesaredefinedbyhavingnointerven$onbyresearchers.Theexploratoryvariablesinthemodel(X)arenotdeterminedbytheresearchers.OLendatacomesfromsurveysordatabases.Observa1onalstudiesareimportantinthefollowingfields:
-macro-economics-epidemiology/publichealth-publicpolicy-poli1calscience-sociology-criminology
1.
-
Experimentalstudiesaredefinedbyhavingaspecificinterven$onbyresearchers.Atleastoneexploratoryvariableinthemodel(X)isdeterminedforeachobserva1onbytheresearchers.ThisisoLendonebyrandomiza$on.Dataiscollectedbyresearchers.Experimentalstudiesareimportantinthefollowingfields:
-medicine(clinicaltrials)-educa1onalresearch-psychology
2.Studieswiththegoalofexplainingaphenomenon
-
Experimentalstudiesforpredic1onareimportantinthefollowingfields:
-ABtes1ng(onlineadver1sing,websiteop1miza1on)
Studieswiththegoalofpredic$ngfutureevents
4.
3. Observa1onalstudiesforpredic1onareimportantinthefollowingfields:
-Economics-transporta1onresearch-realestate-financials-insurance
-
1.RegressiontothemeanExample:OnDay1:Studentstakeamul1plechoicetestandfillouttheanswersrandomly.Welookattheresults,ahistogramoftestscores: Histogram of day1test
test score
Frequency
0 20 40 60 80 100
05
1015
2025
30
-
1.RegressiontothemeanExample:OnDay2:Studentstakeanothermul1plechoicetestandfillouttheanswersrandomly.Welookattheresults,ahistogramoftestscores: Histogram of day2test
test score
Frequency
0 20 40 60 80 100
05
1015
2025
3035
-
1.RegressiontothemeanExample:Let’slookatascaXerplotofeachstudent’stwotestscores: 1
1
30 40 50 60 70
3040
5060
7080
x
y
-
1.RegressiontothemeanExample:Thosewhodidworstonthefirsttest,tendedtoimprovetheirscoreonthesecondtest.
1
1
30 40 50 60 70
3040
5060
7080
x
y
-
1.RegressiontothemeanLet’simaginesomeonehasanew“treatment”tohelpstudentswhodopoorlyonmul1plechoicetestsgetbeXergrades.Whatwouldhappenifwetestedthistreatment?Wouldweseeanyimprovement? 1
1
30 40 50 60 70
3040
5060
7080
x
y
-
• ThehoXestplaceinthecountrytodayismorelikelytobecoolertomorrow
thanhoXer,ascomparedtotoday.
• Thebestperformingmutualfundoverthelastthreeyearsismorelikelytoseerela1veperformancedeclinethanimproveoverthenextthreeyears.
• ThemostsuccessfulHollywoodactorofthisyearislikelytohavelessgrossthanmoregrossforhisorhernextmovie.
• Thebaseballplayerwiththegreatestba\ngaveragebytheAll-Starbreakismorelikelytohavealoweraveragethanahigheraverageoverthesecondhalfoftheseason.
hXps://en.wikipedia.org/wiki/Regression_toward_the_mean
MoreExamplesfromWikipedia
1.Regressiontothemean
-
1.Regressiontothemean
• “Regressiontothemean”canbeaproblemforobserva$onalstudiesdependingonwhichobserva1onsareincludedintheanalysis.• “Regressiontothemean”canbeaproblemforexperimentalstudiesifsubjectsareusedastheirowncontrol.InotherwordsifyousimplycompareTheoutcomepost-treatmenttopre-treatment,youwilllikelysee“regressiontothemean”andcouldmistakethisfortreatmenteffect.
• Thebestwaytoavoidthisprobleminexperimentalstudiesistorandomizesubjectstotwogroups:atreatmentgroupandacontrolgroup.
-
1.Regressiontothemean
Themeasurementofbloodpressureservesasagoodexample.Ifbloodpressureisini1allymeasuredinagroupofpa1entsandthenre-measuredaLeraperiodof1me,peoplewithextremebloodpressureatTime1willtendtobeclosertotheaveragelevelatTime2.Thisimprovementisnotduetoanytreatment,onlyduetorandomerror.Peopleusuallyseektreatmentwhentheirsymptomsarepar1cularlysevere.Iftreatmentissoughtwhenthesesymptomsareattheirworst,thesesymptomsshouldbelessseveresimplybyrandomfluctua1onsandnaturalrecovery,evenwhennotreatmentisused
YuandChen(2015)hXps://www.fron1ersin.org/ar1cles/10.3389/fpsyg.2014.01574/full
-
1.RegressiontothemeanRegressiontothemeancanbeaproblemforobserva$onalstudiesandexperimentalstudies(thathavenocontrolgroup).
day1test
-
2.UnobservedConfounding
TheNurses’HealthStudy(NHS)wasoneofthelargestandmostinfluen1alobserva1onalstudiesinhealth.TheNHSbeganin1976andsubsequentlyfollowedmorethan120,000marriedfemaleregisterednurses.TheNHSpublishedresultsinthe1991andfoundthathormonetherapyinpost-menopausalwomenwasassociatedwithasubstan1alreduc1oninthedevelopmentofheartdisease.In1998,theHeartandEstrogen-proges1nReplacementStudy(HERS)randomized2,763womentoreceiveeitherhormonetherapyorplacebo.Itconcludedthathormonetherapyincreased,notdecreased,theriskofheartdisease.
Mostfamousexample:
-
2.UnobservedConfounding
X Y
Z
Variablethatisknownandmeasured
Outcomevariable
Variablethatisunknownand/orunmeasured
-
2.UnobservedConfounding
X Y
Z
Sizeofgarage0=nogarage1=1cargarage2=2cargarage
Saleprice($)
Loca$on(Downtownvs.Suburbs) n=200housesforsale
Whatistheeffectofhavingalargergarageonthesalepriceofthehouse?
-
2.UnobservedConfounding
X Y
Z
Policebudget Murderrate
Crimelevel n=60ci1es
Whatistheeffectofincreasingordecreasingthepolicebudgetonthemurderrate?
-
3.“Mul$plecomparisons”
Type1error=Pr(rejectH0|H0istrue)
Forlinearregression:
Type1error=Pr(βjisnotzero|βj=0) =Pr(p-valueforβjissmall|βj=0)
0.05 >Pr(p-valueforβj
-
3.“Mul$plecomparisons”
0.05 >Pr(p-valueforβj
-
3.“Mul$plecomparisons”
0.05 >Pr(p-valueforβj
-
Agevs.Money
Popula$on
cash($)onhand
Popula1onparameters
HypothesisTest
Sample,n=9Samplesta1s1cs
β0, σ2β1,
H0:β1=0H1:β1≠0
82
22
4571
29
129
1824
X y 71
54
43452111304510
AgeinYears
PREDICTOR variable
X RESPONSE variable
Y
b0=17.7b1=0.55s=15.5R2=0.49
Forsta1s1cβ1:
linearregression
-
Agevs.MoneyObjec$ve: Thepurposeofthisobserva$onalstudywasto
demonstrateif,andtowhatextent,ageis associatedwithuseofcash.
DesignandMethods: Wecollectedarandomsampleofindividualsandforeach
determinedtheirage(recordedinyears)andtheamount ofcash(indollars)theyhadonhand.Analysisof thedatawasdoneusinglinearregression.
Results: Weobtainedarandomsampleofn=9subjects. Thereisa
sta1s1callysignificantassocia1onbetweenageandmoney(p-value=0.036). Foreveryaddi1onalyearinage,anindividual’samountofmoneyincreases onaveragebyanes1matedof$0.55(95%C.I.=[$0.05,$1.05]).
Conclusions: Wefoundthat,ashypothesized,ageisassociatedwithcashuse. Inoursampleageaccountedforabouthalfofthevariability observedinmoney(R2=0.49).Wepredictthata50yearoldwill have$45.1(95%P.I.=[$5.6,$84.5]),whereasa40year oldwillhave$39.6(95%P.I.=[$0.8,$78.4]).
SmallPrint: Theanalysisrestsonthefollowingassump1ons:
- theobserva1onsareindependentlyandiden1callydistributed. - theresponsevariable,money,isnormallydistributed. - Homoscedas1cityofresidualsorequalvariance. - therela1onshipbetweenresponseandpredictorvariablesislinear.
Forparameterβ1:
-
Giulietal.(2014)
-
Habyetal.(2011)
-
Promislowetal.(2002)
-
Fallowfieldetal.(2002)
-
• Categoricalpredictors
• Quadra1c(polynomial)rela1onships• Outliers
• Howtofixheterogeneity• Regressiontothemean• SimpsonsParadox
• UnobservedConfounding
Theartoflinearregression