p8130: biostatistical methods i · 2017-11-29 · measures of location: median • compared to the...

21
P8130: Biostatistical Methods I Lecture 2: Descriptive Statistics Cody Chiuzan, PhD Department of Biostatistics Mailman School of Public Health (MSPH)

Upload: others

Post on 17-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

P8130:BiostatisticalMethodsILecture2:DescriptiveStatistics

CodyChiuzan,PhDDepartmentofBiostatisticsMailmanSchoolofPublicHealth(MSPH)

Page 2: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

Lecture1:Recap• IntrotoBiostatistics• TypesofData• StudyDesigns

Page 3: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

DescriptiveStatistics

• Thecollectionandpresentationofthedatathroughgraphicalandnumericaldisplays

• Lookforpatternsinthedataandsummarizeinformation

• Measuresoflocation

• Measuresofdispersion

• Graphicaldisplay

Page 4: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofLocation• Measuresoflocationorcentraltendency indicatethecenterofthedata

• Mean(average)

• Median(the50th percentile)

• Mode

Page 5: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofLocation:MeanDefinition:thearithmeticmeanrepresentsthesumofallobservationsdividedbythenumberofobservations

Samplemeanforasampleofn observationsisgivenby:

𝑥=∑ 𝑥#/𝑛&#'(

Samplemeanisusedtoestimatethepopulationmeanμ whichistypicallyunknown

Page 6: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofLocation:Mean• Themostcommonusedmeasureoflocation

• Overlysensitivetooutliers(unusualobservations),thusnotrecommendedifthedataareskewed

• Notappropriatefornominalorcategoricalvariables

Page 7: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofLocation:MedianDefinition:Thesamplemedianiscomputedas:1. Ifnisodd,medianiscomputedas &)(

*𝑡ℎ largestiteminthesample

2. Ifniseven,medianiscomputedastheaveragebetween &*𝑎𝑛𝑑 &

*+ 1 th

largestitems

Example:Givenn=7(odd)totalsampleobservations,medianisthe1)(* = 4𝑡ℎ largestitemGivenn=10(even)totalsampleobservations,medianistheaverageofthe

(4* = 5𝑡ℎand (4* + 1 = 6𝑡ℎ largestitems

Page 8: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofLocation:Median• Comparedtothemean,themedianisnotaffectedbyeveryvalueinthedatasetincludingoutliers

• Themedianisdefinedasthemiddlevalueorthe50th percentile• Thismeansthathalfofthedataarelessthanorequaltoit,andatleastaregreatertanorequaltoit

•Mediancalculationstartsbyfirstorderingthedata(increasingorder)• Appropriatemeasureforordinaldata

Page 9: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

OtherMeasuresofLocationPercentiles:medianisthe50th percentile

• Ingeneral:thekth percentileisavaluesuchthatmostk%ofthedataaresmallerthanitand(100-k)%arelarger• Deciles:10th,20th,30th,…• Quartiles:25th (Q1),50th,75th (Q3)

• Question:whatdoesitmeanifyourGREscoreisinthe90thpercentile?

Page 10: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofLocation:ModeDefinition:themostfrequentlyoccurringvalueinthedata

• Youcanhavemultiplemodesornone(really?)

• Problematicifthereisalargenumberofpossiblevalueswithinfrequentoccurrence

Page 11: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofDispersionDescribethespreadofthedata:• Range

• Inter-quartilerange(IQR)

• Variance/Standarddeviation

• Coefficientofvariation(CV)

Page 12: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofDispersionRange:Max– Min

Inter-quartilerange:IQR=75th (Q3)– 25th (Q1)

Sincetherangeonlydependsontheminimumandmaximumvalues,itcanbeinfluencedbytheextremes

Solution?UsetheIQR

Page 13: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofDispersionPopulationVarianceistheaveragesquareddeviationfromthemean:

𝜎*=(<∑ (𝑥# − 𝜇)*<#'(

PopulationStandardDeviationisjustthesquarerootofthevariance:

𝜎 = 𝜎*�

Valuesoftenunknownandthenwereferbacktosample…

Page 14: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofDispersionSampleVarianceistheaveragesquareddeviationfromthemean:

𝑠*= (&C(

∑ (𝑥# − 𝑥)*&#'(

PopulationStandardDeviationisjustthesquarerootofthevariance:

s= 𝑠*�

Lotsofchangesinnotationandalsoformula!!

Page 15: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofDispersion

Meanandstandarddeviationsarethemostusedmeasuresoflocationandspread.Why?It’sallaboutthe…

Property:lineartransformationsdoaffectthesemeasures

Let𝑌 = 𝑐𝑋 + 𝑏 bealineartransformationavariableX

Meanof𝑌 = 𝑐𝑋 + 𝑏StandardDeviation𝑠H = 𝑐𝑠I

Page 16: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

MeasuresofDispersion

CoefficientofVariation(CV)isameasurethatrelatesthemeanandthestandarddeviation.• Sometimesthevariancechangeswithitsmean

• Population:𝐶𝑉 = LM×100%

• Sample:𝐶𝑉 = QR×100%

• CVisunitless andcanbeinterpretedintermsofvariabilitytotheaverage

Page 17: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

GraphicalDisplay

• Apictureisworthathousandwords(sometimes)

• Bargraphs

• Histograms

• Box-plots

• Scatterplots(laterinlinearregression)

Page 18: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

BarGraph• Dataaredividedintogroupsandfrequenciesaredeterminedforeachgroup• Rectanglesareconstructedwiththebaseofconstantwidthandheightsproportionaltothefrequencies

Page 19: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

Histogram• Numericalvaluesaregroupedintomeasurementsclasses,definedbyequal-lengthintervalsalongthenumericalscale• Eachvaluebelongstoonlyoneclass• Usually5-12classes• Likebargraph,thisplothasfrequenciesontheverticalaxis• Ifthemean>median:rightskew• Ifthemean<median:leftskew

Page 20: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

Box-plot• ExtendsfromtheQ1(25th)totheQ3(75th)quartile– thebox• The‘whiskers’extendfromthesmallesttothelargestvalues• Ifoneofthewhiskersislong,itindicatesskewnessinthatdirection• IfadatavalueislessthanQ1–1.5(IQR)orgreaterthanQ3+1.5(IQR),thenitisconsideredanoutlierandgivenaseparatemarkontheboxplot

Page 21: P8130: Biostatistical Methods I · 2017-11-29 · Measures of Location: Median • Compared to the mean, the median is not affected by every value in the data set including outliers

Readings

Rosner,FundamentalsofBiostatistics,Chapter2

• Sections:2.2– 2.6

• Sections:2.9– 2.10