quan%fying central tendency and...

28
Quan%fying central tendency and variability

Upload: others

Post on 06-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Quan%fying central tendency and variability

  • Outline for today

    BigDataBaseball–chapter3Be2erknowaplayer:BoJacksonReview:

    •  Plo=ngcategoricalandquanAtaAvedata

    MoredescripAvestaAsAcs:•  MeasuresofcentraltendencyforquanAtaAvedata•  Measuresofvariability

    Worksheet2:duemidnightWednesdayFeb15th

  • DataFest2017•  March31sttoApril2nd

    Amherstishavingpre-DataFestworkshops•  Firstoneisat7:30onFebruary15th

    IfyouareinterestedinparAcipaAnginDataFestletmeknow

  • Big Data Baseball Chapter 3

    •  Thoughts?

  • Be=er know a player

    BoJackson

  • Review

  • Categorical and quan%ta%ve data

    Descrip(vesta(s(csdescribesthesampleofdatayouhaveCategoricalvariables:fallintodisAnctcategories

    E.g.,team(RedSox,Yankees,Mets,etc.)

    Quan(ta(vevariables:numericaldataE.g.,Numberofhomeruns

  • Categorical variables: propor%on

    Thepropor(onofacategoryisfoundby:

    ProporAonofcategory=Numberinthatcategory totalnumber

    Example:proporAonofhitsthatarehomeruns160hitstotal30/160=.1875

    1B 2B 3B HRCount 90 38 2 30ProporAon 0.56 0.24 0.01 0.19

    >hit.typeshit.types/sum(hit.types)

  • R: barplot(x)

    R: pie(x)

    PloDng categorical data

  • Describing quan%ta%ve variables

    R: stem(x)

    R: hist(x)

  • hist(player.data.2013$HR,n=30,xlab="HR",main="HistogramofHRsfor2013playerswithover300PA")

    ObservaAonsaboutthedistribuAon?

  • Dotplot for individuals’ HR in 2013

    Whenwehavediscretedata,wecanalsousedotplotstogetasenseofhowthedataisdistributed

  • Dotplot for individuals HR in 2013

    OnewaytogetasenseoftheshapeofadistribuAonistouseadotplot

    R: mosaic::dotPlot(x)

    MiguelCabrera

    ChrisDavis

  • Common shapes for Distribu%ons

  • sta%s%cs measuring the center of distribu%on

    Graphsareusefulforvisualizingdatatogetasenseofwhatofwhatthedatalooklike

    Wecanalsosummarizedatanumerically

    Anumericalsummary(funcAon)ofsampleiscalledsta(s(c

    TwoimportantstaAsAcsthatcanbeusedtodescribethecenterofthedataarethemeanandthemedian.

  • The mean

    Mean=SumofalldatavaluesNumberofdatavalues

    Mean=x1+x2+x3+…+xn= Σxin n

    R: mean(x) Ifyoudatahasmissingvaluesuse:

    mean(x, na.rm = TRUE)

  • Mean number of games played (G)

    Canyoucalculatethemeannumberofgamesplayedforplayerswhohad300plateappearancesin2013>players.2013mean(players.2013$G)

    Doyouthinkthemeannumberofgamesplayedwouldbehigherifwecalculateditfromonlyplayerswhohad500plateappearances?

  • Sample vs. Popula%on mean

    Themeanforasampleisdenotedx̄(pronounced“x-bar”)Themeanforapopula)onisdenotedμ,whichistheGreekle2er“mu”

    μ

  • Give the proper nota%on: μ vs. x̄ ?̄

    Werandomlyselect50baseballplayersandtaketheirmeanheight?

    Welookatallprofessionalbaseballplayersandtaketheirmeanheight?

  • The median

    Themedianofadatasetofsizenis

    •  Ifnisodd:Themiddlevalueofthesorteddata

    •  Ifniseven:Theaverageofthemiddletwovaluesofthesorteddata

    ThemediansplitsthedatainhalfR: median(x)

  • Resistance

    WesaythatastaAsAcsisresistantifitisrelaAvelyunaffectedbyextremevalues(outliers).

    Themedianisresistantwhenthemeanisnot

  • Football player salary examples

    SomeNFLfootballplayersarepaidalotmorethanothers(starquarterbackscanbepaidmorethan$20million)

    MeanNFLsalary=$1.87millionMedianNFLsalary=$838,000

    MeanandmediansalaryforallUSworkers?DistribuAonofsalariesforUSworkers?

  • Which is the mean A or B?

    A B

  • Summary sta%s%cs quan%fying the spread of quan%ta%ve variables

  • Standard devia%on

    R: sd(x)

  • In class worksheet: compu%ng the standard devia%on

    54 35 23 28 3229 23 30 35 3738

    Values DeviaAons SquaredDeviaAons

    54 20.91 437.19

    35 1.91 3.64

    NumberofhomerunsDavidOrAzhadinthelast11seasons: