basic statistics review

Upload: maitraya

Post on 14-Apr-2018

225 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/29/2019 Basic statistics Review

    1/55

    1

    Review

    Some BasicStatistical Concepts

    TOPIC

  • 7/29/2019 Basic statistics Review

    2/55

    2

    random variable:A variable whose value is unknown until it is observed.

    The value of a random variable results from an experiment.

    The term random variable implies the existence of some

    known or unknown probability distribution defined over

    the set of all possible values of that variable.

    In contrast, an arbitrary variable does not have a

    probability distribution associated with its values.

    Random Variable

  • 7/29/2019 Basic statistics Review

    3/55

    3

    Controlled experiment values

    of explanatory variables are chosenwith great care in accordance with

    an appropriate experimental design.

    Uncontrolled experiment values

    of explanatory variables consist of

    nonexperimental observations over

    which the analyst has no control.

  • 7/29/2019 Basic statistics Review

    4/55

    4

    discrete random variable:A discrete random variable can take only a finite

    number of values, that can be counted by using

    the positive integers.

    Example: Prize money from the following

    lottery is a discrete random variable:

    first prize : Rs.1,000

    second prize : Rs.50third prize : Rs.5.75

    since it has only four (a finite number)

    (count: 1,2,3,4) of possible outcomes:

    Rs.0.00; Rs.5.75; Rs.50.00; Rs.1,000.00

    Discrete Random Variable

  • 7/29/2019 Basic statistics Review

    5/55

    5

    continuous random variable:A continuous random variable can take

    any real value (not just whole numbers)

    in at least one interval on the real line.

    Examples:

    Gross national product (GNP)

    money supply

    interest ratesprice of eggs

    household income

    expenditure on clothing

    Continuous Random Variable

  • 7/29/2019 Basic statistics Review

    6/55

    6

    A discrete random variable that is restricted

    to two possible values (usually 0 and 1) is

    called a dummy variable (also, binary orindicator variable).

    Dummy variables account for qualitative differences:

    gender (0=male, 1=female),

    race (0=white, 1=nonwhite),

    citizenship (0=Indian., 1=not Indian),

    income class (0=poor, 1=rich).

    Dummy Variable

  • 7/29/2019 Basic statistics Review

    7/55

    7

    A list of all of the possible values taken

    by a discrete random variable along with

    their chances of occurring is called a probabilityfunction orprobability density function (pdf).

    die x f(x)one dot 1 1/6

    two dots 2 1/6

    three dots 3 1/6

    four dots 4 1/6

    five dots 5 1/6

    six dots 6 1/6

  • 7/29/2019 Basic statistics Review

    8/55

    8

    A discrete random variable X

    haspdf, f(x), which is the probability

    that X takes on the value x.

    f(x) = P(X=xi)

    0 < f(x) < 1

    If X takes on the n values: x1, x2, . . . , xn,then f(x1) + f(x2)+. . .+f(xn) = 1.

    For example in a throw of one dice:

    (1/6) + (1/6) + (1/6) + (1/6) + (1/6) + (1/6)=1

    Therefore,

  • 7/29/2019 Basic statistics Review

    9/55

    9

    In a throw of two dice, the discrete random variable X,

    x = 2 3 4 5 6 7 8 9 10 11 12

    f(x) =(1/36)(2/36)(3/36)(4/36)(5/36)(6/36)(5/36)( 4/36)(3/36)(2/36)(1/36)

    the pdf f(x) can be shown by the presented by height:

    0 2 3 4 5 6 7 8 9 10 11 12 X

    number, X, the possible outcomes of two dice

    f(x)

  • 7/29/2019 Basic statistics Review

    10/55

    10

    A continuous random variable uses

    area under a curve rather than the

    height, f(x), to represent probability:

    f(x)

    XRs.34,000 Rs.55,000. .

    per capita income, X, in the United States

    0.1324

    0.8676

    red area

    green area

  • 7/29/2019 Basic statistics Review

    11/55

    11

    Since a continuous random variable has an

    uncountably infinite number of values,the probability of one occurring is zero.

    P [ X = a ] = P [ a < X < a ] = 0

    Probability is represented by area.

    Height alone has no area.

    An interval for X is needed to get

    an area under the curve.

  • 7/29/2019 Basic statistics Review

    12/55

    12

    P [ a < X < b ] = f(x) dx

    b

    a

    The area under a curve is the integral of

    the equation that generates the curve:

    For continuous random variables it is theintegral of f(x), and not f(x) itself, which

    defines the area and, therefore, theprobability.

  • 7/29/2019 Basic statistics Review

    13/55

    13

    nRule 2: kxi = kxi

    i = 1 i = 1

    n

    Rule 1: xi = x1 + x2 + . . . + xni = 1

    n

    Rule 3: xi +yi = xi + yii = 1 i = 1 i = 1n n n

    Note that summation is a linear operator

    which means it operates term by term.

    Rules of Summation

  • 7/29/2019 Basic statistics Review

    14/55

    14

    Rule 4: axi +byi = a xi + b yii = 1 i = 1 i = 1

    n n n

    Rules of Summation (continued)

    Rule 5: x = xi =i = 1

    n

    n1 x1 + x2 + . . . + xn

    n

    The definition of x as given in Rule 5 impliesthe following important fact:

    xix) = 0i = 1

    n

  • 7/29/2019 Basic statistics Review

    15/55

    15

    Rule 6: f(xi) = f(x1) + f(x2) + . . . + f(xn)i = 1

    n

    Notation: f(xi) = f(xi) = f(xi)n

    x i i = 1

    nRule 7: f(xi,yj) = [ f(xi,y1) + f(xi,y2)+. . .+ f(xi,ym)]

    i = 1 i = 1

    n m

    j = 1

    The order of summation does not matter:

    f(xi,yj) = f(xi,yj)i = 1

    n m

    j = 1 j = 1

    m n

    i = 1

    Rules of Summation (continued)

  • 7/29/2019 Basic statistics Review

    16/55

    16

    The mean or arithmetic average of a

    random variable is its mathematicalexpectation or expected value, E(X).

    The Mean of a Random Variable

  • 7/29/2019 Basic statistics Review

    17/55

    17

    Expected Value

    There are two entirely different, but mathematically

    equivalent, ways of determining the expected value:

    1. Empirically:The expected value of a random variable, X,

    is the average value of the random variable in an

    infinite number of repetitions of the experiment.

    In other words, draw an infinite number of samples,

    and average the values of X that you get.

  • 7/29/2019 Basic statistics Review

    18/55

    18

    Expected Value

    2. Analytically:

    The expected value of a discrete random

    variable, X, is determined by weighting all

    the possible values of X by the corresponding

    probability density function values, f(x), and

    summing them up.

    E(X) = x1f(x1) + x2f(x2) + . . . + xnf(xn)

    In other words:

  • 7/29/2019 Basic statistics Review

    19/55

    19

    In the empirical case when the

    sample goes to infinity the values

    of X occur with a frequency

    equal to the corresponding f(x)

    in the analytical expression.

    As sample size goes to infinity, theempirical and analytical methods

    will produce the same value.

    Empirical vs. Analytical

  • 7/29/2019 Basic statistics Review

    20/55

    20

    x = xi/nni = 1

    where n is the number of sample observations.

    Empirical (sample) mean:

    E(X) = xif(xi)i = 1

    n

    where n is the number of possible values of xi.

    Analytical mean:

    Notice how the meaning of n changes.

  • 7/29/2019 Basic statistics Review

    21/55

    21

    E (X) = xi f(xi)i=1n

    The expected value ofX-squared:

    E (X ) = xi f(xi)i=1

    n2 2

    It is important to notice thatf(xi) does not change!

    The expected value ofX-cubed:

    E (X )= xi f(xi)i=1

    n3 3

    The expected value ofX:

  • 7/29/2019 Basic statistics Review

    22/55

    22

    E(X) = 0 (.1) + 1 (.3) + 2 (.3) + 3 (.2) + 4 (.1)

    2

    E(X )= 0 (.1) + 1 (.3) + 2 (.3) + 3 (.2) + 4 (.1)22222

    = 1.9

    = 0 + .3 + 1.2 + 1.8 + 1.6

    = 4.9

    3E( X ) = 0 (.1) + 1 (.3) + 2 (.3) + 3 (.2) +4 (.1)

    33333

    = 0 + .3 + 2.4 + 5.4 + 6.4

    = 14.5

  • 7/29/2019 Basic statistics Review

    23/55

    23

    E[g(X)] = g(xi) f(xi)ni = 1g(X) = g1(X) + g2(X)

    E[g(X)] = g1(xi) + g2(xi)]f(xi)ni = 1E[g(X)] = g

    1(x

    i)f(x

    i) + g

    2(x

    i)f(x

    i)

    n

    i = 1

    n

    i = 1

    E[g(X)] = E[g1(X)] + E[g2(X)]

  • 7/29/2019 Basic statistics Review

    24/55

    24

    Adding and Subtracting

    Random Variables

    E(X-Y) = E(X) - E(Y)

    E(X+Y) = E(X) + E(Y)

    2

  • 7/29/2019 Basic statistics Review

    25/55

    25

    E(X+a) = E(X) + a

    Adding a constant to a variable willadd a constant to its expected value:

    Multiplying byconstant will multiply

    its expected value by that constant:

    E(bX) = b E(X)

    26

  • 7/29/2019 Basic statistics Review

    26/55

    26

    var(X) = average squared deviationsaround the mean of X.

    var(X) = expected value of the squared deviationsaround the expected value of X.

    var(X) =E [(X - E(X)) ]2

    Variance

  • 7/29/2019 Basic statistics Review

    27/55

    28

  • 7/29/2019 Basic statistics Review

    28/55

    28

    variance of a discreterandom variable, X:

    standard deviation is square root of variance

    var ( X) = ( xi - EX )2f(xi)

    i = 1

    n

    29

  • 7/29/2019 Basic statistics Review

    29/55

    29

    xi f(xi) (xi - EX) (xi - EX) f(xi)

    2 .1 2 - 4.3 = -2.3 5.29 (.1) = .5293 .3 3 - 4.3 = -1.3 1.69 (.3) = .5074 .1 4 - 4.3 = - .3 .09 (.1) = .0095 .2 5 - 4.3 = .7 .49 (.2) = .0986 .3 6 - 4.3 = 1.7 2.89 (.3) = .867

    xi f(xi) = .2 + .9 + .4 + 1.0 + 1.8 = 4.3

    (xi - EX) f(xi) = .529 + .507 + .009 + .098 + .867= 2.01

    2

    2

    calculate the variance for adiscrete random variable, X:

    i = 1

    n

    n

    i = 1

    30

  • 7/29/2019 Basic statistics Review

    30/55

    30

    Z = a + cXvar(Z) = var(a + cX)

    = E [(a+cX) - E(a+cX)]= c var(X)

    2

    2

    var(a + cX) = c var(X)2

    31

  • 7/29/2019 Basic statistics Review

    31/55

    31

    The covariance between two random

    variables, X and Y, measures the

    linear association between them.

    cov(X,Y) = E[(X - EX)(Y-EY)]

    Note that variance is a special case of covariance.

    cov(X,X) = var(X) = E[(X - EX) ]2

    Covariance

    32

  • 7/29/2019 Basic statistics Review

    32/55

    32

    cov(X,Y) =E [(X - EX)(Y-EY)]

    =E [XY - X EY - Y EX + EX EY]

    =E(XY) - 2 EX EY + EX EY

    =E(XY) - EX EY

    cov(X,Y) =E [(X - EX)(Y-EY)]

    cov(X,Y) = E(XY) - EX EY

    =E(XY) - EX EY - EY EX + EX EY

    33

  • 7/29/2019 Basic statistics Review

    33/55

    33

    .15

    .05

    .45

    .35

    Y = 1 Y = 2

    X = 0

    X = 1

    .60

    .40

    .50.50

    EX=0(.60)+1(.40)=.40

    EY=1(.50)+2(.50)=1.50

    E(XY) = (0)(1)(.45)+(0)(2)(.15)+(1)(1)(.05)+(1)(2)(.35)=.75

    EX EY = (.40)(1.50) = .60

    cov(X,Y) = E(XY) - EX EY= .75 - (.40)(1.50)

    = .75 - .60

    = .15

    covariance

    34

  • 7/29/2019 Basic statistics Review

    34/55

    34

    Ajoint probability density function,

    f(x,y), provides the probabilitiesassociated with the joint occurrence

    of all of the possible pairs of X and Y.

    Joint pdf

    35

  • 7/29/2019 Basic statistics Review

    35/55

    35

    college gradsin household

    .15

    .05

    .45

    .35

    joint pdf

    f(x,y)

    Y = 1 Y = 2

    vacationhomesowned

    X = 0

    X = 1

    Survey of College City, Pilani

    f(0,1) f(0,2)

    f(1,1) f(1,2)

    36

  • 7/29/2019 Basic statistics Review

    36/55

    36

    E[g(X,Y)] = g(xi,yj) f(xi,yj)i j

    E(XY) = (0)(1)(.45)+(0)(2)(.15)+(1)(1)(.05)+(1)(2)(.35)=.75

    E(XY) = xiyj

    f(xi,y

    j)

    i j

    Calculating the expected value of

    functions of two random variables.

    37

  • 7/29/2019 Basic statistics Review

    37/55

    37

    The marginal probability density functions,

    f(x) and f(y), for discrete random variables,

    can be obtained by summing over the f(x,y)with respect to the values of Y to obtain f(x)

    with respect to the values of X to obtain f(y).

    f(xi) = f(xi,yj) f(yj) = f(xi,yj)ij

    Marginal pdf

    38

  • 7/29/2019 Basic statistics Review

    38/55

    38

    .15

    .05

    .45

    .35

    marginal

    Y = 1 Y = 2

    X = 0

    X = 1

    .60

    .40

    .50.50

    f(X = 1)

    f(X = 0)

    f(Y = 1) f(Y = 2)

    marginal

    pdf for Y:

    marginalpdf for X:

    39

  • 7/29/2019 Basic statistics Review

    39/55

    39

    The conditional probability density

    functions of X given Y=y , f(x|y),

    and of Y given X=x , f(y|x),are obtained by dividing f(x,y) by f(y)

    to get f(x|y) and by f(x) to get f(y|x).

    f(x|y) = f(y|x) =f(x,y) f(x,y)

    f(y) f(x)

    Conditional pdf

    40

  • 7/29/2019 Basic statistics Review

    40/55

    40

    .15

    .05

    .45

    .35

    conditonal

    Y = 1 Y = 2

    X = 0

    X = 1

    .60

    .40

    .50.50

    .25.75

    .875.125

    .90

    .10 .70

    .30

    f(Y=2|X= 0)=.25f(Y=1|X = 0)=.75

    f(Y=2|X = 1)=.875

    f(X=0|Y=2)=.30

    f(X=1|Y=2)=.70f(X=0|Y=1)=.90

    f(X=1|Y=1)=.10

    f(Y=1|X = 1)=.125

    41

  • 7/29/2019 Basic statistics Review

    41/55

    41

    X and Y are independent random

    variables if their joint pdf, f(x,y),

    is the product of their respective

    marginal pdfs, f(x) and f(y) .

    f(xi,yj) = f(xi) f(yj)

    forindependence this must hold for all pairs of i and j

    Independence

    42

  • 7/29/2019 Basic statistics Review

    42/55

    42

    .15

    .05

    .45

    .35

    not independent

    Y = 1Y = 2

    X = 0

    X = 1

    .60

    .40

    .50.50

    f(X = 1)

    f(X = 0)

    f(Y = 1) f(Y = 2)

    marginal

    pdf for Y:

    marginalpdf for X:

    .50x.60=.30 .50x.60=.30

    .50x.40=.20 .50x.40=.20The calculationsin the boxes show

    the numbers

    required to have

    independence.

    43

  • 7/29/2019 Basic statistics Review

    43/55

    43

    The correlation between two random

    variables X and Y is their covariance

    divided by the square roots of theirrespective variances.

    Correlation is a pure number falling between -1 and 1.

    cov(X,Y)(X,Y) =var(X) var(Y)

    Correlation

    44

  • 7/29/2019 Basic statistics Review

    44/55

    44

    .15

    .05

    .45

    .35

    Y = 1 Y = 2

    X = 0

    X = 1

    .60

    .40

    .50.50

    EX=.40

    EY=1.50

    cov(X,Y) = .15

    correlation

    EX=0(.60)+1(.40)=.4022 2

    var(X) = E(X ) - (EX)

    = .40 - (.40)

    = .24

    2 2

    2

    EY=1(.50)+2(.50)

    = .50 + 2.0

    = 2.50

    2 2 2

    var(Y) = E(Y ) - (EY)

    = 2.50 - (1.50)

    = .25

    2 2

    2

    (X,Y) =cov(X,Y)

    var(X) var(Y)

    (X,Y) = .61

    45

  • 7/29/2019 Basic statistics Review

    45/55

    45

    Independent random variables

    have zero covariance and,therefore, zero correlation.

    The converse is not true.

    Zero Covariance & Correlation

    46Si i i li

  • 7/29/2019 Basic statistics Review

    46/55

    46

    The expected value of the weighted sumof random variables is the sum of the

    expectations of the individual terms.

    Since expectation is a linear operator,

    it can be applied term by term.

    E[c1X + c2Y] = c1EX + c2EY

    E[c1X1+...+ cnXn] = c1EX1+...+ cnEXn

    In general, for random variables X1, . . . , Xn :

    47

  • 7/29/2019 Basic statistics Review

    47/55

    47

    The variance of a weighted sum of random

    variables is the sum of the variances, each times

    the square of the weight, plus twice the covariancesof all the random variables times the products of

    their weights.

    var(c1X + c2Y)=c1 var(X)+c2 var(Y) + 2c1c2cov(X,Y)2 2

    var(c1X c2Y) = c1 var(X)+c2 var(Y) 2c1c2cov(X,Y)2 2

    Weighted sum of random variables:

    Weighted difference of random variables:

    48

  • 7/29/2019 Basic statistics Review

    48/55

    48

    The Normal Distribution

    Y ~ N(,2)

    f(y) =22

    1

    exp

    y

    f(y)

    22(y - )2

    -

  • 7/29/2019 Basic statistics Review

    49/55

  • 7/29/2019 Basic statistics Review

    50/55

  • 7/29/2019 Basic statistics Review

    51/55

    52

  • 7/29/2019 Basic statistics Review

    52/55

    52

    Y1 ~ N(1,12), Y2 ~ N(2,22), . . . , Yn ~ N(n,n2)

    W = c1Y1 + c2Y2 + . . . + cnYn

    Linear combinations of jointly

    normally distributed random variablesare themselves normally distributed.

    W ~ N[ E(W), var(W) ]

    53

  • 7/29/2019 Basic statistics Review

    53/55

    mean: E[V] = E[ (m) ] = m

    If Z1, Z2, . . . , Zm denote m independent

    N(0,1) random variables, and

    V = Z1 + Z2 + . . . + Zm, then V ~ (m)2 2 2 2

    V is chi-square with m degrees of freedom.

    Chi-Square

    variance: var[V] = var[ (m) ] = 2m

    If Z1, Z2, . . . , Zm denote m independent

    N(0,1) random variables, and

    V = Z1 + Z2 + . . . + Zm, then V ~ (m)2 2 2 2

    V is chi-square with m degrees of freedom.

    2

    2

    54

    d

  • 7/29/2019 Basic statistics Review

    54/55

    mean: E[t] = E[t(m) ] = 0 symmetric about zero

    variance: var[t] = var[t(m) ] = m/(m-2)

    If Z ~ N(0,1) and V ~ (m) and if Z and Vare independent then,

    ~ t(m)

    t is student-t with m degrees of freedom.

    2

    t =Z

    V m

    Student - t

    55

    S i i

  • 7/29/2019 Basic statistics Review

    55/55

    If V1 ~ (m1) and V2 ~ (m

    2) and if V1 and V2

    are independent, then

    ~F(m1,m2)

    F is an F statistic with m1

    numerator

    degrees of freedom and m2 denominator

    degrees of freedom.

    2

    F =

    V1 m1

    V2 m2

    2

    F Statistic