data analysis1

Upload: tripty-khanna

Post on 05-Apr-2018

250 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Data Analysis1

    1/60

    Data Analysis

    Kulwant Singh Kapoor

  • 8/2/2019 Data Analysis1

    2/60

    Data Structure

    The process of arranging data in groups orclasses according to resemblances andsimilarities is technically calledclassification.

    Types of Classification:

    Geographical

    Chronological Qualitative

    Quantitative

  • 8/2/2019 Data Analysis1

    3/60

    Geographical DataIn geographical classification data are classified on the

    basis of place.Example: geographical distribution of National Income

    COUNTRY INCOME IN US DOLLARSCanada 7950

    USA 7880

    West Germany 7510France 6730

    USSR 2800

    India 500

  • 8/2/2019 Data Analysis1

    4/60

    Chronological DataWhen the data are classified on the basis of time,

    also known as time series.Example: production of polio vaccine by a company

    X.

    YEAR No. of Vaccines

    2005 12,800

    2006 15,600

    2007 18,2002008 16,600

    2009 20,000

    2010 20,800

  • 8/2/2019 Data Analysis1

    5/60

    Qualitative Data

    When data are classified on the basis of descriptivecharacteristics or attributes.

    Examples:

    Male/ Female

    Strongly agree/ Agree/Disagree/Strongly Disagree

    Low/Medium/High

    Diabetic/Non- Diabetic Hypertensive/Mildly Hypertensive/Non

    Hypertensive

  • 8/2/2019 Data Analysis1

    6/60

    Quantitative Classification

    When classification is based on characteristics

    which are capable of Quantitative measurement.

    Example:

    Height/Weight

    Income/Expenditure

    Blood PressureBody Temperature

    Blood Count

  • 8/2/2019 Data Analysis1

    7/60

    Quantitative Data

    Ungrouped Grouped

    Raw Data Discreet data Continuous data

  • 8/2/2019 Data Analysis1

    8/60

    Mean

    Median

    Mode

    Quartile Percentile

    MEASURE OF CENTRAL

    TENDENCY

  • 8/2/2019 Data Analysis1

    9/60

    MEAN

    Arithmetic Mean of a given set of observations istheir sum divided by the number of observations.For example if X1, X2, X3,.. Xn are the given nobservations then their arithmetic mean, denotedby

    1 2 1........

    n

    i

    n i

    x x x x

    Xn n

  • 8/2/2019 Data Analysis1

    10/60

    EXAMPLE 1MARKS OF 24STUDENTS

    12 43 54 67 87 98 65 43

    54 67 89 90 98 76 54 56

    54 98 89 78 90 98 99 87

    TOTAL 1746

    # OF OBSERVATIONS 24

    MEAN 72.75

  • 8/2/2019 Data Analysis1

    11/60

    Arithmetic's Mean for Un-GroupedSeries

    Employee Income X-A

    1 1000 -

    2 1500 - 3 800 -

    4 1200 -

    5 900 -

  • 8/2/2019 Data Analysis1

    12/60

    For discreet data mean is calculated with

    respect to frequencies.In Case of continuous data, the value of X istaken as the mid value of the correspondingclass.

    1 1 2 2 1

    1 2

    1

    ..............

    n

    i i

    n n in

    ni

    i

    f x

    f x f x f xX f f f

    f

  • 8/2/2019 Data Analysis1

    13/60

    EXAMPLE 2 NUMBER OF STUDENTS ABSENT IN A YEAR

    X f Xf

    1 8 8

    2 9 18

    3 21 63

    4 32 128

    5 12 60

    6 22 132

    7 24 1688 37 296

    9 15 135

    10 20 200

    TOTAL 200 1208MEAN 6.04

  • 8/2/2019 Data Analysis1

    14/60

    Marks Students X-40

    X f d F*d

    20 8 - -

    30 12 - -

    40 20 - -50 10 - -

    60 6 - -

    70 4 - -

    Total 60

  • 8/2/2019 Data Analysis1

    15/60

    EXAMPLE 3 DISTRIBUTION OF NUMBER OF

    PROCESSED ARTICLES PER DAY

    PER PERSON

    LIMITS f X fX

    80-100 7 90 630

    100-120 50 110 5500120-140 80 130 10400

    140-160 60 150 9000

    160-180 3 170 510

    TOTAL 200 26040

    MEAN 130.2

  • 8/2/2019 Data Analysis1

    16/60

    Mathematical Properties ofArithmetic Mean

    Property 1 The Algebraic sum of thedeviations of the given set ofobservations from their arithmetic

    mean is zero Property 2 If the sizes and the mean

    of two component series is known thenthe mean of resultant series obtainedon combining the given series can befound

  • 8/2/2019 Data Analysis1

    17/60

    Merits and demerits ofArithmetic Mean

    Merits:

    i. It is rigidly defined.

    ii. It is easy to calculate and understand.

    iii. It is based on all the observations

    iv. It is suitable for further mathematicaltreatment.

    v. Of all the averages, arithmetic mean isaffected least by fluctuations of samplingor arithmetic mean is a stableaverage.

    (contd.)

  • 8/2/2019 Data Analysis1

    18/60

    Merits and demerits ofArithmetic Mean

    Demerits:

    i. It is affected by extreme observations.ii. It cannot be used in case of open end classes such as less than 10

    and more than 70, etc.

    iii. It can not be determined by inspection nor can it be locatedgraphically.

    iv. It cannot be used in dealing with qualitative characteristics.v. It cannot be obtained if a single observation is missing or lost.vi. It is not representative of the distribution and hence is not a suitable

    measure of locationvii. It may lead to wrong conclusion if the details of the data from whichit is obtained are not available.

    viii. Arithmetic mean may not be one of the values which the variableactually takes and is termed as fictitious mean

  • 8/2/2019 Data Analysis1

    19/60

    Mean For Combined Data

    If is the mean for observations and

    If is the mean for observations

    The combined mean is given by

    1X 1n

    2

    X2n

    1 1 2 2

    1 2

    n X n X X

    n n

  • 8/2/2019 Data Analysis1

    20/60

    Example

    Mean height of 25 Male worker in thefactory is 61 inches and Mean height of 35female worker is the same factory is 58

    inches. Find out the combine Mean of 60workers

  • 8/2/2019 Data Analysis1

    21/60

    Median

    Median is that value of the variable whichdivides the group in two equal parts, onepart comprising all the values greater and

    the other, all the values less than themedian.

    Median is only a positional average i.e, itsvalue depends on the position occupied bya value in the frequency distribution.

  • 8/2/2019 Data Analysis1

    22/60

    Calculation of Median

    Case I: Ungrouped data: If the number of observation is odd,then the median is the middle value after the observationshave been arranged in ascending or descending order ofmagnitude.

    Case II: Discreet Distribution: In case of frequencydistribution where the variable takes the value X1, X2,, , Xnwith respective frequencies 1,2,, ,n with =N, totalfrequency, median is the size of the (N+1)/2th item or

    observation. In this case the use of cumulative frequency(c. .) distribution facilitates the calculations.

  • 8/2/2019 Data Analysis1

    23/60

    EXAMPLE 4

    MARKS OF 10

    STUDENTS ARE4 7 6 8 9 4 3 2 7 8

    IN ORDER 2 3 4 4 6 7 7 8 8 9

    MEDIAN 6.5

    MARKS OF 11STUDENTS ARE

    4 7 6 8 9 4 3 2 7 8 4

    IN ORDER 2 3 4 4 4 6 7 7 8 8 9

    MEDIAN 6

  • 8/2/2019 Data Analysis1

    24/60

    ND NUMBER OF HEAD ARE NOATED

    THE EXPERIMENT IS REPEATED 256 TIMES

    # HEADS FREQUENCY

    X f CF xf

    0 1 1 0

    1 9 10 9

    2 26 36 523 59 95 177

    4 72 167 288

    5 52 219 260

    6 29 248 174

    7 7 255 498 1 256 8

    N/2 128 1017

    MEDIAN 4 mean 3.972656

  • 8/2/2019 Data Analysis1

    25/60

    Case III: Continuous distribution: Compute cumulative frequency (cf)

    Find N/2

    See cf just greater than N/2

    The corresponding class contains the median valuecalled median class

    2

    h N Median l C f

    Where l is the lower limit of median classf is the frequency of the median classH is the magnitude of the median classN is the total frequencyC is the CF of the class preceding the median class

  • 8/2/2019 Data Analysis1

    26/60

  • 8/2/2019 Data Analysis1

    27/60

    Merits:

    i. It is rigidly definedii. It is easy to understand and calculate for a non medical

    person.iii. It is not affected by extreme observations and as such is very

    useful in the case of skewed distributionsiv. It can be computed by dealing with the distribution with open

    end classesv. It can sometimes be located by simple inspection and can

    also be computed graphicallyvi. It is the only average to be used while dealing with qualitative

    characteristics which can not be measured quantitatively butstill can be arranged in ascending oe descending order ofmagnitude.

    Merits And Demerits

  • 8/2/2019 Data Analysis1

    28/60

    Merits And Demerits

    Demerits:

    i. In case of even number of observations ofungrouped data it can not be determined

    exactly.ii. It is not based on each and every item of thedistribution.

    iii. It is not suitable for further mathematical

    treatment.iv. It is relatively less stable than mean, particularly

    for small samples.

  • 8/2/2019 Data Analysis1

    29/60

    Quartile

    The values which divide the givendata into four equal parts areknown as quartiles. Therefore,there will be only three such points

  • 8/2/2019 Data Analysis1

    30/60

    Quartile

    The values which divide the given data into fourequal parts are known as quartiles. Therefore,there will be only three such points Q1, Q2 andQ3such that Q1Q2Q3termed as the three quartiles.

    Q1known as the lower or first quartile is the valuewhich has 25% of the items of the distributionbelow it and consequently 75% of the items aregreater than it. Q2, the second quartile coincideswith the median and has equal number of

    observations above and below it. Q3upper or thirdquartile, has 75% of the observations below it andconsequently 25% of the observations above it

  • 8/2/2019 Data Analysis1

    31/60

    1 4

    h N

    Q l Cf

    3

    3

    4

    h NQ l C

    f

  • 8/2/2019 Data Analysis1

    32/60

    Percentile

    Percentiles are the values which divide theseries into 100 equal parts. So, there are 99percentiles P1, P2 P99 such that P1 P2

    P99. The ith percentile value is:

    100i

    h iNP l Cf

  • 8/2/2019 Data Analysis1

    33/60

    MODE

    Mode is the value which has thegreatest frequency density

    Mode for continuous distribution is

    given by

    1 0

    1 0 2 1

    h f f

    Mode l f f f f

  • 8/2/2019 Data Analysis1

    34/60

    EXAMPLE 7

    f x xf

    10-20 4 15 60

    20-30 6 25150

    30-40 5 35 175

    40-50 10 45 450

    50-60 20 55 1100

    60-70 22 65 1430

    70-80 21 75 1575

    80-90 6 85 510

    90-100 2 95 190

    100-110 1 105 105

    f1=22 h=10 5745f0=20 97

    f2=21 mean 59.2268

    l=60

    mode= 66.6666667

  • 8/2/2019 Data Analysis1

    35/60

    Measures of Dispersion

    Range

    Quartile deviation

    Mean Deviation

    Variance

    Standard deviation

  • 8/2/2019 Data Analysis1

    36/60

    RANGE

    max min Range X X

    Range is the difference between the two extremeobservations of distribution

    OR

    It is the difference between the greatest (maximum) and thesmallest (minimum) observation of the distribution.

    It is the simplest but crude measure of dispersion. It isrigidly defined, readily comprehensible and easiest to

    compute requiring very little calculations

    RANGE

  • 8/2/2019 Data Analysis1

    37/60

    EXAMPLE

    MARKS OF STUDENTS

    ROLL NO. MARKS SORTED

    123 98 52

    125 95 56

    126 96 56127 87 66

    128 56 78

    134 52 87

    135 89 89

    136 78 95

    137 56 96

    138 66 98

    RANGE 98-52= 46

    RANGE

  • 8/2/2019 Data Analysis1

    38/60

    Merits and Demerits of Range

    It is not based in the entire set of data.

    Its value varies very widely from sample tosample.

    If the Xmax and Xminremain unaltered and all theother values are replaced by a set of observationthe range of distribution remains the same.

    It can not be used when dealing with open endclasses

    Not Suitable for mathematical treatment.It is very sensitive to the size of the sample.

    It is too indefinite to be used as a practicalmeasure of dispersion.

  • 8/2/2019 Data Analysis1

    39/60

    QUARTILE DEVIATION

    3 1D

    2

    Q QQuartile eviation

    It is a measure of dispersion based on the upper quartileQ3 and the lower quartile Q1.

    Inter-quartile Range= Q3 - Q1

    Quartile Deviation is obtained from inter quartile rangeon dividing by 2.

  • 8/2/2019 Data Analysis1

    40/60

    Merits and Demerits of Quartile

    Merits:

    It is quite easy to understand & calculate.

    It makes use of 50% of the data & as such isbetter measure than range

    As it ignore 25% of data from the beginning and25% from the top end, it is not affected at all by

    extreme observations.It can be Computed from the Frequency

    distribution with open end classes .

    (Contd.)

  • 8/2/2019 Data Analysis1

    41/60

    Demerits:

    It is not based on all observations.

    It is affected considerably byfluctuations of sampling.

    It is not suitable for furthermathematical treatment.

    Merits and Demerits of Quartile

    EXAMPLE

  • 8/2/2019 Data Analysis1

    42/60

    DISTRIBUTION OF MONTHLY EARNING

    MONTH EARNING

    1 10239

    2 10250

    3 10251

    4 10251

    5 10257

    6 10258

    7 10260

    8 10261

    9 10262

    10 10262

    11 1027312 10275

    Q1 10251

    Q3 10262

    QUARTILE DEVIATIO 5.5

  • 8/2/2019 Data Analysis1

    43/60

    MEAN DEVIATION

    1D i Mean eviation X X

    n

    1D i i Mean eviation f X X N

    Average or Mean deviation is the average amount of scatterof the items in a distribution from either the mean or themedian, ignoring the signs of deviation. The average that istaken of the scatter is an arithmetic mean, which accounts forthe fact that this measure is often called the mean deviation.

    For grouped data

    For ungrouped data

    EXAMPLE

  • 8/2/2019 Data Analysis1

    44/60

    EXAMPLE

    DISTRIBUTION OF SERIES OF DAILY RENTS

    HOUSE RENT -MEAN

    1 3000 18192 3000 1819

    3 3000 1819

    4 3750 1069

    5 4000 819.4

    6 4000 819.4

    7 4000 819.4

    8 4500 319.4

    9 4750 69.44

    10 5000 180.6

    11 5000 180.6

    12 5000 180.6

    13 5250 430.6

    14 5250 430.615 5500 680.6

    16 6250 1431

    17 6500 1681

    18 9000 4181

    TOTAL 86750 18750

    MEAN 4819.4

  • 8/2/2019 Data Analysis1

    45/60

    EXAMPLE

    DISTRIBUTION OF HEIGHTS OF STUDEN

    HEIGHT # OF STUDENTS

    X f fX (X-MEAN

    158 15 2370 49.1667

    159 20 3180 45.5556

    160 32 5120 40.8889161 35 5635 9.72222

    162 33 5346 23.8333

    163 21 3423 36.1667

    164 10 1640 27.2222

    165 8 1320 29.7778

    166 6 996 28.3333

    TOTAL 180 29030 290.667

    MEAN 161.278

    MD 1.61481

  • 8/2/2019 Data Analysis1

    46/60

    STANDARD DEVIATION

    It is defined as the positive square root of themean of the squares of the deviations of the givenobservations from their mean

    21

    Standard Deviation iX Xn

    21

    Standard Deviationi i

    f X X N

    For un-grouped data

    For grouped data

  • 8/2/2019 Data Analysis1

    47/60

    VARIANCE

    22 1

    i iVariance f X X N

    2

    2 1iVariance X X

    n

    It is the square of standard deviation and is denotedby 2

    For un-grouped data

    For grouped data

    PROPERTIES OF STANDARD

  • 8/2/2019 Data Analysis1

    48/60

    PROPERTIES OF STANDARDDEVIATION

    PROPERTY 1

    is independent of change of origin but not scale

    PROPERTY 2

    Is the minimum value of the root mean square deviation

    PROPERTY 3

    Is suitable for further mathematical treatment

    PROPERTY 4

    SD < Range

  • 8/2/2019 Data Analysis1

    49/60

    MERITS AND DEMERITS OF SD

    Is the most important and widely usedmeasure of dispersion

    It is defined on all the observations

    The squaring of the deviations removes thedrawback of ignoring the signs of deviationsin computing the mean deviation

    It is affected least by fluctuations ofsampling

  • 8/2/2019 Data Analysis1

    50/60

    EXAMPLE

    X (X-MEAN)^2

    12 13.69

    15 0.49

    24 68.89

    12 13.69

    13 7.29

    15 0.4914 2.89

    12 13.69

    16 0.09

    24 68.89

    TOTAL 157 190.1

    MEAN 15.7

    VARIAN 19.01

    SD 4.36

  • 8/2/2019 Data Analysis1

    51/60

    EXAMPLE

    # LETTERS IN WORREQUENCY X-MEAN

    X f fX d fd^d

    1 3 3 -3.277 32.208

    2 8 16 -2.277 41.463

    3 9 27 -1.277 14.667

    4 10 40 -0.277 0.765

    5 5 25 0.723 2.617

    6 4 24 1.723 11.880

    7 3 21 2.723 22.251

    8 1 8 3.723 13.864

    9 3 27 4.723 66.932

    10 1 10 5.723 32.757

    TOTAl 47 201 239.404

    MEAN 4.277

    VARIANCE 5.094

  • 8/2/2019 Data Analysis1

    52/60

    EXAMPLE f x xf d^2 fd^2

    30-39 1 29.5-39. 34.5 34.5 1128.96 1128.96

    40-49 4 39.5-49. 44.5 178 556.96 2227.8450-59 14 49.5-59. 54.5 763 184.96 2589.44

    60-69 20 59.5-69. 64.5 1290 12.96 259.2

    70-79 22 69.5-79. 74.5 1639 40.96 901.12

    80-89 12 79.5-89. 84.5 1014 268.96 3227.52

    90-99 2 89.5-99. 94.5 189 696.96 1393.92

    TOTAL 75 5107.5 11728

    MEAN 68.1

    VARIANCE 156

    SD 12.5

  • 8/2/2019 Data Analysis1

    53/60

    CORRELATION

    When the relationships of quantitativenature, the appropriate statistical tool fordiscovering and measuring the relationship

    and expressing it in a brief formula is knownas correlation

    It is defined as an analysis of the co-

    variation between two or more variables

  • 8/2/2019 Data Analysis1

    54/60

    Types of Correlation

    a) Positive and negative correlationb) Linear and non-linear correlation

    METHODS OF STUDYING

  • 8/2/2019 Data Analysis1

    55/60

    METHODS OF STUDYINGCORRELATION

    1. Scatter diagram

    2. Karl Pearsons coefficient ofcorrelation

    3. Bi-variate correlation method

    4. Rank correlation

    S Di

  • 8/2/2019 Data Analysis1

    56/60

    Scatter Diagram

    Karl Pearsons Coefficient of

  • 8/2/2019 Data Analysis1

    57/60

    Karl Pearsons Coefficient of

    Correlation

    Is a numerical measure of linearrelationship between them and isdefined as the ratio of the covariancebetween X & Y to the product of thestandard deviations

    ( , )

    x y

    C o v x yr

  • 8/2/2019 Data Analysis1

    58/60

    2 2

    1( )( )

    1 1( ) ( )

    x x y y

    nr

    x x y yn n

    2 2 2 2

    ( )( )

    [ ( ) ][ ( ) ]

    n xy x yr

    n x x n y y

    EXAMPLE

  • 8/2/2019 Data Analysis1

    59/60

    EXAMPLE

    ADVERTISING Sales

    EXPENSES

    x y x-mx y-my dx^2 dy^2 dxdy

    39 47 -26 -19 676 361 494

    65 53 0 -13 0 169 0

    62 58 -3 -8 9 64 24

    90 86 25 20 625 400 500

    82 62 17 -4 289 16 -6875 68 10 2 100 4 20

    25 60 -40 -6 1600 36 240

    98 91 33 25 1089 625 825

    36 51 -29 -15 841 225 435

    78 84 13 18 169 324 234

    650 660 0 0 5398 2224 2704

    mx= 65

    my= 66

    r= 0.78

  • 8/2/2019 Data Analysis1

    60/60