chs 221 v isualizing d ata week 3 dr. wajed hatamleh 1

61
CHS 221 VISUALIZING DATA Week 3 Dr. Wajed Hatamleh http://staff.ksu.edu.sa/ whatamleh/en 1

Upload: kristian-jacobs

Post on 03-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

CHS 221VISUALIZING DATA

Week 3Dr. Wajed Hatamlehhttp://staff.ksu.edu.sa/whatamleh/en

1

Page 2: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

VISUALIZING DATA

•Depict the nature of shape or shape of the data distribution

•In a graph: Different graphs used for different types of data

2

Page 3: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

HISTOGRAM

Another common graphical presentation of quantitative data is a histogram.

The variable of interest is placed on the horizontal axis. A rectangle is drawn above each class interval with its height corresponding to the interval’s frequency, relative frequency, or percent frequency.

3

Page 4: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

HISTOGRAMS Histograms: Used for quantitative data Similar to a bar graph, with an X and Y axis—but

adjacent values are on a continuum so bars touch one another

Data values on X axis are arranged from lowest to highest

Bars are drawn to height to show frequency or percentage (Y axis)

4

Page 5: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

HISTOGRAMS (CONT’D) Example of a histogram: Heart rate data

f

Heart rate in bpm0

2

4

6

8

10

12

0 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74

5

Page 6: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

HistogramA bar graph in which the horizontal scale represents the classes of data values and the vertical scale represents the frequencies.

Figure 2-1 6

Page 7: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Relative Frequency Histogram

Has the same shape and horizontal scale as a histogram, but the vertical scale is marked with relative frequencies.

Figure 2-27

Page 8: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Histogram and Relative Frequency Histogram

Figure 2-1 Figure 2-2

Page 9: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Ogive

An ogive is a graph of a cumulative distribution.. The data values are shown on the horizontal The data values are shown on the horizontal axis.axis. Shown on the vertical axis are the:Shown on the vertical axis are the:

• cumulative frequencies, orcumulative frequencies, or

• cumulative relative frequencies, orcumulative relative frequencies, or

• cumulative percent frequenciescumulative percent frequencies The frequency (one of the above) of each class The frequency (one of the above) of each class

is plotted as a point.is plotted as a point.

The plotted points are connected by straight The plotted points are connected by straight lines.lines.

9

Page 10: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Ogive

A line graph that depicts cumulative frequencies

Figure 2-4 10

Page 11: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

BAR GRAPHS Bar graphs: Used qualitative data. Bar graphs have a horizontal dimension (X axis)

that specifies categories (i.e., data values) The vertical dimension (Y axis) specifies either

frequencies or percentages Bars for each category drawn to the height that

indicates the frequency or %

11

Page 12: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

BAR GRAPHS Example of

a bar graph Note the

bars do not touch each other

Page 13: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

PIE CHART

Pie Charts: Also used for qualitative data. Circle is divided into pie-shaped wedges

corresponding to percentages for a given category or data value

All pieces add up to 100% Place wedges in order, with biggest wedge starting

at “12 o’clock”

13

Page 14: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

PIE CHART

Example of a pie chart, for same marital status data

Page 15: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Recap

In this Section we have discussed graphs that are pictures of distributions.

Keep in mind that the object of this section is not just to construct graphs, but to learn something about the data sets – that is, to understand the nature of their distributions.

15

Page 16: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

CHARACTERISTICS OF A DATA DISTRIBUTION

Central tendency Variability

Both central tendency and variability can be expressed by indexes that are descriptive statistics

16

Page 17: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

CENTRAL TENDENCY Indexes of central tendency provide a single

number to characterize a distribution

Measures of central tendency come from the center of the distribution of data values, indicating what is “typical,” and where data values tend to cluster

Popularly called an “average”

17

Page 18: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

CENTRAL TENDENCY INDEXES

Three alternative indexes:

The mode The median The mean

18

Page 19: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MODE

The mode is the score value with the highest frequency; the most “popular” scoreAge: 26 27 27

28 29 30 31Mode = 27

The mode

19

Page 20: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MODE: ADVANTAGES

Can be used with data measured on any measurement level (including nominal level)

Easy to “compute”

Reflects an actual value in the distribution, so it is easy to understand

Useful when there are 2+ “popular” scores (i.e., in multimodal distributions)

20

Page 21: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Mode

A data set may be:Bimodal

MultimodalNo Mode

denoted by M

the only measure of central tendency that can be used with qualitative data

21

Page 22: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

a. 5.40 1.10 0.42 0.73 0.48 1.10

b. 27 27 27 55 55 55 88 88 99

c. 1 2 3 6 7 8 9 10

Examples

Mode is 1.10

Bimodal - 27 & 55

No Mode

22

Page 23: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MODE: DISADVANTAGES

Ignores most information in the distribution

Tends to be unstable (i.e., value varies a lot from one sample to the next)

Some distributions may not have a mode (e.g., 10, 10, 11, 11, 12, 12)

23

Page 24: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MEDIANThe median is the

score that divides the distribution into two equal halves

50% are below the median, 50% aboveAge: 26 27 27 28

29 30 31Median (Mdn) = 28

The median

24

Page 25: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

5.40 1.10 0.42 0.73 0.48 1.10 0.66

0.42 0.48 0.66 0.73 1.10 1.10 5.40

(in order - odd number of values)

exact middle MEDIAN is 0.73

5.40 1.10 0.42 0.73 0.48 1.10

0.42 0.48 0.73 1.10 1.10 5.40

0.73 + 1.10

2

(even number of values – no exact middleshared by two numbers)

MEDIAN is 0.915

25

Page 26: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MEDIAN: ADVANTAGES

Not influenced by outliers

Particularly good index of what is “typical” when distribution is skewed

Easy to “compute”

26

Page 27: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MEDIAN: DISADVANTAGES

Does not take actual data values into account—only an index of position

Value of median not necessarily an actual data value, so it is more difficult to understand than mode

27

Page 28: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MEAN

The mean is the arithmetic average

Data values are summed and divided by N

Age: 26 27 27 28 29 30 31

Mean = 28.3

The mean28

Page 29: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MEAN (CONT’D) Most frequently used measure of central

tendency

Equation:M = ΣX ÷ N

Where: M = sample mean Σ = the sum ofX = actual data valuesN = number of people

29

Page 30: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MEAN: ADVANTAGES

The balance point in the distribution: Sum of deviations above the mean always exactly

balances those below it

Does not ignore any information

The most stable index of central tendency

Many inferential statistics are based on the mean

30

Page 31: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MEAN: DISADVANTAGES

Sensitive to outliers

Gives a distorted view of what is “typical” when data are skewed

Value of mean is often not an actual data value

31

Page 32: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

THE MEAN: SYMBOLS

Sample means: In reports, usually symbolized as M In statistical formulas, usually symbolized as (pronounced X bar)

Population means: The Greek letter μ (mu)

x

32

Page 33: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Notation

µ is pronounced ‘mu’ and denotes the mean of all values in a population

x =n

∑ x

is pronounced ‘x-bar’ and denotes the mean of a set of sample valuesx

Nµ =

∑ x

33

Page 34: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Best Measure of Center

34

Page 35: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

SymmetricData is symmetric if the left half of

its histogram is roughly a mirror image of its right half.

SkewedData is skewed if it is not symmetric

and if it extends more to one side than the other.

Definitions

35

Page 36: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Skewness Figure 2-11

36

Page 37: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

RecapIn this section we have discussed:

Types of Measures of CenterMeanMedianMode

Mean from a frequency distribution

Best Measures of Center

Skewness 37

Page 38: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

MEASURES OF VARIATION

Because this section introduces the concept of variation, this is one of the most important sections in the entire book

38

Page 39: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

DEFINITION

The range of a set of data is the difference between the highest value and the lowest value

valuehighest lowest

value

39

Page 40: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

DEFINITION

The standard deviation of a set of sample values is a measure of variation of values about the mean

40

Page 41: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

SAMPLE STANDARD DEVIATION FORMULA

∑ (x - x)2

n - 1S =

41

Page 42: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

SAMPLE STANDARD DEVIATION (SHORTCUT FORMULA)

n (n - 1)

s =n (∑x2) - (∑x)2

42

Page 43: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Standard Deviation - Key Points

The standard deviation is a measure of variation of all values from the mean

The value of the standard deviation s is usually positive

The value of the standard deviation s can increase dramatically with the inclusion of one or more outliers (data values far away from all others)

The units of the standard deviation s are the same as the units of the original data values

43

Page 44: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Definition

Empirical (68-95-99.7) Rule For data sets having a distribution that is approximately bell shaped, the following properties apply:

About 68% of all values fall within 1 standard deviation of the mean About 95% of all values fall within 2 standard deviations of the mean About 99.7% of all values fall within 3 standard deviations of the mean

44

Page 45: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

The Empirical Rule

FIGURE 2-13

45

Page 46: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

The Empirical Rule

FIGURE 2-13

46

Page 47: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

The Empirical Rule

FIGURE 2-13

47

Page 48: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

ARE YOU READY

Post test Time

48

Page 49: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Slid

e 3

- 49

Which measure of center is the only one that can be used with data at the catogrical level of measurement?

A. Mean

B. Median

C. Mode

Page 50: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Slid

e 3

- 50

Which of the following measures of center is not affected by outliers?

A. Mean

B. Median

C. Mode

Page 51: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Slid

e 3

- 51

Which of the following measures of center is not affected by outliers?

A. Mean

B. Median

C. Mode

Page 52: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Slid

e 3

- 52

Find the mode (s) for the given sample data.

79, 25, 79, 13, 25, 29, 56, 79

A. 79

B. 48.1

C. 42.5

D. 25

Page 53: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Slid

e 3

- 53

Find the mode (s) for the given sample data.

79, 25, 79, 13, 25, 29, 56, 79

A. 79

B. 48.1

C. 42.5

D. 25

Page 54: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Slid

e 3

- 54

Which is not true about the variance?

A. It is the square of the standard deviation.

B. It is a measure of the spread of data.

C. The units of the variance are different from the units of the original data set.

D. It is not affected by outliers.

Page 55: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Slid

e 3

- 55

Which is not true about the variance?

A. It is the square of the standard deviation.

B. It is a measure of the spread of data.

C. The units of the variance are different from the units of the original data set.

D. It is not affected by outliers.

Page 56: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

Slid

e 3

- 56

Which of the following measures of center is not affected by outliers?

A. Mean

B. Median

C. Mode

Page 57: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

EXERCISE TIME

57

Page 58: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

EXERCISE 1

1. The following 10 data values are diastolic blood pressure readings. Compute the mean, the range and SD, for these data.

130 110 160 120 170 120 150 140 160 140

58

Page 59: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

EXERCISE 2

The following are the fasting blood glucose level of 10 children

1. 56 6. 56 2. 62 7. 653. 63 8. 684. 65 9. 705. 65 10. 72Compute the: a. range b. standard deviation

59

Page 60: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

EXERCISE 3

3.The fifteen patients making initial visits to a rural health department travelled these distances: Find: a. Range, b. Standard DeviationPatient Distance

(Miles)Patient Distance

(Miles)1 5 8 62 9 9 133 11 10 74 3 11 35 12 12 156 13 13 127 12 14 15

15 5

60

Page 61: CHS 221 V ISUALIZING D ATA Week 3 Dr. Wajed Hatamleh  1

ANSWER

1. Range = 60 ; SD = 20 2. Range = 16 ; SD = 4.4 3. Range = 12 ; SD = 4.2

61