chapter 2 part 1

15
Chapter 2: Descriptive Statistics PART 1

Upload: jason-edington

Post on 12-Apr-2017

66 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Chapter 2 part 1

Chapter 2: Descriptive StatisticsPART 1

Page 2: Chapter 2 part 1

Objective: Summarize the main characteristics of a data set such as

Measures of location: Where are the data centered (midpoint)?

Measures of variation: Dispersion

Distribution: What is the shape of the distribution of the data?

Outliers: Are there extreme observations?

Temporal or spatial patters: Do the data exhibit changes over time/space?

We will learn both graphical and numerical ways to explore these concepts

Page 3: Chapter 2 part 1

Stem-and-Leaf Graphs (Stem Plots), Line Graphs, and Bar Graphs

The Stem-and-Leaf Graph, or Stem Plot, comes from the field of exploratory data analysis

It’s a good choice when the data sets are small

To create the plot, divide each observation of data into a stem and a leafThe leaf consists of a final significant digitConsider the following scores on a test:

90, 63, 76, 70, 85, 90, 43, 75, 95, 88, 76, 90, 82, 90, 101, 100, 95, 76, 38, 65, 84, 91

It can be helpful to reorder them: 38, 43, 63, 65, 70, 75, 76, 76, 76, 82, 84, 85, 88, 90, 90,

90, 90, 91, 95, 95, 100, 101

Stem Leaf3 8

4 3

5

6 3 5

7 0 5 6 6 6

8 2 4 5 8

9 0 0 0 0 1 5 5

10 0 1

This is a great way to visualize the shape of the data You can look for an overall pattern, as well as any

outliers An outlier is an observation of data that does

not fit the pattern of the graph What are the outliers here?

Page 4: Chapter 2 part 1

Outliers

An outlier is also sometimes called an extreme value

Some outliers are due to mistakes (writing down the wrong number, for example)

The presence of an outlier (or outliers) can also mean that something unusual is happeningYou try it: The following data show the

distances (in miles) from homes of off-campus statistics students to their college.

Create a stemplot using the data and identify any outliers

0.5; 0.7; 1.1; 1.2; 1.2; 1.3; 1.3; 1.5; 1.5; 1.7; 1.7; 1.8; 1.9; 2.0; 2.2; 2.5; 2.6; 2.8; 2.8; 2.8; 3.5; 3.8; 4.4; 4.9; 4.9; 5.2; 5.5; 5.7; 5.8, 8.0

Stem Leaf0 5 7

1 1 2 2 3 3 5 5 7 7 8 9

2 0 2 5 6 8 8 8

3 5 8

4 4 9 9

5 2 5 7 8

6

7

8 0

Page 5: Chapter 2 part 1

Side-by-side stem-and-leaf plot of ages of presidents at their inauguration and death

Ages at Inauguration Ages at Death9 9 8 7 7 7 6 3 2 4 6 9

8 7 7 7 7 6 6 6 5 5 5 5 4 4 4 4 4 2 1 1 1 1 1 0 5 3 6 6 7 7 8

9 5 4 4 2 1 1 1 0 6 0 0 3 3 4 4 5 6 7 7 7 8

7 0 0 0 1 1 1 4 7 8 8 9

8 0 1 3 5 8

9 0 0 3 3

Page 6: Chapter 2 part 1

Figure 2.2

Another type of graph that is useful for specific data values is a line graph

In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do his or her chores.

In this line graph, the x-axis (horizontal axis) consists of data values and the y-axis (vertical axis) consists of frequency points

The frequency points are then connected using line segments

Number of times teenagers is reminded Frequency0 21 52 83 144 75 4

Page 7: Chapter 2 part 1

Bivariate Data (Two variables)

Sometimes we want to look at two variables at onceSuppose we want to study the connection

between people’s ages and the number of pets they have Here, the ordered pair is (age, # of pets) (19, 2), (23, 2), (18, 4), (18, 2), (28, 0), (19, 3), (37, 1), (20,

0), (34, 0), (40, 1), (18, 27), (19, 0), (18, 2), (18, 1), (18, 4), (20, 1), (19, 3), (26, 2), (23, 2), (29, 1), (23, 0), (19, 5), (19, 10), (29, 0), (19, 2), (19, 0)

Can you see what (18, 27) does to the graph?This is called a Scatter Plot

What are the outliers here? It’s easy to see that (18, 27) is an outlier; are there

others? Let’s take this one out and see what things look like

now…

15 20 25 30 35 40 450

5

10

15

20

25

30

Number of Pets

Page 8: Chapter 2 part 1

Bivariate Data (Two variables)

This allows us to see what kind of variability is going on a little bit easierWas this ‘OK’ to do?

Outliers happen, sometimes from mistakes, sometimes simply because they do exist You should note that you have removed an outlier to look at the data

15 20 25 30 35 40 450

2

4

6

8

10

12

Number of Pets

Page 9: Chapter 2 part 1

Bar Graphs

Bar Graphs consist of bars that are separated from each otherThe bars can be rectangles or they can be

rectangular boxes (3-D Plots)The bars can be vertical or horizontal

Purple Blue Pink Black Green Orange Yellow Burgandy Gold Silver Turquoise Wine Red0

1

2

3

4

5

6

7

Chart Title

Page 10: Chapter 2 part 1

Histograms

For most of the work we’ll do in this book, we’ll use a histogram to display the dataOne advantage of a histogram is that it can

readily display large data setsA histogram consists of adjoining (touching)

boxes. It has both a horizontal axis and a vertical axis

The horizontal axis is labeled with what the data represent

The vertical axis is labeled either with frequency or relative frequency (or percent frequency or probability)

The histogram can give you the shape of the data, the center, and the spread of the data

Here is our classes heights divided into 5 classes

19%

38%

27%

12%3.8%Re

lativ

e Fr

eque

ncy

Page 11: Chapter 2 part 1

Histograms

The relative frequency is equal to the frequency for an observed value of the data, divided by the total number of data values in the sample

Recall, frequency is defined as the number of times an answer occurs f = frequency n = total number of data values (or sum of the individual frequencies) RF = relative frequency RF = f/n

For example, if there are 30 students in this class, and four of you get a grade of A (between 90% and 100%), then f = 4, n = 30, and RF = 4/30 = 0.133 = 13.3% of the class received 90-100%.

19%

38%

27%

12%3.8%Re

lativ

e Fr

eque

ncy

Page 12: Chapter 2 part 1

Constructing a Histogram

First, decide how many bars or intervals, called classes, represent the data

Many histograms contain between 5 and 15 bars or classes for clarity The number of bars needs to be chosen One method used is to take the square root of the number of data values, and round So, if you had 150 data values, you might choose 12 classes

Choose a starting point for the first interval to be less than the smallest data value A convenient starting point is a lower value carried out to one or more decimal place than the value with the most decimal places.

Suppose you have a data point that is 11.5, and your smallest value is 7. Then 7 - .05 = 6.95 would be a good place to start.

Take a look at page 75 on your own to see other examples of finding the convenient starting points (read through last paragraph on the page)

Let’s take a look at Example 2.7 on page 76

Page 13: Chapter 2 part 1

Example 2.7 – Heights of 100 male semiprofessional soccer players

Smallest data value is 60, and the data has one decimal place Starting point will have two decimal places Start with 60 - .05 = 59.95 (starting value)

Largest data value is 74 74 + .05 = 74.05 will be our ending value

Calculate the width of each bar or class interval You’ll need to decide on the number of bars to use For this example, we’ll choose 8

We’ll round up to two (2) and make each bar width two units wide

We do this to prevent a value from falling on a boundary

Then, the boundaries will be: 59.95, 61.95, 63.95…75.95 (covers all the data, goes up by 2)

Gather all of the values that fall between

61.9

5

63.9

5

65.9

5

67.9

5

69.9

5

71.9

5

73.9

5

75.9

5

0

5

10

15

20

25

30

35

40

45

53

15

40

17

12

7

1

Heights

Page 14: Chapter 2 part 1

Figure 2.5

You find relative frequency in a similar manner, but instead of plotting how many in each class, instead play how many in the class divided by the number of total data values. (In this case it was the same since there were 100 players.)

Page 15: Chapter 2 part 1

Measures of the Location of Data

Quartiles and Percentiles

Find the IQR for 1, 4, 7, 7, 9, 12, 25

Are there any potential outliers?

To find the kth percentile:k = the kth percentile i = the index (ranking or position of data)n = the total number of data

Order the data from smallest to largest Calculate i =

If i is an integer, then the kth percentile is the data value in the ith position in the ordered set of data If i is not an integer, then round i up and round i down to the nearest integers Average the two data values in these tow positions in the ordered data set.

Let’s look at Try it 2.17 on page 91