chapter 2 part 1
TRANSCRIPT
Chapter 2: Descriptive StatisticsPART 1
Objective: Summarize the main characteristics of a data set such as
Measures of location: Where are the data centered (midpoint)?
Measures of variation: Dispersion
Distribution: What is the shape of the distribution of the data?
Outliers: Are there extreme observations?
Temporal or spatial patters: Do the data exhibit changes over time/space?
We will learn both graphical and numerical ways to explore these concepts
Stem-and-Leaf Graphs (Stem Plots), Line Graphs, and Bar Graphs
The Stem-and-Leaf Graph, or Stem Plot, comes from the field of exploratory data analysis
It’s a good choice when the data sets are small
To create the plot, divide each observation of data into a stem and a leafThe leaf consists of a final significant digitConsider the following scores on a test:
90, 63, 76, 70, 85, 90, 43, 75, 95, 88, 76, 90, 82, 90, 101, 100, 95, 76, 38, 65, 84, 91
It can be helpful to reorder them: 38, 43, 63, 65, 70, 75, 76, 76, 76, 82, 84, 85, 88, 90, 90,
90, 90, 91, 95, 95, 100, 101
Stem Leaf3 8
4 3
5
6 3 5
7 0 5 6 6 6
8 2 4 5 8
9 0 0 0 0 1 5 5
10 0 1
This is a great way to visualize the shape of the data You can look for an overall pattern, as well as any
outliers An outlier is an observation of data that does
not fit the pattern of the graph What are the outliers here?
Outliers
An outlier is also sometimes called an extreme value
Some outliers are due to mistakes (writing down the wrong number, for example)
The presence of an outlier (or outliers) can also mean that something unusual is happeningYou try it: The following data show the
distances (in miles) from homes of off-campus statistics students to their college.
Create a stemplot using the data and identify any outliers
0.5; 0.7; 1.1; 1.2; 1.2; 1.3; 1.3; 1.5; 1.5; 1.7; 1.7; 1.8; 1.9; 2.0; 2.2; 2.5; 2.6; 2.8; 2.8; 2.8; 3.5; 3.8; 4.4; 4.9; 4.9; 5.2; 5.5; 5.7; 5.8, 8.0
Stem Leaf0 5 7
1 1 2 2 3 3 5 5 7 7 8 9
2 0 2 5 6 8 8 8
3 5 8
4 4 9 9
5 2 5 7 8
6
7
8 0
Side-by-side stem-and-leaf plot of ages of presidents at their inauguration and death
Ages at Inauguration Ages at Death9 9 8 7 7 7 6 3 2 4 6 9
8 7 7 7 7 6 6 6 5 5 5 5 4 4 4 4 4 2 1 1 1 1 1 0 5 3 6 6 7 7 8
9 5 4 4 2 1 1 1 0 6 0 0 3 3 4 4 5 6 7 7 7 8
7 0 0 0 1 1 1 4 7 8 8 9
8 0 1 3 5 8
9 0 0 3 3
Figure 2.2
Another type of graph that is useful for specific data values is a line graph
In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do his or her chores.
In this line graph, the x-axis (horizontal axis) consists of data values and the y-axis (vertical axis) consists of frequency points
The frequency points are then connected using line segments
Number of times teenagers is reminded Frequency0 21 52 83 144 75 4
Bivariate Data (Two variables)
Sometimes we want to look at two variables at onceSuppose we want to study the connection
between people’s ages and the number of pets they have Here, the ordered pair is (age, # of pets) (19, 2), (23, 2), (18, 4), (18, 2), (28, 0), (19, 3), (37, 1), (20,
0), (34, 0), (40, 1), (18, 27), (19, 0), (18, 2), (18, 1), (18, 4), (20, 1), (19, 3), (26, 2), (23, 2), (29, 1), (23, 0), (19, 5), (19, 10), (29, 0), (19, 2), (19, 0)
Can you see what (18, 27) does to the graph?This is called a Scatter Plot
What are the outliers here? It’s easy to see that (18, 27) is an outlier; are there
others? Let’s take this one out and see what things look like
now…
15 20 25 30 35 40 450
5
10
15
20
25
30
Number of Pets
Bivariate Data (Two variables)
This allows us to see what kind of variability is going on a little bit easierWas this ‘OK’ to do?
Outliers happen, sometimes from mistakes, sometimes simply because they do exist You should note that you have removed an outlier to look at the data
15 20 25 30 35 40 450
2
4
6
8
10
12
Number of Pets
Bar Graphs
Bar Graphs consist of bars that are separated from each otherThe bars can be rectangles or they can be
rectangular boxes (3-D Plots)The bars can be vertical or horizontal
Purple Blue Pink Black Green Orange Yellow Burgandy Gold Silver Turquoise Wine Red0
1
2
3
4
5
6
7
Chart Title
Histograms
For most of the work we’ll do in this book, we’ll use a histogram to display the dataOne advantage of a histogram is that it can
readily display large data setsA histogram consists of adjoining (touching)
boxes. It has both a horizontal axis and a vertical axis
The horizontal axis is labeled with what the data represent
The vertical axis is labeled either with frequency or relative frequency (or percent frequency or probability)
The histogram can give you the shape of the data, the center, and the spread of the data
Here is our classes heights divided into 5 classes
19%
38%
27%
12%3.8%Re
lativ
e Fr
eque
ncy
Histograms
The relative frequency is equal to the frequency for an observed value of the data, divided by the total number of data values in the sample
Recall, frequency is defined as the number of times an answer occurs f = frequency n = total number of data values (or sum of the individual frequencies) RF = relative frequency RF = f/n
For example, if there are 30 students in this class, and four of you get a grade of A (between 90% and 100%), then f = 4, n = 30, and RF = 4/30 = 0.133 = 13.3% of the class received 90-100%.
19%
38%
27%
12%3.8%Re
lativ
e Fr
eque
ncy
Constructing a Histogram
First, decide how many bars or intervals, called classes, represent the data
Many histograms contain between 5 and 15 bars or classes for clarity The number of bars needs to be chosen One method used is to take the square root of the number of data values, and round So, if you had 150 data values, you might choose 12 classes
Choose a starting point for the first interval to be less than the smallest data value A convenient starting point is a lower value carried out to one or more decimal place than the value with the most decimal places.
Suppose you have a data point that is 11.5, and your smallest value is 7. Then 7 - .05 = 6.95 would be a good place to start.
Take a look at page 75 on your own to see other examples of finding the convenient starting points (read through last paragraph on the page)
Let’s take a look at Example 2.7 on page 76
Example 2.7 – Heights of 100 male semiprofessional soccer players
Smallest data value is 60, and the data has one decimal place Starting point will have two decimal places Start with 60 - .05 = 59.95 (starting value)
Largest data value is 74 74 + .05 = 74.05 will be our ending value
Calculate the width of each bar or class interval You’ll need to decide on the number of bars to use For this example, we’ll choose 8
We’ll round up to two (2) and make each bar width two units wide
We do this to prevent a value from falling on a boundary
Then, the boundaries will be: 59.95, 61.95, 63.95…75.95 (covers all the data, goes up by 2)
Gather all of the values that fall between
61.9
5
63.9
5
65.9
5
67.9
5
69.9
5
71.9
5
73.9
5
75.9
5
0
5
10
15
20
25
30
35
40
45
53
15
40
17
12
7
1
Heights
Figure 2.5
You find relative frequency in a similar manner, but instead of plotting how many in each class, instead play how many in the class divided by the number of total data values. (In this case it was the same since there were 100 players.)
Measures of the Location of Data
Quartiles and Percentiles
Find the IQR for 1, 4, 7, 7, 9, 12, 25
Are there any potential outliers?
To find the kth percentile:k = the kth percentile i = the index (ranking or position of data)n = the total number of data
Order the data from smallest to largest Calculate i =
If i is an integer, then the kth percentile is the data value in the ith position in the ordered set of data If i is not an integer, then round i up and round i down to the nearest integers Average the two data values in these tow positions in the ordered data set.
Let’s look at Try it 2.17 on page 91