random raw data · includes precise details. on the other hand, boxplots can be especially useful...

Algebra 2: Statistics

1

Random Raw Data: Since the dawn of civilization of people had the urge to count things. In fact, the earliest form of

writing were invented for counting. As civilizations grew… so did the number of things needed to be

counted. This created a new challenge. Sometimes it’s impossible to count all the things you want to

know which is why, long ago, someone dreamed up the strategy of studying a SAMPLE to learn

something about an entire POPULATION.

There are a few facts to keep in mind before we actually try to start taking samples. First, it is

impossible to use a sample to achieved absolute certainty about a population. That is why statistics is

about making our best possible guess, and never about being certain. Second, if we’re stuck with a

single sample, we’d better be sure we collected it CAREFULLY! Because any mistakes we make when we

collect our sample can totally screw up what we conclude about eh larger population.

Perhaps the biggest challenge in collecting a sample is figuring out exactly what to include in it.

The goal is to avoid any BIAS in our sample that might lead us to MISCHARACTERIZE THE POPULATION.

Ideally, we’d like to gather a sample that accurately mirrors the population. To avoid bias, we ALWAYS

collect samples RANDOMLY.

Types of Statistical Studies When collecting data, the way you collect the data can be divided into 4 different types of

studies: Observational, Experimental, Simulation and survey.

Observational:

A research observes and measures characteristics of interest of part of a population, but does

not interfere or change existing conditions.

Example: A person sits on the side of a road counting vehicles running a red light at a busy

intersection

Experimental:

A treatment is applied to part of a population and responses are observed. Another part of

population my e used as a control group, in which no treatment is applied.

Example: A study was performed where diabetics took cinnamon extract daily while a control

group took none. After 4 days, the diabetics who took the cinnamon reduced their risk of heart disease

where the control group experienced no change.

Simulation:

The use of mathematical or physical model to reproduce the conditions of situation or process.

Collecting data often involves the use of computers. Simulations allow you to study situations that are

impractical or even dangerous to create in real life, and often they save time and money.


2

Example: The insurance institute uses crash test dummies to determine the physical damage

done to the human body due to side impact crashes in a Smart Car.

Survey:

An investigation of one or more characteristics of population. Most often, surveys are carried

out on people asking them questions. The most common types of surveys are done by interview, mail,

computer or telephone. In designing a survey, it is important to word the questions so that they do not

lead to biased results which will not be representative of population.

Example: A question flier was sent out to new physicians to determine whether the primary

reason for their career choice is financial stability.

Sorting the data: Categorical Data:

When we’re studying features that we can describe only with words or yes/no answer this is

considered categorical data. After we gather categorical data we can easily pile it or slice it to give us a

sense of proportions in our sample.

Numerical data:

When we’re studying features that we can compare using number, this is considered numerical

data. As we’ll see in part two, all these numbers make numerical data much more useful overall.

The crucial difference between the two types of data is that we can’t do mat on categorical data, but we

can do math on numerical data!

For better or worse most of our brains aren’t great processing large piles of raw numbers. So the first

thing we do after we’ve collected a big mess of numerical data is draw pictures with it.

The Histogram:

To draw a histogram of our sample, we start with a number line


3

Then we pile our data of top of it, piece by piece

The Box Plot:

Another useful way to visualize numerical data is with a BOX PLOT. To draw a Box plot of our sample we

start with the same number line

But in this case we cram the middle 50% of our sample value into one big box.

In general, we draw histograms when we want a complete portrait of our entire pile of data that

includes precise details. On the other hand, boxplots can be especially useful when we want an overview

of our data, or want to compare different samples or groups Boxplots can give us a quick sense of how

data clumps together and whether it trails off in one direction or another.

Then we

indicate the

minimum…

And the

maximum

individual

values with

these bars

… Middle


4

Analyzing Data: Analyzing data is like solving a mystery, our ultimate goal is to gather evidence from one random

sample, and just it to piece together a story about a population. When we start to investigate any pile of

data we always look at for primary characteristics.

a. Sample size: How much data is in there?

a. In general, a Larger sample size is better

b. Size of the sample directly related to the level of confidence we can have about a

population

c. Size of a sample is always limited by something

b. Shape: What does the pile look like?

a. Flat graph: all possible outcomes are equally likely

b. Normal distributive (bell shape) when something is causing it to clump around one

particular value

i. Z-SCORE

1. 𝑍 = 𝑥−𝜇

𝜎 Where x is a single data point, μ is the population mean, and σ

is the standard deviation.

ii. Here is how to interpret z-scores.

A z-score less than 0 represents a data point less than the mean.

A z-score greater than 0 represents a data point greater than the mean.

A z-score equal to 0 represents a data point equal to the mean.

A z-score equal to 1 represents a data point that is 1 standard deviation

greater than the mean; a z-score equal to 2, 2 standard deviations

greater than the mean; etc.

A z-score equal to -1 represents a data point that is 1 standard deviation

less than the mean; a z-score equal to -2, 2 standard deviations less

than the mean; etc.

If the number of data points in the set is large, about 68% of the

elements have a z-score between -1 and 1; about 95% have a z-score

between -2 and 2; and about 99% have a z-score between -3 and 3.


5

c. Skewed: something is causing it to trail off more than one direction then the other.

c. Location: Where is it exactly on the number line?

a. The measure of where the bulk of the data sits on the number line

b. Defining location with words can be tricky, so often we describe it with a single number:

The AVERAGE (AKA MEAN).

i. To calculate the average we simply add up all the data values, then divide by the

number of data values

Example 1:

You take the SAT and score 1100. The mean score for the SAT is 1026 and the standard

deviation is 209. How well did you score on the test compared to the average test taker?

Step 1: Write your X-value into the z-score equation. For this sample question the X-value is your

SAT score, 1100.

Step 2: Write the mean, μ, into the z-score equation.

Step 3: Write the standard deviation, σ into the z-score equation.

Step 4: Calculate the answer using a calculator:

(1100 – 1026) / 209 = .354. This means that your score was .354 standard deviation above the

mean.

Step 5: (Optional) Look up your z-value in the z-table to see what percentage of test-takers scored

below you. A z-score of .354 is .1368 + .5000* = .6368 or 63.68%.


6

ii. �̅� = 𝒙𝟏+𝒙𝟐+𝒙𝟑+⋯+𝒙𝒏

𝒏

c. Although the average is useful and precise, as a measure of location, it is not

perfect. Unfortunately, averages can be deceptive, for example, if a pile of data

is skewed, and average value can be seriously misleading, with skewed data, the

MEDIAN is often more revealing as a measure of location because it can give a

better sense of a “typical” value.

i. To find the Median, place the numbers in value order and find the

middle. ii. BUT, with an even amount of numbers things are slightly different. In that case

we find the middle pair of numbers, and then find the value that is half way

between them. This is easily done by adding them together and dividing by two.

Example 2:

The frequency table shows the number of job offers received by each student within two months of

graduating with a mathematics degree from a small College. What is the mean for the job offers per

student.

Job offers 0 1 2 3 4

Students 2 2 4 5 2

Mean: 𝑥 = 2 0 +2 1 +4 2 +5 3 +2 4

15 = 2.2

Mean of this data set is 2.2

** You multiply by the amount of times the number occurs, for example 2 students got 2 job offers so

that was a total of 4 job offers given.


7

d. Spread: where does the data start and end?

a. The measure of the width of a pile of data, but also a measure of variation.

b. The most common Measure of spread is STANDARD DEVIATION (s – sigma)

i. Our goal when we calculate Standard deviation is to get a sense of the distance

from the average value. Here’s how to do it (mostly in plain English :)

1. Calculate the distance between each measurement x and the sample

average𝑥 . We call this distance DEVIATION

2. Square each Deviation

3. Add up all the squared deviations

4. Divide the sum by n-1 (if we stop here we get what’s called the

VARIANCE)

5. Take the square root of the whole Shebang.

ii. 𝒔 = √ 𝒙𝟏−�̅� 𝟐+ 𝒙𝟐−�̅� 𝟐+…+ 𝒙𝒏−�̅� 𝟐

𝒏−𝟏

Example 3:

The frequency table shows the number of job offers received by each student within two months

of graduating with a mathematics degree from a small College. What is the mode for the job

offers per student.


Students 2 2 4 5 2

Median: 0+0+1+1+2+2+2+2+3+3+3+3+3+4+4

n=16, and to find the mode we half that, so term 8 will be our mode.

Mode of the data will be 2

** List each value the number of times it occurs. For example there are 5 students with 3 job

offers, you would have 5 number 3s in the equation because 3 job offers occurs 5 times.


8

e. The last thing to consider when looking at data is the MODE

a. The mode is the number that occurs the most often in the data set.

b. There can be 2 modes (BIMODAL) or 3 modes (TRIMODAL).

Example 4:


of graduating with a mathematics degree from a small College. What is the standard deviation

for the job offers per student.


Students 2 2 4 5 2

Standard deviation: 𝑠 = √ 0−2.2 2+ 0−2.2 2+ 2−2.2 2+ 2−2.2 2+ 2−2.2 2+ 2−2.2 2+⋯+ 4−2.2 2+ 4−2.2 2

16−1

= 1.264

n=16

Standard deviation of the data will be 1.264


offers, so you should have (3-2.2)2 five times.

Example 5:


of graduating with a mathematics degree from a small College. What is the mode for the job

offers per student.


Students 2 2 4 5 2

Mode: 0+0+1+1+2+2+2+2+3+3+3+3+3+4+4

Mode of the data will be 3 since it occurs the most often.


offers, you would have 5 number 3s in the equation because 3 job offers occurs 5 times.


9

Using the TI calculator to find mean, median and mode. Step 1: Use STAT EDIT to enter the data in L1 (dependent values)

Step 2: In STAT CALC select 1-Var Stats.

Example 6:

The table displays the number of US hurricane strikes by decade from year 1851 to 2000.

What are the mean and standard deviation for the data set?

Decade 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Strikes 19 15 20 22 21 18 21 13 19 24 17 14 12 15 14

Step 1: Use STAT EDIT to enter the data in L1


10

Dependent data for this data set would be the strikes, input this into L1

STEP 2: Use STAT CALC Select 1-Var Stats

Hit enter till your screen looks like the one blow

ANS: The mean is 17.6, the standard deviation is 3.5

Sample

mean

Population Std

Dev

Sample Std Dev

Scroll down on

calculator while

still looking at 1-

VAR

Median

Population

size

random raw data · includes precise details. on the other hand, boxplots can be especially useful...

Documents