preliminary definitions

16
INTRODUCTION © 2013 Radha Bose Florida State University Department of Statistics 1/16 PRELIMINARY DEFINITIONS Data is information. When we collect data, we are collecting information. Subjects: The objects from whom, or about whom, data is collected. The objects may be living or nonliving, tangible or intangible, may consist of events or abstract concepts. To identify the subjects in a story, first identify the sample size n , then ask yourself “ n of whom/what are we getting information about?". Observation / Response / Data Point: A single collected data value (a piece of information). To identify the observations, first identify the subjects, then ask yourself "from each subject what piece of information are we getting?". Sometimes the words used to describe the subjects and the observations are the same. Population: The entire group of subjects that we wish to get information about in a single study. In reality, it is often time-consuming or costly to obtain data from the entire population under consideration, so we usually work with samples. Sample: A subgroup of subjects within a population. In real life, we would obtain data from a sample instead of a population, and then use our knowledge of statistics to make inferences about the population from which the sample was drawn. Sample size: The number of subjects involved in a study, i.e., the number of subjects in the sample. Parameter: A numerical measurement that describes a characteristic of a population. Statistic: A numerical measurement that describes a characteristic of a sample. Very often sample statistics are used as estimates for the corresponding population parameters. The picture below may help you to understand the concept of population, sample, parameter and statistic better.

Upload: independent

Post on 13-May-2023

2 views

Category:

Documents


0 download

TRANSCRIPT

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

1/16

PRELIMINARY DEFINITIONS

Data is information. When we collect data, we are collecting information.

Subjects: The objects from whom, or about whom, data is collected. The objects may be living

or nonliving, tangible or intangible, may consist of events or abstract concepts. To identify the

subjects in a story, first identify the sample size n , then ask yourself “ n of whom/what are we

getting information about?".

Observation / Response / Data Point: A single collected data value (a piece of information).

To identify the observations, first identify the subjects, then ask yourself "from each subject what

piece of information are we getting?". Sometimes the words used to describe the subjects and

the observations are the same.

Population: The entire group of subjects that we wish to get information about in a single

study. In reality, it is often time-consuming or costly to obtain data from the entire population

under consideration, so we usually work with samples.

Sample: A subgroup of subjects within a population. In real life, we would obtain data from a

sample instead of a population, and then use our knowledge of statistics to make inferences

about the population from which the sample was drawn.

Sample size: The number of subjects involved in a study, i.e., the number of subjects in the

sample.

Parameter: A numerical measurement that describes a characteristic of a population.

Statistic: A numerical measurement that describes a characteristic of a sample. Very often

sample statistics are used as estimates for the corresponding population parameters.

The picture below may help you to understand the concept of population, sample, parameter and

statistic better.

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

2/16

The table below shows the symbols used to represent some parameters and statistics.

Symbol

Name Population Parameter

(generally Greek

letters)

Sample Statistic

(generally Latin letters)

Size N

n

“N” often denotes total sample size when

more than one sample is involved

Mean (average) µ

Variance σ2 s2

Standard Deviation σ s

Proportion / Percent π p

Correlation ρ r

1. We want to know what percent of this term's Statistics students are taking French. We

randomly select 40 students in our Statistics class and ask if they are taking French, answer YES

or NO. It turns out that 15 of them say YES so we calculate that 15/40=37.5% of them are also

taking French. FSU administrators want a truer percent so they poll all of this term's Statistics

students and find that 29.8% of them are also taking French. Identify the following in the story.

Subjects:

Observations:

Sample size:

Parameter:

Statistic:

2. You want to know the mean age (average age) of all the persons living in your household so

you do the calculation and the mean turns out to be 16.4 years. Your teacher wants to estimate

the mean age of all the persons living in your household so she asks you your age, and you

happen to be 20 years old, so she says that her estimate is 20 years. Identify the following in the

story.

Subjects:

Observations:

Sample size:

Parameter:

Statistic:

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

3/16

Variable: A characteristic of the subjects. When we collect data about the subjects, we are

collecting values of variables. Different subjects often give rise to different values of the same

variable.

Suppose we want to know about the number of Technology Enhanced Classrooms (TECs)

in the buildings on FSU campus. The buildings would be the subjects and the number of TECs in

each building would be the variable being measured (the characteristic we are interested in

knowing about).

TWO TYPES OF VARIABLE

Categorical — Non-numerical observations that are never used directly in calculations.

Observations consist of names of categories or labels given to categories, each subject falls in

exactly one category. The YES/NO data in #1 above are an example of categorical data.

Basically, we are categorizing the subjects into those who are also taking French (YES) and those

who are not (NO).

Quantitative — Numerical observations that can be used directly in calculations. The age data in

#2 above are an example of quantitative data. Quantitative data always have a unit of

measurement, though sometimes it may not be obvious what the unit of measurement is.

~ If numbers are used only as labels for categories, then they are considered categorical data.

~ If quantitative data are used for the purpose of categorization, then they are considered

categorical in that context.

3. Webflicks, a web-based DVD mail-rental company, collects the following data about each of its

movies ― genre, number of copies owned, duration, primary language, MPAA rating, current

retail price.

(a) Who/what are the subjects?

(b) List the variables along with type, and units (if any). For this question, if a variable can be

considered quantitative at all, list it as quantitative.

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

4/16

4. (i) Classify the following data as categorical (C) or quantitative (Q). If a variable can be used

quantitatively at all, label it as quantitative.

(a) The amount of air inside the balloons at a party.

(b) The ice-cream flavors favored by FSU students.

(c) The ranks held by the officers at a military base.

(d) The number of textbooks in each room at a dormitory.

(e) The current temperature inside the classrooms on campus.

(f) The time it takes to fully charge your cell phone every day.

(g) The maximum legal highway speed in major European cities.

(h) The most-watched TV channel in each household in your neighborhood.

(i) The country that earned the highest number of gold medals at each Olympic games.

(j) The systems of government (presidential republic, constitutional monarchy, etc) in the

countries of the world.

(ii) For each data description above, identify the subjects and the observations by writing out

something along the lines of "for each subject, we record the observation." For example, for part

(a), we might say "for each balloon, we record the amount of air in it."

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

5/16

Distribution: The distribution of a variable tells us what values the variable can possibly take

and how often it takes these values. Distributions can be set out in tables, charts, graphs, or

written in "shorthand" form using statistical notation. We begin with the table. Frequency tables

or relative frequency tables can be used for a single categorical or a single quantitative variable.

Frequency is count. Relative frequency is proportion or percentage, and is calculated as

count÷÷÷÷total. The example given below is for a single categorical variable.

Suppose we were recording the genders of all the students in our class. The values of the

variable "Gender" would be "female" and "male" and the counts of females and males in our class

would tell us how often the variable "Gender" took the values "female" and "male", respectively.

Fill in the missing numbers.

Frequency Table Relative Frequency Table

for the variable GENDER for the variable GENDER

Gender Count Gender Percentage

Female 174 Female 72.5%

Male 66 Male 27.5%

Total 240 Total 100%

GRAPHS FOR CATEGORICAL DATA

These include Bar Charts and Pie Charts, among others.

BAR CHARTS PIE CHARTS

area under graph represents percentage of data area under graph represents percentage of

data

allow us compare counts/percentages among categories

allow us to compare proportions relative to the whole (only percentages can be

displayed)

each bar represents one category each sector represents one category

categories may be listed in any order along the variable axis, this is why there are gaps between the bars

categories may be displayed in any order around the pie

height of bar represents count/percentage of the category

proportion of area occupied by sector equals proportion of data represented by

that sector

bars need to be equal width otherwise area will not be proportional to percentage of data — when bars are

equal width, area is proportional to height, but height is

there must be no gaps between sectors otherwise area will not be proportional to

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

6/16

proportional to count/percentage, so then area is proportional to count/percentage of data

percentage of data

counts/percentages must start at zero all percentages must add up to 100%

can be used to display complete or partial distributions without recalculation of percentages

can be used to display complete or partial distributions, but percentages will have to

be calculated out of partial total for a partial distribution

5. Data compiled from The CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/.

The table and pie chart below show the make-up of the US labor force (2006 estimate excluding

unemployed). Label the sectors of the pie with the occupation categories.

Subjects: ____________________________ Observations: ____________________________

Occupation Percentage

Farming, forestry and fishing 0.7%

Manufacturing, extraction, transportation and crafts 22.7%

Managerial, professional and technical 34.9%

Sales and office 25.5%

Other services 16.3%

Total 100.1%

The percentages add up to 100.1% due to rounding error.

6. Data from FRS Bulletin Winter 2008.

The pie chart below shows the proportions of Florida Retirement System (FRS) members who

belong to certain employment groups. Given that 3% of members are Community College

employees and 21% of members are State and State University employees, which ONE of the

following gives the correct percentages of the other categories?

Subjects: ____________________________ Observations: ____________________________

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

7/16

(A) Cities and Special Districts, etc = 5%, County Governments = 25%, School Boards = 49%

(B) Cities and Special Districts, etc = 2%, County Governments = 21%, School Boards = 53%

(C) Cities and Special Districts, etc = 9%, County Governments = 17%, School Boards = 50%

(D) Cities and Special Districts, etc = 4%, County Governments = 23%, School Boards = 49%

School Boards

County Governments

State and State

Universities

Cities and Special

Districts, etc

Community Colleges

7. Use the bar chart below to answer this question.

Subjects: ____________________________ Observations: ____________________________

(a) If 108 students were Chemistry majors, what was the total number of students?

(b) How many students were Computing majors?

Distribution of the six most common majors at a small science college

0%

4%

8%

12%

16%

20%

24%

28%

32%

36%

40%

1 2 3 4 5 6

Majors: 1-Mathematics, 2-Biology, 3-Physics,4-Astronomy, 5-Chemistry, 6-Computing

Perc

en

tag

e o

f G

rad

uate

s

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

8/16

8. For the distribution displayed in the bar chart below, classify the following statements as true

(T) or false (F).

Subjects: ____________________________ Observations: ____________________________

(a) The 8:00, 9:05 and 1:25 recitations together had more than half the students.

(b) The difference between the number of students in the 10:10 and 11:15 recitations was less

than the difference between the number of students in the 8:00 and 11:15 recitations.

(c) The 12:20 recitation was less popular than the 1:25.

(d) The 11:15 recitation had the most students.

Distribution of the number of students in recitations

1 2 3 4 5 6

Recitations: 1=8:00, 2=9:053=10:10, 4=11:15, 5=12:20, 6=1:25

Num

ber

of

students

9. Data compiled from The CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/.

The bar chart below shows the types of ship owned by the USA (2010 figures).

Subjects: ____________________________ Observations: ____________________________

(i) Which type of ship occurs the most? ______________________________

(ii) Which types of ship occur about as often as the Passenger/cargo?

(iii) The five greatest categories make up about 25% / 50% / 75% of all the ships.

(iv) The total percentage of Carrier and Refrigerated Cargo ships is the same as the percentage of

______________________________ ships.

(v) More ships carry both passengers and cargo rather than passengers only. The number of

ships that carry both passengers and cargo is less than double / double / more than double the

number that carry passengers only.

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

9/16

Barg

e ca

rrie

r

Bulk

car

rier

Carg

o

Carr

ier

Chem

ical

tan

ker

Cont

aine

r

Pass

enge

rPa

ssen

ger/

carg

oPe

trol

eum

tan

ker

Refr

iger

ated

car

goRo

ll on

/ r

oll o

ffVe

hicl

e ca

rrie

r

Below is the same ship data shown in a Pareto chart. A Pareto chart is a special bar chart with

the bars arranged in order of height, highest one leftmost. Arranging the bars like this allows us

to see at a glance which categories are the major contributors, i.e., the categories that carry the

bulk of the distribution.

Cont

aine

r

Bulk

car

rier

Carg

o

Pass

enge

r/ca

rgo

Petr

oleu

m t

anke

r

Chem

ical

tan

ker

Roll

on /

rol

l off

Vehi

cle

carr

ier

Pass

enge

r

Barg

e ca

rrie

r

Carr

ier

Refr

iger

ated

car

go

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

10/16

10. Data from the CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/index.html.

(i) Read the description that follows, then match the religions to the bars of the Pareto chart

below. When two or more categories have equal percentages, the categories are arranged in

alphabetical order on the chart, left to right.

Subjects: ____________________________ Observations: ____________________________

There were more Mormons than Jews, and there were equal numbers of Jews and

Muslims. But most Americans are Protestants, with Roman Catholics coming in right after the

Protestants. There were equal numbers of persons who followed other religions and who followed

no religion, and both of these were higher than the number of Mormons.

(ii) What percentage of persons were neither Protestant nor Roman Catholic?

(A) Less than 50%. (B) About 50%. (C) More than 50%.

Distribution of religious following in the USA (2002 estimate)

1 2 3 4 5 6 7

Perc

enta

ge

GRAPHS FOR QUANTITATIVE DATA

These include Histograms, Stemplots and Boxplots, among others. Histograms and Boxplots will

be covered in a later chapter. Please study Stemplots after we cover Histograms in class.

STEMPLOTS

— display individual data values along with classes of values

— are quick and easy to construct for small data sets

— stems should be in ascending order from top to bottom with none missing between the first

and last

— leaves should be listed in ascending order from left to right with no gaps

— leaves in different rows should be aligned vertically

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

11/16

— vertically aligned leaves ensure that "area" under the graph represents percentage of data

— should be accompanied by a key which shows how to reconstruct the raw data

— show features of distributions in the same way histograms do: to "see" the features, rotate the

page 90o anticlockwise so that the stems become a horizontal axis with the leaves forming

ascending columns like the bars of histogram

Splitting or Combining Stems

If it turns out that most of the leaves belong to just one or two stems, then it is possible get a better picture of the

distribution if we increase the number of stems by splitting them so that each stem occurs twice in the Stem column. The

first occurrence of the stem carries leaves 0-4 and the second occurrence carries leaves 5-9. If it turns out that the

majority of stems have none or one leaf, then it is possible to get a better picture of the distribution if we decrease the

number of stems by putting two consecutive stems in one row and separating them by a hyphen " – ". The leaves of the

two stems are then listed in the same row and are separated by an asterisk " * ".

11. Below are the weights (in lbs) of the 4-month-old babies seen by a certain pediatrician.

14.1 12.1 15.7 14.0 15.8 12.6 11.3 14.9 12.0 14.5

15.6 15.3 12.9 14.8 11.4 16.8 14.3 11.4 15.0 14.6

12.6 14.4 16.2 15.2 16.4 14.8 11.6 14.9 16.7 15.2

Here is the stemplot.

Notice that if you rotate the page 90o anticlockwise, you can see a “histogram” with six data

classes and five bars (there is one empty data class). The six data classes are [11,12), [12,13),

[13,14), [14,15), [15,16) and [16,17). The [13,14) data class is empty. The key allows us to

reconstruct the raw data just by looking at the numbers in the stemplot.

KEY: 11|3 = 11.3 lbs

Stem Leaves

11 3 4 4 6

12 0 1 6 6 9

13

14 0 1 3 4 5 6 8 8 9 9

15 0 2 2 3 6 7 8

16 2 4 7 8

Here is a description of the distribution obtained mainly by just looking at the graph.

Center: roughly 14.5 lbs (this is the midpoint of the “14” data class)

Shape: somewhat bell-shaped Spread: 17—11=6 lbs Outliers: not likely

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

12/16

You try this: use the judgments of shape, center, spread and outliers above to complete the

report below.

It seems that 4-month old babies generally weigh between _____ lbs and _____ lbs, with

about half of them weighing more than _____ lbs. This doctor sees / does not see a baby whose

weight is outside of the range of expected weights. It appears that the weights of 4-month-old

babies are collected around the lower end / middle / higher end of the distribution.

12. Data obtained at http://www.eia.gov/ipm/supply.html, December 2010 International Petroleum Monthly, Table 1.3.

Below are the US monthly natural gas production amounts (in thousand-barrels) for the years

2001 and 2002. The numbers have been put in ascending order so they are no longer in time

order.

Subjects: ____________________________ Observations: ____________________________

1398, 1732, 1760, 1827, 1831, 1833, 1846, 1870, 1875, 1889, 1891, 1898, 1899, 1900, 1901,

1908, 1912, 1925, 1936, 1937, 1955, 2001, 2025, 2034

(a) Construct a stemplot to display the data - a grid is provided to help you align the leaves

vertically.

KEY: ________________________________________

(b) Obtain a description of the distribution mainly by just looking at the graph.

Shape: __________ Center: __________ Spread: __________ Outliers: __________

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

13/16

(c) Use your judgments of shape, center, spread and outliers to complete the report below.

It seems that production is generally between __________ thousand barrels and

__________ thousand barrels, and about half the time it’s more than __________ thousand

barrels. The USA produces / does not produce amounts that are outside of the range of expected

amounts. It appears that production is concentrated around the lower end / middle / higher end

of the distribution.

13. Snapshot below obtained at: http://en.wikipedia.org/wiki/List_of_highest_mountains.

The stemplot below was not drawn according to the guidelines in these notes, however, you

should still be able to interpret it.

(i) Who/what are the subjects of the study?

(ii) How many subjects were in the sample?

(iii) What is the variable in question? What are the units, if any?

(iv) Is the variable categorical or quantitative?

(v) What shape does the distribution have?

(vi) What is the approximate center?

(vii) What is the approximate spread?

(viii) Does it appear that the distribution has outliers?

(ix) If yes, what is your best guess as to the number of outliers?

(x) Why might the stemplot have been drawn the way it was?

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

14/16

ANSWERS

1. Subjects: This term's Statistics students.

Observations: The YES/NO answers.

Sample size: 40

Parameter: π = 29.8% (population proportion)

Statistic: p = 15/40 = 37.5% (sample proportion)

2. Subjects: The persons living in your household.

Observations: Their ages.

Sample size: 1 (just you!)

Parameter: µ =16.4 years (population mean)

Statistic: = 20 years (sample mean)

3. (a) The movies.

(b) Genre — categorical

No. copies — quantitative

Duration — quantitative (minutes)

Language — categorical

MPAA Rating — categorical

Price — quantitative (dollars)

4. (i) Q, C, C, Q, Q, Q, Q, C, C, C

(ii)

(a) for each balloon, we record the amount of air in it

(b) for each FSU student, we record her/his favorite ice-cream flavor

(c) for each officer at the base, we record her/his military rank

(d) for each room in the dorm, we record the number of textbooks in it

(e) for each classroom on campus, we record the current temperature inside it

(f) for each day, we record the time it takes to fully charge my cell phone

(g) for each major European city, we record the maximum legal highway speed

(h) for each household in my neighborhood, we record the most-watched TV channel

(i) for each (occurrence of the) Olympic games, we record the country that earned the highest

number of gold medals

(j) for each country of the world, we record its system of government

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

15/16

5. Subjects: persons in the US labor force, Observations: their occupations

6. Subjects: persons who belong to the FL Retirement System

Observations: their employment groups

(D) Cities and Special Districts, etc = 4%, County Governments = 23%, School Boards = 49%

7. Subjects: students at the college, Observations: their majors

(a) studentsTotal

students

30010836

100%100

108%36

=×=

=

(b) studentsComputing 12300100

4%4 =×=

8. Subjects: students, Observations: their recitation times

Recitation bar chart (a)—F , (b)—T , (c)—F , (d)—T

9. Subjects: ships owned by the USA, Observations: their types

(i) Container (ii) Bulk Carrier and Cargo (iii) 75% (iv) Barge Carrier (v) more than double

10. Subjects: persons who live in the USA, Observations: their religious affiliations

(i) 1 — Protestant , 2 — Roman Catholic , 3 — No religion (ii)—A

4 — Other religion , 5 — Mormon , 6 — Jewish , 7 — Muslim

11. in this order: 11, 17, 14.5, does not see, middle.

13. (i) Mountains (ii) 117 (iii) Height, meters (iv) Quantitative (v) Right-skewed

(vi) roughly 7400m (vii) 89—72=1700m (viii) yes (ix) five

(x) to draw attention to the highest mountains by putting them physically at the top of the chart

INTRODUCTION

© 2013 Radha Bose Florida State University Department of Statistics

16/16

12. Subjects: the 24 months in the two years 2001 and 2002

Observations: natural gas production, in thousand-barrels

(a) KEY: 13|98=1398 thousand barrels

Stem Leaves

13 98

14

15

16

17 32 60

18 27 31 33 46 70 75 89 91 98 99

19 00 01 08 12 25 36 37 55

20 01 25 34

(b) Shape: left-skewed Center: roughly 1850 thousand barrels

Spread: 2100—1300=800 thousand barrels Outliers: possible

(c) in this order: 1700, 2100, 1850, produces, higher end

(Since the “1300” amount appears to be an outlier, we would say that, generally speaking, the

lower bound is 1700 thousand barrels as opposed to 1300 thousand barrels.)

______________________________________________________________________________