preliminary definitions
TRANSCRIPT
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
1/16
PRELIMINARY DEFINITIONS
Data is information. When we collect data, we are collecting information.
Subjects: The objects from whom, or about whom, data is collected. The objects may be living
or nonliving, tangible or intangible, may consist of events or abstract concepts. To identify the
subjects in a story, first identify the sample size n , then ask yourself “ n of whom/what are we
getting information about?".
Observation / Response / Data Point: A single collected data value (a piece of information).
To identify the observations, first identify the subjects, then ask yourself "from each subject what
piece of information are we getting?". Sometimes the words used to describe the subjects and
the observations are the same.
Population: The entire group of subjects that we wish to get information about in a single
study. In reality, it is often time-consuming or costly to obtain data from the entire population
under consideration, so we usually work with samples.
Sample: A subgroup of subjects within a population. In real life, we would obtain data from a
sample instead of a population, and then use our knowledge of statistics to make inferences
about the population from which the sample was drawn.
Sample size: The number of subjects involved in a study, i.e., the number of subjects in the
sample.
Parameter: A numerical measurement that describes a characteristic of a population.
Statistic: A numerical measurement that describes a characteristic of a sample. Very often
sample statistics are used as estimates for the corresponding population parameters.
The picture below may help you to understand the concept of population, sample, parameter and
statistic better.
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
2/16
The table below shows the symbols used to represent some parameters and statistics.
Symbol
Name Population Parameter
(generally Greek
letters)
Sample Statistic
(generally Latin letters)
Size N
n
“N” often denotes total sample size when
more than one sample is involved
Mean (average) µ
Variance σ2 s2
Standard Deviation σ s
Proportion / Percent π p
Correlation ρ r
1. We want to know what percent of this term's Statistics students are taking French. We
randomly select 40 students in our Statistics class and ask if they are taking French, answer YES
or NO. It turns out that 15 of them say YES so we calculate that 15/40=37.5% of them are also
taking French. FSU administrators want a truer percent so they poll all of this term's Statistics
students and find that 29.8% of them are also taking French. Identify the following in the story.
Subjects:
Observations:
Sample size:
Parameter:
Statistic:
2. You want to know the mean age (average age) of all the persons living in your household so
you do the calculation and the mean turns out to be 16.4 years. Your teacher wants to estimate
the mean age of all the persons living in your household so she asks you your age, and you
happen to be 20 years old, so she says that her estimate is 20 years. Identify the following in the
story.
Subjects:
Observations:
Sample size:
Parameter:
Statistic:
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
3/16
Variable: A characteristic of the subjects. When we collect data about the subjects, we are
collecting values of variables. Different subjects often give rise to different values of the same
variable.
Suppose we want to know about the number of Technology Enhanced Classrooms (TECs)
in the buildings on FSU campus. The buildings would be the subjects and the number of TECs in
each building would be the variable being measured (the characteristic we are interested in
knowing about).
TWO TYPES OF VARIABLE
Categorical — Non-numerical observations that are never used directly in calculations.
Observations consist of names of categories or labels given to categories, each subject falls in
exactly one category. The YES/NO data in #1 above are an example of categorical data.
Basically, we are categorizing the subjects into those who are also taking French (YES) and those
who are not (NO).
Quantitative — Numerical observations that can be used directly in calculations. The age data in
#2 above are an example of quantitative data. Quantitative data always have a unit of
measurement, though sometimes it may not be obvious what the unit of measurement is.
~ If numbers are used only as labels for categories, then they are considered categorical data.
~ If quantitative data are used for the purpose of categorization, then they are considered
categorical in that context.
3. Webflicks, a web-based DVD mail-rental company, collects the following data about each of its
movies ― genre, number of copies owned, duration, primary language, MPAA rating, current
retail price.
(a) Who/what are the subjects?
(b) List the variables along with type, and units (if any). For this question, if a variable can be
considered quantitative at all, list it as quantitative.
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
4/16
4. (i) Classify the following data as categorical (C) or quantitative (Q). If a variable can be used
quantitatively at all, label it as quantitative.
(a) The amount of air inside the balloons at a party.
(b) The ice-cream flavors favored by FSU students.
(c) The ranks held by the officers at a military base.
(d) The number of textbooks in each room at a dormitory.
(e) The current temperature inside the classrooms on campus.
(f) The time it takes to fully charge your cell phone every day.
(g) The maximum legal highway speed in major European cities.
(h) The most-watched TV channel in each household in your neighborhood.
(i) The country that earned the highest number of gold medals at each Olympic games.
(j) The systems of government (presidential republic, constitutional monarchy, etc) in the
countries of the world.
(ii) For each data description above, identify the subjects and the observations by writing out
something along the lines of "for each subject, we record the observation." For example, for part
(a), we might say "for each balloon, we record the amount of air in it."
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
5/16
Distribution: The distribution of a variable tells us what values the variable can possibly take
and how often it takes these values. Distributions can be set out in tables, charts, graphs, or
written in "shorthand" form using statistical notation. We begin with the table. Frequency tables
or relative frequency tables can be used for a single categorical or a single quantitative variable.
Frequency is count. Relative frequency is proportion or percentage, and is calculated as
count÷÷÷÷total. The example given below is for a single categorical variable.
Suppose we were recording the genders of all the students in our class. The values of the
variable "Gender" would be "female" and "male" and the counts of females and males in our class
would tell us how often the variable "Gender" took the values "female" and "male", respectively.
Fill in the missing numbers.
Frequency Table Relative Frequency Table
for the variable GENDER for the variable GENDER
Gender Count Gender Percentage
Female 174 Female 72.5%
Male 66 Male 27.5%
Total 240 Total 100%
GRAPHS FOR CATEGORICAL DATA
These include Bar Charts and Pie Charts, among others.
BAR CHARTS PIE CHARTS
area under graph represents percentage of data area under graph represents percentage of
data
allow us compare counts/percentages among categories
allow us to compare proportions relative to the whole (only percentages can be
displayed)
each bar represents one category each sector represents one category
categories may be listed in any order along the variable axis, this is why there are gaps between the bars
categories may be displayed in any order around the pie
height of bar represents count/percentage of the category
proportion of area occupied by sector equals proportion of data represented by
that sector
bars need to be equal width otherwise area will not be proportional to percentage of data — when bars are
equal width, area is proportional to height, but height is
there must be no gaps between sectors otherwise area will not be proportional to
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
6/16
proportional to count/percentage, so then area is proportional to count/percentage of data
percentage of data
counts/percentages must start at zero all percentages must add up to 100%
can be used to display complete or partial distributions without recalculation of percentages
can be used to display complete or partial distributions, but percentages will have to
be calculated out of partial total for a partial distribution
5. Data compiled from The CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/.
The table and pie chart below show the make-up of the US labor force (2006 estimate excluding
unemployed). Label the sectors of the pie with the occupation categories.
Subjects: ____________________________ Observations: ____________________________
Occupation Percentage
Farming, forestry and fishing 0.7%
Manufacturing, extraction, transportation and crafts 22.7%
Managerial, professional and technical 34.9%
Sales and office 25.5%
Other services 16.3%
Total 100.1%
The percentages add up to 100.1% due to rounding error.
6. Data from FRS Bulletin Winter 2008.
The pie chart below shows the proportions of Florida Retirement System (FRS) members who
belong to certain employment groups. Given that 3% of members are Community College
employees and 21% of members are State and State University employees, which ONE of the
following gives the correct percentages of the other categories?
Subjects: ____________________________ Observations: ____________________________
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
7/16
(A) Cities and Special Districts, etc = 5%, County Governments = 25%, School Boards = 49%
(B) Cities and Special Districts, etc = 2%, County Governments = 21%, School Boards = 53%
(C) Cities and Special Districts, etc = 9%, County Governments = 17%, School Boards = 50%
(D) Cities and Special Districts, etc = 4%, County Governments = 23%, School Boards = 49%
School Boards
County Governments
State and State
Universities
Cities and Special
Districts, etc
Community Colleges
7. Use the bar chart below to answer this question.
Subjects: ____________________________ Observations: ____________________________
(a) If 108 students were Chemistry majors, what was the total number of students?
(b) How many students were Computing majors?
Distribution of the six most common majors at a small science college
0%
4%
8%
12%
16%
20%
24%
28%
32%
36%
40%
1 2 3 4 5 6
Majors: 1-Mathematics, 2-Biology, 3-Physics,4-Astronomy, 5-Chemistry, 6-Computing
Perc
en
tag
e o
f G
rad
uate
s
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
8/16
8. For the distribution displayed in the bar chart below, classify the following statements as true
(T) or false (F).
Subjects: ____________________________ Observations: ____________________________
(a) The 8:00, 9:05 and 1:25 recitations together had more than half the students.
(b) The difference between the number of students in the 10:10 and 11:15 recitations was less
than the difference between the number of students in the 8:00 and 11:15 recitations.
(c) The 12:20 recitation was less popular than the 1:25.
(d) The 11:15 recitation had the most students.
Distribution of the number of students in recitations
1 2 3 4 5 6
Recitations: 1=8:00, 2=9:053=10:10, 4=11:15, 5=12:20, 6=1:25
Num
ber
of
students
9. Data compiled from The CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/.
The bar chart below shows the types of ship owned by the USA (2010 figures).
Subjects: ____________________________ Observations: ____________________________
(i) Which type of ship occurs the most? ______________________________
(ii) Which types of ship occur about as often as the Passenger/cargo?
(iii) The five greatest categories make up about 25% / 50% / 75% of all the ships.
(iv) The total percentage of Carrier and Refrigerated Cargo ships is the same as the percentage of
______________________________ ships.
(v) More ships carry both passengers and cargo rather than passengers only. The number of
ships that carry both passengers and cargo is less than double / double / more than double the
number that carry passengers only.
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
9/16
Barg
e ca
rrie
r
Bulk
car
rier
Carg
o
Carr
ier
Chem
ical
tan
ker
Cont
aine
r
Pass
enge
rPa
ssen
ger/
carg
oPe
trol
eum
tan
ker
Refr
iger
ated
car
goRo
ll on
/ r
oll o
ffVe
hicl
e ca
rrie
r
Below is the same ship data shown in a Pareto chart. A Pareto chart is a special bar chart with
the bars arranged in order of height, highest one leftmost. Arranging the bars like this allows us
to see at a glance which categories are the major contributors, i.e., the categories that carry the
bulk of the distribution.
Cont
aine
r
Bulk
car
rier
Carg
o
Pass
enge
r/ca
rgo
Petr
oleu
m t
anke
r
Chem
ical
tan
ker
Roll
on /
rol
l off
Vehi
cle
carr
ier
Pass
enge
r
Barg
e ca
rrie
r
Carr
ier
Refr
iger
ated
car
go
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
10/16
10. Data from the CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/index.html.
(i) Read the description that follows, then match the religions to the bars of the Pareto chart
below. When two or more categories have equal percentages, the categories are arranged in
alphabetical order on the chart, left to right.
Subjects: ____________________________ Observations: ____________________________
There were more Mormons than Jews, and there were equal numbers of Jews and
Muslims. But most Americans are Protestants, with Roman Catholics coming in right after the
Protestants. There were equal numbers of persons who followed other religions and who followed
no religion, and both of these were higher than the number of Mormons.
(ii) What percentage of persons were neither Protestant nor Roman Catholic?
(A) Less than 50%. (B) About 50%. (C) More than 50%.
Distribution of religious following in the USA (2002 estimate)
1 2 3 4 5 6 7
Perc
enta
ge
GRAPHS FOR QUANTITATIVE DATA
These include Histograms, Stemplots and Boxplots, among others. Histograms and Boxplots will
be covered in a later chapter. Please study Stemplots after we cover Histograms in class.
STEMPLOTS
— display individual data values along with classes of values
— are quick and easy to construct for small data sets
— stems should be in ascending order from top to bottom with none missing between the first
and last
— leaves should be listed in ascending order from left to right with no gaps
— leaves in different rows should be aligned vertically
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
11/16
— vertically aligned leaves ensure that "area" under the graph represents percentage of data
— should be accompanied by a key which shows how to reconstruct the raw data
— show features of distributions in the same way histograms do: to "see" the features, rotate the
page 90o anticlockwise so that the stems become a horizontal axis with the leaves forming
ascending columns like the bars of histogram
Splitting or Combining Stems
If it turns out that most of the leaves belong to just one or two stems, then it is possible get a better picture of the
distribution if we increase the number of stems by splitting them so that each stem occurs twice in the Stem column. The
first occurrence of the stem carries leaves 0-4 and the second occurrence carries leaves 5-9. If it turns out that the
majority of stems have none or one leaf, then it is possible to get a better picture of the distribution if we decrease the
number of stems by putting two consecutive stems in one row and separating them by a hyphen " – ". The leaves of the
two stems are then listed in the same row and are separated by an asterisk " * ".
11. Below are the weights (in lbs) of the 4-month-old babies seen by a certain pediatrician.
14.1 12.1 15.7 14.0 15.8 12.6 11.3 14.9 12.0 14.5
15.6 15.3 12.9 14.8 11.4 16.8 14.3 11.4 15.0 14.6
12.6 14.4 16.2 15.2 16.4 14.8 11.6 14.9 16.7 15.2
Here is the stemplot.
Notice that if you rotate the page 90o anticlockwise, you can see a “histogram” with six data
classes and five bars (there is one empty data class). The six data classes are [11,12), [12,13),
[13,14), [14,15), [15,16) and [16,17). The [13,14) data class is empty. The key allows us to
reconstruct the raw data just by looking at the numbers in the stemplot.
KEY: 11|3 = 11.3 lbs
Stem Leaves
11 3 4 4 6
12 0 1 6 6 9
13
14 0 1 3 4 5 6 8 8 9 9
15 0 2 2 3 6 7 8
16 2 4 7 8
Here is a description of the distribution obtained mainly by just looking at the graph.
Center: roughly 14.5 lbs (this is the midpoint of the “14” data class)
Shape: somewhat bell-shaped Spread: 17—11=6 lbs Outliers: not likely
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
12/16
You try this: use the judgments of shape, center, spread and outliers above to complete the
report below.
It seems that 4-month old babies generally weigh between _____ lbs and _____ lbs, with
about half of them weighing more than _____ lbs. This doctor sees / does not see a baby whose
weight is outside of the range of expected weights. It appears that the weights of 4-month-old
babies are collected around the lower end / middle / higher end of the distribution.
12. Data obtained at http://www.eia.gov/ipm/supply.html, December 2010 International Petroleum Monthly, Table 1.3.
Below are the US monthly natural gas production amounts (in thousand-barrels) for the years
2001 and 2002. The numbers have been put in ascending order so they are no longer in time
order.
Subjects: ____________________________ Observations: ____________________________
1398, 1732, 1760, 1827, 1831, 1833, 1846, 1870, 1875, 1889, 1891, 1898, 1899, 1900, 1901,
1908, 1912, 1925, 1936, 1937, 1955, 2001, 2025, 2034
(a) Construct a stemplot to display the data - a grid is provided to help you align the leaves
vertically.
KEY: ________________________________________
(b) Obtain a description of the distribution mainly by just looking at the graph.
Shape: __________ Center: __________ Spread: __________ Outliers: __________
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
13/16
(c) Use your judgments of shape, center, spread and outliers to complete the report below.
It seems that production is generally between __________ thousand barrels and
__________ thousand barrels, and about half the time it’s more than __________ thousand
barrels. The USA produces / does not produce amounts that are outside of the range of expected
amounts. It appears that production is concentrated around the lower end / middle / higher end
of the distribution.
13. Snapshot below obtained at: http://en.wikipedia.org/wiki/List_of_highest_mountains.
The stemplot below was not drawn according to the guidelines in these notes, however, you
should still be able to interpret it.
(i) Who/what are the subjects of the study?
(ii) How many subjects were in the sample?
(iii) What is the variable in question? What are the units, if any?
(iv) Is the variable categorical or quantitative?
(v) What shape does the distribution have?
(vi) What is the approximate center?
(vii) What is the approximate spread?
(viii) Does it appear that the distribution has outliers?
(ix) If yes, what is your best guess as to the number of outliers?
(x) Why might the stemplot have been drawn the way it was?
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
14/16
ANSWERS
1. Subjects: This term's Statistics students.
Observations: The YES/NO answers.
Sample size: 40
Parameter: π = 29.8% (population proportion)
Statistic: p = 15/40 = 37.5% (sample proportion)
2. Subjects: The persons living in your household.
Observations: Their ages.
Sample size: 1 (just you!)
Parameter: µ =16.4 years (population mean)
Statistic: = 20 years (sample mean)
3. (a) The movies.
(b) Genre — categorical
No. copies — quantitative
Duration — quantitative (minutes)
Language — categorical
MPAA Rating — categorical
Price — quantitative (dollars)
4. (i) Q, C, C, Q, Q, Q, Q, C, C, C
(ii)
(a) for each balloon, we record the amount of air in it
(b) for each FSU student, we record her/his favorite ice-cream flavor
(c) for each officer at the base, we record her/his military rank
(d) for each room in the dorm, we record the number of textbooks in it
(e) for each classroom on campus, we record the current temperature inside it
(f) for each day, we record the time it takes to fully charge my cell phone
(g) for each major European city, we record the maximum legal highway speed
(h) for each household in my neighborhood, we record the most-watched TV channel
(i) for each (occurrence of the) Olympic games, we record the country that earned the highest
number of gold medals
(j) for each country of the world, we record its system of government
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
15/16
5. Subjects: persons in the US labor force, Observations: their occupations
6. Subjects: persons who belong to the FL Retirement System
Observations: their employment groups
(D) Cities and Special Districts, etc = 4%, County Governments = 23%, School Boards = 49%
7. Subjects: students at the college, Observations: their majors
(a) studentsTotal
students
30010836
100%100
108%36
=×=
=
(b) studentsComputing 12300100
4%4 =×=
8. Subjects: students, Observations: their recitation times
Recitation bar chart (a)—F , (b)—T , (c)—F , (d)—T
9. Subjects: ships owned by the USA, Observations: their types
(i) Container (ii) Bulk Carrier and Cargo (iii) 75% (iv) Barge Carrier (v) more than double
10. Subjects: persons who live in the USA, Observations: their religious affiliations
(i) 1 — Protestant , 2 — Roman Catholic , 3 — No religion (ii)—A
4 — Other religion , 5 — Mormon , 6 — Jewish , 7 — Muslim
11. in this order: 11, 17, 14.5, does not see, middle.
13. (i) Mountains (ii) 117 (iii) Height, meters (iv) Quantitative (v) Right-skewed
(vi) roughly 7400m (vii) 89—72=1700m (viii) yes (ix) five
(x) to draw attention to the highest mountains by putting them physically at the top of the chart
INTRODUCTION
© 2013 Radha Bose Florida State University Department of Statistics
16/16
12. Subjects: the 24 months in the two years 2001 and 2002
Observations: natural gas production, in thousand-barrels
(a) KEY: 13|98=1398 thousand barrels
Stem Leaves
13 98
14
15
16
17 32 60
18 27 31 33 46 70 75 89 91 98 99
19 00 01 08 12 25 36 37 55
20 01 25 34
(b) Shape: left-skewed Center: roughly 1850 thousand barrels
Spread: 2100—1300=800 thousand barrels Outliers: possible
(c) in this order: 1700, 2100, 1850, produces, higher end
(Since the “1300” amount appears to be an outlier, we would say that, generally speaking, the
lower bound is 1700 thousand barrels as opposed to 1300 thousand barrels.)
______________________________________________________________________________