module - math & statistics department
Post on 23-Oct-2021
31 Views
Preview:
TRANSCRIPT
Module 1
1
Stat 2300Descriptive Statistics
Upon completion of this lesson you will be able to◦ Identify the population, parameter of interest,
sample, and sample statistic in real-world situations◦ Identify the data type in real-world situations◦ Calculate numerical measures of center and
variability for small data sets using simple calculator functions. using the statistical functions of a calculator.◦ Calculate numerical measures of center and
variability and create graphical representations for large data sets using RStudio.
The use of specialized language in a domain (like statistics) can cause a subject to seem more difficult than it actually is.
For example…
Image source: http://nursingcrib.com/anatomy-and-physiology/anatomy-and-physiology-cells/
The Good The Bad
No (or few) new words to learn to pronounce
Can use prior knowledge to connect to new concepts
Old/common definitions may interfere with your understanding of the statistical term
When words that are part of everyday English are used differently in a domain, these words are said to have lexical ambiguity.
Word My definition/use Statistical definition/use
Module 1
2
Get a piece of paper or open a file on your computer.
Take each of the words on the flashcard list below and either◦ use it in a sentence, or◦ write a short definition based on your most
common use of the word.
Pay attention and keep track Notify instructor of any words you think
should be on lists like these but aren’t!
Module 1
3
Population Sample
All US citizens
All students taking the SAT on a given day
All students at a certain university
All of your friends who live in the US
The students who took the SAT exam at your local high school on that given day
The students at the university who are taking an online class
Parameter Statistic
The median age of all US citizens (FYI: 37.2)
The average SAT Math score of all test takers on a given day
The actual proportion of female students at a certain university
The median age of your friends who live in the US
The average SAT Math score of the test takers at your local high school on that given day
The proportion of female students who are enrolled in at least one online class at a certain university
Module 1
4
Image source: http://www.cultureeveryday.com/education-cultural-literacy/the-most-important-question-about-cultural-literacy/attachment/question-mark-in-the-sky
Nominal Data
Ordinal Data
Interval Data
2-20
Also called◦ Categorical Data◦ Qualitative Data
Data set: MLB◦ Contains salary data for Major League Baseball
players in the year 2010 Variables◦ Player Name◦ Team◦ Position◦ Salary
Nominal Data◦ Bar Charts-MLB Positions
2-23
A sample of 230 applicants to a university’s business school was asked to report their undergraduate degree. Data was recorded using these codes:◦ 1=BA◦ 2=BBA◦ 3=B.Eng◦ 4=BSc◦ 5=Other
Module 1
5
Nominal Data◦ Bar Charts-Degree Types
2-25
Freshman = 1 Sophomore = 2 Junior = 3 Senior = 4
How is this different from the degree types example?
IDEA Evaluations◦ Choices are 1=No apparent progress 2=Slight progress 3=Moderate progress 4=Substantial progress 5=Exceptional progress
Substantial progress (4) is more than Slight progress (2) but is it exactly twice as much?
Someone with a salary of 4000 makes twice as much as someone with a salary of 2000.
Module 1
6
Interval Data◦ Histograms Uses numerical
scale on horizontal axis
Bars touch!
2-31
Mean Median Mode
Symbol:◦ μ if the data is from the entire population◦ ̅ if the data is from a sample
1 2
1
...nn
ii
x x xx x
n
Here are the scores on the first exam in an introductory statistics course for 6 students:
Calculate the mean.78 73 92 85 75 98
Symbol: Med, Md or Q2◦ Arrange all the data in order from smallest to
largest. The median is the number in the middle.◦ If there are an even number of observations, the
median is the average of the two middle observations.
Here are the scores on the first exam in an introductory statistics course for 6 students:
Calculate the median.78 73 92 85 75 98
Module 1
7
◦ If there is just one mode, the data is unimodal◦ If there are two modes, the data is bimodal◦ If more than two modes, the data is multimodal◦ Not every data set has a mode
Only allowed measure of central tendency for nominal data.
Here are the scores on the first exam in an introductory statistics course for 6 students:
Calculate the mode.78 73 92 85 75 98
What if we add one data point to the list we had before? Say this person doesn’t do well.
Student scores:
Recalculate the mean. What happens?
Recalculate the median. What happens?
78 73 92 85 75 98 22
Mean is usually the first choice for interval and ordinal data◦ Calculation includes all data points◦ Mean is sensitive to extreme values
Median is preferred when there are extreme values in the data set◦ Income data◦ Stat 2300 exam scores!
Mode ◦ Not generally used for interval data◦ The only choice for nominal data
2-40
• http://www.math.usu.edu/~schneit/CTIS/MM/
Range Variance Standard Deviation
Module 1
8
Knowing the center of the data is not enough. Both graphs below have the same measures
of center. Which would you rather use?
Range
Advantages◦ Quick◦ Only uses 2 observations from the whole set◦ Very sensitive to extreme values
78 73 92 85 75 98
Deviation from average◦ Take each number on a list and subtract the
average. This is its deviation.◦ Recall: average = 83.5
What is the typical size of the deviations?◦ What happens when you take the average of the
deviations?
Number 78 73 92 85 75 98Deviation
Square the deviations to make them all positive.
Compute the average of the squared deviations. This is the variance.Number Deviation Squared Deviation
78 -5.573 -10.592 8.585 1.575 -8.598 14.5
We calculate the variance differently for data from a population and data from a sample
Why?◦ We typically want to use a sample variance to
estimate a population variance◦ Populations usually have larger variance than a
sample◦ When we have data from a sample, we have to
inflate our estimate to more closely match the variance of the population Divide by n-1 instead of n
Number Deviation Squared Deviation78 -5.5 30.2573 -10.5 110.2592 8.5 72.2585 1.5 2.2575 -8.5 72.2598 14.5 210.25
Sum 0 497.5
Module 1
9
Population Variance Sample Variance
2i2 i=1
X ‐
N
N
11
2
2
n-
x - x= s
n
i=i
s2
Variance is calculated in square units◦ i.e. s2 = 99.5 points squared◦ Hard or impossible to imagine
Use standard deviation instead
Population standard deviation:
Sample standard deviation:
2
2s s
Number Deviation Squared Deviation78 -5.5 30.2573 -10.5 110.2592 8.5 72.2585 1.5 2.2575 -8.5 72.2598 14.5 210.25
Sum 0 497.5
The more tools you have in your tool belt, the more flexibly you can work to solve problems.
If you (only) have a hammer…then all your problems look like nails. –Neil deGrasse Tyson
Good for small data sets and for when you already have the summary statistics.
Portable, easy to use. Secure for testing—no internet access. Recommended: TI-83, TI-84, or TI-89.◦ The TI-85 and TI-86 do not have the functions you
need for this class◦ Other calculators may, but I do not provide
instructions for them.
Many choices◦ SAS◦ SPSS◦ Minitab◦ Excel
Not going to use these…
Module 1
10
Open source = free Powerful, widely used in industry Platform independent—looks the same on
Macs & PCs Exposure to programming language◦ It’s good for you (and your future job prospects!)
Knowledge transfer◦ All stat software is similar◦ Learning one program makes learning another
easier I promise to hold your hand
R is a programming language! R is case sensitive! Spelling and punctuation matter!
Crawl: Learn the formulas, the ideas.◦ Do some simple calculations “by hand”
Walk: Automate the formulas using a calculator◦ When summary statistics are available◦ With small data sets (n<20 or so)
Run: Use statistical software◦ Visualize large data sets◦ Do more complicated calculations◦ Investigate alternative hypotheses
The single-season home run record was broken by Barry Bonds of the San Francisco Giants in 2001, when he hit 73 home runs. Here are Bonds’ home run totals from 1986 to 2003:
Calculate measures of center and variability for this data.
16 25 24 19 33 25 34 46 37
33 42 40 37 34 49 73 46 45
Boxplots and the 5 number summary◦ Minimum◦ Q1 – first quartile◦ Median◦ Q3 – third quartile◦ Maximum
min
Q1
median
Q3
max
Boxplots can be horizontal or vertical
Both show how the data is distributed
Histograms give a more detailed view
Module 1
11
The interquartile range is the difference between Q3 and Q1◦ Range=max-min◦ IQR=Q3-Q1
Outliers:◦ Data points more than Q3+1.5(IQR)◦ Data points less than Q1=1.5(IQR)
When outliers are present, the “whiskers” will be at these cutoffs instead of min and max.
min
Q1
median
Q3
max
Q3+1.5(IQR)or
the last data point below Q3+1.5(IQR)
What it looks like when a data set has lots of outliers:
A common use of boxplots is to look for differences in the distribution of an interval variable based on a nominal variable (factor)
top related