bio5312 biostatistics lecture 1: introduction...appropriate (for example, when the distribution has...
Post on 17-Mar-2020
1 Views
Preview:
TRANSCRIPT
BIO5312 BiostatisticsLecture 1: Introduction
Yujin Chung
August 30th, 2016
Fall 2016
Yujin Chung Lec1: Descriptive Statistics Fall 2016 1/29
Basic information
Read the syllabus in the blackboard (learn.temple.edu)
Instructors:I Junchao Xia: junchao.xia@temple.eduI Yujin Chung: yujin.chung@temple.eduI William Flynn: tuf31071@temple.edu
Weekly homework: posted on Tuesday in the blackboard, duefollowing Tuesday before the class. Upload your answer to theblackboard:
I A document file (pdf, word etc) including your answersI R code fileI extra files (e.g., plots) if necessary
Final grade: all homework will contribute equally to the finalgrade. No exam.
Yujin Chung Lec1: Descriptive Statistics Fall 2016 2/29
What are Statistics and Biostatistics?
From WikipediaStatistics is the study of the collection, analysis, interpretation,presentation, and organization of data.
Statistics is the science whereby inferences are made aboutspecific random phenomena on the basis of relatively limitedsample material.
I Data exploration and analysisI Quantification of uncertainty with probability
Biostatistics is the branch of applied statistics that appliesstatistical methods to medical and biological problems. Given acertain biostatistical application, standard methods do not applyand must be modified. In this circumstance, biostatisticians areinvolved in developing new methods.
Yujin Chung Lec1: Descriptive Statistics Fall 2016 3/29
Example 1: Effects of lead exposure
Research question: what are the effects of exposure to lead onneurological and psychological function in children?Data:
Children who lived near a lead smelter in El Paso, TexasI Exposed group: 46 children with blood levels of lead ≥ 40µg/mLI Control group: 78 children with blood levels of lead < 40µg/mL
Measures of neurological functionI the number of finger-wrist taps
Measures of psychological functionI IQ
More data such as gender, age etc
Yujin Chung Lec1: Descriptive Statistics Fall 2016 4/29
Example 1: Effects of lead exposure
Research question: what are the effects of exposure to lead onneurological and psychological function in children?
Neurological function:
Are the numbers of finger-wrist taps in the control and exposedgroups are different?
What would be a good measure that represents/summarizes thenumbers of finger-wrist taps?
Let’s say we use the mean. Are they different? How difference is areal difference?
Yujin Chung Lec1: Descriptive Statistics Fall 2016 5/29
Goals for the course
Basics of statisticsI Probability distributionsI Statistics inference: confidence intervals, hypothesis testingI Modern statistical practice
Statistical graphics
Using R for statistical analysis
Yujin Chung Lec1: Descriptive Statistics Fall 2016 6/29
Descriptive Statistics
Descriptive statistics: numeric or graphic display of data
to describe the data in some concise manner.
to indicate principal trends in data
Purpose: Initial data analysis and exploratory data analysis
Identifying missing values, outliers, errors etc
checking assumptions required for model fitting and hypothesistesting
finding trends and patterns in data that merit further study
possibly formulate new hypotheses
Yujin Chung Lec1: Descriptive Statistics Fall 2016 7/29
Types of Data
Continuous data forms a continuum
Example: blood pressure, IQ
Discrete data
Count data: obtained by countingI ex) the number of births/deaths
Categorical dataI nominal data has two or more categories
ex) blood types, genotypesI binary data: two categories or levels
ex) exposed group vs control group, genderI ordinal data has ordered or ranked categories
ex)
A mix of them?
Ambiguities in classifying a type of data!Q) What is the type of the number of finger-wrist taps?
Yujin Chung Lec1: Descriptive Statistics Fall 2016 8/29
Measures of Location
Measure of location is a type of measure useful for datasummarization that defines the center or middle of the sample(x1, . . . , xn) of size n.• Continuous or count data
Arithmetic mean: x =1
n
n∑i=1
xi
Median:
I if n is odd, the
(n+ 1
2
)th largest observation;
I if n is even, the average of(n
2
)th and
(n2
+ 1)
th largest
observation.
Geometric mean: the nth root of the product of the sample
• Discrete data
Mode: the most frequently occurring value among all theobservations in a sample
Yujin Chung Lec1: Descriptive Statistics Fall 2016 9/29
Measures of Location: Arithmetic mean
Arithmetic mean: x =1
n
n∑i=1
xi
The sum of all the observations divided by the number of observations.
Linear transformation: for constant a and c,I If yi = xi + c for i = 1, . . . , n, y = x+ c (location)I If yi = axi for i = 1, . . . , n, y = ax (scale)I If yi = axi + c for i = 1, . . . , n, y = ax+ c (location & scale)
Oversensitive to extreme values; in which case, it may not berepresentative of the location of the majority of sample points.
I The mean of sample (1, 2, 3, 4, 5) is x = 3I The mean of sample (1, 2, 3, 4, 100) is x = 22
Yujin Chung Lec1: Descriptive Statistics Fall 2016 10/29
Measures of Location: Median
Median: Suppose there are n observations in a sample. If theseobservations are ordered from smallest to largest, then the median isdefined as follows:
if n is odd, the
(n+ 1
2
)th largest observation;
if n is even, the average of(n
2
)th and
(n2
+ 1)
th largest
observation.
The rationale for these definitions is to ensure an equal number ofsample points on both sides of the sample median.
Resistant to extreme valuesI The median of sample (1, 2, 3, 4, 5) is 3I The mean of sample (1, 2, 3, 4, 100) is 3
Yujin Chung Lec1: Descriptive Statistics Fall 2016 11/29
Measures of Location: Geometric mean
Geometric mean: the nth root of the product of the sample
n
√√√√ n∏i=1
xi = exp
{1
n
n∑i=1
log xi
}= exp
{log x
}The geometric mean is used when a logarithmic transformation isappropriate (for example, when the distribution has a long right tail).
The inequality of arithmetic and geometric means: the arithmeticmean of a list of non-negative real numbers is greater than or equal tothe geometric mean of the same list; and further, that the two meansare equal if and only if every number in the list is the same.
n
√√√√ n∏i=1
xi ≤ x
Yujin Chung Lec1: Descriptive Statistics Fall 2016 12/29
Comparisons of arithmetic means, median andgeometric mean
Symmetric
Fre
quen
cy
97 98 99 100 101 102 103
020
4060
8010
0
AM = 100.109median = 100.153GM = 100.104
Positively skewed
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
050
100
150
200
250
300
AM = 0.101median = 0.07GM = 0.054
Negatively skewed
Fre
quen
cy
0 50 100 150
020
4060
8010
012
0
AM = 108.408median = 113.646GM = 101.174
Yujin Chung Lec1: Descriptive Statistics Fall 2016 13/29
Measures of Location: mode
Mode: the most frequently occurring value among all the observationsin a sampleCount data: the mode is 28,
Nominal data: the mode is blood type O,
blood type O A B AB
% 45% 33% 17% 5%
Data distributions may have one or more modes.
One mode = unimodal; Two modes = bimodal; Three modes =trimodal and so on.
Yujin Chung Lec1: Descriptive Statistics Fall 2016 14/29
Measures of Location
The different possible measures of the “center” of the distribution areall allowable.
Which is the best measure of the “typical” value (for yoursituation)?
Be sure to make clear which “average” you use.
IQ
Fre
quen
cy
40 60 80 100 120 140
05
1015
2025
3035
finger−wrist tapping
Fre
quen
cy
20 40 60 80 100
05
1015
2025
30
Yujin Chung Lec1: Descriptive Statistics Fall 2016 15/29
Measures of Location
The different possible measures of the “center” of the distribution areall allowable.
Which is the best measure of the “typical” value (for yoursituation)?
Be sure to make clear which “average” you use.
IQ
Fre
quen
cy
40 60 80 100 120 140
05
1015
2025
3035
AM = 91.08median =91GM=89.91
finger−wrist tapping
Fre
quen
cy
20 40 60 80 100
05
1015
2025
30
AM = 61.44median =56GM=57.289
Yujin Chung Lec1: Descriptive Statistics Fall 2016 16/29
Measures of Spread
Continuous and count data
Range: the difference between the largest and smallestobservations in a sample
Quartiles/Percentiles
Interquartile range
Variance and standard deviation
Coefficient of variance (CV)
Yujin Chung Lec1: Descriptive Statistics Fall 2016 17/29
Measures of Spread: Range
Range: the difference between the largest and smallest observations ina sample
Range is very sensitive to extreme observations.I The range of sample (1, 2, 3, 4, 5) is 5− 1 = 4I The range of sample (1, 2, 3, 4, 100) is 100− 1 = 99
Larger the sample size (n), the larger the range and the moredifficult the comparison between ranges from data sets of varyingsizes.
Yujin Chung Lec1: Descriptive Statistics Fall 2016 18/29
Measures of Spread: Quartiles/Percentiles
The pth percentile is defined by
The (k + 1)th largest sample point if np/100 is not an integer(where k is the largest integer less than np/100)
The average of the (np/100)th and (np/100 + 1)th largestobservations if np/100 is an integer.
Quartiles: 1st quartile (Q1, 25th percentile), 2nd quartile (Q2, 50thpercentile, median), 3rd quartile (Q3, 75th percentile)Less sensitive to extreme values
If sample is (1, 2, 3, 4, 5), Q1 = 2, Q2 = 3, Q3 = 4
If sample is (1, 2, 3, 4, 100), Q1 = 2, Q2 = 3, Q3 = 4
Yujin Chung Lec1: Descriptive Statistics Fall 2016 19/29
Measures of Spread: Interquartile range
The interquartile range (IQR) of a sample is Q3−Q1. Unlike total
range, the interquartile range has a breakdown point of 25%, and isthus often preferred to the total range. Less sensitive to extreme
values, robust measure of spread
If sample is (1, 2, 3, 4, 5), IQR = 4− 2 = 2
If sample is (1, 2, 3, 4, 100), IQR = 4− 2 = 2
Yujin Chung Lec1: Descriptive Statistics Fall 2016 20/29
Measures of Spread: Variance and standard deviation
Deviations are the difference between individual sample points andthe arithmetic mean is needed; that is, x1 − x, x2 − x, . . . , xn − x.
Variance (s2) is the average of the squares of the deviations from thesample mean.
s2 =1
n− 1
n∑i=1
(xi − x)2
A rationale for using n− 1 in the denominator rather than n ispresented in the discussion of estimation in Chapter 6.
Standard deviation is s =√s2 =
√√√√ 1
n− 1
n∑i=1
(xi − x)2
Yujin Chung Lec1: Descriptive Statistics Fall 2016 21/29
Measures of Spread: Variance and standard deviation II
Linear transformation: for constant a and c,
If yi = xi + c for i = 1, . . . , n, s2y = s2x, sy = sx (location)
If yi = axi for i = 1, . . . , n, s2y = a2s2x, sy = asx (scale)
If yi = axi + c for i = 1, . . . , n, s2y = a2s2x, sy = asx (location &scale)
−6 −4 −2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
Pro
port
ion
locationscale
Yujin Chung Lec1: Descriptive Statistics Fall 2016 22/29
Measures of Spread: Coefficient of Variation (CV)
Coefficient of Variation (CV) is s/x× 100%.
Unit free
Useful in comparing variability of different samples with differentarithmetic means
Useful for comparing the reproducibility of different variables
Yujin Chung Lec1: Descriptive Statistics Fall 2016 23/29
Graphic Methods
Continuous data
Box plots
Histograms
Scatter plots
Discrete data
Bar graphs
Yujin Chung Lec1: Descriptive Statistics Fall 2016 24/29
Graphic Methods: Bar graphs
Bar plot is a chart that presents categorical or count data withrectangular bars with lengths proportional to the values that theyrepresent.
O A B AB
Bar plot
010
2030
40
Yujin Chung Lec1: Descriptive Statistics Fall 2016 25/29
Graphic Methods: Box plots
Box plot is a standard way of displaying the distribution of databased on the five number summary: minimum, first quartile, median,third quartile, and maximum.
Yujin Chung Lec1: Descriptive Statistics Fall 2016 26/29
Graphic Methods: Box plots II
The finger-wrist tapping scores (MAXFWT) and full-scale IQ scores(IQF) seem slightly lower in the exposed group than in the controlgroup.
Yujin Chung Lec1: Descriptive Statistics Fall 2016 27/29
Graphic Methods: Scatter plots
Scatter plots use horizontal and vertical axes to plot data points.However, they have a very specific purpose. Scatter plots show howmuch one variable is affected by another. The relationship between twovariables is called their correlation (Chapter 5 and 11).
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●
●●●
●
● ●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
● ●●
●
● ●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
● ●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●●
●●
●●●●
●
●●●
●
●
●
20 40 60 80 100
6080
120
scatter plot
FWT
IQ
Yujin Chung Lec1: Descriptive Statistics Fall 2016 28/29
Summary
Numeric or graphic methods for displaying data help in
quickly summarizing a data set
And/or presenting results to others
Steps for exploratory data analysis1 Identify data types
I Continuous data: means, median, quartiles, variance, box plot,scatter plot
I Categorical data: mode, contingency table, bar plotI Count data: maybe both, but it depends on the data property
2 Do not rely on one measure or one graphic display
3 Report several measures and graphs if they provide differentinformation
4 After statistical inference, come back to the numerical andgraphical analysis and confirm your inference agrees with yourdescriptive statistics!
Yujin Chung Lec1: Descriptive Statistics Fall 2016 29/29
top related