bio5312 biostatistics lecture 1: introduction...appropriate (for example, when the distribution has...

BIO5312 BiostatisticsLecture 1: Introduction

Yujin Chung

August 30th, 2016

Fall 2016

Yujin Chung Lec1: Descriptive Statistics Fall 2016 1/29

Basic information

Read the syllabus in the blackboard (learn.temple.edu)

Instructors:I Junchao Xia: junchao.xia@temple.eduI Yujin Chung: yujin.chung@temple.eduI William Flynn: tuf31071@temple.edu

Weekly homework: posted on Tuesday in the blackboard, duefollowing Tuesday before the class. Upload your answer to theblackboard:

I A document file (pdf, word etc) including your answersI R code fileI extra files (e.g., plots) if necessary

Final grade: all homework will contribute equally to the finalgrade. No exam.

What are Statistics and Biostatistics?

From WikipediaStatistics is the study of the collection, analysis, interpretation,presentation, and organization of data.

Statistics is the science whereby inferences are made aboutspecific random phenomena on the basis of relatively limitedsample material.

I Data exploration and analysisI Quantification of uncertainty with probability

Biostatistics is the branch of applied statistics that appliesstatistical methods to medical and biological problems. Given acertain biostatistical application, standard methods do not applyand must be modified. In this circumstance, biostatisticians areinvolved in developing new methods.

Example 1: Effects of lead exposure

Research question: what are the effects of exposure to lead onneurological and psychological function in children?Data:

Children who lived near a lead smelter in El Paso, TexasI Exposed group: 46 children with blood levels of lead ≥ 40µg/mLI Control group: 78 children with blood levels of lead < 40µg/mL

Measures of neurological functionI the number of finger-wrist taps

Measures of psychological functionI IQ

More data such as gender, age etc

Example 1: Effects of lead exposure

Research question: what are the effects of exposure to lead onneurological and psychological function in children?

Neurological function:

Are the numbers of finger-wrist taps in the control and exposedgroups are different?

What would be a good measure that represents/summarizes thenumbers of finger-wrist taps?

Let’s say we use the mean. Are they different? How difference is areal difference?

Goals for the course

Basics of statisticsI Probability distributionsI Statistics inference: confidence intervals, hypothesis testingI Modern statistical practice

Statistical graphics

Using R for statistical analysis

Descriptive Statistics

Descriptive statistics: numeric or graphic display of data

to describe the data in some concise manner.

to indicate principal trends in data

Purpose: Initial data analysis and exploratory data analysis

Identifying missing values, outliers, errors etc

checking assumptions required for model fitting and hypothesistesting

finding trends and patterns in data that merit further study

possibly formulate new hypotheses

Types of Data

Continuous data forms a continuum

Example: blood pressure, IQ

Discrete data

Count data: obtained by countingI ex) the number of births/deaths

Categorical dataI nominal data has two or more categories

ex) blood types, genotypesI binary data: two categories or levels

ex) exposed group vs control group, genderI ordinal data has ordered or ranked categories

A mix of them?

Ambiguities in classifying a type of data!Q) What is the type of the number of finger-wrist taps?

Measures of Location

Measure of location is a type of measure useful for datasummarization that defines the center or middle of the sample(x1, . . . , xn) of size n.• Continuous or count data

Arithmetic mean: x =1

n∑i=1

Median:

I if n is odd, the

)th largest observation;

I if n is even, the average of(n

)th and

th largest

observation.

Geometric mean: the nth root of the product of the sample

• Discrete data

Mode: the most frequently occurring value among all theobservations in a sample

Measures of Location: Arithmetic mean

Arithmetic mean: x =1

n∑i=1

The sum of all the observations divided by the number of observations.

Linear transformation: for constant a and c,I If yi = xi + c for i = 1, . . . , n, y = x+ c (location)I If yi = axi for i = 1, . . . , n, y = ax (scale)I If yi = axi + c for i = 1, . . . , n, y = ax+ c (location & scale)

Oversensitive to extreme values; in which case, it may not berepresentative of the location of the majority of sample points.

I The mean of sample (1, 2, 3, 4, 5) is x = 3I The mean of sample (1, 2, 3, 4, 100) is x = 22

Measures of Location: Median

Median: Suppose there are n observations in a sample. If theseobservations are ordered from smallest to largest, then the median isdefined as follows:

if n is odd, the

)th largest observation;

if n is even, the average of(n

)th and

th largest

observation.

The rationale for these definitions is to ensure an equal number ofsample points on both sides of the sample median.

Resistant to extreme valuesI The median of sample (1, 2, 3, 4, 5) is 3I The mean of sample (1, 2, 3, 4, 100) is 3

Measures of Location: Geometric mean

Geometric mean: the nth root of the product of the sample

√√√√ n∏i=1

xi = exp

n∑i=1

log xi

}= exp

{log x

}The geometric mean is used when a logarithmic transformation isappropriate (for example, when the distribution has a long right tail).

The inequality of arithmetic and geometric means: the arithmeticmean of a list of non-negative real numbers is greater than or equal tothe geometric mean of the same list; and further, that the two meansare equal if and only if every number in the list is the same.

√√√√ n∏i=1

xi ≤ x

Comparisons of arithmetic means, median andgeometric mean

Symmetric

97 98 99 100 101 102 103

AM = 100.109median = 100.153GM = 100.104

Positively skewed

0.0 0.2 0.4 0.6 0.8 1.0

AM = 0.101median = 0.07GM = 0.054

Negatively skewed

0 50 100 150

AM = 108.408median = 113.646GM = 101.174

Measures of Location: mode

Mode: the most frequently occurring value among all the observationsin a sampleCount data: the mode is 28,

Nominal data: the mode is blood type O,

blood type O A B AB

% 45% 33% 17% 5%

Data distributions may have one or more modes.

One mode = unimodal; Two modes = bimodal; Three modes =trimodal and so on.

The different possible measures of the “center” of the distribution areall allowable.

Which is the best measure of the “typical” value (for yoursituation)?

Be sure to make clear which “average” you use.

40 60 80 100 120 140

finger−wrist tapping

20 40 60 80 100

The different possible measures of the “center” of the distribution areall allowable.

Which is the best measure of the “typical” value (for yoursituation)?

Be sure to make clear which “average” you use.

40 60 80 100 120 140

AM = 91.08median =91GM=89.91

finger−wrist tapping

20 40 60 80 100

AM = 61.44median =56GM=57.289

Measures of Spread

Continuous and count data

Range: the difference between the largest and smallestobservations in a sample

Quartiles/Percentiles

Interquartile range

Variance and standard deviation

Coefficient of variance (CV)

Measures of Spread: Range

Range: the difference between the largest and smallest observations ina sample

Range is very sensitive to extreme observations.I The range of sample (1, 2, 3, 4, 5) is 5− 1 = 4I The range of sample (1, 2, 3, 4, 100) is 100− 1 = 99

Larger the sample size (n), the larger the range and the moredifficult the comparison between ranges from data sets of varyingsizes.

Measures of Spread: Quartiles/Percentiles

The pth percentile is defined by

The (k + 1)th largest sample point if np/100 is not an integer(where k is the largest integer less than np/100)

The average of the (np/100)th and (np/100 + 1)th largestobservations if np/100 is an integer.

Quartiles: 1st quartile (Q1, 25th percentile), 2nd quartile (Q2, 50thpercentile, median), 3rd quartile (Q3, 75th percentile)Less sensitive to extreme values

If sample is (1, 2, 3, 4, 5), Q1 = 2, Q2 = 3, Q3 = 4

If sample is (1, 2, 3, 4, 100), Q1 = 2, Q2 = 3, Q3 = 4

Measures of Spread: Interquartile range

The interquartile range (IQR) of a sample is Q3−Q1. Unlike total

range, the interquartile range has a breakdown point of 25%, and isthus often preferred to the total range. Less sensitive to extreme

values, robust measure of spread

If sample is (1, 2, 3, 4, 5), IQR = 4− 2 = 2

If sample is (1, 2, 3, 4, 100), IQR = 4− 2 = 2

Measures of Spread: Variance and standard deviation

Deviations are the difference between individual sample points andthe arithmetic mean is needed; that is, x1 − x, x2 − x, . . . , xn − x.

Variance (s2) is the average of the squares of the deviations from thesample mean.

n− 1

n∑i=1

(xi − x)2

A rationale for using n− 1 in the denominator rather than n ispresented in the discussion of estimation in Chapter 6.

Standard deviation is s =√s2 =

√√√√ 1

n− 1

n∑i=1

(xi − x)2

Measures of Spread: Variance and standard deviation II

Linear transformation: for constant a and c,

If yi = xi + c for i = 1, . . . , n, s2y = s2x, sy = sx (location)

If yi = axi for i = 1, . . . , n, s2y = a2s2x, sy = asx (scale)

If yi = axi + c for i = 1, . . . , n, s2y = a2s2x, sy = asx (location &scale)

−6 −4 −2 0 2 4 6 8

locationscale

Measures of Spread: Coefficient of Variation (CV)

Coefficient of Variation (CV) is s/x× 100%.

Unit free

Useful in comparing variability of different samples with differentarithmetic means

Useful for comparing the reproducibility of different variables

Graphic Methods

Continuous data

Box plots

Histograms

Scatter plots

Discrete data

Bar graphs

Graphic Methods: Bar graphs

Bar plot is a chart that presents categorical or count data withrectangular bars with lengths proportional to the values that theyrepresent.

O A B AB

Bar plot

Graphic Methods: Box plots

Box plot is a standard way of displaying the distribution of databased on the five number summary: minimum, first quartile, median,third quartile, and maximum.

Graphic Methods: Box plots II

The finger-wrist tapping scores (MAXFWT) and full-scale IQ scores(IQF) seem slightly lower in the exposed group than in the controlgroup.

Graphic Methods: Scatter plots

Scatter plots use horizontal and vertical axes to plot data points.However, they have a very specific purpose. Scatter plots show howmuch one variable is affected by another. The relationship between twovariables is called their correlation (Chapter 5 and 11).

●●

● ●

●●

●●●

● ●

●●

● ●●

● ●

●●

● ●

●●

●●●●

●●●

20 40 60 80 100

scatter plot

Summary

Numeric or graphic methods for displaying data help in

quickly summarizing a data set

And/or presenting results to others

Steps for exploratory data analysis1 Identify data types

I Continuous data: means, median, quartiles, variance, box plot,scatter plot

I Categorical data: mode, contingency table, bar plotI Count data: maybe both, but it depends on the data property

2 Do not rely on one measure or one graphic display

3 Report several measures and graphs if they provide differentinformation

4 After statistical inference, come back to the numerical andgraphical analysis and confirm your inference agrees with yourdescriptive statistics!

bio5312 biostatistics lecture 1: introduction...appropriate (for example, when the distribution has...

Documents

probability distributions and introduction to...

biostatistics intro

applied biostatistics

applied biostatistics

biostatistics in biology. why we use biostatistics in...

biostatistics 140.754 advanced methods in biostatistics iv

biostatistics -iii

biostatistics unit 2 descriptive biostatistics 1

ictr biostatistics research resource mary lindstrom...

unc biostatistics

biostatistics in nursing research 101409.ppt - biostatistics...

department of biostatistics biostatistics resource and...

arithmetic mean & arithmetic series

biostatistics basics - biostatistics

hypothesis - biostatistics

biostatistics anova.pptx

biostatistics iii

biostatistics and experimental design - bioinformatics...

bio5312 biostatistics lecture 10:regression and correlation...

the biostatistics graduate program at boston …the...