course overview stt 864: statistical methods ii...3/29 references i faraway, j. (2005), extending...

36
1/29 Course Overview STT 864: Statistical Methods II

Upload: others

Post on 10-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

1/29

Course Overview

STT 864: Statistical Methods II

Page 2: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

2/29

General information

Time: M-W, 12:40-2:00pm

Place: A220 Wells Hall

Instructor: Ping-Shou Zhong

Office Hours: Tues, 12:30-2:30pm at C418 Wells Hall and by

appointment

E-mail: [email protected]

Page 3: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

3/29

References

I Faraway, J. (2005), Extending the Linear Model with R:

Generalized Linear, Mixed Effects and Nonparametric

Regression Models, Chapman and Hall/CRC.

I McCulloch and Searle S. (2001), Generalized, Linear and

Mixed Models, Wiley & Sons.

I Hardin, J. and Hilbe, J. (2007), Generalized Linear Models

and Extensions, 2nd Edition, Stata Press.

Page 4: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

4/29

Main topics

I Review of linear models

I Non-linear models

I generalized linear models

I linear mixed models

I generalized linear mixed models

Page 5: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

5/29

Laboratory

I Main purpose: demonstrate the practical implementation of

the statistical methods and provide you with opportunities

to analyze data in the class.

I A total of four to five labs in this semester.

I Lab location: B110G (tentative, TBA).

Page 6: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

6/29

Homework

I Homework will be typically assigned on Wednesday.

I You will need to use R for some computation. You will need

to download software R in your personal computer.

I You could discuss it with your classmates.

I But finish it independently.

Page 7: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

7/29

Grading

I Homework: 40%

I Course Project: 30%

I Final Exam: 30%

Page 8: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

8/29

Course overview: Linear models

I Linear models are used for studying the relationship

between some predictors and a response.

I In general, Y is used to denote the response variable,

which is the variable we would like to predict for.

X = (X1,X2, · · · ,Xp)T are predictors (covariates) that are

used for predicting the response variable Y .

I Typically, the response variable and the predictors are

obtained for n subjects. Here n is called sample size.

Page 9: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

9/29

Example: Beverage study data set

Page 10: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

10/29

Example: Beverage study data set

Page 11: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

11/29

Example: Beverage study data set

I This data set comes from a study conducted by Baty et al.

(2006).

I The original purpose of this study was to measure the

influence of beverages on blood gene expression.

I To explore the underlying mechanisms of the

cardioprotective effects of beverages.

Page 12: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

12/29

Some Fundamentals of Microarray Biology

nucleus

chromosome

DNA strands

A Cell

Page 13: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

13/29

DNA contains genes that code for proteins.

DNA

RNA

protein

(transcription)

(translation)

Proteins perform essential biological functions.

Page 14: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

14/29

Microarray technology

I Microarrays allow researchers to measure the abundance

of thousands of mRNA transcripts in multiple biological

samples.

I By understanding how transcript abundance changes

across experimental conditions, researchers gain clues

about gene function and learn how genes work together to

carry out biological processes.

Page 15: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

15/29

Example: Beverage study data set

I Six healthy individuals participated in the experiment.

I Four different beverages (500mL each: grape juice, red

wine, 40g diluted ethanol, water) are evluated in the study.

I Blood samples were taken after their drinking beverages.

I Gene expression data for 22,238 genes were measured

using the blood samples.

Page 16: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

16/29

Build a linear model for the beverage study data set

I What is the response?

I Gene expression is the response.

I What are the covariates/predictors?

I The types of beverage are the predictors/covariates.

Page 17: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

16/29

Build a linear model for the beverage study data set

I What is the response?

I Gene expression is the response.

I What are the covariates/predictors?

I The types of beverage are the predictors/covariates.

Page 18: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

16/29

Build a linear model for the beverage study data set

I What is the response?

I Gene expression is the response.

I What are the covariates/predictors?

I The types of beverage are the predictors/covariates.

Page 19: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

16/29

Build a linear model for the beverage study data set

I What is the response?

I Gene expression is the response.

I What are the covariates/predictors?

I The types of beverage are the predictors/covariates.

Page 20: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

Page 21: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

Page 22: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

Page 23: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

Page 24: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

Page 25: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

18/29

Example: A linear model

A linear model for studying the relationship between beverages

and gene expression is

Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi4 + εi , i = 1, · · · ,n,

where εi is measurement error. The measurement error εi is

typically assumed to be normally distributed with mean 0 and

unknown variance σ2. β0, β1, β2, β3, β4 are unknown

parameters.

Page 26: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

19/29

Linear models in a matrix form

I Let Y = (Y1, · · · ,Yn)T be the n × 1 response vector.

I Let Xi = (1,Xi1, ·,Xi4)T be the 5× 1 predictor obtain from

the i-th individual. Let X = (X1, · · · ,Xn)T be the n × 5

design matrix.

I Let β = (β0, · · · , β4)T and ε = (ε1, · · · , εn)

T .

I A linear model could be written as the following matrix form

Y = Xβ + ε.

Page 27: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

20/29

Basic assumptions

I Y |X is normally distributed;

I E(Y |X ) = Xβ is linear function of β;

I ε1, · · · , εn are independent, namely, Y1, · · · ,Yn are

independent.

Page 28: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

21/29

Outline of the course

I Generalized linear models, which allows Y |X to be binary,

counts and other distributions, and also allow E(Y |X ) be a

non-linear function of unknown parameters.

I Non-linear models, which assumes E(Y |X ) to be a

nonlinear function of unknown parameters.

I Linear mixed models and generalized linear mixed models.

These models will allow some dependence among the

observations Y1, · · · ,Yn.

Page 29: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

22/29

Example for generalized linear models

Consider a breast cancer study conducted by Richardson et al.

(2006). The study aims to provide insight into the molecular

pathogenesis of Sporadic basal-like cancers (BLC), a distinct

class of human breast cancers.

Fourty seven subjects participated into this study. For each

patient, the single nucletide polymorphism (SNP) array and

microarray gene expression were measured. The original data

consist of 7 normal specimens, 2 BRCA-associated breast

cancer specimens, 18 sporadic BLC specimens and 20

non-BLC specimens.

Page 30: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

23/29

Questions

If we would like to find out what genes are associated with the

BLC cancers,

I what is the response should be used?

I what are the covariates?

I can we fit them using linear models?

Page 31: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

24/29

Questions

If we would like to find out genes that are associated with the all

the four types of breast cancers,

I what is the response should be used?

I what are the covariates?

I can we fit them using linear models?

Page 32: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

25/29

Example for non-linear models

Let us consider the beverage study in more detail. In fact, in the

experiment, for each individual and each beverage, blood

samples were taken at baseline (0 hour, without drinking

beverages), 1, 2, 4, 12 hours after the drink together with

standardized nutrition. RNA of 120 samples was hybridized on

Affymetrix microarrays.

Page 33: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

26/29

Gene expression profile for gene 1 in Alcohol group

0 2 4 6 8 10 12

6.8

6.9

7.0

7.1

7.2

hours

gene

exp

ress

ion

for

gene

1 fo

r in

divi

dual

s w

ith A

lcoh

ol

Page 34: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

27/29

Nonlinear relationship

I Consider time in hours as the covariate and the gene

expression for gene 1 as the response.

I It might be clear that E(Y |X ) is not linear in X. Namely, we

can not write E(Y |X ) = βT X .

I A non-linear regression may be more appropriate. That is,

assuming E(Y |X ) = g(X ;β) where g(X ;β) is a nonliner

function of X and β.

Page 35: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

28/29

Example for linear mixed models

Let us now examine the data structure of the beverage study

data set more carefully. The design of the experiment could be

illustrated in the following plot:

Page 36: Course Overview STT 864: Statistical Methods II...3/29 References I Faraway, J. (2005), Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression

29/29

Dependence among observations

I Consider the gene expression data observed for j-th gene

at k -th hour of the i-th individual. Denote it by Yijk .

I The observations of the same gene from the same

individual at different hours are dependent. Namely, Yijk ,

k = 0,1,2,4,12 are dependent to each other.

I The observations of the gene expression from the same

individual for different genes are also dependent. Namely,

Yijk , j = 1,2,3,4, · · · are dependent.