course overview stt 864: statistical methods ii...3/29 references i faraway, j. (2005), extending...

Post on 10-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1/29

Course Overview

STT 864: Statistical Methods II

2/29

General information

Time: M-W, 12:40-2:00pm

Place: A220 Wells Hall

Instructor: Ping-Shou Zhong

Office Hours: Tues, 12:30-2:30pm at C418 Wells Hall and by

appointment

E-mail: pszhong@stt.msu.edu

3/29

References

I Faraway, J. (2005), Extending the Linear Model with R:

Generalized Linear, Mixed Effects and Nonparametric

Regression Models, Chapman and Hall/CRC.

I McCulloch and Searle S. (2001), Generalized, Linear and

Mixed Models, Wiley & Sons.

I Hardin, J. and Hilbe, J. (2007), Generalized Linear Models

and Extensions, 2nd Edition, Stata Press.

4/29

Main topics

I Review of linear models

I Non-linear models

I generalized linear models

I linear mixed models

I generalized linear mixed models

5/29

Laboratory

I Main purpose: demonstrate the practical implementation of

the statistical methods and provide you with opportunities

to analyze data in the class.

I A total of four to five labs in this semester.

I Lab location: B110G (tentative, TBA).

6/29

Homework

I Homework will be typically assigned on Wednesday.

I You will need to use R for some computation. You will need

to download software R in your personal computer.

I You could discuss it with your classmates.

I But finish it independently.

7/29

Grading

I Homework: 40%

I Course Project: 30%

I Final Exam: 30%

8/29

Course overview: Linear models

I Linear models are used for studying the relationship

between some predictors and a response.

I In general, Y is used to denote the response variable,

which is the variable we would like to predict for.

X = (X1,X2, · · · ,Xp)T are predictors (covariates) that are

used for predicting the response variable Y .

I Typically, the response variable and the predictors are

obtained for n subjects. Here n is called sample size.

9/29

Example: Beverage study data set

10/29

Example: Beverage study data set

11/29

Example: Beverage study data set

I This data set comes from a study conducted by Baty et al.

(2006).

I The original purpose of this study was to measure the

influence of beverages on blood gene expression.

I To explore the underlying mechanisms of the

cardioprotective effects of beverages.

12/29

Some Fundamentals of Microarray Biology

nucleus

chromosome

DNA strands

A Cell

13/29

DNA contains genes that code for proteins.

DNA

RNA

protein

(transcription)

(translation)

Proteins perform essential biological functions.

14/29

Microarray technology

I Microarrays allow researchers to measure the abundance

of thousands of mRNA transcripts in multiple biological

samples.

I By understanding how transcript abundance changes

across experimental conditions, researchers gain clues

about gene function and learn how genes work together to

carry out biological processes.

15/29

Example: Beverage study data set

I Six healthy individuals participated in the experiment.

I Four different beverages (500mL each: grape juice, red

wine, 40g diluted ethanol, water) are evluated in the study.

I Blood samples were taken after their drinking beverages.

I Gene expression data for 22,238 genes were measured

using the blood samples.

16/29

Build a linear model for the beverage study data set

I What is the response?

I Gene expression is the response.

I What are the covariates/predictors?

I The types of beverage are the predictors/covariates.

16/29

Build a linear model for the beverage study data set

I What is the response?

I Gene expression is the response.

I What are the covariates/predictors?

I The types of beverage are the predictors/covariates.

16/29

Build a linear model for the beverage study data set

I What is the response?

I Gene expression is the response.

I What are the covariates/predictors?

I The types of beverage are the predictors/covariates.

16/29

Build a linear model for the beverage study data set

I What is the response?

I Gene expression is the response.

I What are the covariates/predictors?

I The types of beverage are the predictors/covariates.

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

17/29

Define response and predictors

I Response variable: Yi is the gene expression data for one

particular gene obtained from the i-th individual.

I How to define the covariates/predictors?

I The types of beverage are the predictors/covariates.

I Assume Xi is the beverage the i-th individual taken.

Typically, we use dummy variables to represent te

categorical data. That is Xi = (Xi1,Xi2,Xi3,Xi4)T .

I Xi1 is the dummy variable for grape juice, Xi1 = 1 if i-th

individual drinks grape duice, Xi1 = 0 otherwise; Xi2 is the

dummy variable for red wine; Xi3 is the dummy variable for

diluted ethanol; Xi4 is the dummy variable for water.

18/29

Example: A linear model

A linear model for studying the relationship between beverages

and gene expression is

Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi4 + εi , i = 1, · · · ,n,

where εi is measurement error. The measurement error εi is

typically assumed to be normally distributed with mean 0 and

unknown variance σ2. β0, β1, β2, β3, β4 are unknown

parameters.

19/29

Linear models in a matrix form

I Let Y = (Y1, · · · ,Yn)T be the n × 1 response vector.

I Let Xi = (1,Xi1, ·,Xi4)T be the 5× 1 predictor obtain from

the i-th individual. Let X = (X1, · · · ,Xn)T be the n × 5

design matrix.

I Let β = (β0, · · · , β4)T and ε = (ε1, · · · , εn)

T .

I A linear model could be written as the following matrix form

Y = Xβ + ε.

20/29

Basic assumptions

I Y |X is normally distributed;

I E(Y |X ) = Xβ is linear function of β;

I ε1, · · · , εn are independent, namely, Y1, · · · ,Yn are

independent.

21/29

Outline of the course

I Generalized linear models, which allows Y |X to be binary,

counts and other distributions, and also allow E(Y |X ) be a

non-linear function of unknown parameters.

I Non-linear models, which assumes E(Y |X ) to be a

nonlinear function of unknown parameters.

I Linear mixed models and generalized linear mixed models.

These models will allow some dependence among the

observations Y1, · · · ,Yn.

22/29

Example for generalized linear models

Consider a breast cancer study conducted by Richardson et al.

(2006). The study aims to provide insight into the molecular

pathogenesis of Sporadic basal-like cancers (BLC), a distinct

class of human breast cancers.

Fourty seven subjects participated into this study. For each

patient, the single nucletide polymorphism (SNP) array and

microarray gene expression were measured. The original data

consist of 7 normal specimens, 2 BRCA-associated breast

cancer specimens, 18 sporadic BLC specimens and 20

non-BLC specimens.

23/29

Questions

If we would like to find out what genes are associated with the

BLC cancers,

I what is the response should be used?

I what are the covariates?

I can we fit them using linear models?

24/29

Questions

If we would like to find out genes that are associated with the all

the four types of breast cancers,

I what is the response should be used?

I what are the covariates?

I can we fit them using linear models?

25/29

Example for non-linear models

Let us consider the beverage study in more detail. In fact, in the

experiment, for each individual and each beverage, blood

samples were taken at baseline (0 hour, without drinking

beverages), 1, 2, 4, 12 hours after the drink together with

standardized nutrition. RNA of 120 samples was hybridized on

Affymetrix microarrays.

26/29

Gene expression profile for gene 1 in Alcohol group

0 2 4 6 8 10 12

6.8

6.9

7.0

7.1

7.2

hours

gene

exp

ress

ion

for

gene

1 fo

r in

divi

dual

s w

ith A

lcoh

ol

27/29

Nonlinear relationship

I Consider time in hours as the covariate and the gene

expression for gene 1 as the response.

I It might be clear that E(Y |X ) is not linear in X. Namely, we

can not write E(Y |X ) = βT X .

I A non-linear regression may be more appropriate. That is,

assuming E(Y |X ) = g(X ;β) where g(X ;β) is a nonliner

function of X and β.

28/29

Example for linear mixed models

Let us now examine the data structure of the beverage study

data set more carefully. The design of the experiment could be

illustrated in the following plot:

29/29

Dependence among observations

I Consider the gene expression data observed for j-th gene

at k -th hour of the i-th individual. Denote it by Yijk .

I The observations of the same gene from the same

individual at different hours are dependent. Namely, Yijk ,

k = 0,1,2,4,12 are dependent to each other.

I The observations of the gene expression from the same

individual for different genes are also dependent. Namely,

Yijk , j = 1,2,3,4, · · · are dependent.

top related