linear regression with r 1

Post on 20-Jun-2015

958 Views

Category:

Education

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Linear Regressionwith

2012-12-07 @HSPHKazuki Yoshida, M.D. MPH-CLE student

FREEDOMTO  KNOW

1: Prepare data/specify model/read results

Group Website is at:

http://rpubs.com/kaz_yos/useR_at_HSPH

n Introduction

n Reading Data into R (1)

n Reading Data into R (2)

n Descriptive, continuous

n Descriptive, categorical

n Deducer

n Graphics

n Groupwise, continuous

n

Previously in this group

Menu

n Linear regression

Ingredients

n Data preparation

n Model formula

n within()

n factor(), relevel()

n lm()

n formula = Y ~ X1 + X2

n summary()

n anova(), car::Anova()

Statistics Programming

Open R Studio

Create a new scriptand save it.

lowbwt.dat

http://www.umass.edu/statdata/statdata/data/lowbwt.txthttp://www.umass.edu/statdata/statdata/data/lowbwt.dat

We will use lowbwt dataset used in BIO213

lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4)

Load dataset from web

header = TRUEto pick up

variable names

skip 4 rows

lbw[c(10,39), "BWT"] <- c(2655, 3035)

“Fix” dataset

Replace data pointsto make the dataset identical

to BIO213 dataset10th,39th

rows

BWT column

Lower case variable names

names(lbw) <- tolower(names(lbw))

Convert variable names to lower case

Put them back into variable names

See overview

library(gpairs)gpairs(lbw)

RecodingChanging and creating variables

dataset <- within(dataset, { _variable manipulations_

})

Take datasetName of newly created dataset

(here replacing original)

Perform variable manipulationYou can specify by variable name

only. No need for dataset$var_name

lbw <- within(lbw, {

## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")

## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

})

lbw <- within(lbw, {

## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")

## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

})1 to White2 to Black3 to Other

Categorize race and label:

Numeric to categorical: element by element

1st will be reference

1st will be reference

lbw <- within(lbw, {

## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

})

factor() to create categorical variable

Take race variable

Order levels 1, 2, 3Make 1 reference level

Label levels 1, 2, 3 as White, Black, Other

Create new variable named

race.cat

Explained more in depth

lbw <- within(lbw, {

## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")

## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

})

-Inf Inf0 1 2 3 4 5 6] ] ](None Normal Many

Numeric to categorical:range to element

1st will be reference

How breaks work

lbw <- within(lbw, {

## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")

## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

})

Reset reference level

Change reference level of ftv.cat variablefrom None to Normal

lbw <- within(lbw, {

## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")

## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(FALSE,TRUE), labels = c("0","1+"))

})

Numeric to Boolean to Category

ptl < 1 to FALSE, then to “0”ptl >= 1 to TRUE, then to “1+”

TRUE, FALSE vector created

here levels labels

lbw <- within(lbw, {

## Categorize smoke ht ui smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes")) ht <- factor(ht, levels = 0:1, labels = c("No","Yes")) ui <- factor(ui, levels = 0:1, labels = c("No","Yes"))

})

## Alternative to abovelbw[,c("smoke","ht","ui")] <- lapply(lbw[,c("smoke","ht","ui")], function(var) { var <- factor(var, levels = 0:1, labels = c("No","Yes")) })

Binary 0,1 to No,Yes

One-by-one method

Loop method

model formula

outcome ~ predictor1 + predictor2 + predictor3

formula

SAS equivalent: model outcome = predictor1 predictor2 predictor3;

age ~ zyg

In the case of t-test

continuous variable to be compared

grouping variable to separate groups

Variable to be explained

Variable used to explain

Y ~ X1 + X2

linear sum

n . All variables except for the outcome

n + X2 Add X2 term

n - 1 Remove intercept

n X1:X2 Interaction term between X1 and X2

n X1*X2 Main effects and interaction term

Y ~ X1 + X2 + X1:X2

Interaction term

Main effects Interaction

Y ~ X1 * X2

Interaction term

Main effects & interaction

Y ~ X1 + I(X2 * X3)

On-the-fly variable manipulation

New variable (X2 times X3) created on-the-fly and used

Inhibit formula interpretation. For math

manipulation

lm.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm , data = lbw)

Fit a model

lm.full

See model object

Call: command repeated

Coefficient for each variable

summary(lm.full)

See summary

Call: command repeated

Model F-test

Residual distribution

Dummy variables created

R^2 and adjusted R^2

Coef/SE = t

ftv.catNone No 1st trimester visit people compared to Normal 1st trimester visit people (reference level)

ftv.catMany Many 1st trimester visit people compared to Normal 1st trimester visit people (reference level)

race.catBlack Black people compared to White people (reference level)

race.catOther Other people compared to White people (reference level)

confint(fit.lm)

Confidence intervals

Lower boundary

Upper boundary

Confidence intervals

anova(lm.full)

ANOVA table (type I)

degree of freedom

Sequential SS

Mean SS = SS/DF

F = Mean SS / Mean SS of residual

ANOVA table (type I)

1 age

2 lwt

3 smoke

1st gets all in type I

2nd gets all but overlap

between 1 in type Ilast remaining

only in type I

Type I = Sequential SS

library(car)Anova(lm.full, type = 3)

ANOVA table (type III)

degree of freedom

Marginal SS

F = Mean SS / Mean SS of residual

ANOVA table (type III)

Multi-category variables tested as

one

1 age

2 lwt

3 smoke

1st gets margin

only in type III

2nd

gets

margin

only

in ty

pe II

I

last gets margin

only in type III

Type III = Marginal SS

Type I Type III

Comparison

library(effects)plot(allEffects(lm.full), ylim = c(2000,4000))

Effect plot

Fix Y-axis values for all

plots

Effect of a variable with other covariate

set at average

Interaction

lm.full.int <- lm(bwt ~ age*lwt + smoke + ht + ui + age*ftv.cat + race.cat*preterm, data = lbw)

Continuous * Continuous

Categorical * CategoricalContinuous * Categorical

This model is for demonstration purpose.

Anova(lm.full.int, type = 3)

degree of freedom

Marginal SS

F = Mean SS / Mean SS of residual

Interactionterms

plot(effect("age:lwt", lm.full.int))

lwt level

Con

tinuo

us *

Con

tinuo

us

plot(effect("age:ftv.cat", lm.full.int), multiline = TRUE)C

ontin

uous

* C

ateg

oric

al

Cat

egor

ical

* C

ateg

oric

alplot(effect(c("race.cat*preterm"), lm.full.int),

x.var = "preterm", z.var = "race.cat", multiline = TRUE)

top related