18 modelling

48
Hadley Wickham Stat405 Quantitative models Wednesday, 28 October 2009

Upload: hadley-wickham

Post on 25-May-2015

860 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: 18 Modelling

Hadley Wickham

Stat405Quantitative models

Wednesday, 28 October 2009

Page 2: 18 Modelling

1. Strategy for analysing large data.

2. Introduction to the Texas housing data.

3. What’s happening in Houston?

4. Using a models as a tool

5. Using models in their own right

Wednesday, 28 October 2009

Page 3: 18 Modelling

Large data strategy

Start with a single unit, and identify interesting patterns.

Summarise patterns with a model.

Apply model to all units.

Look for units that don’t fit the pattern.

Summarise with a single model.

Wednesday, 28 October 2009

Page 4: 18 Modelling

Texas housing data

For each metropolitan area (45) in Texas, for each month from 2000 to 2009 (112):

Number of houses listed and sold

Total value of houses, and average sale price

Average time on market

CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/Wednesday, 28 October 2009

Page 5: 18 Modelling

Strategy

Start with a single city (Houston).

Explore patterns & fit models.

Apply models to all cities.

Wednesday, 28 October 2009

Page 6: 18 Modelling

date

avgprice

160000

180000

200000

220000

2000 2002 2004 2006 2008

Wednesday, 28 October 2009

Page 7: 18 Modelling

date

sales

3000

4000

5000

6000

7000

8000

2000 2002 2004 2006 2008

Wednesday, 28 October 2009

Page 8: 18 Modelling

date

onmarket

4.0

4.5

5.0

5.5

6.0

6.5

2000 2002 2004 2006 2008

Wednesday, 28 October 2009

Page 9: 18 Modelling

Seasonal trends

Make it much harder to see long term trend. How can we remove the trend?

(Many sophisticated techniques from time series, but what’s the simplest thing that might work?)

Wednesday, 28 October 2009

Page 10: 18 Modelling

month

avgprice

160000

180000

200000

220000

2 4 6 8 10 12

Wednesday, 28 October 2009

Page 11: 18 Modelling

month

avgprice

160000

180000

200000

220000

2 4 6 8 10 12

qplot(month, avgprice, data = houston, geom = "line", group = year) + stat_summary(aes(group = 1), fun.y = "mean", geom = "line", colour = "red", size = 2, na.rm = TRUE) Wednesday, 28 October 2009

Page 12: 18 Modelling

What does the following function do?

deseas <- function(var, month) {

resid(lm(var ~ factor(month))) +

mean(var, na.rm = TRUE)

}

How could you use it in conjunction with transform to deasonalise the data? What if you wanted to deasonalise every city?

Challenge

Wednesday, 28 October 2009

Page 13: 18 Modelling

houston <- transform(houston, avgprice_ds = deseas(avgprice, month), listings_ds = deseas(listings, month), sales_ds = deseas(sales, month), onmarket_ds = deseas(onmarket, month))

qplot(month, sales_ds, data = houston, geom = "line", group = year) + avg

Wednesday, 28 October 2009

Page 14: 18 Modelling

month

avgprice_ds

150000

160000

170000

180000

190000

200000

210000

2 4 6 8 10 12

Wednesday, 28 October 2009

Page 15: 18 Modelling

Model as tools

Here we’re using the linear model as a tool - we don’t care about the coefficients or the standard errors, just using it to get rid of a striking pattern.

Tukey described this pattern as residuals and reiteration: by removing a striking pattern we can see more subtle patterns.

Wednesday, 28 October 2009

Page 16: 18 Modelling

date

avgprice_ds

150000

160000

170000

180000

190000

200000

210000

2000 2002 2004 2006 2008

Wednesday, 28 October 2009

Page 17: 18 Modelling

date

sales_ds

4000

4500

5000

5500

6000

6500

7000

2000 2002 2004 2006 2008

Wednesday, 28 October 2009

Page 18: 18 Modelling

date

onmarket_ds

4.0

4.5

5.0

5.5

6.0

6.5

2000 2002 2004 2006 2008

Wednesday, 28 October 2009

Page 19: 18 Modelling

Summary

Most variables seem to be combination of strong seasonal pattern plus weaker long-term trend.

How do these patterns hold up for the rest of Texas? We’ll focus on sales.

Wednesday, 28 October 2009

Page 20: 18 Modelling

date

sales

2000

4000

6000

8000

2000 2002 2004 2006 2008

Wednesday, 28 October 2009

Page 21: 18 Modelling

date

sales_ds

1000

2000

3000

4000

5000

6000

7000

2000 2002 2004 2006 2008tx <- ddply(tx, "city", transform, sales_ds = deseas(sales, month))qplot(date, sales_ds, data = tx, geom = "line", group = city)Wednesday, 28 October 2009

Page 22: 18 Modelling

date

sales_ds

1000

2000

3000

4000

5000

6000

7000

2000 2002 2004 2006 2008tx <- ddply(tx, "city", transform, sales_ds = deseas(sales, month))qplot(date, sales_ds, data = tx, geom = "line", group = city)

Is this still such a good idea? What do we lose? Is there anything else we could do to improve the plot?

Wednesday, 28 October 2009

Page 23: 18 Modelling

date

log10(sales)

0.5

1.0

1.5

2.0

2.5

3.0

3.5

2000 2002 2004 2006 2008qplot(date, log10(sales), data = tx, geom = "line", group = city)Wednesday, 28 October 2009

Page 24: 18 Modelling

It works, but...

Instead of throwing the models away and just using the residuals, let’s keep the models and explore them in more depth.

Wednesday, 28 October 2009

Page 25: 18 Modelling

Two new toolsdlply: takes a data frame, splits up in the same way as ddply, applies function to each piece and combines the results into a list

ldply: takes a list, splits up into elements, applies function to each piece and then combines the results into a data frame

dlply + ldply = ddply

Wednesday, 28 October 2009

Page 26: 18 Modelling

models <- dlply(tx, "city", function(df) lm(sales ~ factor(month), data = df))

models[[1]]coef(models[[1]])

ldply(models, coef)

Wednesday, 28 October 2009

Page 27: 18 Modelling

Notice we didn’t have to do anything to have the coefficients labelled correctly.

Behind the scenes plyr records the labels used for the split step, and ensures they are preserved across multiple plyr calls.

Labelling

Wednesday, 28 October 2009

Page 28: 18 Modelling

What are some problems with this model? How could you fix them?

Is the format of the coefficients optimal?

Turn to the person next to you and discuss for 2 minutes.

Back to the model

Wednesday, 28 October 2009

Page 29: 18 Modelling

qplot(date, log10(sales), data = tx, geom = "line", group = city)

models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df))

coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1])})

Wednesday, 28 October 2009

Page 30: 18 Modelling

qplot(date, log10(sales), data = tx, geom = "line", group = city)

models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df))

coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1])})

Put coefficients in rows, so they can be plotted more easily

Wednesday, 28 October 2009

Page 31: 18 Modelling

month

effect

−0.1

0.0

0.1

0.2

0.3

0.4

2 4 6 8 10 12

qplot(month, effect, data = coef2, group = city, geom = "line")Wednesday, 28 October 2009

Page 32: 18 Modelling

month

10^effect

1.0

1.5

2.0

2.5

2 4 6 8 10 12

qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line")Wednesday, 28 October 2009

Page 33: 18 Modelling

month

10^e

ffect

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

1.01.52.02.5

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

2 4 6 8 1012

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

2 4 6 8 1012

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

2 4 6 8 1012

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

2 4 6 8 1012

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

2 4 6 8 1012

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

2 4 6 8 1012

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

2 4 6 8 1012

qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city)Wednesday, 28 October 2009

Page 34: 18 Modelling

What should we do next?

What do you think?

You have 30 seconds to come up with (at least) one idea.

Wednesday, 28 October 2009

Page 35: 18 Modelling

My ideas

Fit a single model, log(sales) ~ city * factor(month), and look at residuals

Fit individual models, log(sales) ~ factor(month) + ns(date, 3), look cities that don’t fit

Wednesday, 28 October 2009

Page 36: 18 Modelling

# One approach - fit a single model

mod <- lm(log10(sales) ~ city + factor(month), data = tx)

tx$sales2 <- 10 ^ resid(mod)qplot(date, sales2, data = tx, geom = "line", group = city)last_plot() + facet_wrap(~ city)

Wednesday, 28 October 2009

Page 37: 18 Modelling

date

sales2

0.5

1.0

1.5

2.0

2.5

3.0

3.5

2000 2002 2004 2006 2008

qplot(date, sales2, data = tx, geom = "line", group = city)Wednesday, 28 October 2009

Page 38: 18 Modelling

date

sale

s2

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

20002002200420062008

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

20002002200420062008

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

20002002200420062008

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

20002002200420062008

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

20002002200420062008

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

20002002200420062008

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

20002002200420062008

last_plot() + facet_wrap(~ city)Wednesday, 28 October 2009

Page 39: 18 Modelling

# Another approach: Essence of most cities is seasonal # term plus long term smooth trend. We could fit this # model to each city, and then look for models which don't# fit well.

library(splines)models3 <- dlply(tx, "city", function(df) { lm(log10(sales) ~ factor(month) + ns(date, 3), data = df)})

# Extract rsquared from each modelrsq <- function(mod) c(rsq = summary(mod)$r.squared)quality <- ldply(models3, rsq)

Wednesday, 28 October 2009

Page 40: 18 Modelling

rsq

city

AbileneAmarillo

ArlingtonAustin

Bay AreaBeaumont

Brazoria CountyBrownsville

Bryan−College StationCollin County

Corpus ChristiDallas

Denton CountyEl Paso

Fort BendFort WorthGalveston

GarlandHarlingen

HoustonIrving

Killeen−Fort HoodLaredo

Longview−MarshallLubbock

LufkinMcAllenMidland

Montgomery CountyNacogdoches

NE Tarrant CountyOdessa

PalestineParis

Port ArthurSan AngeloSan AntonioSan Marcos

Sherman−DenisonTemple−Belton

TexarkanaTyler

VictoriaWaco

Wichita Falls

0.5 0.6 0.7 0.8 0.9

qplot(rsq, city, data = quality)Wednesday, 28 October 2009

Page 41: 18 Modelling

rsq

reor

der(c

ity, r

sq)

Port ArthurEl Paso

PalestineParis

TexarkanaBeaumont

VictoriaLufkin

NacogdochesAmarillo

San AngeloSan Marcos

OdessaGalveston

IrvingSherman−Denison

Wichita FallsBrownsville

McAllenBrazoria County

Killeen−Fort HoodAbilene

HarlingenLaredo

MidlandLongview−Marshall

GarlandLubbock

Temple−BeltonWaco

ArlingtonCorpus Christi

Bay AreaNE Tarrant County

TylerFort WorthFort Bend

AustinDenton County

Collin CountyBryan−College Station

DallasHouston

Montgomery CountySan Antonio

0.5 0.6 0.7 0.8 0.9

qplot(rsq, reorder(city, rsq), data = quality)

How are the good fits different from

the bad fits?

Wednesday, 28 October 2009

Page 42: 18 Modelling

quality$poor <- quality$rsq < 0.7tx2 <- merge(tx, quality, by = "city")

cities <- dlply(tx, "city")

mfit <- mdply(cbind(mod = models3, df = cities), function(mod, df) { data.frame( city = df$city, date = df$date, resid = resid(mod), pred = predict(mod))})tx2 <- merge(tx2, mfit)

Wednesday, 28 October 2009

Page 43: 18 Modelling

quality$poor <- quality$rsq < 0.7tx2 <- merge(tx, quality, by = "city")

cities <- dlply(tx, "city")

mfit <- mdply(cbind(mod = models3, df = cities), function(mod, df) { data.frame( city = df$city, date = df$date, resid = resid(mod), pred = predict(mod))})tx2 <- merge(tx2, mfit)

Takes each row of input and feeds it to function

Wednesday, 28 October 2009

Page 44: 18 Modelling

date

log1

0(sa

les)

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

0.51.01.52.02.53.03.5

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

20002002200420062008

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

20002002200420062008

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

20002002200420062008

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

20002002200420062008

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

20002002200420062008

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

20002002200420062008

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

20002002200420062008Raw data

Wednesday, 28 October 2009

Page 45: 18 Modelling

date

pred

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

1.01.52.02.53.03.5

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

20002002200420062008

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

20002002200420062008

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

20002002200420062008

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

20002002200420062008

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

20002002200420062008

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

20002002200420062008

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

20002002200420062008Predictions

Wednesday, 28 October 2009

Page 46: 18 Modelling

date

resi

d

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

−0.4−0.20.00.20.4

Abilene

Brownsville

Fort Bend

Killeen−Fort Hood

Montgomery County

San Angelo

Victoria

20002002200420062008

Amarillo

Bryan−College Station

Fort Worth

Laredo

Nacogdoches

San Antonio

Waco

20002002200420062008

Arlington

Collin County

Galveston

Longview−Marshall

NE Tarrant County

San Marcos

Wichita Falls

20002002200420062008

Austin

Corpus Christi

Garland

Lubbock

Odessa

Sherman−Denison

20002002200420062008

Bay Area

Dallas

Harlingen

Lufkin

Palestine

Temple−Belton

20002002200420062008

Beaumont

Denton County

Houston

McAllen

Paris

Texarkana

20002002200420062008

Brazoria County

El Paso

Irving

Midland

Port Arthur

Tyler

20002002200420062008Residuals

Wednesday, 28 October 2009

Page 47: 18 Modelling

Your turn

Pick one of the other variables and perform a similar exploration, working through the same steps.

Wednesday, 28 October 2009

Page 48: 18 Modelling

ConclusionsSimple (and relatively small) example, but shows how collections of models can be useful for gaining understanding.

Each attempt illustrated something new about the data.

Plyr made it easy to create and summarise collection of models, so we didn’t have to worry about the mechanics.

Wednesday, 28 October 2009